Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Adaptive behavior with stable synapses

Cristiano Capone Corresponding author: cristiano0capone@gmail.com Natl. Center for Radiation Protection and Computational Physics, Istituto Superiore di Sanità, 00161 Rome, Italy Luca Falorsi Natl. Center for Radiation Protection and Computational Physics, Istituto Superiore di Sanità, 00161 Rome, Italy PhD Program in Mathematics, Dept. of Mathematics, “Sapienza” University of Rome, 00185 Rome, Italy Maurizio Mattia Natl. Center for Radiation Protection and Computational Physics, Istituto Superiore di Sanità, 00161 Rome, Italy
Abstract

Behavioral changes in animals and humans, as a consequence of an error or a verbal instruction, can be extremely rapid. Improvement in behavioral performances are usually associated in machine learning and reinforcement learning to synaptic plasticity, and, in general, to changes and optimization of network parameters. However, such rapid changes are not coherent with the timescales of synaptic plasticity, suggesting that the mechanism responsible for that could be a dynamical network reconfiguration. In the last few years, similar capabilities have been observed in transformers, foundational architecture in the field of machine learning that are widely used in applications such as natural language and image processing. Transformers are capable of in-context learning, the ability to adapt and acquire new information dynamically within the context of the task or environment they are currently engaged in, without the need for significant changes to their underlying parameters. Building upon the notion of something unique within transformers enabling the emergence of this property, we claim that it could be supported by gain-modulation, feature extensively observed in biological networks. We propose an architecture composed of gain-modulated recurrent networks that excels at in-context learning, showing abilities inaccessible to standard networks. We demonstrate that we can extend our approach to non-linear and temporal tasks and to reinforcement learning. Our framework contributes to understanding the principles underlying in-context learning and adaptive behavior in both natural and artificial intelligence.

1 Introduction

The study of behavioral adaptation, observed in both humans and animals, has been a longstanding subject of research. The rapidity with which behaviors can change in response to new cues challenges the conventional explanation of synaptic plasticity. This led researchers to explore mechanisms that underlie such flexible adaptations [1].

In-context learning as an emerging property in AI

In-context learning (ICL) is the capacity of a model to adapt its behavior, without weight updates, to solve tasks not encountered during training. Initially, ICL was observed in architectures tailored for few-shot learning [2] or even zero-shot learning [3]. However, the game changed when it was observed that ICL emerges naturally in large-scale transformers [4, 5]. Their exceptional capability to adapt to contextual information allowed transformer-based architectures to achieve state-of-the-art performances in many domains, such as natural language processing and image analysis[6, 7, 8]. Despite some intriguing explanations [9, 10, 11], the emergence of this property remains not completely understood. In particular, it is not clear how to associate it with in-context learning abilities observed in biological networks. The major contribution of this work is the definition of a constructive method to induce in-context learning in biologically plausible neural networks, in a broad variety of scenarios.

Limitations of transformers architecture

Despite their success, transformer architectures present many shortcomings, primarily in memory requirements, that scale quadratically with sequence length. This limits the scalability of transformers and hinders their applicability to tasks requiring the processing of long sequences, such as full-length document analysis or video understanding. To address these problems, several efficient linearized attention models have emerged, characterized by a forward pass executed in an RNN-like manner with constant inference memory costs. Recently, deep linear RNN architectures [12, 13] have yielded notable performance improvements over transformers, particularly in long-sequence tasks. Zucchet et al. [14] explore the efficacy of these deep linear gated RNNs, incorporating element-wise multiplications, in approximating attention mechanisms and implementing in-context supervised learning.

Biological support for in-context learning

Our investigation extends this premise, aiming to unravel how in-context learning, as observed in deep learning artificial neural networks, might also manifest in biological recurrent neural networks. Are there unique features in transformers that are also present in biological networks? We claim that in-context learning can be supported by input segregation and dendritic amplification, features extensively observed in biological networks. We argue that those are a biologically plausible ingredient capable of implementing a process that is similar to the attention mechanism present in transformers. Recent findings on dendritic computational properties [15] and on the complexity of pyramidal neurons dynamics [16] motivated the study of multi-compartment neuron models in the development of new biologically plausible learning rules [17, 18, 19, 20]. It has has been proposed that segregation of dendritic input [18] (i. e., neurons receive sensory information and higher-order feedback in segregated compartments) and generation of high-frequency bursts of spikes [20] would support backpropagation in biological neurons. In [21] authors suggest that this neuronal architecture naturally allows for orchestrating “hierarchical imitation learning”, enabling the decomposition of challenging long-horizon decision-making tasks into simpler subtasks. They show a possible implementation of this in a two-level network, where the high-network produces the contextual signal for the low-network. Here, we propose an architecture composed of gain-modulated recurrent networks that demonstrate remarkable in-context learning capabilities, which we refer to as ’dynamical adaptation’. Specifically, we illustrate that our biologically plausible architecture can dynamically adapt its behavior in response to feedback from the environment without altering its synaptic weights. We present results for supervised learning of temporal trajectories and reinforcement learning, involving non trivial input-output temporal relations. This novel architecture aims to bridge the gap between biological-inspired in-context learning and the capabilities of artificial neural networks, offering a promising avenue for advancing our understanding of adaptive behaviour in both natural and artificial intelligence domains.

Our work generalizes and provides a biologically plausible implementation for the type of networks presented in [14]. Notably, our architecture has the same order of magnitude of trainable parameters and hidden units (see Appendix B). In our approach, in-context learning (ICL) can emerge simply by tuning the readout weights of the two involved networks. In contrast, previous implementations require biologically implausible mechanisms of temporal credit assignment for the network weights to converge correctly. We show that we can side-step this problem by dividing the architecture into two separate components and introducing an additional objective, which forces the first network to approximate the gradient of the second.

2 Methods

2.1 Dynamical adaptation replaces synaptic plasticity

Consider a generic learning task, in which the response y𝑦yitalic_y to a state x𝑥xitalic_x and is influenced by a set of parameters w𝑤witalic_w. The latter are adjusted in function of an internal error e𝑒eitalic_e. This can be rewritten in a generic formulation as follows:

{τee˙=E(e,y,f),τyy˙=Y(y,w,x),τww˙=W(w,x,e).casessubscript𝜏𝑒˙𝑒absent𝐸𝑒𝑦𝑓subscript𝜏𝑦˙𝑦absent𝑌𝑦𝑤𝑥subscript𝜏𝑤˙𝑤absent𝑊𝑤𝑥𝑒\begin{cases}\tau_{e}\,\dot{e}&=E(e,y,f),\\ \tau_{y}\,\dot{y}&=Y(y,w,x),\\ \tau_{w}\,\dot{w}&=W(w,x,e).\end{cases}{ start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT over˙ start_ARG italic_e end_ARG end_CELL start_CELL = italic_E ( italic_e , italic_y , italic_f ) , end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT over˙ start_ARG italic_y end_ARG end_CELL start_CELL = italic_Y ( italic_y , italic_w , italic_x ) , end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG italic_w end_ARG end_CELL start_CELL = italic_W ( italic_w , italic_x , italic_e ) . end_CELL end_ROW (1)

where f𝑓fitalic_f is feedback from the environment (e.g. the reward, the target behavior, …). Alternatively to the standard interpretation of w𝑤witalic_w as synaptic weights of a neuronal network, we consider them as the state variables of a dynamical system coupled to the dynamics of y𝑦yitalic_y, expressed by the activity of an auxiliary network. For this reason, we also refer to them as “virtual” weights.

2.2 Dynamical supervised learning for temporal trajectory

We consider the task of learning a target temporal trajectory ytarg(t)superscript𝑦𝑡𝑎𝑟𝑔𝑡y^{targ}(t)italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ), We define y(t)=𝒘𝒙(𝒕)𝑦𝑡𝒘𝒙𝒕y(t)=\bm{w}\cdot\bm{x(t)}italic_y ( italic_t ) = bold_italic_w ⋅ bold_italic_x bold_( bold_italic_t bold_) as the current estimation of ytarg(t)superscript𝑦𝑡𝑎𝑟𝑔𝑡y^{targ}(t)italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ), where 𝒘𝒘\bm{w}bold_italic_w and 𝒙(𝒕)𝒙𝒕\bm{x(t)}bold_italic_x bold_( bold_italic_t bold_) are virtual weights and input vectors respectively, since we extended to formulation to a multidimensional dataset. In this case, there is not a stationary projection of the training set, but the current value of the target signal itself is projected at every time stem t𝑡titalic_t. We do not separate the target estimation on a test set and on a training set, as a consequence, y(t)=ytrain(t)𝑦𝑡superscript𝑦𝑡𝑟𝑎𝑖𝑛𝑡y(t)=y^{train}(t)italic_y ( italic_t ) = italic_y start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT ( italic_t ). In this case, learning can be formulated as follows:

{e=(ytargy)y=𝒙𝒘τw𝒘˙=𝒙ecases𝑒absentsuperscript𝑦𝑡𝑎𝑟𝑔𝑦𝑦absent𝒙𝒘subscript𝜏𝑤˙𝒘absent𝒙𝑒\begin{cases}e&=(y^{targ}-y)\\ y&=\bm{x}\cdot\bm{w}\\ \tau_{w}\,\dot{\bm{w}}&=\bm{x}\,e\end{cases}{ start_ROW start_CELL italic_e end_CELL start_CELL = ( italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT - italic_y ) end_CELL end_ROW start_ROW start_CELL italic_y end_CELL start_CELL = bold_italic_x ⋅ bold_italic_w end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = bold_italic_x italic_e end_CELL end_ROW (2)

where we removed the dependence on time t𝑡titalic_t for simplicity. The operations required are nonlinear, and usually are naturally implemented by the plasticity rule and by the multiplication between presynaptic activity and synaptic weights. However, here 𝒘𝒘\bm{w}bold_italic_w are not actual weights but rather dynamical variables.

We propose, as possible implementation of this, that such non-linear functions are computed by two neural networks networks WΘwsubscript𝑊superscriptΘ𝑤W_{\Theta^{w}}italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and YΘysubscript𝑌superscriptΘ𝑦Y_{\Theta^{y}}italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT:

{e=(ytargy)y=YΘy(𝒙,𝒘)Y(𝒙,𝒘)=𝒙𝒘τw𝒘˙=WΘw(𝒙,e)W(𝒙,e)=𝒙ecases𝑒absentsuperscript𝑦𝑡𝑎𝑟𝑔𝑦𝑦absentsubscript𝑌superscriptΘ𝑦𝒙𝒘similar-to-or-equals𝑌𝒙𝒘𝒙𝒘subscript𝜏𝑤˙𝒘absentsubscript𝑊superscriptΘ𝑤𝒙𝑒similar-to-or-equals𝑊𝒙𝑒𝒙𝑒\begin{cases}e&=(y^{targ}-y)\\ y&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{x},\bm{w})\simeq Y(\bm{x},\bm{w})=% \bm{x}\cdot\bm{w}\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{x},e)% \simeq W(\bm{x},e)=\bm{x}\,e\end{cases}{ start_ROW start_CELL italic_e end_CELL start_CELL = ( italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT - italic_y ) end_CELL end_ROW start_ROW start_CELL italic_y end_CELL start_CELL = italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_w ) ≃ italic_Y ( bold_italic_x , bold_italic_w ) = bold_italic_x ⋅ bold_italic_w end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_e ) ≃ italic_W ( bold_italic_x , italic_e ) = bold_italic_x italic_e end_CELL end_ROW (3)

In particular, we used two RNNs (see details below in the following Method subsections) that receive 𝒙,𝒘,e𝒙𝒘𝑒\bm{x},\bm{w},ebold_italic_x , bold_italic_w , italic_e as inputs and provide the proper output thanks to a suited training of their readout weights (ΘysuperscriptΘ𝑦\Theta^{y}roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT and ΘwsuperscriptΘ𝑤\Theta^{w}roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT, following reservoir computing paradigm). In addition, we introduce a novel concept known as a gain-modulated network. This concept is inspired by the remarkable ability observed in biological neurons, particularly L5 neurons as discussed in previous studies (e.g., [22, 21]). These neurons exhibit the capacity to non-linearly integrate segregated inputs, a process critical for various cognitive functions. More details in the section "Gain Modulated Reservoir Computing". We empirically demonstrate that this network architecture outperforms standard RNNs in approximating and generalising the required virtual update rules described above, suggesting that gain modulation might be an important requirement for the adaptive behaviour observed in biological agents. The formulation shown above can only be used to tackle linear problems, by learning linear relationships between 𝒙𝒙\bm{x}bold_italic_x and y𝑦yitalic_y. However, if we consider that 𝒙𝒙\bm{x}bold_italic_x is the activity of another RNN that operates as a reservoir computer extracting nonlinear features of the input sequence, a wider class of tasks can be tackled without changing the constraint of a linear readout y=wx𝑦𝑤𝑥y=wxitalic_y = italic_w italic_x.

2.3 Dynamical reinforcement learning

In the context of reinforcement learning, our framework outlines a systematic approach for modelling agents that can dynamically adapt their behaviour across diverse environments. We start by defining a policy network, denoted as 𝝅=softmax(𝒚)𝝅softmax𝒚\bm{\pi}=\mathrm{softmax}(\bm{y})bold_italic_π = roman_softmax ( bold_italic_y ) which implements a policy mapping the agent state encoded by the vector 𝒙𝒙\bm{x}bold_italic_x to a probability distribution over actions. For the sake of simplicity, We assume a linear agent such that 𝒚=𝒘𝒙𝒚𝒘𝒙\bm{y}=\bm{w}\cdot\bm{x}bold_italic_y = bold_italic_w ⋅ bold_italic_x. This assumption is done without loss of generality since this could be easily extended by resorting to a reservoir computer as an intermediate layer as described above. The policy depends on the virtual weights 𝒘𝒘\bm{w}bold_italic_w, determined by the activity of an additional RNN. This auxiliary network adjusts its internal activity based on the rewards received, effectively implementing policy gradient updates of the virtual weights and thereby modulating the agent behaviour in real time. This can be formalized in a formulation that is very similar to the one used above, by changing the definition of e(t)𝑒𝑡e(t)italic_e ( italic_t ) as follows:

{𝒆=r(𝟙a𝝅)𝒚=𝒘𝒙τw𝒘˙=𝒆𝒙cases𝒆absent𝑟subscript1𝑎𝝅𝒚absent𝒘𝒙subscript𝜏𝑤˙𝒘absentdirect-product𝒆𝒙\begin{cases}\bm{e}&=r\left(\bm{\mathds{1}}_{a}-\bm{\pi}\right)\\ \bm{y}&=\bm{w}\cdot\bm{x}\\ \tau_{w}\,\dot{\bm{w}}&=\bm{e}\odot\bm{x}\end{cases}{ start_ROW start_CELL bold_italic_e end_CELL start_CELL = italic_r ( blackboard_bold_1 start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - bold_italic_π ) end_CELL end_ROW start_ROW start_CELL bold_italic_y end_CELL start_CELL = bold_italic_w ⋅ bold_italic_x end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = bold_italic_e ⊙ bold_italic_x end_CELL end_ROW (4)

This is the dynamics obtained by evaluating the policy gradient with respect to the virtual weights 𝒘𝒘\bm{w}bold_italic_w. a(t)𝑎𝑡a(t)italic_a ( italic_t ) is an integer value indicating the action at time t𝑡titalic_t among DD\mathrm{D}roman_D possible ones. 𝟙a(t)subscript1𝑎𝑡\mathds{1}_{a(t)}blackboard_1 start_POSTSUBSCRIPT italic_a ( italic_t ) end_POSTSUBSCRIPT represents the ’one-hot encoded’ action (as defined in [23, 24]) at time t𝑡titalic_t. It is a DD\mathrm{D}roman_D-element vector where the a(t)𝑎𝑡a(t)italic_a ( italic_t )-th element is one, and all other elements are zero.

This formulation holds for a null discount factor [23, 24]), we refer to the Appendix section C for the description of the general case. The above equation can be rewritten in order to refer only to two networks, one for estimating the gradients, and one for the scalar product.

{τw𝒘˙=WΘw(r,a,𝝅)W(r,a,𝝅)=r(𝟙a𝝅(𝒚))𝒙𝒚=YΘy(𝒙,𝒘)Y(𝒙,𝒘)=𝒙𝒘casessubscript𝜏𝑤˙𝒘absentsubscript𝑊superscriptΘ𝑤𝑟𝑎𝝅similar-to-or-equals𝑊𝑟𝑎𝝅direct-product𝑟subscript1𝑎𝝅𝒚𝒙𝒚absentsubscript𝑌superscriptΘ𝑦𝒙𝒘similar-to-or-equals𝑌𝒙𝒘𝒙𝒘\begin{cases}\tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}% }}(r,a,\bm{\pi})\simeq W(r,a,\bm{\pi})=r\left(\bm{\mathds{1}}_{a}-\bm{\pi}(\bm% {y})\right)\odot\bm{x}\\ \bm{y}&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{x},\bm{w})\simeq Y(\bm{x},\bm{% w})=\bm{x}\cdot\bm{w}\end{cases}{ start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_r , italic_a , bold_italic_π ) ≃ italic_W ( italic_r , italic_a , bold_italic_π ) = italic_r ( blackboard_bold_1 start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - bold_italic_π ( bold_italic_y ) ) ⊙ bold_italic_x end_CELL end_ROW start_ROW start_CELL bold_italic_y end_CELL start_CELL = italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_w ) ≃ italic_Y ( bold_italic_x , bold_italic_w ) = bold_italic_x ⋅ bold_italic_w end_CELL end_ROW (5)

2.4 Algorithm Distillation

To demonstrate the ability of this architecture to learn in context, we train networks Y and W using an Algorithmic Distillation protocol, as defined in [25]. Namely, we consider a family of environments 𝒜𝒜\mathcal{A}caligraphic_A with different reward distributions 111A family of environments is a set of possible environments with the same action-state transitions but different reward distributions.. We then train the aforementioned networks using Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT learning histories obtained through policy gradient methods from a subset of these environments, designated as 𝒜IDsubscript𝒜ID\mathcal{A}_{\text{ID}}caligraphic_A start_POSTSUBSCRIPT ID end_POSTSUBSCRIPT. The network performance is then tested out of distribution, i.e., on the rest of the environments 𝒜OOD:=𝒜𝒜IDassignsubscript𝒜OOD𝒜subscript𝒜ID\mathcal{A}_{\text{OOD}}:=\mathcal{A}\setminus\mathcal{A}_{\text{ID}}caligraphic_A start_POSTSUBSCRIPT OOD end_POSTSUBSCRIPT := caligraphic_A ∖ caligraphic_A start_POSTSUBSCRIPT ID end_POSTSUBSCRIPT

2.5 Gain modulated reservoir computing (GM-RC)

Reservoir Computing

Reservoir Computing (RC) represents a paradigm in machine learning that provides a model for biologically plausible computation and learning, drawing inspiration from the information processing mechanisms observed in biological neural networks.

In this approach, a random RNN is employed to extract features from a time-dependent signal 𝒙(t)𝒙𝑡\bm{x}(t)bold_italic_x ( italic_t ). The dynamics of the RNN describe the evolution of N𝑁Nitalic_N hidden units 𝒛(t)=(z1(t),,zN(t))𝒛𝑡subscript𝑧1𝑡subscript𝑧𝑁𝑡\bm{z}(t)=(z_{1}(t),\cdots,z_{N}(t))bold_italic_z ( italic_t ) = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , ⋯ , italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) ), governed by the following differential equation:

τz𝒛˙=ϕ(J𝒛+R𝒙)𝒛subscript𝜏𝑧˙𝒛italic-ϕ𝐽𝒛𝑅𝒙𝒛\displaystyle\tau_{z}\dot{\bm{z}}=\phi\left(J\bm{z}+R\bm{x}\right)-\bm{z}italic_τ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT over˙ start_ARG bold_italic_z end_ARG = italic_ϕ ( italic_J bold_italic_z + italic_R bold_italic_x ) - bold_italic_z (6)

The value of each unit represents the activity of a population of neurons following the Wilson and Cowan formulation ([26]). Here, J𝐽Jitalic_J and R𝑅Ritalic_R represent fixed random matrices, representing the recurrent connections and the projection from inputs to hidden units, respectively. Subsequently, the features extracted by the network can be utilized to predict a target signal 𝒚targ(t)superscript𝒚𝑡𝑎𝑟𝑔𝑡\bm{y}^{targ}(t)bold_italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ) by learning readout weights ΘΘ\Thetaroman_Θ, such that |𝒚targ(t)Θ𝒛(t)|22subscriptsuperscriptsuperscript𝒚𝑡𝑎𝑟𝑔𝑡Θ𝒛𝑡22|\bm{y}^{targ}(t)-\Theta\bm{z}(t)|^{2}_{2}| bold_italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ) - roman_Θ bold_italic_z ( italic_t ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is minimized. The reservoir is then implementing a map

𝒚(t)=RNNΘ({𝒙(s)}st)𝒚𝑡subscriptRNNΘsubscript𝒙𝑠𝑠𝑡\displaystyle\bm{y}(t)=\mathrm{RNN}_{\Theta}(\{\bm{x}(s)\}_{s\leq t})bold_italic_y ( italic_t ) = roman_RNN start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( { bold_italic_x ( italic_s ) } start_POSTSUBSCRIPT italic_s ≤ italic_t end_POSTSUBSCRIPT ) (7)

where ΘΘ\Thetaroman_Θ represents trainable parameters. In the rest of the article, we will write 𝒚=RNNΘ({𝒙})𝒚subscriptRNNΘ𝒙\bm{y}=\mathrm{RNN}_{\Theta}(\{\bm{x}\})bold_italic_y = roman_RNN start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( { bold_italic_x } ), dropping the temporal dependencies. 222 For this mapping to be well-defined, we can assume the RNN to have the echo-state property [27].

When addressing input-output mappings without time dependencies, we consider a network operating in the τz,Jij0subscript𝜏𝑧subscript𝐽𝑖𝑗0\tau_{z},J_{ij}\to 0italic_τ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT → 0 limit. In this scenario, the network computes an instantaneous function of the input, represented as NNΘ(x)=Θϕ(R𝒙)subscriptNNΘ𝑥Θitalic-ϕ𝑅𝒙\mathrm{NN}_{\Theta}(x)=\Theta\phi\left(R\bm{x}\right)roman_NN start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_x ) = roman_Θ italic_ϕ ( italic_R bold_italic_x ). Essentially, this is equivalent to considering a one-layer feed-forward network with random fixed input weights. This architecture is also referred to in the literature as the Extreme Learning Machine [28].

Gain modulated network architecture

Analysing Equations (3) and (5) we observe that a network must possess the capability to perform multiplications of its inputs to approximate a gradient descent update of virtual parameters. Building upon this insight, we introduce a gain-modulated reservoir network (GM-RC). This architecture draws inspiration from the morphology and function of pyramidal neurons in the cortex, which nonlinearly integrate inputs from basal and apical dendrites [22, 29]. Here, we consider an additional input source 𝒙apsuperscript𝒙ap\bm{x}^{\text{ap}}bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT randomly projected into the apical dendrite of each neuron by the matrix Rapsuperscript𝑅apR^{\text{ap}}italic_R start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT. Consistent with experimental observations in L5 pyramidal neurons, we allow the apical inputs to modulate the gain of the activation function, thereby altering its slope. Consequently, the resulting RNN equation is formulated as:

τz𝒛˙subscript𝜏𝑧˙𝒛\displaystyle\tau_{z}\dot{\bm{z}}italic_τ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT over˙ start_ARG bold_italic_z end_ARG =ϕ((α𝒃ap+γRap𝒙ap)(J𝒛+βRap𝒙ap+R𝒙))𝒛absentitalic-ϕdirect-product𝛼superscript𝒃ap𝛾superscript𝑅apsuperscript𝒙ap𝐽𝒛𝛽superscript𝑅apsuperscript𝒙ap𝑅𝒙𝒛\displaystyle=\phi\left(\left(\alpha\bm{b}^{\text{ap}}+\gamma\cdot R^{\text{ap% }}\bm{x}^{\text{ap}}\right)\odot(J\bm{z}+\beta\cdot R^{\text{ap}}\bm{x}^{\text% {ap}}+R\bm{x})\right)-\bm{z}\ = italic_ϕ ( ( italic_α bold_italic_b start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT + italic_γ ⋅ italic_R start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT ) ⊙ ( italic_J bold_italic_z + italic_β ⋅ italic_R start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT + italic_R bold_italic_x ) ) - bold_italic_z (8)

Where 𝒃apsuperscript𝒃ap\bm{b}^{\text{ap}}bold_italic_b start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT is a constant bias vector. The hyperparameters α,β,γ𝛼𝛽𝛾\alpha,\beta,\gammaitalic_α , italic_β , italic_γ modulate the effect of the gain modulation of the apical inputs. Specifically, when γ=0𝛾0\gamma=0italic_γ = 0, we obtain a network in which 𝒙apsuperscript𝒙ap\bm{x}^{\text{ap}}bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT does not affect the gain modulation. Similarly, as before, the expression 𝒚=GMRNNΘ({𝒙}|{γ𝒙ap})𝒚subscriptGMRNNΘconditional𝒙𝛾superscript𝒙ap\bm{y}=\mathrm{GMRNN}_{\Theta}(\{\bm{x}\}|\{\gamma\bm{x}^{\text{ap}}\})bold_italic_y = roman_GMRNN start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( { bold_italic_x } | { italic_γ bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT } ) will denote the input \to output mapping implemented by a gain-modulated reservoir. As before, removing time dependencies, we will have an instantaneous function 𝒚=GMNNΘ(𝒙|γ𝒙ap)=Θϕ((α𝒃ap+γRap𝒙ap)(R𝒙))𝒚subscriptGMNNΘconditional𝒙𝛾superscript𝒙apΘitalic-ϕdirect-product𝛼superscript𝒃ap𝛾superscript𝑅apsuperscript𝒙ap𝑅𝒙\bm{y}=\mathrm{GMNN}_{\Theta}(\bm{x}|\gamma\bm{x}^{\text{ap}})=\Theta\phi\left% (\left(\alpha\bm{b}^{\text{ap}}+\gamma\cdot R^{\text{ap}}\bm{x}^{\text{ap}}% \right)\odot(R\bm{x})\right)bold_italic_y = roman_GMNN start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( bold_italic_x | italic_γ bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT ) = roman_Θ italic_ϕ ( ( italic_α bold_italic_b start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT + italic_γ ⋅ italic_R start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT ) ⊙ ( italic_R bold_italic_x ) ) of the inputs 𝒙apsuperscript𝒙ap\bm{x}^{\text{ap}}bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT and 𝒙𝒙\bm{x}bold_italic_x. We explicitly maintain the dependence on γ𝛾\gammaitalic_γ because, in our experiments, we use this parameter to regulate the gain modulation effect of the apical inputs on the network. The choice for the name 𝒙apsuperscript𝒙ap\bm{x}^{\text{ap}}bold_italic_x start_POSTSUPERSCRIPT ap end_POSTSUPERSCRIPT is inspired by the current received in the apical dendrites of L5 pyramidal neurons, that are believed to carry contextual/high-level information [22].

3 Results

Computing resources

The experiments of this paper were executed on a Macbook pro M3 CPU 12-core with 36 GB of RAM, and on a Macbook pro 2,9 GHz 6-Core Intel Core i9 with 32 GB of RAM.

Refer to caption
Figure 1: Dynamical adaptation of temporal trajectories: A. Overview of the network architecture employed. A recurrent network composed of N=20𝑁20N=20italic_N = 20 units (described by the vector 𝒛(t)𝒛𝑡\bm{z}(t)bold_italic_z ( italic_t ), depicted in pink, which follows Eq. (6)) is utilized for learning a periodic trajectory, following the prescription of reservoir computing. One recurrent network is tasked with estimating the gradient of virtual weights, while another is dedicated to estimating the behavioral reconfiguration (illustrated in green and orange, respectively) resulting from these updated virtual weights. B. Illustration of our architecture dynamically adjusting (without synaptic alterations) to adhere to the desired dynamics. Errors (represented by the blue trajectory) are fed back to the initial network to assess necessary updates to the virtual weights δw𝛿𝑤\delta witalic_δ italic_w, a N𝑁Nitalic_N-dimensional vector (shown in the green trajectory). Subsequently, these updates are transmitted to the second network, modifying the decoding of reservoir dynamics (indicated by the orange lines). Initially, the reservoir receives the target trajectory as input (in open loop, before the red vertical line), which is later replaced by the estimated trajectory itself (in closed loop, after the red vertical line). C. Our networks were pre-trained on five target frequencies (marked by red vertical lines) and tested across a range of frequencies, evaluating the mean squared error (MSE) between the target and estimated trajectories in closed-loop scenarios (solid: median, dashed: 20-th/80-th percentile (statistics evaluated over 10 realizations).

Dynamical adaptation for temporal trajectories

We consider the task of autonomously predicting a temporal trajectory {ytarg(t)}tsubscriptsuperscript𝑦𝑡𝑎𝑟𝑔𝑡𝑡\{y^{targ}(t)\}_{t}{ italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Following the reservoir computing paradigm, we employ an RNN (Fig. 1A, pink), whose dynamics follows Equation (6), to extract temporal features. During training, the reservoir receives the target dynamics ytarg(t)superscript𝑦𝑡𝑎𝑟𝑔𝑡y^{targ}(t)italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ) as input (x=ytarg(t)𝑥superscript𝑦𝑡𝑎𝑟𝑔𝑡x=y^{targ}(t)italic_x = italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ) in Eq. (6), while R𝑅Ritalic_R is Gaussian matrix with zero mean and variance, σR2superscriptsubscript𝜎𝑅2\sigma_{R}^{2}italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) and is tasked with predicting the subsequent step of the trajectory via a linear readout of its activity (open loop, see Fig. 1B). This setup can be reformulated as a dynamical supervised learning problem, as described in the methods section, with the inputs 𝒙𝒙\bm{x}bold_italic_x replaced by the features 𝒛(t)𝒛𝑡\bm{z}(t)bold_italic_z ( italic_t ) extracted by the RNN. The readout virtual weights can then be dynamically adjusted, minimizing the error between the target and the current prediction y(t)𝑦𝑡y(t)italic_y ( italic_t ). This results in the following dynamics:

{e=(ytargy)y=YΘy(𝒛,𝒘)𝒛𝒘τw𝒘˙=WΘw(𝒛,e)𝒛ecases𝑒absentsuperscript𝑦𝑡𝑎𝑟𝑔𝑦𝑦absentsubscript𝑌superscriptΘ𝑦𝒛𝒘similar-to-or-equals𝒛𝒘subscript𝜏𝑤˙𝒘absentsubscript𝑊superscriptΘ𝑤𝒛𝑒similar-to-or-equals𝒛𝑒\begin{cases}e&=(y^{targ}-y)\\ y&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}({\color[rgb]{1,0.58984375,0.70703125}% \bm{z}},\bm{w})\simeq{\color[rgb]{1,0.58984375,0.70703125}\bm{z}}\cdot\bm{w}\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}({\color[% rgb]{1,0.58984375,0.70703125}\bm{z}},e)\simeq{\color[rgb]{% 1,0.58984375,0.70703125}\bm{z}}\,e\end{cases}{ start_ROW start_CELL italic_e end_CELL start_CELL = ( italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT - italic_y ) end_CELL end_ROW start_ROW start_CELL italic_y end_CELL start_CELL = italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_w ) ≃ bold_italic_z ⋅ bold_italic_w end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , italic_e ) ≃ bold_italic_z italic_e end_CELL end_ROW (9)

Here, the network WΘwsubscript𝑊superscriptΘ𝑤W_{\Theta^{w}}italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT(𝒛,e)=GMNNΘw(({\color[rgb]{1,0.58984375,0.70703125}\bm{z}},e)=\textrm{GMNN}_{\Theta^{w}}(( bold_italic_z , italic_e ) = GMNN start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (𝒛𝒛\bm{z}bold_italic_z,e|γee),e|\gamma_{e}e), italic_e | italic_γ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_e ) is dedicated to estimating the gradient of virtual weights, while YΘysubscript𝑌superscriptΘ𝑦Y_{\Theta^{y}}italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT((((𝒛𝒛\bm{z}bold_italic_z,w)=GMNNΘy(,w)=\textrm{GMNN}_{\Theta^{y}}(, italic_w ) = GMNN start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (𝒛𝒛\bm{z}bold_italic_z,w|γw),w|\gamma w), italic_w | italic_γ italic_w ) is tasked with estimating the predicted y(t)𝑦𝑡y(t)italic_y ( italic_t ) as a function of the new virtual weights (Fig.1A green and orange respectively). Here, the parameters γ,γe𝛾subscript𝛾𝑒\gamma,\gamma_{e}italic_γ , italic_γ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT define the strength of the gain modulation, as explained in the Methods section.

These gain-modulated architectures are pre-trained to replicate, respectively, gradient descent updates and scalar products obtained on a set of Ntrainsubscript𝑁𝑡𝑟𝑎𝑖𝑛N_{train}italic_N start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT training sequences {{yαtarg(t)}t:α[Ntrain]}conditional-setsubscriptsuperscriptsubscript𝑦𝛼𝑡𝑎𝑟𝑔𝑡𝑡𝛼delimited-[]subscript𝑁𝑡𝑟𝑎𝑖𝑛\{\{y_{\alpha}^{targ}(t)\}_{t}:\ \alpha\in[N_{train}]\}{ { italic_y start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_α ∈ [ italic_N start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ] }. More specifically, the target sequences are 5555 sinusoidal functions with different frequencies (see vertical red lines in Fig. 1C.)

In the closed-loop phase, the features 𝒛𝒛\bm{z}bold_italic_z are obtained by directly feeding the network estimation y(t)𝑦𝑡y(t)italic_y ( italic_t ) as the input to the RNN (x=y(t)𝑥𝑦𝑡x=y(t)italic_x = italic_y ( italic_t ) in Eq. (6)). This results in an autonomous dynamical system that reproduces the target trajectory. (see Fig. 1B). In Fig.1B we report an example of successful dynamical adaptation of our model. Errors, represented by the blue trajectory, are fed back to the initial network to assess necessary updates to virtual weights, shown in the green trajectory. These updates are then transmitted to the second network, modifying the decoding of reservoir dynamics indicated by the orange lines. Initially, the reservoir receives the target trajectory as input in an open-loop fashion before the red vertical line, which is later replaced by the estimated trajectory itself in a closed-loop configuration after the red vertical line. The networks were pre-trained on five target frequencies marked by red vertical lines in Fig.1C and then tested across a range of frequencies. Evaluation was performed by assessing the mean squared error (MSE) between the target and estimated trajectories in closed-loop scenarios (see Fig.1C, solid and dashed lines represent respectively, median and 20th/80th percentile range. Statistics is evaluated over 10 realizations of the experiment).

Refer to caption
Figure 2: Dynamical reinforcement learning: Multi Armed Bandits. A. Schematic of network architecture and task. Virtual weighs parameterize the agent’s policy (orange). At each round, an action (red) is sampled from the policy and played. A reward is then sampled from the current environment. A GM-network (green) then predicts the virtual parameter update. B. Regret comparison between the distilled and the original policy gradient algorithm. We report log regret per round distribution achieved at the 100-th round for 100 independently trained models in ID and OOD settings. We compare a gain-modulated network (γ=1𝛾1\gamma=1italic_γ = 1), and a network without gain modulation (γ=0𝛾0\gamma=0italic_γ = 0) with the policy gradient training source (PG). C. OOD model performance varying number of hidden dimensions. Solid lines indicate the median score (expected reward), computed over 100 independently trained models. The filled area indicates 20-80% confidence interval. D. We report average regret per round curves in OOD setting. We compare a gain-modulated network (γ=1𝛾1\gamma=1italic_γ = 1, teal), and a network without gain modulation (γ=0𝛾0\gamma=0italic_γ = 0, orchid) with the policy gradient training source (in black). Each curve represents the average regret per round for one fixed trained model, computed by averaging over 100 independent simulations.

3.1 Multi Armed Bandits

Dynamical adaptation is now investigated within the framework of reinforcement learning. We provide robust evidence supporting the hypothesis that gain modulation represents a fundamental component in implementing in-context learning in biological agents. We first explore stateless environments 𝒜banditssuperscript𝒜𝑏𝑎𝑛𝑑𝑖𝑡𝑠\mathcal{A}^{bandits}caligraphic_A start_POSTSUPERSCRIPT italic_b italic_a italic_n italic_d italic_i italic_t italic_s end_POSTSUPERSCRIPT, which are represented by Bernoulli K-armed bandits [30]. Within each environment α𝒜bandits𝛼superscript𝒜𝑏𝑎𝑛𝑑𝑖𝑡𝑠\alpha\in\mathcal{A}^{bandits}italic_α ∈ caligraphic_A start_POSTSUPERSCRIPT italic_b italic_a italic_n italic_d italic_i italic_t italic_s end_POSTSUPERSCRIPT, there exists a subset Pα[K]subscript𝑃𝛼delimited-[]𝐾P_{\alpha}\subset[K]italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊂ [ italic_K ] of arms that yield a reward with high probability (p=0.95𝑝0.95p=0.95italic_p = 0.95). During the training phase, the in-distribution environments give a high reward probability to even-numbered arms, whereas during testing, the out-of-distribution environments assign a high reward probability to odd-numbered arms.

In this simplified bandit scenario, where state information is absent, the virtual weights 𝒘𝒘\bm{w}bold_italic_w directly parameterize the policy probabilities: 𝝅=softmax(𝒘)𝝅softmax𝒘\bm{\pi}=\mathrm{softmax}{(\bm{w})}bold_italic_π = roman_softmax ( bold_italic_w ). Consequently, our focus lies primarily on analyzing the behavior of the network WΘwsubscript𝑊superscriptΘ𝑤W_{\Theta^{w}}italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT()(\cdot)( ⋅ ) acting on the virtual parameters (Fig. 2A, green). This serves as an ideal test bed for evaluating the network’s ability to learn the policy gradient update rule and generalize it to out-of-distribution scenarios.

To parameterize WΘwsubscript𝑊superscriptΘ𝑤W_{\Theta^{w}}italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT()(\cdot)( ⋅ ) we employ a gain-modulated architecture WΘwsubscript𝑊superscriptΘ𝑤W_{\Theta^{w}}italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (𝝅,a,r)=GMNNΘw(𝝅,a,r|γr)𝝅𝑎𝑟subscriptGMNNsuperscriptΘ𝑤𝝅𝑎conditional𝑟𝛾𝑟(\bm{\pi},a,r)=\mathrm{GMNN}_{\Theta^{w}}(\bm{\pi},a,r|\gamma r)( bold_italic_π , italic_a , italic_r ) = roman_GMNN start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_π , italic_a , italic_r | italic_γ italic_r ), where r𝑟ritalic_r is the reward and a𝑎aitalic_a is the action. We train this network to approximate the policy gradient update rule. As training data, we use policy gradient estimates over one single learning trajectory (1000 rounds, learning rate lr=0.1𝑙𝑟0.1lr=0.1italic_l italic_r = 0.1) in an in-distribution environment. We then test the distilled policy gradient networks with a higher learning rate (lr=1.0𝑙𝑟1.0lr=1.0italic_l italic_r = 1.0) models in out-of-distribution (OOD) environments. We train the network to approximate the policy gradient update rule using policy gradient estimates from a single learning trajectory (1000 rounds, learning rate lr=0.1𝑙𝑟0.1lr=0.1italic_l italic_r = 0.1) in an in-distribution environment. We then test the distilled policy gradient networks with a higher learning rate (lr=1.0𝑙𝑟1.0lr=1.0italic_l italic_r = 1.0) in out-of-distribution (OOD) environments.

To systematically investigate the impact of gain modulation (γ=1𝛾1\gamma=1italic_γ = 1) on out-of-distribution performance, we compare it with networks where the reward does not modulate the network gain (γ=0𝛾0\gamma=0italic_γ = 0). For each case, we select the optimal hyperparameters through a grid-based search (additional details in the Appendix).

Our experimental findings are presented in Fig.2. First, comparing models of different sizes (number N𝑁Nitalic_N of hidden units), we find that models with gain modulation require significantly fewer neurons to achieve maximum scores in OOD environments (Fig.2C).

In Fig.2B, we report the regret per round distribution at the 100th round and compare it with the distribution obtained by policy gradient with the same learning rate and iterations. A gain-modulated network achieves a regret distribution comparable to the policy gradient in both ID and OOD environments, with a lower median. In contrast, networks without gain modulation show significantly higher regret. Analyzing the regret curves in Fig.2D, we observe that a model with gain modulation often learns a more data-efficient algorithm than its source, even in OOD environments. Conversely, a model without gain modulation fails to generalize to OOD environments and does not converge to zero regret.

In summary, gain modulation enables the network to consistently and efficiently distill the correct gradient update rule and generalize it to unseen environments, predicting the correct virtual weight update in regions of the input space far from the training data.

Refer to caption
Figure 3: Dynamical Reinforcement Learning Dark Room A. Overview of the network architecture employed. One GM-network (depicted in green) is responsible for estimating the necessary update of virtual weights through policy gradient, while another GM-reservoir is focused on estimating policy reconfiguration (shown in orange) resulting from changes in virtual weights. C. Illustration of the task: the agent begins from the center and learns to reach not observable objects positioned at various locations (represented by colored circles). After several trials, the agent achieves precise targeting of the circles. Our networks are pre-trained on a set of target points (red crosses) an then tested on the same (displayed in the left panel) and new positions (shown in the right panel). C. Reward as a function of number of trials (or games), blue: policy gradient, left panel pink: dynamical learning for training positions (ID), right panel pink: dynamical learning for new positions (OOD). Line: median, shading: 20-th/80-th percentile range. B, D, F. Similar to A, C, E, respectively, but without gain modulation.

3.2 Reinforcement learning in a reaching task.

We examine a reaching task, known as the dark room task (a simple instance of the water maze task [31]), set in a 2D maze within the domain (1,1)×(1,1)1111(-1,1)\times(-1,1)( - 1 , 1 ) × ( - 1 , 1 ) with a grid size of 0.10.10.10.1. Within this grid-like environment, the agent navigates by selecting one of four actions: up, down, left, or right, thereby determining its subsequent position. The primary goal is to locate a concealed object within the maze, with the agent having sole awareness of its own position. Feedback is provided via rewards, where the agent receives a reward of 10101010 if the distance to the object is 0.10.10.10.1, 15151515 if the distance is 00, and 00 otherwise. Through iterative exploration, the agent develops a strategy to efficiently traverse the maze and pinpoint the object despite the limited visibility. The position of the agent is encoded separately using 25 input units 𝒙𝒙\bm{x}bold_italic_x each, employing Gaussian activation functions distributed on a 5x5 grid in the maze and with a width of 0.20.20.20.2.

To accomplish this task we consider a network architecture composed of two networks. One GM-network, WΘwsubscript𝑊superscriptΘ𝑤W_{\Theta^{w}}italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT(r,a,𝝅)=GMNNΘw(r,a,𝝅|γrr)𝑟𝑎𝝅subscriptGMNNsuperscriptΘ𝑤𝑟𝑎conditional𝝅subscript𝛾𝑟𝑟(r,a,\bm{\pi})=\textrm{GMNN}_{\Theta^{w}}(r,a,\bm{\pi}|\gamma_{r}r)( italic_r , italic_a , bold_italic_π ) = GMNN start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_r , italic_a , bold_italic_π | italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_r ) is responsible for estimating the necessary update of virtual weights through policy gradient (Fig. 3A, green network), while another GM-network YΘysubscript𝑌superscriptΘ𝑦Y_{\Theta^{y}}italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT(𝒙,𝒘)=GMNNΘy(𝒙,𝒘|γ𝒘)𝒙𝒘subscriptGMNNsuperscriptΘ𝑦𝒙conditional𝒘𝛾𝒘(\bm{x},\bm{w})=\textrm{GMNN}_{\Theta^{y}}(\bm{x},\bm{w}|\gamma\bm{w})( bold_italic_x , bold_italic_w ) = GMNN start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_w | italic_γ bold_italic_w ) estimates policy reconfiguration (Fig. 3A, shown in orange) resulting from changes in virtual weights. Here, the parameters γ,γr𝛾subscript𝛾𝑟\gamma,\gamma_{r}italic_γ , italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT define the strength of the gain modulation, as explained in the Methods section. Their dynamics can be described by the following equation Eq.(5). Indeed, we compare the case with (Fig. 3A) and without (Fig. 3B) gain modulation. We refer to Supporting Information for further details on the training procedure.

Firstly, we assessed the performance of the policy gradient algorithm using a set of 8 food locations (refer to Fig. 3C, indicated by red crosses), where the total reward averaged over 200 trials was observed against the number of trials (Fig. 3E, depicted by blue lines, thin lines for individual positions and thick lines for the average across all positions). After multiple trials, the agent successfully achieved precise targeting of the circles. Data collected from these experiments were utilized to train our networks to estimate gradients and scalar products, as defined in Eq. (5).

To validate that our trained model is capable of dynamically implementing policy gradient itself, we tested it on both the training set locations (ID, Fig. 3C, left panel) and new test positions (OOD, Fig. 3C, right panel). For each food position (coded with different colors), we illustrate a sample trajectory executed by the agent (in corresponding colors) to reach the target at the end of the training. The agent’s precision closely matches that of the plastic policy gradient learning rule. We present the reward plotted against the number of trials for both training (Fig. 3E, right panel, pink line) and test (Fig. 3C, right panel, pink line) food locations.

We compared these performances against an architecture lacking gain modulation (γ=0𝛾0\gamma=0italic_γ = 0), observing worst performances (see Fig. 3B, D, F). Notably, while performances for ID food locations are acceptable (Fig. 3E, left panel, pink lines), those for OOD cases are extremely poor (Fig. 3E, right panel, pink lines). This observation, coupled with the results from the preceding section, suggests that gain modulation is a crucial component in facilitating the generalization of adaptive behavioral capabilities in recurrent networks.

We demonstrate, that our architecture is capable to perform the temporal computation required to learn delayed action-reward temporal relations (see section C in Appendix), requiring evaluating temporal credit assignment.

4 Discussion

It is believed that the remarkable capability of transformers to adapt to contextual information, is the key ingredient that allowed transformer-based architectures [4, 32], achieving state-of-the-art in many domains, such as processing and generation of natural language and images [6, 7, 8]. Currently, mechanisms based on attention, associative memory [9], and induction-based copying [10] by heads are the predominant tentative explanations for the emergence of in-context learning in transformers. This explanation theoretically demonstrates that the attention mechanism has ability to execute gradient descent updates [11]. However, the emergence of this property remains not fully understood and it is not clear how to translate this to the in-context learning capabilities observed in human and animals. In this this work we proposed a constructive method to induce in-context learning in biologically plausible neural networks, in a broad variety of scenarios.

When tasked to learn temporal trajectories, through error feedback and virtual weight updates, the network achieves successful dynamical adaptation, without synaptic plasticity. Similarly, in reinforcement learning scenarios, the policy in response to the environment state is dynamically modulated by virtual weights, that are updated in function of the reward. We stress that virtual weights are not synaptic weights, but their values and updates are evaluated and encoded by the activity of other RNNs, that were pre-trained to perform gradient updates. Another pre-trained network receives those weights as an input, along with the current environment features, and computes an adapted in-context response. We find that networks with gain modulation exhibit improved performance and robustness to variations compared to those without gain modulation. Moreover, these gain-modulated networks demonstrate more data-efficient learning algorithms, outperforming counterparts in both in-distribution and out-of-distribution environments, showing its capability to face novel scenarios.

In conclusion, our approach provides an explicit framework to induce ICL in biologically plausible networks, possibly opening the route to a formal understanding of ICL in biological agents.

Limitations of this study

In this study, we focused on dynamical adaptation of virtual readout weights, in other words, only the last layer is considered. We are neglecting the dynamics on recurrent weights and weights of the hidden units, which, in general, strongly affect performance and stability. Generalizability to untested real-world conditions is uncertain, and we do not investigate the scalability to larger architectures. We discuss that the mechanism we discuss could be present in biological agents, allowing in-context learning. However, the theoretical basis for attention mechanisms executing gradient descent updates lacks comprehensive empirical validation in biological experiments.

Additionally, the study does not fully address how dynamic gradient structures could be learned or developed in biological networks.

5 Acknowledgments

This research has received financial support from the Italian National Recovery and Resilience Plan (PNRR), M4C2, funded by the European Union - NextGenerationEU (Project IR0000011, CUP B51E22000150006, ‘EBRAINS-Italy’) to MM. LF is supported by ICSC – Centro Nazionale di Ricerca in High Performance Computing, Big Data and Quantum Computing, funded by European Union – NextGenerationEU.

Source code availability

The source code is available under CC-BY license in the https://github.com/cristianocapone/ABSS public repository.

References

  • [1] Inah Lee and Choong-Hee Lee. Contextual behavior and neural circuits. Frontiers in neural circuits, 7:84, 2013.
  • [2] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
  • [3] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pages 2152–2161. PMLR, 2015.
  • [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [5] Aaditya Singh, Stephanie Chan, Ted Moskovitz, Erin Grant, Andrew Saxe, and Felix Hill. The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 36, 2024.
  • [6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [7] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [8] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  • [9] Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, et al. Hopfield networks is all you need. arXiv preprint arXiv:2008.02217, 2020.
  • [10] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  • [11] Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
  • [12] Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
  • [13] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670–26698. PMLR, 2023.
  • [14] Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes Von Oswald, Maxime Larcher, Angelika Steger, and Joao Sacramento. Gated recurrent neural networks discover attention. arXiv preprint arXiv:2309.01775, 2023.
  • [15] Panayiota Poirazi and Athanasia Papoutsi. Illuminating dendritic function with computational models. Nature Reviews Neuroscience, 21(6):303–321, 2020.
  • [16] Matthew Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in neurosciences, 36(3):141–151, 2013.
  • [17] Robert Urbanczik and Walter Senn. Learning by the dendritic prediction of somatic spiking. Neuron, 81(3):521–528, 2014.
  • [18] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated dendrites. eLife, 6:e22901, 2017.
  • [19] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8721–8732. Curran Associates, Inc., 2018.
  • [20] Alexandre Payeur, Jordan Guerguiev, Friedemann Zenke, Blake A Richards, and Richard Naud. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature neuroscience, 24(7):1010–1019, 2021.
  • [21] Cristiano Capone, Cosimo Lupo, Paolo Muratore, and Pier Stanislao Paolucci. Beyond spiking networks: The computational advantages of dendritic amplification and input segregation. Proceedings of the National Academy of Sciences, 120(49):e2220743120, 2023.
  • [22] Matthew Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in Neurosciences, 36(3):141 – 151, 2013.
  • [23] Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):1–15, 2020.
  • [24] Cristiano Capone and Pier Stanislao Paolucci. Towards biologically plausible dreaming and planning. arXiv preprint arXiv:2205.10044, 2022.
  • [25] Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
  • [26] Hugh R Wilson and Jack D Cowan. Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical journal, 12(1):1–24, 1972.
  • [27] Izzet B Yildiz, Herbert Jaeger, and Stefan J Kiebel. Re-visiting the echo state property. Neural networks, 35:1–9, 2012.
  • [28] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1-3):489–501, 2006.
  • [29] Adam S Shai, Costas A Anastassiou, Matthew E Larkum, and Christof Koch. Physiology of layer 5 pyramidal neurons in mouse primary visual cortex: coincidence detection through bursting. PLoS computational biology, 11(3):e1004090, 2015.
  • [30] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall, 5(71-87):7–7, 1985.
  • [31] Richard GM Morris. Spatial localization does not require the presence of local cues. Learning and motivation, 12(2):239–260, 1981.
  • [32] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  • [33] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.

Appendix A Relationship with previous work on product-based architectures implementing in context-learning

The relationship between attention-based transformer architectures and in-context learning was first noted in [11] where it was shown through constructive proof, confirmed by experiments, that linear attention layers can implement a gradient descent update for linear regression in its forward pass. Building upon this insight, [14] showed that a linear gated RNN can implement the same mechanism, showing that a linear two-layer gated RNN can replicate a linear transformer. The work proposes an implementation that uses O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) hidden units and has O(d4)𝑂superscript𝑑4O(d^{4})italic_O ( italic_d start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) trainable weights, that can be reduced to O(d3)𝑂superscript𝑑3O(d^{3})italic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) by side gating.

Our gain modulated architecture significant analogies with the gated RNNs analyzed in [14]. Consider a linear 333For simplicity, here we assume α,β=0,γ=1formulae-sequence𝛼𝛽0𝛾1\alpha,\beta=0,\ \gamma=1italic_α , italic_β = 0 , italic_γ = 1 and consider a linear activation function ϕ=Iditalic-ϕ𝐼𝑑\phi=Iditalic_ϕ = italic_I italic_d. gain-modulated network architectures implementing the functions YΘy(𝒛,𝒘)subscript𝑌superscriptΘ𝑦𝒛𝒘{\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_w ), WΘw(𝒛,ytargy)subscript𝑊superscriptΘ𝑤𝒛superscript𝑦𝑡𝑎𝑟𝑔𝑦{\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},y^{targ}-y)italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT - italic_y ), dynamically performing a ICL linear regression task that requires to map temporal dependent features z(t)Nin𝑧𝑡superscriptsubscript𝑁𝑖𝑛z(t)\in\mathbb{R}^{N_{in}}italic_z ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to predict an output 𝒚targNoutsuperscript𝒚𝑡𝑎𝑟𝑔superscriptsubscript𝑁𝑜𝑢𝑡\bm{y}^{targ}\in\mathbb{R}^{N_{out}}bold_italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT:

{𝒚=YΘy(𝒛,𝒘)=Θy(Ryap𝒘)(Ry𝒛(t))τw𝒘˙=WΘw(𝒛,𝒚targ𝒚)=Θw(Rwap(y^y))(Rw𝒛(t))cases𝒚absentsubscript𝑌superscriptΘ𝑦𝒛𝒘direct-productsuperscriptΘ𝑦subscriptsuperscript𝑅𝑎𝑝𝑦𝒘subscript𝑅𝑦𝒛𝑡subscript𝜏𝑤˙𝒘absentsubscript𝑊superscriptΘ𝑤𝒛superscript𝒚𝑡𝑎𝑟𝑔𝒚direct-productsuperscriptΘ𝑤subscriptsuperscript𝑅𝑎𝑝𝑤^𝑦𝑦subscript𝑅𝑤𝒛𝑡\begin{cases}\bm{y}&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})=\Theta% ^{y}\left(R^{ap}_{y}\bm{w}\right)\odot(R_{y}\bm{z}(t))\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},\bm% {y}^{targ}-\bm{y})=\Theta^{w}\left(R^{ap}_{w}\left(\hat{y}-y\right)\right)% \odot(R_{w}\bm{z}(t))\end{cases}{ start_ROW start_CELL bold_italic_y end_CELL start_CELL = italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_w ) = roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_italic_w ) ⊙ ( italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_italic_z ( italic_t ) ) end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT - bold_italic_y ) = roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG - italic_y ) ) ⊙ ( italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT bold_italic_z ( italic_t ) ) end_CELL end_ROW (10)

This architecture can be seen as a particular instance of the RNN with side gating (see Equation (15) in [14]). Interestingly, a similar mechanism is also present in the recently proposed Mamba layer [33].

Appendix B Approximating gradients with gain modulated architectures

In this appendix, we show that an architecture with gain modulation is more suited to approximate gradient terms involved in dynamical learning.

Scalar product

We first consider the task of approximating a scalar product between virtual weights 𝒘Nin𝒘superscriptsubscript𝑁𝑖𝑛\bm{w}\in\mathbb{R}^{N_{in}}bold_italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and features 𝒙Nin𝒙superscriptsubscript𝑁𝑖𝑛\bm{x}\in\mathbb{R}^{N_{in}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. To achieve this, we train the readout weights ΘΘ\Thetaroman_Θ of a gain modulated network GMNNΘ({𝒙}|{γ𝒘})subscriptGMNNΘconditional𝒙𝛾𝒘\mathrm{GMNN}_{\Theta}(\{\bm{x}\}|\{\gamma\bm{w}\})roman_GMNN start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( { bold_italic_x } | { italic_γ bold_italic_w } ) with Nhsubscript𝑁N_{h}italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT hidden features to approximate the function dot(𝒙,𝒘)=i=1Ninxiwi𝑑𝑜𝑡𝒙𝒘superscriptsubscript𝑖1subscript𝑁𝑖𝑛subscript𝑥𝑖subscript𝑤𝑖dot(\bm{x},\bm{w})=\sum_{i=1}^{N_{in}}x_{i}w_{i}italic_d italic_o italic_t ( bold_italic_x , bold_italic_w ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The training dataset is composed of 1000100010001000 pairs (𝒙,𝒘)𝒙𝒘(\bm{x},\bm{w})( bold_italic_x , bold_italic_w ) uniformly sampled in the hypercube [0,1]2Ninsuperscript012subscript𝑁𝑖𝑛[0,1]^{2N_{in}}[ 0 , 1 ] start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The test set consists of the same number of pairs sampled in the hypercube [1,1]2Ninsuperscript112subscript𝑁𝑖𝑛[-1,1]^{2N_{in}}[ - 1 , 1 ] start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

We compare an architecture with gain modulation (γ=1𝛾1\gamma=1italic_γ = 1) with an architecture without gain modulation γ=0𝛾0\gamma=0italic_γ = 0. For each of the two settings, we select the best hyperparameters by fixing the dimensionality of the inputs (Nin=5subscript𝑁𝑖𝑛5N_{in}=5italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = 5) and hidden units (Nh=101subscript𝑁101N_{h}=101italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 101) and varying the standard deviation of the projection matrices Rij,Rijap𝒩(0,σR2),log10(σR){2,1.8,,0}formulae-sequencesimilar-tosubscript𝑅𝑖𝑗subscriptsuperscript𝑅𝑎𝑝𝑖𝑗𝒩0superscriptsubscript𝜎𝑅2subscript10subscript𝜎𝑅21.80R_{ij},R^{ap}_{ij}\sim\mathcal{N}(0,\sigma_{R}^{2}),\ \log_{10}(\sigma_{R})\in% \{-2,-1.8,\cdots,0\}italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) ∈ { - 2 , - 1.8 , ⋯ , 0 }, the standard deviation of the bias bi𝒩(0,σb2),σb{0,0.1,,1}formulae-sequencesimilar-tosubscript𝑏𝑖𝒩0superscriptsubscript𝜎𝑏2subscript𝜎𝑏00.11b_{i}\sim\mathcal{N}(0,\sigma_{b}^{2}),\ \sigma_{b}\in\{0,0.1,\cdots,1\}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ { 0 , 0.1 , ⋯ , 1 } and the nonlinearity type ϕ{tanh,softplus}italic-ϕtanhsoftplus\phi\in\{\mathrm{tanh},\mathrm{softplus}\}italic_ϕ ∈ { roman_tanh , roman_softplus }. The best hyperparameters found were used in the models to test the approximation performance of the dot product varying the number of input features (Nin[1,10]subscript𝑁𝑖𝑛110N_{in}\in[1,10]\cap\mathbb{N}italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ [ 1 , 10 ] ∩ blackboard_N) and hidden units (Nh[1,200]subscript𝑁1200N_{h}\in[1,200]\cap\mathbb{N}italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ [ 1 , 200 ] ∩ blackboard_N). Results are shown in Fig. A1 A, B, C. We see that while both architectures require a quadratic number of hidden units to achieve low error on the test set, the gain-modulated network after a threshold number of units is hidden units is reached, is able to perfectly approximate the scalar product function consistently achieving 107similar-toabsentsuperscript107\sim 10^{-7}∼ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT RMSE error. At the same time, the error in the network without gain modulation remains several orders of magnitude higher.

Scalar-vector product

We then tackle the task of approximating a vector-scalar product between features 𝒙Nin𝒙superscriptsubscript𝑁𝑖𝑛\bm{x}\in\mathbb{R}^{N_{in}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and features and error 𝒆𝒆\bm{e}\in\mathbb{R}bold_italic_e ∈ blackboard_R. To achieve this, we train the readout weights ΘΘ\Thetaroman_Θ of a gain modulated network GMNNΘ({𝒙}|{γe})subscriptGMNNΘconditional𝒙𝛾𝑒\mathrm{GMNN}_{\Theta}(\{\bm{x}\}|\{\gamma e\})roman_GMNN start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( { bold_italic_x } | { italic_γ italic_e } ) with Nhsubscript𝑁N_{h}italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT hidden features to approximate the function prod(𝒙,e)=e𝒙Nin𝑝𝑟𝑜𝑑𝒙𝑒𝑒𝒙superscriptsubscript𝑁𝑖𝑛prod(\bm{x},e)=e\bm{x}\in\mathbb{R}^{N_{in}}italic_p italic_r italic_o italic_d ( bold_italic_x , italic_e ) = italic_e bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

The training dataset is composed of 1000100010001000 pairs (𝒙,e)𝒙𝑒(\bm{x},e)( bold_italic_x , italic_e ) uniformly sampled in the hypercube [0,1]Nin×[0,1]superscript01subscript𝑁𝑖𝑛01[0,1]^{N_{in}}\times[0,1][ 0 , 1 ] start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × [ 0 , 1 ]. The test set consists of the same number of pairs sampled in the hypercube [1,1]Nin×[1,1]superscript11subscript𝑁𝑖𝑛11[-1,1]^{N_{in}}\times[-1,1][ - 1 , 1 ] start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × [ - 1 , 1 ].

We compare an architecture with gain modulation (γ=1𝛾1\gamma=1italic_γ = 1) with an architecture without gain modulation γ=0𝛾0\gamma=0italic_γ = 0. For each of the two settings, we select the best hyperparameters by fixing the dimensionality of the inputs (Nin=5subscript𝑁𝑖𝑛5N_{in}=5italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = 5) and hidden units (Nh=21subscript𝑁21N_{h}=21italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 21) and varying the standard deviation of the projection matrices Rij𝒩(0,σR2/Nin),Rijap𝒩(0,σR2,log10(σR){2,1.8,,0}R_{ij}\sim\mathcal{N}(0,\sigma_{R}^{2}/N_{in}),R^{ap}_{ij}\sim\mathcal{N}(0,% \sigma_{R}^{2},\ \log_{10}(\sigma_{R})\in\{-2,-1.8,\cdots,0\}italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) , italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) ∈ { - 2 , - 1.8 , ⋯ , 0 }, the standard deviation of the bias bi𝒩(0,σb2),σb{0,0.1,,1}formulae-sequencesimilar-tosubscript𝑏𝑖𝒩0superscriptsubscript𝜎𝑏2subscript𝜎𝑏00.11b_{i}\sim\mathcal{N}(0,\sigma_{b}^{2}),\ \sigma_{b}\in\{0,0.1,\cdots,1\}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ { 0 , 0.1 , ⋯ , 1 } and the nonlinearity type ϕ{tanh,softplus}italic-ϕtanhsoftplus\phi\in\{\mathrm{tanh},\mathrm{softplus}\}italic_ϕ ∈ { roman_tanh , roman_softplus }. The best hyperparameters found were used in the models to test the approximation performance of the dot product varying the number of input features (Nin[1,10]subscript𝑁𝑖𝑛110N_{in}\in[1,10]\cap\mathbb{N}italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ [ 1 , 10 ] ∩ blackboard_N) and hidden units (Nh[1,40]subscript𝑁140N_{h}\in[1,40]\cap\mathbb{N}italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ [ 1 , 40 ] ∩ blackboard_N). Results are shown in Fig. A1 D, E, F. We see that the network without gain modulation requires a quadratic number of hidden units to achieve low error on the test set. As in the scalar product case, the gain-modulated network can nearly perfectly approximate the target function after a threshold number of hidden units is reached. Significantly, in this case, the threshold scales linearly with Ninsuperscriptsubscript𝑁𝑖𝑛\mathbb{R}^{N_{in}}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Refer to caption
Figure A1: A Test errors for the scalar product approximation task, varying the number of features Ninsubscript𝑁𝑖𝑛N_{in}italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, hidden units Nhsubscript𝑁N_{h}italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. We compare architectures with gain modulation (γ=1𝛾1\gamma=1italic_γ = 1, bottom) with architectures without gain modulation (γ=0𝛾0\gamma=0italic_γ = 0, bottom). B Test errors for the dot product approximation task, varying the number of features Ninsubscript𝑁𝑖𝑛N_{in}italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, and fixing the number of hidden units Nh=101subscript𝑁101N_{h}=101italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 101. Solid lines indicate the median over 100 trained models, while the filled region indicates the 20/80th2080𝑡20/80th20 / 80 italic_t italic_h-percentile interval. C Same as B but fixing the number of features Nin=5subscript𝑁𝑖𝑛5N_{in}=5italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = 5, and varying the number of hidden units Nh=101subscript𝑁101N_{h}=101italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 101. D, E, F Same as panels A, B, C but for the scalar-vector product approximation task.

On the number of hidden units and trainable weights needed in our model

Consider the architecture in Eq. (10) with Nout=1subscript𝑁𝑜𝑢𝑡1N_{out}=1italic_N start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = 1:

{y=YΘy(𝒛,𝒘)=Θy(Ryap𝒘)(Ry𝒛(t))τw𝒘˙=WΘw(𝒛,ytargy)=Θw(Rwap(y^y))(Rw𝒛(t))cases𝑦absentsubscript𝑌superscriptΘ𝑦𝒛𝒘direct-productsuperscriptΘ𝑦subscriptsuperscript𝑅𝑎𝑝𝑦𝒘subscript𝑅𝑦𝒛𝑡subscript𝜏𝑤˙𝒘absentsubscript𝑊superscriptΘ𝑤𝒛superscript𝑦𝑡𝑎𝑟𝑔𝑦direct-productsuperscriptΘ𝑤subscriptsuperscript𝑅𝑎𝑝𝑤^𝑦𝑦subscript𝑅𝑤𝒛𝑡\begin{cases}y&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})=\Theta^{y}% \left(R^{ap}_{y}\bm{w}\right)\odot(R_{y}\bm{z}(t))\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},y^{% targ}-y)=\Theta^{w}\left(R^{ap}_{w}\left(\hat{y}-y\right)\right)\odot(R_{w}\bm% {z}(t))\end{cases}{ start_ROW start_CELL italic_y end_CELL start_CELL = italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_w ) = roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_italic_w ) ⊙ ( italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_italic_z ( italic_t ) ) end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT - italic_y ) = roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG - italic_y ) ) ⊙ ( italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT bold_italic_z ( italic_t ) ) end_CELL end_ROW (11)

Where Ryap,Ry,Rwap,Rwsubscriptsuperscript𝑅𝑎𝑝𝑦subscript𝑅𝑦subscriptsuperscript𝑅𝑎𝑝𝑤subscript𝑅𝑤R^{ap}_{y},R_{y},R^{ap}_{w},R_{w}italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT are fixed random matrices respectively of dimension Ny×Nin,Ny×Nin,Nw×1,Nw×Ninsubscript𝑁𝑦subscript𝑁𝑖𝑛subscript𝑁𝑦subscript𝑁𝑖𝑛subscript𝑁𝑤1subscript𝑁𝑤subscript𝑁𝑖𝑛N_{y}\times N_{in},N_{y}\times N_{in},N_{w}\times 1,N_{w}\times N_{in}italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × 1 , italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, and Θy,ΘwsuperscriptΘ𝑦superscriptΘ𝑤\Theta^{y},\Theta^{w}roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT are trainable readout matrices of sizes 1×Ny,Nin×Nw1subscript𝑁𝑦subscript𝑁𝑖𝑛subscript𝑁𝑤1\times N_{y},N_{in}\times N_{w}1 × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. In this simplified linear setting, the features extracted by the network are linear combinations of the product of the input entries that need to be linearly combined by the readout weights to the target function, which in both cases is composed of a specific linear combination of these products. It is then straightforward to observe that, to approximate the dot product, in YΘy(𝒛,𝒘)subscript𝑌superscriptΘ𝑦𝒛𝒘{\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})italic_Y start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_w ) we need NwO(Nin2)similar-tosubscript𝑁𝑤𝑂superscriptsubscript𝑁𝑖𝑛2N_{w}\sim O(N_{in}^{2})italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∼ italic_O ( italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) hidden units and O(Nin2)𝑂superscriptsubscript𝑁𝑖𝑛2O(N_{in}^{2})italic_O ( italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) readout parameters. as confirmed in the experiment. This number can be further reduced to O(Nin)𝑂subscript𝑁𝑖𝑛O(N_{in})italic_O ( italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) both for the hidden units and trainable weights 444This can be done assuming that Rap,Rwsuperscript𝑅𝑎𝑝subscript𝑅𝑤R^{ap},R_{w}italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT are diagonal matrices (and thus reducing the problem to estimating Ninsubscript𝑁𝑖𝑛N_{in}italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT 1×1111\times 11 × 1 products). As for the network WΘw(𝒛,ytargy)subscript𝑊superscriptΘ𝑤𝒛superscript𝑦𝑡𝑎𝑟𝑔𝑦{\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},y^{targ}-y)italic_W start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_z , italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT - italic_y ) approximating vector-scalar products, similar considerations support the experimental observation that NyO(Nin)similar-tosubscript𝑁𝑦𝑂subscript𝑁𝑖𝑛N_{y}\sim O(N_{in})italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∼ italic_O ( italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) hidden units and O(Nin2)𝑂superscriptsubscript𝑁𝑖𝑛2O(N_{in}^{2})italic_O ( italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) features are needed in this case.

For a multivariate NinNoutsuperscriptsubscript𝑁𝑖𝑛subscript𝑁𝑜𝑢𝑡\mathbb{R}^{N_{in}}\to{N_{out}}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → italic_N start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT linear regression task, we can consider Noutsubscript𝑁𝑜𝑢𝑡N_{out}italic_N start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT independent modules of the type described before. In this case, the number of trainable parameters is O(Nin2Nout)𝑂subscriptsuperscript𝑁2𝑖𝑛subscript𝑁𝑜𝑢𝑡O(N^{2}_{in}\cdot N_{out})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) with O(NinNout)𝑂subscript𝑁𝑖𝑛subscript𝑁𝑜𝑢𝑡O(N_{in}\cdot N_{out})italic_O ( italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) hidden units (O(Nin2Nout)𝑂superscriptsubscript𝑁𝑖𝑛2subscript𝑁𝑜𝑢𝑡O(N_{in}^{2}\cdot N_{out})italic_O ( italic_N start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) in the more biologically plausible case).

Appendix C Reinforcement Learning and the Discount Factor

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. The discount factor, usually denoted by γ𝛾\gammaitalic_γ (where 0γ10𝛾10\leq\gamma\leq 10 ≤ italic_γ ≤ 1), is crucial in RL as it determines the importance of future rewards. The cumulative reward Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t is given by:

Rt=rt+γrt+1+γ2rt+2+γ3rt+3+subscript𝑅𝑡subscript𝑟𝑡𝛾subscript𝑟𝑡1superscript𝛾2subscript𝑟𝑡2superscript𝛾3subscript𝑟𝑡3R_{t}=r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+\gamma^{3}r_{t+3}+\ldotsitalic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 3 end_POSTSUBSCRIPT + …

In our experiment, the policy gradient update rule in the presence of the discount factor could be approximated as (see [23]):

{𝒆=(𝟙a𝝅)𝒙𝒚=𝒘𝒙τw𝒘˙=r𝒆^cases𝒆absentdirect-productsubscript1𝑎𝝅𝒙𝒚absent𝒘𝒙subscript𝜏𝑤˙𝒘absent𝑟bold-^𝒆\begin{cases}\bm{e}&=\left(\bm{\mathds{1}}_{a}-\bm{\pi}\right)\odot\bm{x}\\ \bm{y}&=\bm{w}\cdot\bm{x}\\ \tau_{w}\,\dot{\bm{w}}&=r\bm{\hat{e}}\end{cases}{ start_ROW start_CELL bold_italic_e end_CELL start_CELL = ( blackboard_bold_1 start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - bold_italic_π ) ⊙ bold_italic_x end_CELL end_ROW start_ROW start_CELL bold_italic_y end_CELL start_CELL = bold_italic_w ⋅ bold_italic_x end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over˙ start_ARG bold_italic_w end_ARG end_CELL start_CELL = italic_r overbold_^ start_ARG bold_italic_e end_ARG end_CELL end_ROW (12)

where 𝒆^bold-^𝒆\bm{\hat{e}}overbold_^ start_ARG bold_italic_e end_ARG is an exponential temporal filtering of 𝒆𝒆\bm{e}bold_italic_e, with a timescale that is proportional to 1log(γ)1𝛾-\frac{1}{\log(\gamma)}- divide start_ARG 1 end_ARG start_ARG roman_log ( start_ARG italic_γ end_ARG ) end_ARG. The importance of the discount factor lies in its ability to balance immediate and future rewards. A higher γ𝛾\gammaitalic_γ values future rewards more significantly, encouraging the agent to consider the long-term consequences of its actions. Conversely, a lower γ𝛾\gammaitalic_γ makes the agent prioritize immediate rewards. This balance is essential for the stability of the learning process. An appropriately chosen γ𝛾\gammaitalic_γ ensures stable learning and convergence; if γ𝛾\gammaitalic_γ is too high, the agent might overvalue distant future rewards, leading to instability, while a very low γ𝛾\gammaitalic_γ can result in short-sighted behavior. If the discount factor is zero, the agent’s policy would change only when it is very close to the moment the reward is received. In a reaching task, the agent will only change its policy when it is very close to the reward, completely ignoring the need for long-term planning and making it unlikely to ever reach the goal from distant starting points.

To demonstrate that our approach performs the temporal computation required to consider future rewards, we compare the performances of our network (see Fig. A2.A-B) to a network without recurrent connections, which therefore cannot perform temporal computations. Recurrent connections enable a network to maintain a memory of past states and actions, effectively allowing it to use information from previous time steps to inform current decisions. Without these connections, a network operates purely on the current input without any contextual information from prior steps, thus lacking the ability to perform temporal computations (see Fig. A2.D-E).

Performances are higher in the first case. This can be visualized by looking at the policy at the end of the dynamic reinforcement learning for a specific target location. When the recurrent weights are set to zero, the policy points towards the target position only when nearby the target itself (see Fig. A2F), resulting in failure when the agent randomly moves in the wrong direction at the beginning of the task (see Fig. A2F, black line).

On the other hand, in the presence of recurrent weights, the proper policy (pointing towards the target, see Fig. A2C) is known even when far from the target, allowing optimal long-term planning (see Fig. A2C, black line).

Refer to caption
Figure A2: Discount Factor and eligibility traces: A. Sample trajectories of the agent at the end of the learning procedure, for various target locations (color-coded), different from the ones used to train the gradient and the scalar product networks (red crosses). B. Reward as a function of number of trials (or games), blue: policy gradient, pink: dynamical learning for new positions (OOD). Line: median, shading: 20-th/80-th percentile range. C. Arrows: policy at the end of the training as a function of the position. Line: sample trajectory of the agent at the end of learning. Small circle: agent position, double circle: target position. D-E-F same as in A-B-C, but the recurrent connections of the gradient network are set to zero.

Appendix D Additional details on the "dynamical learning of a temporal trajectory" experiment

In the "dynamical learning of a temporal trajectory" experiment we test our architecture and non-synaptic learning approach on temporal tasks. The primary goal is to analyze the network’s ability to generalize beyond the target frequencies used to pre-train our networks.

Table 1: Simulation Parameters
Parameter Symbol Reservoir Gradient Net Scalar Net
Network Size N𝑁Nitalic_N 20 500 500
Input Dimension I𝐼Iitalic_I 1 100 + 10 100+100
Apical Input Dimension Iapsuperscript𝐼𝑎𝑝I^{ap}italic_I start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT 0 1 100
Output Dimension O𝑂Oitalic_O 1 100 1
Time Step dt𝑑𝑡dtitalic_d italic_t 0.005
Reservoir Time Constant τmfsubscript𝜏subscript𝑚𝑓\tau_{m_{f}}italic_τ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT 10 dt𝑑𝑡dtitalic_d italic_t 1 dt𝑑𝑡dtitalic_d italic_t 1 dt𝑑𝑡dtitalic_d italic_t
Input weights var σinputsubscript𝜎𝑖𝑛𝑝𝑢𝑡\sigma_{input}italic_σ start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT 0.06 0.06 0.06
Apical Input weights var σinputapsuperscriptsubscript𝜎𝑖𝑛𝑝𝑢𝑡𝑎𝑝\sigma_{input}^{ap}italic_σ start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT 0.0 0.1 0.1
Recurrent weights σrecsubscript𝜎𝑟𝑒𝑐\sigma_{rec}italic_σ start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT 0.99 / N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG 0.5 / N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG 0. / N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG
Gain-modulation factor γnetsubscript𝛾𝑛𝑒𝑡\gamma_{net}italic_γ start_POSTSUBSCRIPT italic_n italic_e italic_t end_POSTSUBSCRIPT 0. 1. 1.

Three networks are defined in this experiment: one uses a reservoir to compute temporal features and encode the target temporal trajectory, one predicts the required weight updates, and one predicts the scalar product, as discussed in the main text. The parameters used in the simulation are summarized in Table 1.

Inputs are projected to the network through Gaussian weights with zero mean and variance σin2superscriptsubscript𝜎𝑖𝑛2\sigma_{in}^{2}italic_σ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to distribute the input information across multiple units in the reservoir.

Data collection and pre-training

The readout of the first reservoir network is defined as Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and the readout at time t𝑡titalic_t is given by:

y(t)=jRjzj(t),𝑦𝑡subscript𝑗subscript𝑅𝑗subscript𝑧𝑗𝑡y(t)=\sum_{j}R_{j}z_{j}(t),italic_y ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ,

where zj(t)subscript𝑧𝑗𝑡z_{j}(t)italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) are the reservoir states.

Training and evaluation involve presenting the networks with five target trajectories ytarg(t)=0.8sin(ωtargt)superscript𝑦𝑡𝑎𝑟𝑔𝑡0.8𝑠𝑖𝑛subscript𝜔𝑡𝑎𝑟𝑔𝑡y^{targ}(t)=0.8sin(\omega_{targ}t)italic_y start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_t ) = 0.8 italic_s italic_i italic_n ( italic_ω start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g end_POSTSUBSCRIPT italic_t ), with five angular velocities, ωtargsubscript𝜔𝑡𝑎𝑟𝑔\omega_{targ}italic_ω start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g end_POSTSUBSCRIPT, ranging from 0.040.040.040.04 to 0.080.080.080.08. The reservoir readout parameters are trained using online gradient descent to minimize the error between the current output and the target output. The loss function is defined as:

L=12(y(t)ytarget(t))2.𝐿12superscript𝑦𝑡superscript𝑦target𝑡2L=\frac{1}{2}\left(y(t)-y^{\text{target}}(t)\right)^{2}.italic_L = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_y ( italic_t ) - italic_y start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The gradient used for online training is:

ΔRj=ηLRj=η(y(t)ytarget(t))zj(t),Δsubscript𝑅𝑗𝜂𝐿subscript𝑅𝑗𝜂𝑦𝑡superscript𝑦target𝑡subscript𝑧𝑗𝑡\Delta R_{j}=-\eta\frac{\partial L}{\partial R_{j}}=-\eta\left(y(t)-y^{\text{% target}}(t)\right)z_{j}(t),roman_Δ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = - italic_η divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = - italic_η ( italic_y ( italic_t ) - italic_y start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT ( italic_t ) ) italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ,

where η𝜂\etaitalic_η is the learning rate.

During data collection, trajectories of network states, errors, and weight updates are recorded. The gradient and scalar-product networks are trained on this data to predict the required gradient given a specific temporal error, and to estimate the current trajectory given the current virtual weights.

The networks are pre-trained by estimating only their readout weights, in accordance with a reservoir computing prescription. They are trained using a pseudo-inverse approach.

Dynamical supervised online learning and evaluation

Once the architecture is pre-trained, it can be tested to learn new trajectory, unobserved during the pretraining. readout weights of the reservoir are no longer changed online using gradient, but rather virtual weights are changed following the prescription of the gradient network.

Virtual weights are then used as an input to the scalar-product network to predict the current prediction y(t)𝑦𝑡y(t)italic_y ( italic_t ).

Performances are evaluated by measuring the MSE between the target trajectory ant the predicted one y(t)𝑦𝑡y(t)italic_y ( italic_t ), for different values of the trajectory angular velocity, equally distribute between 0.01.01.01.01 and 0.10.10.10.1.

Appendix E Additional details on the bandit experiment

Family of tasks

We consider a family of tasks 𝒜𝒜\mathcal{A}caligraphic_A, such that every element α𝒜𝛼𝒜\alpha\in\mathcal{A}italic_α ∈ caligraphic_A represents a Bernoulli K𝐾Kitalic_K-armed bandit problem with rewards {Riα}i=1Ksuperscriptsubscriptsuperscriptsubscript𝑅𝑖𝛼𝑖1𝐾\{R_{i}^{\alpha}\}_{i=1}^{K}{ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Each task α𝛼\alphaitalic_α is specified by a set Pα{1,K}P_{\alpha}\subset\{1\cdot,K\}italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊂ { 1 ⋅ , italic_K } of positive arms, such that

Riα={Ber(p)iPαBer(1p)iPαsuperscriptsubscript𝑅𝑖𝛼cases𝐵𝑒𝑟𝑝𝑖subscript𝑃𝛼𝐵𝑒𝑟1𝑝𝑖subscript𝑃𝛼\displaystyle R_{i}^{\alpha}=\begin{cases}Ber({p})&i\in{P_{\alpha}}\\ Ber({1-p})&i\in{P_{\alpha}}\end{cases}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_B italic_e italic_r ( italic_p ) end_CELL start_CELL italic_i ∈ italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_B italic_e italic_r ( 1 - italic_p ) end_CELL start_CELL italic_i ∈ italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW (13)

In the main text experiment, we used K=10𝐾10K=10italic_K = 10 and PID:={i[K]|i0(mod2)}assignsubscript𝑃𝐼𝐷conditional-set𝑖delimited-[]𝐾𝑖0mod2P_{ID}:=\{i\in[K]|i\equiv 0(\text{mod}2)\}italic_P start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT := { italic_i ∈ [ italic_K ] | italic_i ≡ 0 ( mod 2 ) } and POOD:={i[K]|i1(mod2)}assignsubscript𝑃𝑂𝑂𝐷conditional-set𝑖delimited-[]𝐾𝑖1mod2P_{OOD}:=\{i\in[K]|i\equiv 1(\text{mod}2)\}italic_P start_POSTSUBSCRIPT italic_O italic_O italic_D end_POSTSUBSCRIPT := { italic_i ∈ [ italic_K ] | italic_i ≡ 1 ( mod 2 ) }.

Network details and hyperparameter search

We compare an architecture with gain modulation (γ=1𝛾1\gamma=1italic_γ = 1) with an architecture without gain modulation γ=0𝛾0\gamma=0italic_γ = 0. For each of the two settings, we select the best hyperparameters by fixing the number of hidden units (Nh=100subscript𝑁100N_{h}=100italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 100) and varying the standard deviation of the projection matrices Rij𝒩(0,σR2/20)Rijap𝒩(0,σR2),log10(σR){2,1.8,,0}formulae-sequencesimilar-tosubscript𝑅𝑖𝑗𝒩0superscriptsubscript𝜎𝑅220formulae-sequencesimilar-tosubscriptsuperscript𝑅𝑎𝑝𝑖𝑗𝒩0superscriptsubscript𝜎𝑅2subscript10subscript𝜎𝑅21.80R_{ij}\sim\mathcal{N}(0,\sigma_{R}^{2}/20)\ \ R^{ap}_{ij}\sim\mathcal{N}(0,% \sigma_{R}^{2}),\ \log_{10}(\sigma_{R})\in\{-2,-1.8,\cdots,0\}italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 20 ) italic_R start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) ∈ { - 2 , - 1.8 , ⋯ , 0 }, the standard deviation of the bias bi𝒩(0,σb2),σb{0,0.1,,1}formulae-sequencesimilar-tosubscript𝑏𝑖𝒩0superscriptsubscript𝜎𝑏2subscript𝜎𝑏00.11b_{i}\sim\mathcal{N}(0,\sigma_{b}^{2}),\ \sigma_{b}\in\{0,0.1,\cdots,1\}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ { 0 , 0.1 , ⋯ , 1 } and the nonlinearity type ϕ{tanh,softplus}italic-ϕtanhsoftplus\phi\in\{\mathrm{tanh},\mathrm{softplus}\}italic_ϕ ∈ { roman_tanh , roman_softplus }. For each point in the grid, we train 20 models using The best hyperparameters found were used in the subsequent experiments In Fig. 2 B, D the reported regrets are on models trained with Nh=100subscript𝑁100N_{h}=100italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 100 hidden features

Regret per round

Given an agent that plays in an environment α𝛼\alphaitalic_α receiving rewards {rt}t+subscriptsubscript𝑟𝑡𝑡subscript\{r_{t}\}_{t\in\mathbb{N}_{+}}{ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the regret per round ρ(T)𝜌𝑇\rho(T)italic_ρ ( italic_T ) achieved by the agent at round T𝑇Titalic_T is defined as:

ρ(T)=μα1Tt=1Trt𝜌𝑇subscriptsuperscript𝜇𝛼1𝑇superscriptsubscript𝑡1𝑇subscript𝑟𝑡\displaystyle\rho(T)=\mu^{\alpha}_{\star}-\frac{1}{T}\sum_{t=1}^{T}r_{t}italic_ρ ( italic_T ) = italic_μ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (14)

Where μα:=max{𝔼[Rkα]|k[K]}assignsubscriptsuperscript𝜇𝛼conditional𝔼delimited-[]superscriptsubscript𝑅𝑘𝛼𝑘delimited-[]𝐾\mu^{\alpha}_{*}:=\max\{\mathbb{E}[R_{k}^{\alpha}]|k\in[K]\}italic_μ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT := roman_max { blackboard_E [ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ] | italic_k ∈ [ italic_K ] } is the maximum expected reward that can be obtained in each round playing the optimal policy that selects one of the best arms with probability 1111.

Appendix F Additional details on the darkroom experiment

This experiment investigates the capability of out dynamical reinforcement learning approach, to improve an agent’s ability to navigate a 2D environment. The agent is trained to locate a randomly placed, not observable food source, with its policy evolving over multiple episodes through.

The environment is a 2D grid where an agent starts at the center, aiming to reach a randomly positioned food item. The food positions vary every 600 episodes, introducing. The agent’s movements are restricted within the grid, with actions limited to moving left, right, up, or down.

To represent the agent’s position, a place cell encoder converts the agent’s (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) coordinates into a higher-dimensional feature vector using Gaussian functions. This encoding helps in effectively capturing spatial information.

The agent’s policy is linear, with action probabilities derived from a softmax function applied to the encoded state. For this reason, only two networks are used for the task, the gradient and the scalar-product network see Table 2 for details on parameters.

Data collection and pre-training

The policy is updated using the policy gradient method, adjusting weights based on received rewards to improve decision-making over time.

Learning data are collected for 8 different positions of the food, equally distributed on a circle.

Two types of networks are used: one to model the agent’s state dynamics and another to handle gradient updates necessary for learning. These networks influence the agent’s internal state and learning process, enabling policy refinement.

The two networks readout are trained with a linear regression on their readout weight, to reproduce the proper weight updates an scalar products.

Dynamical reinforcement learning and evaluation

The agent is tested again on the 8 positions used for pre-training, and on 8 new positions. Similarly to previous experiments, policy gradient is no longer used and replaced by virtual weights updated estimated by the gradient network, and used by the scalar-product network to predict agent policy.

Performance is measured by the total reward accumulated over episodes. Visualizing the agent’s trajectories reveals its navigation efficiency and decision-making process.

Table 2: Simulation Parameters
Parameter Symbol Gradient Net Scalar Net
Network Size N𝑁Nitalic_N 500 1000
Input Dimension I𝐼Iitalic_I 25+4+10 4 ×\times× 25+4 ×\times× 25
Apical Input Dimension Iapsuperscript𝐼𝑎𝑝I^{ap}italic_I start_POSTSUPERSCRIPT italic_a italic_p end_POSTSUPERSCRIPT 1 4 ×\times× 25
Output Dimension O𝑂Oitalic_O 4 ×\times× 25 4
Time Step dt𝑑𝑡dtitalic_d italic_t 0.005
Reservoir Time Constant τmfsubscript𝜏subscript𝑚𝑓\tau_{m_{f}}italic_τ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 dt𝑑𝑡dtitalic_d italic_t 1 dt𝑑𝑡dtitalic_d italic_t
Input weights var σinputsubscript𝜎𝑖𝑛𝑝𝑢𝑡\sigma_{input}italic_σ start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT 0.01 0.01
Apical Input weights var σinputsubscript𝜎𝑖𝑛𝑝𝑢𝑡\sigma_{input}italic_σ start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT 0.1 0.1
Recurrent weights σrecsubscript𝜎𝑟𝑒𝑐\sigma_{rec}italic_σ start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT 0.5 / N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG 0. / N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG
Gain-modulation factor γnetsubscript𝛾𝑛𝑒𝑡\gamma_{net}italic_γ start_POSTSUBSCRIPT italic_n italic_e italic_t end_POSTSUBSCRIPT 1. 1.