\addbibresource

references.bib

Binding in hippocampal-entorhinal circuits
enables compositionality in cognitive maps

Christopher J. Kymn Sonia Mazelet Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA Université Paris-Saclay, ENS Paris-Saclay, Gif-sur-Yvette, France Anthony Thomas Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA Denis Kleyko Centre for Applied Autonomous Sensor Systems, Örebro University, Örebro, Sweden E. Paxon Frady Intel Labs, Santa Clara, USA Friedrich T. Sommer Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA Intel Labs, Santa Clara, USA Bruno A. Olshausen Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA

Abstract

We propose a normative model for spatial representation in the hippocampal formation that combines optimality principles, such as maximizing coding range and spatial information per neuron, with an algebraic framework for computing in distributed representation. Spatial position is encoded in a residue number system, with individual residues represented by high-dimensional, complex-valued vectors. These are composed into a single vector representing position by a similarity-preserving, conjunctive vector-binding operation. Self-consistency between the representations of the overall position and of the individual residues is enforced by a modular attractor network whose modules correspond to the grid cell modules in entorhinal cortex. The vector binding operation can also associate different contexts to spatial representations, yielding a model for entorhinal cortex and hippocampus. We show that the model achieves normative desiderata including superlinear scaling of patterns with dimension, robust error correction, and hexagonal, carry-free encoding of spatial position. These properties in turn enable robust path integration and association with sensory inputs. More generally, the model formalizes how compositional computations could occur in the hippocampal formation and leads to testable experimental predictions.

1 Introduction

The hippocampal formation (HF), consisting of hippocampus (HC) and the medial and lateral part of the neighboring entorhinal cortex, (MEC) and (LEC), is critical for forming memories and representing variables such as spatial position [eichenbaum2017integration, moser2017spatial]. Recent work has provided evidence of compositional structure in HF representations, for example, novel recombinations of past experience occurring in replay [kurth2023replay], or the exponential expressivity of the grid cell code [fiete2008grid, sreenivasan2011grid]. In particular, compositional representations afford high expressivity with lower dimensional storage requirements [behrens2018cognitive], less complexity in latent state inference, and generalization to novel scenes with familiar parts.

To gain insight into the possible computational principles and neural mechanisms at play in the HF, we take a normative modeling approach. That is, we seek a set of neural coding principles that effectively achieve the postulated function of the system. With this approach, we can then explain details about the neuroanatomical and neurophysiological structures in light of their particular contributions to an information processing objective. We believe that the resulting model can also lead to new predictions about the neural mechanisms that enable this function.

The postulated function of the HF —as a cognitive map and episodic memory— has a core computational requirement, to represent and navigate space. Here, space is either the actual physical environment or a more abstract conceptual space. We formulate multiple desiderata for an effective representation of space. We then show that a residue number system, incorporated into a compositional encoding scheme, fulfills these desiderata. It is achieved by a modular attractor network that factorizes the individual components of encoded locations. This provides an algorithmic-level hypothesis of hippocampal-entorhinal interactions. A core mechanism of this algorithm is binding, which draws inspiration from work in neuroscience, cognitive science, and artificial intelligence.

2 A normative model for the hippocampal formation

2.1 Principles for representing space

Our first set of normative requirements is that space is represented by a compositional code that has high spatial resolution, is noise-robust, and in which algebraic operations on the components can be updated in parallel. Prior work [fiete2008grid, sreenivasan2011grid] has proposed the residue number system (RNS) [garner1959residue] as a candidate for fulfilling these requirements. An RNS expresses an integer $x$ in terms of its remainder relative to a set of co-prime moduli $\{m_{i}\}$ . For example, relative to moduli $\{3,5,7\}$ , $x=40$ is encoded as $\{1,0,5\}$ . The Chinese Remainder Theorem guarantees that all integers in the range $[0,M-1]$ , where $M=\prod_{i}m_{i}$ , are assigned a unique representation. An RNS provides high spatial resolution, carry-free arithmetic operations, and robust error correction [goldreich1999chinese]. Experimental observations in entorhinal cortex show a discrete multi-scale organization of spatial grid cells [stensola2012entorhinal] that is compatible with an organization into discrete RNS modules.

The second normative principle we adopt is that an individual residue value should be encoded by a neural population in a similarity-preserving fashion. In particular, we require that distinct integer values are represented with nearly orthogonal vectors. To achieve this principle, we use a method similar to random Fourier features [rahimi2007random]. Each modulus, with value $m_{i}$ , is assigned a seed phasor vector, $\mathbf{g}_{i}\in\mathbb{C}^{D}$ , whose elements $(\mathbf{g}_{i})_{j}$ are drawn uniformly from the $m_{i}$ -th roots of unity (i.e., $(\mathbf{g}_{i})_{j}=e^{\sqrt{-1}\,\omega_{ij}}$ , with $\omega_{ij}=\frac{2\pi}{m_{i}}\,k_{j}$ , and $k_{j}$ chosen randomly from $\{0,...,m_{i}-1\}$ ). The representation of a particular residue value $a_{i}\in\{0,\dots,m_{i}-1\}$ is then given by rotating the phases of the seed vector according to [plate1992holographic]:

\mathbf{g}_{i}(a_{i})=(\mathbf{g}_{i})^{a_{i}},

(1)

where we abuse notation slightly to also think of $\mathbf{g}_{i}$ as a function that takes $a_{i}$ as input and produces an embedding as described above. The complex-valued vectors can be mapped to interpretable population vectors via a randomized Fourier transform (Figures 6D and S2).

Our third normative principle concerns the manner in which a unique representation of a particular point in space is formed from the individual residue representations. This requires that we somehow combine the residue vectors for each modulus. Combining via concatenation, though straightforward, is not effective because codes that coincide in subsets of their residue representation would be similar, even when the encoded values are very different. Thus, the method of combining residue codes must be conjunctive. Conjunctive composition is often called binding and is of fundamental importance in neuroscience [MalsburgAssemblies1986], cognitive science [FodorCritical1988], and machine learning [greff2020binding]. An early proposal for binding is the tensor product of representation vectors [SmolenskyTensor1990], with the tensor order equal to the number of bound objects.

Here, we implement binding with component-wise vector multiplication, a dimensionality preserving operation that represents a lossy compression of the full tensor product [plate1991holographic, kanerva2009hyperdimensional]. The resulting compositional vector representation of an integer $x\in\mathbb{Z}$ using an RNS representation with $K$ moduli, $\{a_{1},a_{2},..,a_{K}\}$ , is:

\mathbf{p}(x)=\bigodot_{i=1}^{K}\mathbf{g}_{i}(a_{i}).

(2)

We prove in Appendix A.1 that this coding scheme represents distinct integer states using nearly orthogonal vectors, and that it generalizes in a natural way to support representation of arbitrary real numbers in a similarity preserving fashion.

Eq. 2 represents individual points along a line. In general, however, a spatial representation involves points in 2D or 3D spaces. Conveniently, vector binding can be also used to compose representations of multidimensional lattices from vectors representing individual dimensions. As we will explain, there is still a choice in this composition that determines the resulting lattice structure. Following earlier proposals [wei2015principle, mathis2015probable, anselmi2020computational], our fourth normative principle is to choose the lattice structure so that spatial information is maximized, as described in Section 3.5.

The final normative principle we require is that for computations such as path integration, there should be a simple vector manipulation that results in addition of the encoded variables. Again, vector binding provides this functionality with our coding strategy, because of the following property:

\mathbf{g}(x)\odot\mathbf{g}(y)=\mathbf{g}(x+y).

(3)

2.2 Modular attractor network for spatial representation

A standard model of grid cell circuits is the line attractor, in which states that represent a consistent location lie on a low-energy manifold [fiete2008grid]. When initialized from a noisy location pattern, the circuit dynamics will generate a denoised location representation. Rather than forming a line attractor model for the entire representational space (Eq. 2), we propose a modular network architecture, so that the compositional structure of a residue number representation can scale towards a large range with fewer memory resources (Section 3.2), in a manner robust to noise (Section 3.3).

A starting point for our attractor network model is the Hopfield network, which acts as an associative memory by storing memory patterns as fixed-point attractors. The Rademacher-Hopfield network [hopfield1982neural] is a dynamical system whose state is a vector $\mathbf{x}\in\{-1,+1\}^{D}$ that obeys the following dynamics:

\mathbf{x}(t+1)=\text{sign}(\mathbf{XX}^{T}\mathbf{x}(t))

(4)

with $\mathbf{X}$ as the matrix of memorized patterns (column vectors of $\mathbf{X}$ ). The fixed-point attractor dynamics can be generalized to complex memory patterns $\mathbf{z}\in\mathbb{C}^{D}$ :

\mathbf{z}(t+1)=\sigma(\mathbf{ZZ}^{{\dagger}}\mathbf{z}(t)),

(5)

where $\sigma$ is a non-linearity normalizing the amplitude of each complex-valued component to one [noest1987phasor], and $\mathbf{Z}$ the corresponding matrix of memorized patterns. The model can also be discretized, such that each component is often quantized to a $r$ -state phasor [noest1988discrete]. The Rademacher-Hopfield model is the special case where $r=2$ and the phasors happen to be real-valued.

An $r$ -state phasor network of the form of Eq. 5 is well-suited to serve as an attractor network for each of the residue vectors in an RNS representation of position, with $r=m_{i}$ for modulus $i$ , and the matrix $\mathbf{Z}$ (which we shall denote $\mathbf{G}_{i}$ ) storing the $\mathbf{g}_{i}(a_{i})$ for $a_{i}\in\{0,..,m_{i}-1\}$ . However, we desire a method for representing the whole coding range $M:=\prod_{i}^{K}m_{i}$ without storing all $M$ patterns in one large associative memory. For this purpose we show that a resonator network, a recently proposed recurrent network for unbinding conjunctive codes [frady2020resonator, kent2020resonator], lets us represent this range by storing only $n:=\sum_{i}^{K}m_{i}\ll M$ patterns. Given a vector encoding of position, $\mathbf{p}(x)$ , as formulated in Eq. (2), a resonator network will factorize it into its constituent RNS components by iteratively updating each residue vector estimate, $\hat{\mathbf{g}}_{i}$ , similar to the attractor dynamics of Eq. (5) but in a way that it is also consistent with $\mathbf{p}(x)$ given all other residue estimates $\hat{\mathbf{g}}_{j\neq i}$ :

\mathbf{\hat{g}}_{i}(t+1)=\sigma\Bigl{(}\mathbf{G}_{i}\mathbf{G}_{i}^{{\dagger% }}\bigl{(}\mathbf{p}\bigodot_{j\neq i}^{K}\mathbf{\hat{g}}_{j}^{*}(t)\bigr{)}% \Bigr{)}\;\;\;\forall\ i

(6)

Let us now assume that the input $\mathbf{p}(x_{t})$ encodes a spatial position $x_{t}$ using Eq. (2). Given a velocity input $\mathbf{q}_{i}(v_{t})$ , estimated from self-motion input, path integration is performed by first running attractor dynamics, then updating attractor states by velocity.

\mathbf{\hat{g}}_{i}(t+1)=\mathbf{q}_{i}(v_{t})\odot\sigma(\mathbf{G}_{i}% \mathbf{G}_{i}^{{\dagger}}\mathbf{p}(x_{t})\bigodot_{i\neq j}^{K}\mathbf{\hat{% g}}_{j}^{*}(t))

(7)

After velocity updates, one can update the input state $\mathbf{p}(x_{t})$ with the conjunctive representation of the current factor estimates:

\mathbf{p}(x_{t+1})=\bigodot_{i}^{K}\hat{\mathbf{g}}_{i}(t+1).

(8)

Further explanation and detail is provided in Appendix B.3.

2.3 Mapping the model to the HF

Although it is not obvious how the components of our normative model should map to the anatomical architecture of HF, we make one proposal as shown in Figure 1. The memory networks for residue representations $\mathbf{\hat{g}}_{i}$ correspond to grid modules in MEC. Similar to the grid modules, a module for context can be added to the architecture, such as a tag for the identity of a specific environment, with the recurrent synapses $\mathbf{C}$ storing tags of different environments.

The context neurons could correspond to the non-grid entorhinal cells, which can contain local, non-spatial information about the environment [latuske2018hippocampal]. The vector $\mathbf{p}(x_{t})$ can be linked to place cells in hippocampus. Internal HC circuitry can either buffer the input as in Eq. (6) or allow it to be updated dynamically according to the MEC input (Section 4.1). The mutual interactions between HC and MEC grid modules require projections between these structures. The binding operations that these interactions involve according to Eq. (6) are hypothesized to be implemented by nonlinear interactions between dendritic inputs in HC and MEC neurons.

Refer to caption — Figure 1: Schematic of proposed attractor model. In MEC, the $\mathbf{g}_{i}$ are residue representations in grid modules, and c encodes a context label. Input of velocity estimate $\mathbf{q}(\mathbf{v})$ can produce path integration in grid modules via binding, denoted by $\odot$ . In HC, p represents contextualized place. Binding serves two roles in the MEC/HC interaction (symbolized by bidirectional arrows): a) factorizing p into $\mathbf{g}_{i}$ ’s, and b) generating an update of p from the $\mathbf{g}_{i}$ ’s, for example, after path integration. In LEC, s represents sensory input, interacting with p through a learned heteroassociative projection.

The model also assumes the ability for sensory cues to provide the initialization signal of the cognitive map, represented by $\mathbf{s}$ in Figure 1. For completeness, we make the basic assumption that heteroassociative memories are formed by the brain that link sensory cues to the place cell representations $\mathbf{p}$ (Section 4.2). This process would require the system to generate a new context vector $\mathbf{c}$ and initialize the cognitive map to a default location in order to learn about new environments. We show that through even a simple heteroassociative mechanism, our modular attractor network can robustly retrieve sensory memories and even protect its compositional structure.

3 Coding properties of the model

3.1 RNS representations have exponential coding range

The compositional RNS vector representation Eq. (2) can encode a coding range of $M$ values using a total of $n$ component patterns for representing the residue of individual modules. The scaling of the coding range is exponential in the number of moduli, $K$ , since if each module has $\mathcal{O}(m)$ patterns, and the co-prime condition is satisfied, the scaling of the coding range is $\mathcal{O}(m^{K})$ . This recovers the expressivity argued by [fiete2008grid, mathis2012resolution].

More generally, it is also exponential in the number of component patterns, $n$ . The optimal coding range is given by the best partition of $n$ into a set of positive $\{m_{i}\}$ . This optimization is identical to that of finding the maximum order of an element in the group of permutations $S_{n}$ , because the maximum order can be found by finding the longest cycle. The scaling of this value in $n$ is characterized by Landau’s function $f(n)$ , which is known to converge to $\text{exp}(\sqrt{n\ \text{ln}\ n})$ as $n\to\infty$ [landau1903maximalordnung]. Figure 2A illustrates how Landau’s function is the upper bound to what is achievable for any fixed number of moduli ( $K$ ).

Though other kinds of representations can achieve an exponential coding range, the advantage of the compositional encoding of Eq. (2) comes from the fact that the binding operation implements carry-free vector addition (our fourth principle). This enables updates of the encoded value without requiring further transformations such as decoding, facilitating tasks such as path integration (Section 4.1, Appendix C.3). Binary representations, by contrast, have exponential coding range but require carry-over operations to implement.

3.2 The modular attractor network has superlinear coding range

The exponential scaling of the coding range of the RNS representation is a prerequisite to obtain a large coding range with the attractor network that has to perform computations on this representation, such as input denoising, working memory, and path integration. To estimate the scaling of the coding range in the proposed attractor network (Eq. 6), we study the critical dimension for which the grid modules converge with high probability. Specifically, we empirically estimate the minimum dimension required to retrieve an arbitrary RNS representation with high probability, given a maximum number of iterations (Figure 2B). Remarkably, we find that the number of component patterns $n$ that can be stored is superlinear in the pattern dimension $D$ ; empirically $\mathcal{O}(D^{\alpha})$ for some $\alpha\geq 1$ . For 2, 3, and 4 moduli, $\alpha\approx 2.05,1.45$ and $1.23$ , respectively (Figure 2C).

These empirical scaling laws are consistent with a simple information-theoretic calculation (Appendix A.2). The minimal amount of bits to be stored for the entire RNS vector encoding scheme is of order $\mathcal{O}(M\ \text{log}\ M)$ , and the number of synapses in the attractor network is $\mathcal{O}(D\sqrt[K]{M})$ . If one makes the cautious assumption of a capacity per synapse of $\mathcal{O}(1)$ , the leading order for the coding range $M$ is $\mathcal{O}(D^{\alpha})$ , with $\alpha=\frac{K}{K-1}$ .

Note that while the coding range increases with the number of moduli ( $K$ ) for the RNS representation, the superlinear scaling coefficient $\alpha_{K}$ decreases with $K$ for the modular attractor network, reaching maximum superlinearity at the smallest value $K=2$ . This reversal is caused by the fact that increasing $K$ decreases the number of synapses, i.e., the memory resource in the attractor network.

3.3 Robust error correction

In addition, we evaluate the robustness of our attractor model to noise. Because the RNS representations are composed of phasors, which are circular variables, we sample noise from a von Mises distribution with two parameters: mean ( $\mu=0$ ) and concentration pattern $\kappa$ (Figure 3A). Higher $\kappa$ values imply less noise; the distribution approximates a Gaussian with variance $1/\kappa$ for large $\kappa$ .

We consider three cases: noisy input patterns, noise added to each time step, and noisy weights corruptions of patterns in $\mathbf{G}_{i}$ (Appendix B.2). The empirical accuracy of recall varies depending on the type of corruption applied (Figure 3A). We find that for a given dimension $D$ (in this case, $1024$ ), increasing noise decreases the maximum coding range that can be decoded with high accuracy (Figure 3B-D). For a fixed noise level, the high-accuracy coding range is largest for input noise, followed by update noise and codebook noise. It is perhaps not surprising that codebook noise has the worst coding range, given that noise added to every stored pattern compounds across the dynamics. Fortunately, the demonstrated robustness to input noise enables sensory patterns to be denoised via heteroassociation (Section 4.2).

3.4 Interpolation between patterns enables continuous path integration

In general, there is a sharp difference between point and line attractors. In our attractor model, the RNS representations of integer values are stored as discrete fixed points. Nevertheless, the attractor network also converges to states that represent non-integer values that are not explicitly stored. In other words, the network smoothly interpolates to points on a manifold of states that represent integer and non-integer values encoded by (2); Figure 4A provides a visualization, showing that the kernel induced by inner product operations retains graded similarity for sub-integer shifts. This kernel enables the modular attractor network to settle to fixed points that correspond to interpolations between integers, and for sub-integer positions to be decoded.

The resolution of decoding is fundamentally limited by the signal to noise ratio. Even so, we find that, up to a fixed noise level, the accuracy regimes of integer decoding and sub-integer decoding coincide. This property enables sub-integer shifts to be encoded within the states of the network, which, as we will show, results in stable, error-correcting path integration (Section 4.1). We quantify the gain in precision in terms of the bits of information that can on average be reconstructed from a vector (Figures 4D, Appendix B.2). Notably, even a moderate noise level of $\kappa=8$ is sufficient to achieve nearly the same information content as in the noiseless case.

3.5 Triangular frames in 2D maximize spatial information

In two-dimensional open field environments, grid cells have firing fields arranged in a hexagonal lattice [hafting2005microstructure]. Work in theoretical neuroscience shows the optimality of this lattice for 2D environments in terms of spatial information [wei2015principle, mathis2015probable, anselmi2020computational]. However, the presence of hexagonal firing fields raises a puzzle for residue number systems. Although a crucial property of a RNS is the carry-free property, most implementations of RNS will not perform carry-free updates within a module in non-Cartesian coordinate systems. This generally occurs because the updates of different coordinates must interact due to non-orthogonality.

We resolve this issue by showing how to implement a version of vector binding of multiple coordinates in a triangular ‘Mercedes-Benz’ frame that enables carry-free hexagonal coding. Furthermore, we provide a combinatoric argument for the optimality of triangular frames for $\mathbb{R}^{2}.$ (A frame is a spanning set for a vector space in which the basis vectors need not be linearly independent.) Our argument relies on the combinatorics of residue numbers, and so for the first time gives an explanation of why the coexistence of RNS and hexagonal codes is optimal.

To form a hexagonal tiling of 2D position requires two steps: first, projection into a $3$ -coordinate frame, and second, choosing phases such that simultaneous, equal movements along all three frames cancel out (Appendix A.3). The resulting Voronoi tessellation for different states is pictured in Figure 5A. This encoding enables higher spatial resolution in terms of the number of discrete states: $3m^{2}-3m+1$ for triangular frames, versus $m^{2}$ for Cartesian frames. This increased expressivity results in a higher entropy) code for space (Figure 5B). It also results in both a periodic hexagonal kernel and the individual grid response fields being arranged in a hexagonal lattice (Figure 6C).

Prior models achieved hexagonal lattices either by circularly symmetric receptive fields (e.g., [fuhs2006spin, burak2009accurate]) arranged on a periodic rectangular sheet or by distorting a square lattice into an oblique one (e.g., [chandra2023high, mosheiff2019velocity]). Importantly, oblique lattices have the combinatorial complexity as the square grid and, unlike the construction described above, they do not achieve the same level of spatial resolution (Figure 5B).

4 Testing functionalities of the model

4.1 Robust path integration

Given the ability of the attractor model to update its representation of position from velocity inputs, along with its ability to represent continuous space, we evaluate its ability to perform path integration in the presence of noise. We simulate trajectories based on a statistical model for generating plausible rodent movements in an arena [raudies2012modeling, banino2018vector], and we update grid cell and place cell state vectors according to Equations 7 and 8, respectively.

To evaluate the robustness of the model to error (Appendix B.3), we consider both sources of extrinsic noise (e.g., mis-representations of velocity information), and intrinsic noise (e.g., due to noise in weight updates). The robustness of our model to intrinsic noise is tested by comparing our results to the estimated trajectories obtained without the correction by the MEC modules (Figure 6 A and B). We find that our model strongly limits noise accumulation along the trajectory and allows highly accurate integration for a longer period of time (Figure 6A). Consistent with our previous experiments on noise robustness (Figure 3), we find strong robustness to intrinsic noise, and that extrinsic noise results in progressive drift of estimated position.

We visualize the response fields in different modules and find hexagonal lattices with a module dependent scaling (Figure 6C, Appendix 4.1). In addition, we show that tethering to external cues (e.g., visual inputs), can significantly increase the accuracy of the attractor network. To study this, we associate visual cues to corresponding patches see Section 4.2) and observe that integration of information from sensory visual inputs succeeds in correcting drift due to extrinsic noise (Figure 6D).

4.2 Denoising sensory states via a heteroassociative memory

Finally, we describe a simple extension to our model, in which sensory patterns are fed from the lateral entorhinal cortex (LEC) to update the hippocampal state. This is consistent with theories of memory suggesting that LEC provides the content of experiences to hippocampus [manns2006evolution], as well as neuroanatomical evidence [knierim2014functional]. Although the structure of the representations of those sensory patterns is unknown, it is theorized that HF is critical to sensory pattern completion [teyler1986hippocampal].

Consistent with this function, recent work [sharma2022content, chandra2023high] has proposed that a heteroassociative scaffold connects sensory patterns to hippocampal activity, allowing robust denoising of sensory states. Though the main focus of our normative model is not sensory denoising, we show that a simple extension to our model (Appendix B.4) robustly retrieves noisy pattern even under high levels of corruption (Figures 7A and B). In Appendix C.3, we also discuss how this capacity for generalization can serve as a model for sequence retrieval, showing some preliminary experiments.

In addition to robust denoising of single patterns, our model is also well-equipped to deal with compositions of sensory patterns. Two situations are worth emphasizing: first, we can often unmix multiple sensory states corresponding to a sum of patterns, because the compositional structure of binding between grid modules “protects” the items in summation (Figure 7C). This differentiates our model from other heteroassociative memories, in which sums of patterns would have multiple equally valid yet incompatible decodings. Second, the context vector modules allow preservation of different sensory information for different environments (Figure S3).

5 Discussion

We propose a normative model of a cognitive map for the hippocampal formation in the mammalian brain. The core principle of the model is a compositional representation of space that achieves a superlinear coding range, which is expressed by a compact, multi-module attractor network. The compositional mechanism of vector binding provides generalization to multiple spatial dimensions, contextualization, and path integration. This binding mechanism builds on prior work proposed in the field of hyperdimensional computing and vector symbolic architectures [kanerva2009hyperdimensional, gayler2004vector, plate1992holographic, dumont2022model, kymn2023computing] — and goes beyond it to develop a specific algorithmic hypothesis about structured operations in HF. Our analyses and experiments confirm that the model can achieve important functions of the hippocampal formation and explains experimental observations, such as hexagonal grid cells, place cells, and remapping phenomena.

The proposed model contributes to, and greatly benefits from, existing work in theoretical neuroscience on residue number systems [fiete2008grid, sreenivasan2011grid], continuous attractor network models of grid cells [fuhs2006spin, zhang1996representation, fiete2008grid], and the optimality of hexagonal representations in 2D [mathis2012resolution, mathis2015probable]. It remains intriguing that biology organized grid cells into multiple discrete modules, rather than pooling all resources into a single module attractor network. This puzzle raises an opportunity for normative models to explain the organization of grid cells into multiple modules. More recent work has focused on the problem of coordinating representations across multiple modules [mosheiff2017efficient, mosheiff2019velocity, kang2019geometric, agmon2020theory, chandra2023high], and large scale recordings of HF [waaga2022grid] may provide new opportunities to evaluate predictions of these different ideas.

Our approach starts from principles of space encoding, in particular, the requirement of compositionality. This strategy is complimentary to, but different from, investigations of the emergence of place and grid cells in artificial neural networks (e.g., [banino2018vector, cueva2018emergence, whittington2022disentanglement, dorrell2022actionable, sorscher2023unified, schaeffer2024self, stachenfeld2017hippocampus, whittington2020tolman, chen2022predictive]). These approaches show optimality of biological response features under the model assumptions, such as ANN properties, network architecture, training objective and protocol. Here, we emphasize the role of multiplicative binding, a primitive that is typically difficult to have emerge in an ANN setting. Early suggestions for realizing conjunctive binding already ventured outside the framework of ANNs [SmolenskyTensor1990, plate1992holographic]. A simple extension of ANNs are sigma-pi neurons [FeldmanConnectionist1982, mel1989sigma] that can implement vector binding [plate2000randomly]. Recent work amplifies the view that full conjunctive binding would be a useful inductive bias to augment deep learning architectures [goyal2022inductive], and various augmentations of ANNs with dedicated binding mechanisms have been proposed [danihelka2016associative, greff2020binding, ganesan2021learning, smolensky2022neurocompositional].

Our model has obvious limitations. Our attractor model for the cognitive map is still a high-level abstraction of spiking neural circuits in the hippocampal formation. In particular, the phasor states in the model are one linear transform removed from vectors that describe neural population activity. Thus, the mapping between model and neurobiological mechanisms is not straight-forward, a disadvantage that can be addressed by switching to other encoding schemes, such as sparse real or complex vectors, e.g., [laiho2015high], for which conjunctive binding operations have been proposed [frady2021variable]. Although the model is more comprehensive than typical normative models, which usually focus on a single computation, it is far from covering the many other functional cell types observed in the hippocampal formation or contextual modulations observed during remapping. In addition, the current model includes learning only in the heteroassociative projection to LEC. Most observations regarding plasticity in HF are not captured, i.e., signals from reward, or eligibility traces. Finally, our assumptions about inputs to HF from the sensory pathway are rather simplifying and primarily intended as a proof of concept.

The purpose of the model express the fundamental principles of a compositional cognitive map, permitting testable predictions: First, at the biophysical level, the model predicts multiplicative interactions between dendritic inputs providing the conjunctive binding operation. Though some evidence of MEC-LEC binding exists [latuske2018hippocampal], our attractor model also predicts binding between MEC modules. Second, the model predicts relatively fixed attractor weights between place and grid cells, and more plasticity from the hippocampus to sensory observations. Third, we predict that causal perturbations of one grid module can affect the states of other grid modules without involvement of the hippocampus, in a direction that is self-consistent with the update of the attractor state.

We believe that the proposed modeling approach and the specific attractor model have broader applications in neuroscience. The proposed attractor network can also model generative models in sensory systems to implement analysis by synthesis postulated in perception. Further, there is a intriguing connection between the proposed phasor models and spiking neural networks [frady2019robust], which could yield normative models with spiking neurons, potentially implementable on neuromorphic hardware at large scale that can lead to further quantitative predictions.

Acknowledgments

The work of CJK was supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program. The work of SM was carried out as part of the ARPE program of ENS Paris-Saclay. The work of DK and BAO was supported in part by Intel’s THWAI program. The work of CJK and BAO was supported by the Center for the Co-Design of Cognitive Systems (CoCoSys), one of seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. DK has received funding from theEuropean Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 839179. FTS discloses support for the research of this work from NIH grant 1R01EB026955-0.

\printbibliography

Supplemental material

Appendix A Mathematical derivations

A.1 Similarity-preserving properties of embeddings

In the following section, we examine the similarity-preserving properties of our coding scheme. Recall from Section 2.1 that our crucial desiderata are that: (1) distinct residue values are represented using vectors which are nearly orthogonal, and that (2) the inner-product between representations of sub-integer values are reflective of a reasonable notion of similarity between the encoded values. There is a robust literature on this topic both within the Vector Symbolic Architectures community [plate2003holographic, thomas2021theoretical, frady2022computing, clarkson2023capacity], and the broader ML community [rahimi2007random] who often study these techniques under the name “random features.” The methods pursued here are in this tradition.

To briefly recapitulate the construction of Equation 1: fix some positive integer $m$ , and let $P(k)$ denote the uniform distribution over $\{0,...,m-1\}$ . Define an embedding $g:\mathbb{R}\to{\mathbb{C}}^{D}$ using the following procedure: draw $k_{1},...,k_{D}$ independently from $P(k)$ , and set:

g(a)_{j}=\exp\left(i\omega k_{j}\right)^{a}/\sqrt{D},\,j=1,...,D,

where $\omega=2\pi/m$ , and $i=\sqrt{-1}$ . To simplify analysis, we here assume that $m$ is odd, in which case the above is equivalent to shifting the support of $P(k)$ to $\{-(m-1)/2,...,(m-1)/2\}$ , and defining the embedding $g:{\mathbb{R}}\to{\mathbb{C}}^{D}$ component-wise via:

g(a)_{j}=\exp\left(i\omega k_{j}a\right)/\sqrt{D},\,j=1,...,D.

The case that $m$ is even is slightly different, but can be handled using similar techniques and the discrepancy does not affect any of our modeling goals.

Our basic claim is that in expectation with respect to randomness in the draw of $k_{1},...,k_{D}$ , inner-products between the embeddings of two numbers $a,a^{\prime}$ recover the periodic sinc-function [smith2011spectral] of their difference. That is:

{\mathbb{E}}[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}]=\frac{\sin(\pi(a-% a^{\prime}))}{m\sin(\pi(a-a^{\prime})/m)}:=\text{psinc}(a-a^{\prime}),

This accomplishes goal (1) because, for $t$ an integer which is not an integer multiple of $m$ , $\text{psinc(t)}=0$ . Therefore, distinct integers are represented using vectors which are, in expectation, orthogonal. It also accomplishes goal (2), because $\text{psinc(t)}\approx 1$ for $0<|t|\ll 1$ . The following theorem demonstrates this property more formally, and provides an approximation guarantee for a specific instantiation of $k_{1},...,k_{D}$ .

Theorem 1.

Fix any $D>0$ and $\delta\in(0,1)$ . For any pair $a,a^{\prime}\in{\mathbb{R}}$ such that $a-a^{\prime}$ is not an integer multiple of $m$ , with probability at least $1-\delta$ over randomness in the draw of $k_{1},...,k_{D}$ :

\left|\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}-\frac{\sin(\pi(a-a))}{m% \sin(\pi(a-a^{\prime})/m)}\right|\leq\sqrt{\frac{2}{D}\ln\frac{2}{\delta}}.

Proof.

Fix any pair $a,a^{\prime}\in{\mathbb{R}}$ , and denote for concision $t=a-a^{\prime}$ . Taking an expectation with respect to randomness in $k_{1},...,k_{D}$ and using a well-known calculation from the signal processing literature [smith2011spectral]:

	$\displaystyle{\mathbb{E}}_{k_{1},...,k_{d}}\left[\mathbf{g}(a)^{\top}\mathbf{g% }(a^{\prime})^{*}\right]$	$\displaystyle=D{\mathbb{E}}_{k_{1}}[g(a)_{1}g(a^{\prime})_{1}^{*}]$
		$\displaystyle=\frac{1}{m}\sum_{k_{1}=-\frac{m-1}{2}}^{\frac{m-1}{2}}\exp\left(% i\omega k_{1}(a-a^{\prime})\right)$
		$\displaystyle=\frac{1}{m}\left(\frac{\exp\left(-\frac{i\omega t(m-1)}{2}\right% )-\exp\left(\frac{i\omega t(m+1)}{2}\right)}{1-\exp(i\omega t)}\right)$
		$\displaystyle=\frac{\exp(i\omega t/2)}{m\exp(i\omega t/2)}\left(\frac{\exp(-% \pi it)-\exp(\pi it)}{\exp(-\pi it/m)-\exp(\pi it/m)}\right)$
		$\displaystyle=\frac{\sin(-\pi t)}{m\sin(-\pi t/m)}$
		$\displaystyle=\frac{\sin(\pi(a-a^{\prime}))}{m\sin(\pi(a-a^{\prime})/m)},$

The third equality follows from the second by noting that the latter is a sum of a geometric series with common ratio $r=\exp(\omega t)$ . The fifth line follows from the fourth by recalling the identity $\sin(x)=(e^{ix}-e^{-ix})/2i$ . In the limit of $t\to 0$ , the expression evaluates to $1$ , consistent with the normalized inner product of a vector with itself.

To show concentration around this value, consider:

\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}=\frac{1}{D}\sum_{j=1}^{D}\exp(i% \omega k_{j}(a-a^{\prime})),

and note that since the complex part of the sum vanishes in expectation, we may consider, without loss of generality, the average of the real-valued quantities: $\left(\cos(\omega k_{j}(a-a^{\prime}))\right)_{j=1}^{D}$ , which are bounded in the range $\pm 1$ . Therefore, by Hoeffding’s inequality:

{\rm Pr}\left(\left|\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}-{\mathbb{E}% }[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}]\right|\geq\epsilon\right)% \leq 2\exp\left(-\frac{D\epsilon^{2}}{2}\right),

whereupon we conclude that, with probability at least $1-\delta$ over randomness in the draw of $k_{1},...,k_{D}$ :

\epsilon\leq\sqrt{\frac{2}{D}\ln\frac{2}{\delta}},

as claimed. ∎

This result can be readily extended to the binding of multiple residue number values. Let $\mathbf{g}(a)=\bigodot_{i=1}^{K}\mathbf{g}_{i}(a)$ , where each $\mathbf{g}_{i}(a)$ is instantiated independently. Then, by independence, we observe that:

	$\displaystyle{\mathbb{E}}\left[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}\right]$	$\displaystyle={\mathbb{E}}\left[\prod_{i=1}^{K}\mathbf{g}_{i}(a)^{\top}\mathbf% {g}_{i}(a^{\prime})^{*}\right]$
		$\displaystyle=\prod_{i=1}^{K}{\mathbb{E}}\left[\mathbf{g}_{i}(a)^{\top}\mathbf% {g}_{i}(a^{\prime})^{*}\right]$

The implication is that ${\mathbb{E}}[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}]=1$ if and only if all residue values agree, and zero otherwise. To show concentration around this value, we can again use Hoeffding’s inequality, which recovers the same bound on the sufficient dimension.

A.2 Information-theoretic estimate of required pattern dimension

In this section, we describe an information-theoretic estimate on the dimension $D$ necessary to retrieve $n$ patterns within $K$ modules. The main result we aim to show is that $D=\mathcal{O}(n^{(K-1)/K})$ ; equivalently, the scaling of $n$ for a given $D$ is $\mathcal{O}(D^{K/(K-1)})$ . This scaling roughly predicts our empirical results of finding the dimension required to achieve high accuracy, suggesting that the attractor network described here performs close to the theoretical bound.

The minimal total amount of information a network needs to store for denoising an RNS representation with coding range $M$ is $\mathcal{O}(M\ \text{log}(M))$ . This results from the requirement of content addressability, i.e., for serving as a unique pointer to one of $n$ patterns, each pattern must at least carry information of the order of $\mathcal{O}(\text{log}(M))$ . For simplicity, we now assume that each module is of size $\mathcal{O}(M^{1/K})$ . The total capacity of the network is bounded by the number of synapses, which is $\mathcal{O}(D*K*M^{1/K})=\mathcal{O}(D*M^{1/K})$ (assuming $K$ is constant), times the capacity per synapse. Under the conservative assumption that the capacity per synapse is $\mathcal{O}(1)$ , the dimension is of order $\mathcal{O}(e^{\frac{K-1}{K}\log{(M)}+\log{(\log{(M)})}})$ . Thus, the leading order of how $D$ depends on $n$ is $\mathcal{O}(M^{(K-1)/K})$ . If the capacity per synapse is assumed to be larger, $O(\log{(M)})$ bits, only the non-leading term cancels and the resulting order of $D$ is still the same.

A.3 Construction of triangular frames

In order to convert a $2D$ coordinate $\mathbf{x}$ into a $3D$ frame $\mathbf{y}$ , we first multiply it by a matrix, $\Psi$ whose rows are the elements of a $3D$ equiangular frame:

\mathbf{y}=\begin{bmatrix}-1/\sqrt{3}&-1/3\\ 1/\sqrt{3}&-1/3\\ 0&2/3\end{bmatrix}\mathbf{x}

(S1)

(This particular frame is commonly referred to as a ‘Mercedes Benz’ frame due to its resemblance to the iconic symbol.) A consequence of working with an overcomplete frame is that there may exist multiple values of $\mathbf{y}$ that correspond to the same $\mathbf{x}$ . For this frame, the null space of $\Psi^{+}$ is the subspace spanned by ${[1,1,1]^{\intercal}}$ – grounding the intuition that equal movement in all equiangular directions “cancels out.” It therefore might seem that triangular frames require extra operations to determine if two coordinates are equal, but here we show how to avoid this consequence.

The core strategy is to choose seed vectors $\mathbf{g}_{i,1},\mathbf{g}_{i,2},\mathbf{g}_{i,3}$ for each modulus $m_{i}$ that implement this self-cancellation. For a modulus $m_{i}$ , we draw the phasors of seed vectors from the $m$ -th roots of unity. However, we further require that, for each vector component, the three selected phases sum to $0\ (\text{mod}\ 2\pi)$ . We then form a hexagonal coordinate vector by binding the three seed vectors:

\mathbf{g}_{i}=\mathbf{g}_{i,1}\odot\mathbf{g}_{i,2}\odot\mathbf{g}_{i,3}

(S2)

By enforcing that the phases sum to $0\ (\text{mod}\ 2\pi)$ , we ensure that positions that have an equivalent $\mathbf{x}$ coordinate are mapped to the same $\mathbf{g}_{i}$ . Observe that Hadamard product binding of phasors is equivalent to summing their phases, and that binding $e^{0i}$ corresponds to adding nothing. Hence, a pair of three-dimensional coordinates whose differences are a multiple of $[1,1,1]$ will be mapped to equivalent vector representations. Finally, we then form the residue number representation for different moduli by binding, as in Eq. 2. The presence of multiple modules and self-cancellation properties complement prior work on the efficiency of hexagonal kernels for spatial navigation tasks [KomerNavigation2020, komer2020biologically].

The equivalence of certain 3D coordinates also helps us count the number of states. Clearly, the redundancy means that we have less than $m^{3}$ states, but it also shows us that every position in the hexagonal grid can be represented by a 3D coordinate which contains at least one coordinate equivalent to $0$ . There is one state where all coordinates are $0$ , $3(m-1)$ states where exactly two coordinates are 0, and $3(m-1)^{2}$ states where exactly one coordinate is zero. Thus, there are $3m^{2}-3m+1$ states for the hexagonal lattice, compared to the $m^{2}$ states for the square lattice.

In the case of square lattices in $2D$ , all states occupy an equal proportion of space; however, this is not the case for the hexagonal lattice (see Figure 5A). This is because states with more zero-valued coordinates occur slightly more frequently. To estimate the effect of unequal proportions on the entropy, we directly calculate the Shannon entropy of hexagonal lattices for finite size spatial grids of increasing radius $l$ , as an approximation to the infinite lattice. We find that even for $l=1000,m>7$ the hexagonal code has $99$ percent of the entropy of a system that divided all possibilities equally, and that this gap decreases as $m$ grows larger. Thus asymptotically, as $m\to\infty$ , the ratio of entropy for hexagonal vs. square grids tends towards $\text{log}_{2}(3)$ .

Appendix B Experimental details

All experiments were implemented in Python involving standard packages for scientific computing (including NumPy, SciPy, Matplotlib). We describe here the parameters and training setup of our experiments in further detail.

B.1 Scaling in dimension

For each number of moduli, $K$ , we seek to find the smallest dimension $D$ for which our attractor model factorizes its input, p, into the correct grid states in a fixed time ( $50$ iterations) with high probability (at least $99$ percent empirically). In instances where the network states remain similar over time (at least $0.95$ cosine similarity), we consider that it converged to a fixed point. If such convergence did not occur, we evaluate the accuracy at the last time step.

To evaluate scaling, we first choose our base moduli to be a set of $K$ consecutive primes. We randomly select one of $M$ random numbers to serve as the input and set the grid states to be random. We then evaluate a candidate dimension on the factorization task for a set number of trials ( $200$ ) and check accuracy. We compare accuracy by considering whether the amplitude of the complex-valued inner products are highest for the true factor. If the accuracy is above our threshold, we then evaluate performance of a slightly higher dimension (dimensions evaluated are spaced apart on a logarithmic scale). Once a sufficiently high dimension achieves the accuracy threshold, we assume that the scaling is non-decreasing and use the last successful dimension as the first try.

Finally, we fit linear regression to all data points on a log-log scale to estimate the scaling between dimension and problem size. We report the slopes to estimate the scaling coefficients.

B.2 Error correction

General experimental setup. We fix in advance the vector dimension, noise level (determined by $1/\kappa$ ), and number of moduli. Given these parameters, we estimate the empirical accuracy of factorization on an arbitrary input known to correspond to one of the patterns. We use the same method for checking convergence as above, though we increase the maximum number of iterations to $100$ . For all experiments in this section, we average over $1{,}000$ trials.

In the case of input noise, the vector $\mathbf{p}$ is multiplied by a noise vector. In the case of update noise, after every time step, each module of the attractor network is corrupted by a von Mises noise update. In the case of codebook noise, all codebooks are corrupted before the start of any iterations.

Decoding values between integers. In order to test the ability of the modular attractor network to decode at sub-integer resolution, we fix a spatial resolution $\Delta x$ to decode from. In our experiments, we test $\Delta x=\{1/3,1/7,1/15,1/31\}$ , and we also report $\Delta x=1$ (integer decoding) as a control. Then, using as input a random integer and random multiple of $\Delta x$ , we let the modules of the attractor network settle until convergence (as in other experiments). To evaluate accuracy, we test if the resulting output of the attractor network, $\odot_{i}\mathbf{\hat{g}}_{i=1}^{K}(t)$ , is closer to the ground truth RNS representation than to any other value. We test this with a “coarse-to-fine” approach: first checking if it is within an integer, and then checking all fractional values within one of that integer. We regard the output as correct if both the integer and fraction match, and incorrect otherwise.

Estimation of information content from a vector. To measure the total resolution of our coding scheme in bits, we factor in both the number of states distinguished ( $\tau=\frac{M}{\Delta x}$ and the empirical accuracy ( $\rho$ ). To quantitatively estimate this, we report the information decoded in bits according to the following equation [frady2018theory, kleyko2023efficient]:

\begin{split}I(\tau,\rho)=&a\log_{2}(\tau\rho)+(1-\rho)\log_{2}\left(\frac{% \tau}{\tau-1}(1-\rho)\right).\end{split}

(S3)

A consequence of this equation is that the information decoded is $0$ when the empirical accuracy is at chance ( $1/\tau$ ).

B.3 Path integration

General experimental setup. We generate paths using a statistical model simulating rodent two-dimensional trajectories in a 50 $\text{cm}^{2}$ closed square environment [raudies2012modeling, banino2018vector], with $\Delta t=100$ ms. The path integration method starts from the ground truth first position $(x_{0},y_{0})$ which is converted to hexagonal coordinates $(a_{0},b_{0},c_{0})$ (see Section A.3) and encoded as an RNS representation $\mathbf{p}(0)$ of dimension $D=3{,}000$ following the method in Section 2.1, for moduli $\{3,5,7\}$ . We then factorize $\mathbf{p}(0)$ into $\{\mathbf{\hat{g}_{i}}(0)\}_{i=1}^{K}$ to produce the estimated representation $\mathbf{\hat{p}}(0)=\bigodot_{i=1}^{K}\mathbf{\hat{g}_{i}}(0)$ .

At each time step $t\geq 0$ , we aim at estimating the position $(x_{t+1},y_{t+1})$ . We give the modular attractor network as input the previous position vector estimate $\mathbf{\hat{p}}(t)$ . It is factorized into the residue components $\{\mathbf{\hat{g}_{j}}(t)\}_{j=1}^{K}$ that are then shifted according to the velocity $(da_{t},db_{t},dc_{t})$ between $(a_{t},b_{t},c_{t})$ and $(a_{t+1},b_{t+1},c_{t+1})$ . Namely, for each residue module, we build a velocity vector $\mathbf{q}_{j}(t)=\mathbf{g}_{j,1}(da(t))\odot\mathbf{g}_{j,2}(db(t))\odot% \mathbf{g}_{j,3}(dc(t))$ that is binded to each residue component $\mathbf{\hat{g}_{j}}(t)$ . The estimated position vector is then the binding of the shifted estimated residue components: $\mathbf{\hat{p}}(t+1)=\bigodot_{j=1}^{K}\mathbf{\hat{g}_{j}}(t)\odot\mathbf{q}% _{j}(t)$ . The estimated position $(\hat{x}_{t+1},\hat{y}_{t+1})$ is chosen to be the position $(x,y)$ in a grid of $50\times 50$ positions mapping the entire environment, corresponding to the highest similarity between $\mathbf{p}(x,y)$ and $\mathbf{\hat{p}}(t+1)$ .

We show the robustness of the path integration dynamics to two different sources of noise. In the case of extrinsic noise (Figure 6D), the hexagonal velocity is corrupted by additive Gaussian noise of variance $0.12$ . In the case of intrinsic noise (Figures 6A and B), the position vector $\tilde{p}_{t}$ is corrupted by binding with a vector sampled from a von Mises distribution with concentration parameter $\kappa=2$ .

Response field visualization. Given a moduli $m_{i}$ and a vector $\mathbf{g}_{i}$ , we visualize its response field by computing the similarity of the modular attractor output $\mathbf{\hat{g}}_{i}(t)$ and $\mathbf{g}_{i}$ along a trajectory. The periodicity in the distribution of random weights and the hexagonal coordinates produce periodic hexagonal receptive fields whose scale depends on $m_{i}$ . The receptive fields of a given moduli are translations of one another, because the inner product between vector states induces a translation-invariant kernel.

Connection to sensory cues. Sensory cues are random binary vectors of size $N_{s}=D$ that are associated with positions along the trajectory. When the true trajectory reaches a sensory cue, the hippocampal state $\mathbf{\hat{p}_{t}}$ is updated using the heteroassociation method described in Appendix B.4

B.4 Heteroassociation

General experimental setup. We evaluate our model’s performance for pattern denoising using a heteroassociative learning rule [sharma2022content, chandra2023high]. We consider random binary patterns of size $N_{s}={D}$ . We corrupt the patterns by randomly flipping bits with probability $p_{\mathrm{flip}}\in[0,0.5]$ and associate them to place cell representations using heteroassociation with a pseudo-inverse learning rule. Let $\mathbf{S}\in{\mathbb{R}}^{N_{s}\times M}$ be the matrix of $M$ patterns to hook to the scaffold and $\mathbf{H}\in{\mathbb{C}}^{M\times D}$ the matrix of $M$ position vectors on which to hook the patterns. We associate pattern $\mathbf{s}$ to a place cell representation $\mathbf{p}=\mathbf{H}\mathbf{S}^{+}\mathbf{s}$ , where $\mathbf{S}^{+}$ is the pseudo-inverse of $\mathbf{S}$ . The model returns a denoised place cell representation $\mathbf{\hat{p}}$ from which we can estimate a denoised pattern by inverting the heteroassociation projection $\mathbf{\hat{s}}=\mathrm{sgn}\left(\mathbf{S}\mathbf{H}^{+}\mathbf{\hat{p}}\right)$ .

Scaling to dimensionality. We evaluate the impact the dimension $D$ has on the denoising performance in Figure 7, for a number of stored patterns $M=60\ (\text{in this case, }3\times 4\times 5$ ) and $210\ (\text{in this case, }5\times 6\times 7)$ . For each dimension $D\in\{256,512,1024,2048\}$ , we show the evolution of accuracy for different levels of corruption. For a given dimension $D$ and noise level $p_{\mathrm{flip}}$ , we denoise a pattern and consider that the denoising is correct if the denoised pattern is closest to the ground truth pattern (in terms of cosine similarity). We repeat over 500 trials and report the accuracy as well as the average similarity (normalized inner product) between the denoised pattern and its noiseless version.

Superposition of patterns. We show that our model can denoise a superposition of $n_{p}$ patterns one at a time, for $n_{p}\in\{1,2,3,4,5,10\}$ . We fix the dimension $D$ to $2{,}000$ and for different values of bit flip probability $p_{\text{flip}}\in[0,...,0.5]$ , we run the model on a superposition $\mathbf{s}$ of random binary patterns $\{\mathbf{s}_{1},...,\mathbf{s}_{n_{p}}\}$ of size $N_{s}=2{,}000$ : $\mathbf{s}=\mathbf{s}_{1}+....+\mathbf{s}_{n_{p}}$ . We run the model $n_{p}$ times and between each run the denoised pattern is explained away from the superposition [kymn2024compositional]. Namely, for run $r\in\{1,...,n_{p}-1\}$ we denote $\mathbf{\hat{s}}(r)$ the denoised pattern. The input to run $r+1$ is then $\mathbf{s}(r+1)=\mathbf{s}(r)-\mathbf{\hat{s}}(r)$ . We find that the more patterns are superposed, the lower the overall denoising accuracy is. This is due to the fact that when a pattern is incorrectly denoised, explaining away adds noise or spurious patterns to the representation of the superposition which makes the following denoising steps more difficult.

Comparison to structured patterns. We evaluate our model’s ability to denoise structured patterns. We consider the FashionMNIST dataset, from which we select $105$ images of size $28\times 28$ that we binarize by setting pixel values to be $-1$ if below $127$ , and $1$ elsewhere. We compare the denoising performance to the performance on random binary patterns of size $28\times 28=784$ for fair comparison (Figure S4).

Appendix C Additional results

C.1 Further visualizations of grid cell modules

We further visualize the receptive fields for path integration by showing receptive fields from different units taken from the same grid module. We simulate a trajectory that traverses the entire environment and represent the activation of different position vectors along the trajectory. For each modulus $m_{i}\in\{3,5,7\}$ , we show the similarity between $4$ different vectors $\mathbf{g}_{i}$ from module $m_{i}$ and the position vectors along the trajectory. We show in Figure (S1) that the different receptive fields of a given module are translations of one another.

C.2 Remapping contexts

We demonstrate that the context vector can serve as a model of global remapping in hippocampal place fields, which occurs when there is no relationship between the firing of place cells in different environments [latuske2018hippocampal]. The simplest instance of this is when a place field occurs in context A but not context B, consistent with the observed sparsity of hippocampal activity [thompson1989place]. To model this kind of remapping phenomenon, we consider an instance where there is a gradation of contexts with some phase transition between them; such an instance was observed experimentally [wills2005attractor]. Towards this end, we model linear combinations of these contexts, where the weights each context is given are $\text{sigmoid}(x),1-\text{sigmoid}(x)$ , with x varying from $-5$ to $5$ in $8$ equally spaced increments, and with $\text{sigmoid}(x)=1/(1+\text{exp}(-x))$ . To model hippocampal units, we generate units that prefer one of the two contexts and have a random place field location, using its weight vector, or address, as $\mathbf{c}\bigodot_{i=1}^{K}\mathbf{g}_{i}$ , and compare its output to that of the context/grid system at each location and context. It is worth noting that the original experiment of [wills2005attractor] also exhibited instances of rate remapping for some units, and so there is certainly additional complexity underlying remapping that is not captured by our simple model.

C.3 Storing and retracing sequences

We demonstrate that our model can recover sequences by heteroassociation of patterns to positions and path integration in a conceptual space (Figure S5A). This is consistent with the postulated role of the hippocampal formation in performing navigation in conceptual spaces [constantinescu2016organizing, bellmund2018navigating], and the role of entorhinal cortex in generating sequences of neural firing in hippocampus [schlesiger2015medial, yamamoto2017direct]. To evaluate our attractor model’s fidelity at sequence memorization and retrieval, we simulate trajectories to form sequences of random binary patterns and recall the sequence using the path integration mechanism following the method in Section 4.1, for $D=10{,}000$ and moduli $\{3,5\}$ . We add extrinsic noise to the velocity input, which accumulates along the trajectory and induces a drift. This implies that patterns at the end of sequences are less well recovered than ones at the beginning (Figure S5B and C).

Binding in hippocampal-entorhinal circuits enables compositionality in cognitive maps