Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\addbibresource

references.bib

Binding in hippocampal-entorhinal circuits
enables compositionality in cognitive maps

Christopher J. Kymn Sonia Mazelet Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA Université Paris-Saclay, ENS Paris-Saclay, Gif-sur-Yvette, France Anthony Thomas Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA Denis Kleyko Centre for Applied Autonomous Sensor Systems, Örebro University, Örebro, Sweden E. Paxon Frady Intel Labs, Santa Clara, USA Friedrich T. Sommer Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA Intel Labs, Santa Clara, USA Bruno A. Olshausen Redwood Center for Theoretical Neuroscience, UC Berkeley, Berkeley, USA
Abstract

We propose a normative model for spatial representation in the hippocampal formation that combines optimality principles, such as maximizing coding range and spatial information per neuron, with an algebraic framework for computing in distributed representation. Spatial position is encoded in a residue number system, with individual residues represented by high-dimensional, complex-valued vectors. These are composed into a single vector representing position by a similarity-preserving, conjunctive vector-binding operation. Self-consistency between the representations of the overall position and of the individual residues is enforced by a modular attractor network whose modules correspond to the grid cell modules in entorhinal cortex. The vector binding operation can also associate different contexts to spatial representations, yielding a model for entorhinal cortex and hippocampus. We show that the model achieves normative desiderata including superlinear scaling of patterns with dimension, robust error correction, and hexagonal, carry-free encoding of spatial position. These properties in turn enable robust path integration and association with sensory inputs. More generally, the model formalizes how compositional computations could occur in the hippocampal formation and leads to testable experimental predictions.

1 Introduction

The hippocampal formation (HF), consisting of hippocampus (HC) and the medial and lateral part of the neighboring entorhinal cortex, (MEC) and (LEC), is critical for forming memories and representing variables such as spatial position [eichenbaum2017integration, moser2017spatial]. Recent work has provided evidence of compositional structure in HF representations, for example, novel recombinations of past experience occurring in replay [kurth2023replay], or the exponential expressivity of the grid cell code [fiete2008grid, sreenivasan2011grid]. In particular, compositional representations afford high expressivity with lower dimensional storage requirements [behrens2018cognitive], less complexity in latent state inference, and generalization to novel scenes with familiar parts.

To gain insight into the possible computational principles and neural mechanisms at play in the HF, we take a normative modeling approach. That is, we seek a set of neural coding principles that effectively achieve the postulated function of the system. With this approach, we can then explain details about the neuroanatomical and neurophysiological structures in light of their particular contributions to an information processing objective. We believe that the resulting model can also lead to new predictions about the neural mechanisms that enable this function.

The postulated function of the HF —as a cognitive map and episodic memory— has a core computational requirement, to represent and navigate space. Here, space is either the actual physical environment or a more abstract conceptual space. We formulate multiple desiderata for an effective representation of space. We then show that a residue number system, incorporated into a compositional encoding scheme, fulfills these desiderata. It is achieved by a modular attractor network that factorizes the individual components of encoded locations. This provides an algorithmic-level hypothesis of hippocampal-entorhinal interactions. A core mechanism of this algorithm is binding, which draws inspiration from work in neuroscience, cognitive science, and artificial intelligence.

2 A normative model for the hippocampal formation

2.1 Principles for representing space

Our first set of normative requirements is that space is represented by a compositional code that has high spatial resolution, is noise-robust, and in which algebraic operations on the components can be updated in parallel. Prior work [fiete2008grid, sreenivasan2011grid] has proposed the residue number system (RNS) [garner1959residue] as a candidate for fulfilling these requirements. An RNS expresses an integer x𝑥xitalic_x in terms of its remainder relative to a set of co-prime moduli {mi}subscript𝑚𝑖\{m_{i}\}{ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. For example, relative to moduli {3,5,7}357\{3,5,7\}{ 3 , 5 , 7 }, x=40𝑥40x=40italic_x = 40 is encoded as {1,0,5}105\{1,0,5\}{ 1 , 0 , 5 }. The Chinese Remainder Theorem guarantees that all integers in the range [0,M1]0𝑀1[0,M-1][ 0 , italic_M - 1 ], where M=imi𝑀subscriptproduct𝑖subscript𝑚𝑖M=\prod_{i}m_{i}italic_M = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, are assigned a unique representation. An RNS provides high spatial resolution, carry-free arithmetic operations, and robust error correction [goldreich1999chinese]. Experimental observations in entorhinal cortex show a discrete multi-scale organization of spatial grid cells [stensola2012entorhinal] that is compatible with an organization into discrete RNS modules.

The second normative principle we adopt is that an individual residue value should be encoded by a neural population in a similarity-preserving fashion. In particular, we require that distinct integer values are represented with nearly orthogonal vectors. To achieve this principle, we use a method similar to random Fourier features [rahimi2007random]. Each modulus, with value misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, is assigned a seed phasor vector, 𝐠iDsubscript𝐠𝑖superscript𝐷\mathbf{g}_{i}\in\mathbb{C}^{D}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, whose elements (𝐠i)jsubscriptsubscript𝐠𝑖𝑗(\mathbf{g}_{i})_{j}( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are drawn uniformly from the misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-th roots of unity (i.e., (𝐠i)j=e1ωijsubscriptsubscript𝐠𝑖𝑗superscript𝑒1subscript𝜔𝑖𝑗(\mathbf{g}_{i})_{j}=e^{\sqrt{-1}\,\omega_{ij}}( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT square-root start_ARG - 1 end_ARG italic_ω start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with ωij=2πmikjsubscript𝜔𝑖𝑗2𝜋subscript𝑚𝑖subscript𝑘𝑗\omega_{ij}=\frac{2\pi}{m_{i}}\,k_{j}italic_ω start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG 2 italic_π end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and kjsubscript𝑘𝑗k_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT chosen randomly from {0,,mi1}0subscript𝑚𝑖1\{0,...,m_{i}-1\}{ 0 , … , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 }). The representation of a particular residue value ai{0,,mi1}subscript𝑎𝑖0subscript𝑚𝑖1a_{i}\in\{0,\dots,m_{i}-1\}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , … , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 } is then given by rotating the phases of the seed vector according to [plate1992holographic]:

𝐠i(ai)=(𝐠i)ai,subscript𝐠𝑖subscript𝑎𝑖superscriptsubscript𝐠𝑖subscript𝑎𝑖\mathbf{g}_{i}(a_{i})=(\mathbf{g}_{i})^{a_{i}},bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (1)

where we abuse notation slightly to also think of 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a function that takes aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as input and produces an embedding as described above. The complex-valued vectors can be mapped to interpretable population vectors via a randomized Fourier transform (Figures 6D and S2).

Our third normative principle concerns the manner in which a unique representation of a particular point in space is formed from the individual residue representations. This requires that we somehow combine the residue vectors for each modulus. Combining via concatenation, though straightforward, is not effective because codes that coincide in subsets of their residue representation would be similar, even when the encoded values are very different. Thus, the method of combining residue codes must be conjunctive. Conjunctive composition is often called binding and is of fundamental importance in neuroscience [MalsburgAssemblies1986], cognitive science [FodorCritical1988], and machine learning [greff2020binding]. An early proposal for binding is the tensor product of representation vectors [SmolenskyTensor1990], with the tensor order equal to the number of bound objects.

Here, we implement binding with component-wise vector multiplication, a dimensionality preserving operation that represents a lossy compression of the full tensor product [plate1991holographic, kanerva2009hyperdimensional]. The resulting compositional vector representation of an integer x𝑥x\in\mathbb{Z}italic_x ∈ blackboard_Z using an RNS representation with K𝐾Kitalic_K moduli, {a1,a2,..,aK}\{a_{1},a_{2},..,a_{K}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }, is:

𝐩(x)=i=1K𝐠i(ai).𝐩𝑥superscriptsubscript𝑖1𝐾subscript𝐠𝑖subscript𝑎𝑖\mathbf{p}(x)=\bigodot_{i=1}^{K}\mathbf{g}_{i}(a_{i}).bold_p ( italic_x ) = ⨀ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (2)

We prove in Appendix A.1 that this coding scheme represents distinct integer states using nearly orthogonal vectors, and that it generalizes in a natural way to support representation of arbitrary real numbers in a similarity preserving fashion.

Eq. 2 represents individual points along a line. In general, however, a spatial representation involves points in 2D or 3D spaces. Conveniently, vector binding can be also used to compose representations of multidimensional lattices from vectors representing individual dimensions. As we will explain, there is still a choice in this composition that determines the resulting lattice structure. Following earlier proposals [wei2015principle, mathis2015probable, anselmi2020computational], our fourth normative principle is to choose the lattice structure so that spatial information is maximized, as described in Section 3.5.

The final normative principle we require is that for computations such as path integration, there should be a simple vector manipulation that results in addition of the encoded variables. Again, vector binding provides this functionality with our coding strategy, because of the following property:

𝐠(x)𝐠(y)=𝐠(x+y).direct-product𝐠𝑥𝐠𝑦𝐠𝑥𝑦\mathbf{g}(x)\odot\mathbf{g}(y)=\mathbf{g}(x+y).bold_g ( italic_x ) ⊙ bold_g ( italic_y ) = bold_g ( italic_x + italic_y ) . (3)

2.2 Modular attractor network for spatial representation

A standard model of grid cell circuits is the line attractor, in which states that represent a consistent location lie on a low-energy manifold [fiete2008grid]. When initialized from a noisy location pattern, the circuit dynamics will generate a denoised location representation. Rather than forming a line attractor model for the entire representational space (Eq. 2), we propose a modular network architecture, so that the compositional structure of a residue number representation can scale towards a large range with fewer memory resources (Section 3.2), in a manner robust to noise (Section 3.3).

A starting point for our attractor network model is the Hopfield network, which acts as an associative memory by storing memory patterns as fixed-point attractors. The Rademacher-Hopfield network [hopfield1982neural] is a dynamical system whose state is a vector 𝐱{1,+1}D𝐱superscript11𝐷\mathbf{x}\in\{-1,+1\}^{D}bold_x ∈ { - 1 , + 1 } start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT that obeys the following dynamics:

𝐱(t+1)=sign(𝐗𝐗T𝐱(t))𝐱𝑡1signsuperscript𝐗𝐗𝑇𝐱𝑡\mathbf{x}(t+1)=\text{sign}(\mathbf{XX}^{T}\mathbf{x}(t))bold_x ( italic_t + 1 ) = sign ( bold_XX start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x ( italic_t ) ) (4)

with 𝐗𝐗\mathbf{X}bold_X as the matrix of memorized patterns (column vectors of 𝐗𝐗\mathbf{X}bold_X). The fixed-point attractor dynamics can be generalized to complex memory patterns 𝐳D𝐳superscript𝐷\mathbf{z}\in\mathbb{C}^{D}bold_z ∈ blackboard_C start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT:

𝐳(t+1)=σ(𝐙𝐙𝐳(t)),𝐳𝑡1𝜎superscript𝐙𝐙𝐳𝑡\mathbf{z}(t+1)=\sigma(\mathbf{ZZ}^{{\dagger}}\mathbf{z}(t)),bold_z ( italic_t + 1 ) = italic_σ ( bold_ZZ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_z ( italic_t ) ) , (5)

where σ𝜎\sigmaitalic_σ is a non-linearity normalizing the amplitude of each complex-valued component to one [noest1987phasor], and 𝐙𝐙\mathbf{Z}bold_Z the corresponding matrix of memorized patterns. The model can also be discretized, such that each component is often quantized to a r𝑟ritalic_r-state phasor [noest1988discrete]. The Rademacher-Hopfield model is the special case where r=2𝑟2r=2italic_r = 2 and the phasors happen to be real-valued.

An r𝑟ritalic_r-state phasor network of the form of Eq. 5 is well-suited to serve as an attractor network for each of the residue vectors in an RNS representation of position, with r=mi𝑟subscript𝑚𝑖r=m_{i}italic_r = italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for modulus i𝑖iitalic_i, and the matrix 𝐙𝐙\mathbf{Z}bold_Z (which we shall denote 𝐆isubscript𝐆𝑖\mathbf{G}_{i}bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) storing the 𝐠i(ai)subscript𝐠𝑖subscript𝑎𝑖\mathbf{g}_{i}(a_{i})bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for ai{0,..,mi1}a_{i}\in\{0,..,m_{i}-1\}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , . . , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 }. However, we desire a method for representing the whole coding range M:=iKmiassign𝑀superscriptsubscriptproduct𝑖𝐾subscript𝑚𝑖M:=\prod_{i}^{K}m_{i}italic_M := ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT without storing all M𝑀Mitalic_M patterns in one large associative memory. For this purpose we show that a resonator network, a recently proposed recurrent network for unbinding conjunctive codes [frady2020resonator, kent2020resonator], lets us represent this range by storing only n:=iKmiMassign𝑛superscriptsubscript𝑖𝐾subscript𝑚𝑖much-less-than𝑀n:=\sum_{i}^{K}m_{i}\ll Mitalic_n := ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≪ italic_M patterns. Given a vector encoding of position, 𝐩(x)𝐩𝑥\mathbf{p}(x)bold_p ( italic_x ), as formulated in Eq. (2), a resonator network will factorize it into its constituent RNS components by iteratively updating each residue vector estimate, 𝐠^isubscript^𝐠𝑖\hat{\mathbf{g}}_{i}over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, similar to the attractor dynamics of Eq. (5) but in a way that it is also consistent with 𝐩(x)𝐩𝑥\mathbf{p}(x)bold_p ( italic_x ) given all other residue estimates 𝐠^jisubscript^𝐠𝑗𝑖\hat{\mathbf{g}}_{j\neq i}over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT:

𝐠^i(t+1)=σ(𝐆i𝐆i(𝐩jiK𝐠^j(t)))isubscript^𝐠𝑖𝑡1𝜎subscript𝐆𝑖superscriptsubscript𝐆𝑖𝐩superscriptsubscript𝑗𝑖𝐾superscriptsubscript^𝐠𝑗𝑡for-all𝑖\mathbf{\hat{g}}_{i}(t+1)=\sigma\Bigl{(}\mathbf{G}_{i}\mathbf{G}_{i}^{{\dagger% }}\bigl{(}\mathbf{p}\bigodot_{j\neq i}^{K}\mathbf{\hat{g}}_{j}^{*}(t)\bigr{)}% \Bigr{)}\;\;\;\forall\ iover^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) = italic_σ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_p ⨀ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) ) ) ∀ italic_i (6)

Let us now assume that the input 𝐩(xt)𝐩subscript𝑥𝑡\mathbf{p}(x_{t})bold_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) encodes a spatial position xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using Eq. (2). Given a velocity input 𝐪i(vt)subscript𝐪𝑖subscript𝑣𝑡\mathbf{q}_{i}(v_{t})bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), estimated from self-motion input, path integration is performed by first running attractor dynamics, then updating attractor states by velocity.

𝐠^i(t+1)=𝐪i(vt)σ(𝐆i𝐆i𝐩(xt)ijK𝐠^j(t))subscript^𝐠𝑖𝑡1direct-productsubscript𝐪𝑖subscript𝑣𝑡𝜎subscript𝐆𝑖superscriptsubscript𝐆𝑖𝐩subscript𝑥𝑡superscriptsubscript𝑖𝑗𝐾superscriptsubscript^𝐠𝑗𝑡\mathbf{\hat{g}}_{i}(t+1)=\mathbf{q}_{i}(v_{t})\odot\sigma(\mathbf{G}_{i}% \mathbf{G}_{i}^{{\dagger}}\mathbf{p}(x_{t})\bigodot_{i\neq j}^{K}\mathbf{\hat{% g}}_{j}^{*}(t))over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) = bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊙ italic_σ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⨀ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) ) (7)

After velocity updates, one can update the input state 𝐩(xt)𝐩subscript𝑥𝑡\mathbf{p}(x_{t})bold_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with the conjunctive representation of the current factor estimates:

𝐩(xt+1)=iK𝐠^i(t+1).𝐩subscript𝑥𝑡1superscriptsubscript𝑖𝐾subscript^𝐠𝑖𝑡1\mathbf{p}(x_{t+1})=\bigodot_{i}^{K}\hat{\mathbf{g}}_{i}(t+1).bold_p ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = ⨀ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) . (8)

Further explanation and detail is provided in Appendix B.3.

2.3 Mapping the model to the HF

Although it is not obvious how the components of our normative model should map to the anatomical architecture of HF, we make one proposal as shown in Figure 1. The memory networks for residue representations 𝐠^isubscript^𝐠𝑖\mathbf{\hat{g}}_{i}over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT correspond to grid modules in MEC. Similar to the grid modules, a module for context can be added to the architecture, such as a tag for the identity of a specific environment, with the recurrent synapses 𝐂𝐂\mathbf{C}bold_C storing tags of different environments.

The context neurons could correspond to the non-grid entorhinal cells, which can contain local, non-spatial information about the environment [latuske2018hippocampal]. The vector 𝐩(xt)𝐩subscript𝑥𝑡\mathbf{p}(x_{t})bold_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be linked to place cells in hippocampus. Internal HC circuitry can either buffer the input as in Eq. (6) or allow it to be updated dynamically according to the MEC input (Section 4.1). The mutual interactions between HC and MEC grid modules require projections between these structures. The binding operations that these interactions involve according to Eq. (6) are hypothesized to be implemented by nonlinear interactions between dendritic inputs in HC and MEC neurons.

\floatbox

[\capbeside\thisfloatsetupcapbesideposition=right,top,capbesidewidth=7cm]figure[\FBwidth] Refer to caption

Figure 1: Schematic of proposed attractor model. In MEC, the 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are residue representations in grid modules, and c encodes a context label. Input of velocity estimate 𝐪(𝐯)𝐪𝐯\mathbf{q}(\mathbf{v})bold_q ( bold_v ) can produce path integration in grid modules via binding, denoted by direct-product\odot. In HC, p represents contextualized place. Binding serves two roles in the MEC/HC interaction (symbolized by bidirectional arrows): a) factorizing p into 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, and b) generating an update of p from the 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, for example, after path integration. In LEC, s represents sensory input, interacting with p through a learned heteroassociative projection.

The model also assumes the ability for sensory cues to provide the initialization signal of the cognitive map, represented by 𝐬𝐬\mathbf{s}bold_s in Figure 1. For completeness, we make the basic assumption that heteroassociative memories are formed by the brain that link sensory cues to the place cell representations 𝐩𝐩\mathbf{p}bold_p (Section 4.2). This process would require the system to generate a new context vector 𝐜𝐜\mathbf{c}bold_c and initialize the cognitive map to a default location in order to learn about new environments. We show that through even a simple heteroassociative mechanism, our modular attractor network can robustly retrieve sensory memories and even protect its compositional structure.

3 Coding properties of the model

3.1 RNS representations have exponential coding range

The compositional RNS vector representation Eq. (2) can encode a coding range of M𝑀Mitalic_M values using a total of n𝑛nitalic_n component patterns for representing the residue of individual modules. The scaling of the coding range is exponential in the number of moduli, K𝐾Kitalic_K, since if each module has 𝒪(m)𝒪𝑚\mathcal{O}(m)caligraphic_O ( italic_m ) patterns, and the co-prime condition is satisfied, the scaling of the coding range is 𝒪(mK)𝒪superscript𝑚𝐾\mathcal{O}(m^{K})caligraphic_O ( italic_m start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ). This recovers the expressivity argued by [fiete2008grid, mathis2012resolution].

More generally, it is also exponential in the number of component patterns, n𝑛nitalic_n. The optimal coding range is given by the best partition of n𝑛nitalic_n into a set of positive {mi}subscript𝑚𝑖\{m_{i}\}{ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. This optimization is identical to that of finding the maximum order of an element in the group of permutations Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, because the maximum order can be found by finding the longest cycle. The scaling of this value in n𝑛nitalic_n is characterized by Landau’s function f(n)𝑓𝑛f(n)italic_f ( italic_n ), which is known to converge to exp(nlnn)exp𝑛ln𝑛\text{exp}(\sqrt{n\ \text{ln}\ n})exp ( square-root start_ARG italic_n ln italic_n end_ARG ) as n𝑛n\to\inftyitalic_n → ∞ [landau1903maximalordnung]. Figure 2A illustrates how Landau’s function is the upper bound to what is achievable for any fixed number of moduli (K𝐾Kitalic_K).

Though other kinds of representations can achieve an exponential coding range, the advantage of the compositional encoding of Eq. (2) comes from the fact that the binding operation implements carry-free vector addition (our fourth principle). This enables updates of the encoded value without requiring further transformations such as decoding, facilitating tasks such as path integration (Section 4.1, Appendix C.3). Binary representations, by contrast, have exponential coding range but require carry-over operations to implement.

3.2 The modular attractor network has superlinear coding range

Refer to caption
Figure 2: Residue number systems, combined with a modular attractor network (resonator network), result in a new kind of attractor neural network with favorable scaling for a large combinatorial range. A) Number of encoding states, M𝑀Mitalic_M, grows rapidly in the number of modules, up to a maximum established by Landau’s function (black dots). B) Coefficient of coding range, M, scales roughly as 𝒪(DαK)𝒪superscript𝐷subscript𝛼𝐾\mathcal{O}(D^{\alpha_{K}})caligraphic_O ( italic_D start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), depending on the number of moduli, K𝐾Kitalic_K, but with αK>1subscript𝛼𝐾1\alpha_{K}>1italic_α start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT > 1. C) Estimation of scaling from slopes of linear regression (fit to log-log scale). Higher values of K𝐾Kitalic_K require a higher dimension to achieve a particular coding range; empirical values are close to αK=KK1subscript𝛼𝐾𝐾𝐾1\alpha_{K}=\frac{K}{K-1}italic_α start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = divide start_ARG italic_K end_ARG start_ARG italic_K - 1 end_ARG.

The exponential scaling of the coding range of the RNS representation is a prerequisite to obtain a large coding range with the attractor network that has to perform computations on this representation, such as input denoising, working memory, and path integration. To estimate the scaling of the coding range in the proposed attractor network (Eq. 6), we study the critical dimension for which the grid modules converge with high probability. Specifically, we empirically estimate the minimum dimension required to retrieve an arbitrary RNS representation with high probability, given a maximum number of iterations (Figure 2B). Remarkably, we find that the number of component patterns n𝑛nitalic_n that can be stored is superlinear in the pattern dimension D𝐷Ditalic_D; empirically 𝒪(Dα)𝒪superscript𝐷𝛼\mathcal{O}(D^{\alpha})caligraphic_O ( italic_D start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) for some α1𝛼1\alpha\geq 1italic_α ≥ 1. For 2, 3, and 4 moduli, α2.05,1.45𝛼2.051.45\alpha\approx 2.05,1.45italic_α ≈ 2.05 , 1.45 and 1.231.231.231.23, respectively (Figure 2C).

These empirical scaling laws are consistent with a simple information-theoretic calculation (Appendix A.2). The minimal amount of bits to be stored for the entire RNS vector encoding scheme is of order 𝒪(MlogM)𝒪𝑀log𝑀\mathcal{O}(M\ \text{log}\ M)caligraphic_O ( italic_M log italic_M ), and the number of synapses in the attractor network is 𝒪(DMK)𝒪𝐷𝐾𝑀\mathcal{O}(D\sqrt[K]{M})caligraphic_O ( italic_D nth-root start_ARG italic_K end_ARG start_ARG italic_M end_ARG ). If one makes the cautious assumption of a capacity per synapse of 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ), the leading order for the coding range M𝑀Mitalic_M is 𝒪(Dα)𝒪superscript𝐷𝛼\mathcal{O}(D^{\alpha})caligraphic_O ( italic_D start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ), with α=KK1𝛼𝐾𝐾1\alpha=\frac{K}{K-1}italic_α = divide start_ARG italic_K end_ARG start_ARG italic_K - 1 end_ARG.

Note that while the coding range increases with the number of moduli (K𝐾Kitalic_K) for the RNS representation, the superlinear scaling coefficient αKsubscript𝛼𝐾\alpha_{K}italic_α start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT decreases with K𝐾Kitalic_K for the modular attractor network, reaching maximum superlinearity at the smallest value K=2𝐾2K=2italic_K = 2. This reversal is caused by the fact that increasing K𝐾Kitalic_K decreases the number of synapses, i.e., the memory resource in the attractor network.

3.3 Robust error correction

Refer to caption
Figure 3: Recovery of encoded positions is robust to various sources of noise. A) Visualization of the von Mises weight distribution. Note that the magnitude of the noise is inversely proportional to κ𝜅\kappaitalic_κ, and that the variance of the phase perturbation is much larger than the distance between the discrete states of phasors. B-D) Visualizations of accuracy as a function of coding range and κ𝜅\kappaitalic_κ for three separate cases: input noise (B), update noise (C), and codebook noise (D). Cases are shown in order of increasing difficulty. The resonator network maintains perfect accuracy up to a point, after which accuracy decays at an earlier point than the noiseless dynamics (black curve).

In addition, we evaluate the robustness of our attractor model to noise. Because the RNS representations are composed of phasors, which are circular variables, we sample noise from a von Mises distribution with two parameters: mean (μ=0𝜇0\mu=0italic_μ = 0) and concentration pattern κ𝜅\kappaitalic_κ (Figure 3A). Higher κ𝜅\kappaitalic_κ values imply less noise; the distribution approximates a Gaussian with variance 1/κ1𝜅1/\kappa1 / italic_κ for large κ𝜅\kappaitalic_κ.

We consider three cases: noisy input patterns, noise added to each time step, and noisy weights corruptions of patterns in 𝐆isubscript𝐆𝑖\mathbf{G}_{i}bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (Appendix B.2). The empirical accuracy of recall varies depending on the type of corruption applied (Figure 3A). We find that for a given dimension D𝐷Ditalic_D (in this case, 1024102410241024), increasing noise decreases the maximum coding range that can be decoded with high accuracy (Figure 3B-D). For a fixed noise level, the high-accuracy coding range is largest for input noise, followed by update noise and codebook noise. It is perhaps not surprising that codebook noise has the worst coding range, given that noise added to every stored pattern compounds across the dynamics. Fortunately, the demonstrated robustness to input noise enables sensory patterns to be denoised via heteroassociation (Section 4.2).

3.4 Interpolation between patterns enables continuous path integration

In general, there is a sharp difference between point and line attractors. In our attractor model, the RNS representations of integer values are stored as discrete fixed points. Nevertheless, the attractor network also converges to states that represent non-integer values that are not explicitly stored. In other words, the network smoothly interpolates to points on a manifold of states that represent integer and non-integer values encoded by (2); Figure 4A provides a visualization, showing that the kernel induced by inner product operations retains graded similarity for sub-integer shifts. This kernel enables the modular attractor network to settle to fixed points that correspond to interpolations between integers, and for sub-integer positions to be decoded.

The resolution of decoding is fundamentally limited by the signal to noise ratio. Even so, we find that, up to a fixed noise level, the accuracy regimes of integer decoding and sub-integer decoding coincide. This property enables sub-integer shifts to be encoded within the states of the network, which, as we will show, results in stable, error-correcting path integration (Section 4.1). We quantify the gain in precision in terms of the bits of information that can on average be reconstructed from a vector (Figures  4D, Appendix B.2). Notably, even a moderate noise level of κ=8𝜅8\kappa=8italic_κ = 8 is sufficient to achieve nearly the same information content as in the noiseless case.

Refer to caption
Figure 4: Continuity of attractor landscape enables sub-integer decoding and path integration. A) Visualization of interpolation between two integer states. The position of the fractional value can be estimated by fitting a periodic sinc function (Appendix A.1) based on the inner products with integer codebooks (visualized in dots), then finding the location of the peak. B, C) Sub-integer states can be be decoded, up to a precision set by the noise level. Note that in both cases, sub-integer decoding can be just as accurate as integer decoding for the same range, even though the sub-integer decoding problem is strictly harder. Even κ=4𝜅4\kappa=4italic_κ = 4 is sufficient to achieve accuracy within a precision of Δx=0.07Δ𝑥0.07\Delta x=0.07roman_Δ italic_x = 0.07, but for higher noise (κ=2𝜅2\kappa=2italic_κ = 2), the precision is worse. D) The best spatial precision (in bits) that can be decoded for a fixed noise level. Less noise achieves both a higher coding range and higher information content per vector.

3.5 Triangular frames in 2D maximize spatial information

In two-dimensional open field environments, grid cells have firing fields arranged in a hexagonal lattice [hafting2005microstructure]. Work in theoretical neuroscience shows the optimality of this lattice for 2D environments in terms of spatial information [wei2015principle, mathis2015probable, anselmi2020computational]. However, the presence of hexagonal firing fields raises a puzzle for residue number systems. Although a crucial property of a RNS is the carry-free property, most implementations of RNS will not perform carry-free updates within a module in non-Cartesian coordinate systems. This generally occurs because the updates of different coordinates must interact due to non-orthogonality.

We resolve this issue by showing how to implement a version of vector binding of multiple coordinates in a triangular ‘Mercedes-Benz’ frame that enables carry-free hexagonal coding. Furthermore, we provide a combinatoric argument for the optimality of triangular frames for 2.superscript2\mathbb{R}^{2}.blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (A frame is a spanning set for a vector space in which the basis vectors need not be linearly independent.) Our argument relies on the combinatorics of residue numbers, and so for the first time gives an explanation of why the coexistence of RNS and hexagonal codes is optimal.

To form a hexagonal tiling of 2D position requires two steps: first, projection into a 3333-coordinate frame, and second, choosing phases such that simultaneous, equal movements along all three frames cancel out (Appendix A.3). The resulting Voronoi tessellation for different states is pictured in Figure 5A. This encoding enables higher spatial resolution in terms of the number of discrete states: 3m23m+13superscript𝑚23𝑚13m^{2}-3m+13 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 3 italic_m + 1 for triangular frames, versus m2superscript𝑚2m^{2}italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for Cartesian frames. This increased expressivity results in a higher entropy) code for space (Figure 5B). It also results in both a periodic hexagonal kernel and the individual grid response fields being arranged in a hexagonal lattice (Figure 6C).

Refer to caption
Figure 5: Hexagonal coding improves spatial resolution. A) Voronoi tessellation for m=5𝑚5m=5italic_m = 5. Each distinct color corresponds to a unique codeword in Dsuperscript𝐷\mathbb{C}^{D}blackboard_C start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Black arrows show the coordinate axes of the triangular ‘Mercedes-Benz’ frame in 2D. B) Hexagonal lattices have higher entropy than square lattices, allowing each state to carry higher resolution in its spatial output.

Prior models achieved hexagonal lattices either by circularly symmetric receptive fields (e.g., [fuhs2006spin, burak2009accurate]) arranged on a periodic rectangular sheet or by distorting a square lattice into an oblique one (e.g., [chandra2023high, mosheiff2019velocity]). Importantly, oblique lattices have the combinatorial complexity as the square grid and, unlike the construction described above, they do not achieve the same level of spatial resolution (Figure 5B).

4 Testing functionalities of the model

4.1 Robust path integration

Refer to caption
Figure 6: Velocity shift mechanism enables robust path integration. A) Example of path integration of a 2D trajectory in the case of intrinsic input noise on the place cell representation. The grid cell modules correct the noise that would otherwise induce drift after a short period of time. B) Path integration results averaged over multiple trajectories in the case of intrinsic input noise on the place cell representation. Grid cell modules limit noise accumulation along the trajectory. Solid lines report the median error over 100100100100 trials, with shaded intervals reporting 25thsuperscript25th25^{\text{th}}25 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT and 75thsuperscript75th75^{\text{th}}75 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT percentile. C) Simulated trajectory, along which colors represent the similarity between the 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of different modules and vectors representing each position in the environment. We see hexagonal response fields, similar to those obtained from single unit recordings of MEC. D) Sensory patterns (symbolized by red dots), representing visual cues, are associated to positions in the environment. Presentation of visual cues helps correct drifted positions due to extrinsic noise.

Given the ability of the attractor model to update its representation of position from velocity inputs, along with its ability to represent continuous space, we evaluate its ability to perform path integration in the presence of noise. We simulate trajectories based on a statistical model for generating plausible rodent movements in an arena [raudies2012modeling, banino2018vector], and we update grid cell and place cell state vectors according to Equations 7 and 8, respectively.

To evaluate the robustness of the model to error (Appendix B.3), we consider both sources of extrinsic noise (e.g., mis-representations of velocity information), and intrinsic noise (e.g., due to noise in weight updates). The robustness of our model to intrinsic noise is tested by comparing our results to the estimated trajectories obtained without the correction by the MEC modules (Figure 6 A and B). We find that our model strongly limits noise accumulation along the trajectory and allows highly accurate integration for a longer period of time (Figure 6A). Consistent with our previous experiments on noise robustness (Figure 3), we find strong robustness to intrinsic noise, and that extrinsic noise results in progressive drift of estimated position.

We visualize the response fields in different modules and find hexagonal lattices with a module dependent scaling (Figure 6C, Appendix 4.1). In addition, we show that tethering to external cues (e.g., visual inputs), can significantly increase the accuracy of the attractor network. To study this, we associate visual cues to corresponding patches see Section 4.2) and observe that integration of information from sensory visual inputs succeeds in correcting drift due to extrinsic noise (Figure 6D).

4.2 Denoising sensory states via a heteroassociative memory

Finally, we describe a simple extension to our model, in which sensory patterns are fed from the lateral entorhinal cortex (LEC) to update the hippocampal state. This is consistent with theories of memory suggesting that LEC provides the content of experiences to hippocampus [manns2006evolution], as well as neuroanatomical evidence [knierim2014functional]. Although the structure of the representations of those sensory patterns is unknown, it is theorized that HF is critical to sensory pattern completion [teyler1986hippocampal].

Consistent with this function, recent work [sharma2022content, chandra2023high] has proposed that a heteroassociative scaffold connects sensory patterns to hippocampal activity, allowing robust denoising of sensory states. Though the main focus of our normative model is not sensory denoising, we show that a simple extension to our model (Appendix B.4) robustly retrieves noisy pattern even under high levels of corruption (Figures  7A and B). In Appendix C.3, we also discuss how this capacity for generalization can serve as a model for sequence retrieval, showing some preliminary experiments.

In addition to robust denoising of single patterns, our model is also well-equipped to deal with compositions of sensory patterns. Two situations are worth emphasizing: first, we can often unmix multiple sensory states corresponding to a sum of patterns, because the compositional structure of binding between grid modules “protects” the items in summation (Figure 7C). This differentiates our model from other heteroassociative memories, in which sums of patterns would have multiple equally valid yet incompatible decodings. Second, the context vector modules allow preservation of different sensory information for different environments (Figure S3).

Refer to caption
Figure 7: Heteroassociation enables recovery of sensory patterns under noise and superposition. A) Accuracy for denoising 60606060 different random binary patterns for different vector dimension D𝐷Ditalic_D. The dotted line is the average similarity between the decoded and ground truth patterns. B) Same experiment as in panel A, but with 210210210210 different possible binary patterns. The accuracy is lower on average. C) Accuracy for denoising multiple patterns from a single input. This task is especially challenging, because sums of patterns combined in this way interfere with each other in retrieval (a phenomenon known as cross-talk noise). However, the compositional structure of our modular attractor network enables multiple patterns to be decoded with high probability.

5 Discussion

We propose a normative model of a cognitive map for the hippocampal formation in the mammalian brain. The core principle of the model is a compositional representation of space that achieves a superlinear coding range, which is expressed by a compact, multi-module attractor network. The compositional mechanism of vector binding provides generalization to multiple spatial dimensions, contextualization, and path integration. This binding mechanism builds on prior work proposed in the field of hyperdimensional computing and vector symbolic architectures [kanerva2009hyperdimensional, gayler2004vector, plate1992holographic, dumont2022model, kymn2023computing] — and goes beyond it to develop a specific algorithmic hypothesis about structured operations in HF. Our analyses and experiments confirm that the model can achieve important functions of the hippocampal formation and explains experimental observations, such as hexagonal grid cells, place cells, and remapping phenomena.

The proposed model contributes to, and greatly benefits from, existing work in theoretical neuroscience on residue number systems [fiete2008grid, sreenivasan2011grid], continuous attractor network models of grid cells [fuhs2006spin, zhang1996representation, fiete2008grid], and the optimality of hexagonal representations in 2D [mathis2012resolution, mathis2015probable]. It remains intriguing that biology organized grid cells into multiple discrete modules, rather than pooling all resources into a single module attractor network. This puzzle raises an opportunity for normative models to explain the organization of grid cells into multiple modules. More recent work has focused on the problem of coordinating representations across multiple modules [mosheiff2017efficient, mosheiff2019velocity, kang2019geometric, agmon2020theory, chandra2023high], and large scale recordings of HF [waaga2022grid] may provide new opportunities to evaluate predictions of these different ideas.

Our approach starts from principles of space encoding, in particular, the requirement of compositionality. This strategy is complimentary to, but different from, investigations of the emergence of place and grid cells in artificial neural networks (e.g., [banino2018vector, cueva2018emergence, whittington2022disentanglement, dorrell2022actionable, sorscher2023unified, schaeffer2024self, stachenfeld2017hippocampus, whittington2020tolman, chen2022predictive]). These approaches show optimality of biological response features under the model assumptions, such as ANN properties, network architecture, training objective and protocol. Here, we emphasize the role of multiplicative binding, a primitive that is typically difficult to have emerge in an ANN setting. Early suggestions for realizing conjunctive binding already ventured outside the framework of ANNs [SmolenskyTensor1990, plate1992holographic]. A simple extension of ANNs are sigma-pi neurons [FeldmanConnectionist1982, mel1989sigma] that can implement vector binding [plate2000randomly]. Recent work amplifies the view that full conjunctive binding would be a useful inductive bias to augment deep learning architectures [goyal2022inductive], and various augmentations of ANNs with dedicated binding mechanisms have been proposed [danihelka2016associative, greff2020binding, ganesan2021learning, smolensky2022neurocompositional].

Our model has obvious limitations. Our attractor model for the cognitive map is still a high-level abstraction of spiking neural circuits in the hippocampal formation. In particular, the phasor states in the model are one linear transform removed from vectors that describe neural population activity. Thus, the mapping between model and neurobiological mechanisms is not straight-forward, a disadvantage that can be addressed by switching to other encoding schemes, such as sparse real or complex vectors, e.g., [laiho2015high], for which conjunctive binding operations have been proposed [frady2021variable]. Although the model is more comprehensive than typical normative models, which usually focus on a single computation, it is far from covering the many other functional cell types observed in the hippocampal formation or contextual modulations observed during remapping. In addition, the current model includes learning only in the heteroassociative projection to LEC. Most observations regarding plasticity in HF are not captured, i.e., signals from reward, or eligibility traces. Finally, our assumptions about inputs to HF from the sensory pathway are rather simplifying and primarily intended as a proof of concept.

The purpose of the model express the fundamental principles of a compositional cognitive map, permitting testable predictions: First, at the biophysical level, the model predicts multiplicative interactions between dendritic inputs providing the conjunctive binding operation. Though some evidence of MEC-LEC binding exists [latuske2018hippocampal], our attractor model also predicts binding between MEC modules. Second, the model predicts relatively fixed attractor weights between place and grid cells, and more plasticity from the hippocampus to sensory observations. Third, we predict that causal perturbations of one grid module can affect the states of other grid modules without involvement of the hippocampus, in a direction that is self-consistent with the update of the attractor state.

We believe that the proposed modeling approach and the specific attractor model have broader applications in neuroscience. The proposed attractor network can also model generative models in sensory systems to implement analysis by synthesis postulated in perception. Further, there is a intriguing connection between the proposed phasor models and spiking neural networks [frady2019robust], which could yield normative models with spiking neurons, potentially implementable on neuromorphic hardware at large scale that can lead to further quantitative predictions.

Acknowledgments

The work of CJK was supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program. The work of SM was carried out as part of the ARPE program of ENS Paris-Saclay. The work of DK and BAO was supported in part by Intel’s THWAI program. The work of CJK and BAO was supported by the Center for the Co-Design of Cognitive Systems (CoCoSys), one of seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. DK has received funding from theEuropean Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 839179. FTS discloses support for the research of this work from NIH grant 1R01EB026955-0.

\printbibliography

Supplemental material

Appendix A Mathematical derivations

A.1 Similarity-preserving properties of embeddings

In the following section, we examine the similarity-preserving properties of our coding scheme. Recall from Section 2.1 that our crucial desiderata are that: (1) distinct residue values are represented using vectors which are nearly orthogonal, and that (2) the inner-product between representations of sub-integer values are reflective of a reasonable notion of similarity between the encoded values. There is a robust literature on this topic both within the Vector Symbolic Architectures community [plate2003holographic, thomas2021theoretical, frady2022computing, clarkson2023capacity], and the broader ML community [rahimi2007random] who often study these techniques under the name “random features.” The methods pursued here are in this tradition.

To briefly recapitulate the construction of Equation 1: fix some positive integer m𝑚mitalic_m, and let P(k)𝑃𝑘P(k)italic_P ( italic_k ) denote the uniform distribution over {0,,m1}0𝑚1\{0,...,m-1\}{ 0 , … , italic_m - 1 }. Define an embedding g:D:𝑔superscript𝐷g:\mathbb{R}\to{\mathbb{C}}^{D}italic_g : blackboard_R → blackboard_C start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT using the following procedure: draw k1,,kDsubscript𝑘1subscript𝑘𝐷k_{1},...,k_{D}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT independently from P(k)𝑃𝑘P(k)italic_P ( italic_k ), and set:

g(a)j=exp(iωkj)a/D,j=1,,D,g(a)_{j}=\exp\left(i\omega k_{j}\right)^{a}/\sqrt{D},\,j=1,...,D,italic_g ( italic_a ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_exp ( italic_i italic_ω italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT / square-root start_ARG italic_D end_ARG , italic_j = 1 , … , italic_D ,

where ω=2π/m𝜔2𝜋𝑚\omega=2\pi/mitalic_ω = 2 italic_π / italic_m, and i=1𝑖1i=\sqrt{-1}italic_i = square-root start_ARG - 1 end_ARG. To simplify analysis, we here assume that m𝑚mitalic_m is odd, in which case the above is equivalent to shifting the support of P(k)𝑃𝑘P(k)italic_P ( italic_k ) to {(m1)/2,,(m1)/2}𝑚12𝑚12\{-(m-1)/2,...,(m-1)/2\}{ - ( italic_m - 1 ) / 2 , … , ( italic_m - 1 ) / 2 }, and defining the embedding g:D:𝑔superscript𝐷g:{\mathbb{R}}\to{\mathbb{C}}^{D}italic_g : blackboard_R → blackboard_C start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT component-wise via:

g(a)j=exp(iωkja)/D,j=1,,D.formulae-sequence𝑔subscript𝑎𝑗𝑖𝜔subscript𝑘𝑗𝑎𝐷𝑗1𝐷g(a)_{j}=\exp\left(i\omega k_{j}a\right)/\sqrt{D},\,j=1,...,D.italic_g ( italic_a ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_exp ( italic_i italic_ω italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a ) / square-root start_ARG italic_D end_ARG , italic_j = 1 , … , italic_D .

The case that m𝑚mitalic_m is even is slightly different, but can be handled using similar techniques and the discrepancy does not affect any of our modeling goals.

Our basic claim is that in expectation with respect to randomness in the draw of k1,,kDsubscript𝑘1subscript𝑘𝐷k_{1},...,k_{D}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, inner-products between the embeddings of two numbers a,a𝑎superscript𝑎a,a^{\prime}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT recover the periodic sinc-function [smith2011spectral] of their difference. That is:

𝔼[𝐠(a)𝐠(a)]=sin(π(aa))msin(π(aa)/m):=psinc(aa),𝔼delimited-[]𝐠superscript𝑎top𝐠superscriptsuperscript𝑎𝜋𝑎superscript𝑎𝑚𝜋𝑎superscript𝑎𝑚assignpsinc𝑎superscript𝑎{\mathbb{E}}[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}]=\frac{\sin(\pi(a-% a^{\prime}))}{m\sin(\pi(a-a^{\prime})/m)}:=\text{psinc}(a-a^{\prime}),blackboard_E [ bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] = divide start_ARG roman_sin ( italic_π ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_m roman_sin ( italic_π ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_m ) end_ARG := psinc ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

This accomplishes goal (1) because, for t𝑡titalic_t an integer which is not an integer multiple of m𝑚mitalic_m, psinc(t)=0psinc(t)0\text{psinc(t)}=0psinc(t) = 0. Therefore, distinct integers are represented using vectors which are, in expectation, orthogonal. It also accomplishes goal (2), because psinc(t)1psinc(t)1\text{psinc(t)}\approx 1psinc(t) ≈ 1 for 0<|t|10𝑡much-less-than10<|t|\ll 10 < | italic_t | ≪ 1. The following theorem demonstrates this property more formally, and provides an approximation guarantee for a specific instantiation of k1,,kDsubscript𝑘1subscript𝑘𝐷k_{1},...,k_{D}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT.

Theorem 1.

Fix any D>0𝐷0D>0italic_D > 0 and δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). For any pair a,a𝑎superscript𝑎a,a^{\prime}\in{\mathbb{R}}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R such that aa𝑎superscript𝑎a-a^{\prime}italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not an integer multiple of m𝑚mitalic_m, with probability at least 1δ1𝛿1-\delta1 - italic_δ over randomness in the draw of k1,,kDsubscript𝑘1subscript𝑘𝐷k_{1},...,k_{D}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT:

|𝐠(a)𝐠(a)sin(π(aa))msin(π(aa)/m)|2Dln2δ.𝐠superscript𝑎top𝐠superscriptsuperscript𝑎𝜋𝑎𝑎𝑚𝜋𝑎superscript𝑎𝑚2𝐷2𝛿\left|\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}-\frac{\sin(\pi(a-a))}{m% \sin(\pi(a-a^{\prime})/m)}\right|\leq\sqrt{\frac{2}{D}\ln\frac{2}{\delta}}.| bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - divide start_ARG roman_sin ( italic_π ( italic_a - italic_a ) ) end_ARG start_ARG italic_m roman_sin ( italic_π ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_m ) end_ARG | ≤ square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_D end_ARG roman_ln divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG end_ARG .
Proof.

Fix any pair a,a𝑎superscript𝑎a,a^{\prime}\in{\mathbb{R}}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R, and denote for concision t=aa𝑡𝑎superscript𝑎t=a-a^{\prime}italic_t = italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Taking an expectation with respect to randomness in k1,,kDsubscript𝑘1subscript𝑘𝐷k_{1},...,k_{D}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and using a well-known calculation from the signal processing literature [smith2011spectral]:

𝔼k1,,kd[𝐠(a)𝐠(a)]subscript𝔼subscript𝑘1subscript𝑘𝑑delimited-[]𝐠superscript𝑎top𝐠superscriptsuperscript𝑎\displaystyle{\mathbb{E}}_{k_{1},...,k_{d}}\left[\mathbf{g}(a)^{\top}\mathbf{g% }(a^{\prime})^{*}\right]blackboard_E start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] =D𝔼k1[g(a)1g(a)1]absent𝐷subscript𝔼subscript𝑘1delimited-[]𝑔subscript𝑎1𝑔superscriptsubscriptsuperscript𝑎1\displaystyle=D{\mathbb{E}}_{k_{1}}[g(a)_{1}g(a^{\prime})_{1}^{*}]= italic_D blackboard_E start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_g ( italic_a ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]
=1mk1=m12m12exp(iωk1(aa))absent1𝑚superscriptsubscriptsubscript𝑘1𝑚12𝑚12𝑖𝜔subscript𝑘1𝑎superscript𝑎\displaystyle=\frac{1}{m}\sum_{k_{1}=-\frac{m-1}{2}}^{\frac{m-1}{2}}\exp\left(% i\omega k_{1}(a-a^{\prime})\right)= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - divide start_ARG italic_m - 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_m - 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_exp ( italic_i italic_ω italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
=1m(exp(iωt(m1)2)exp(iωt(m+1)2)1exp(iωt))absent1𝑚𝑖𝜔𝑡𝑚12𝑖𝜔𝑡𝑚121𝑖𝜔𝑡\displaystyle=\frac{1}{m}\left(\frac{\exp\left(-\frac{i\omega t(m-1)}{2}\right% )-\exp\left(\frac{i\omega t(m+1)}{2}\right)}{1-\exp(i\omega t)}\right)= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ( divide start_ARG roman_exp ( - divide start_ARG italic_i italic_ω italic_t ( italic_m - 1 ) end_ARG start_ARG 2 end_ARG ) - roman_exp ( divide start_ARG italic_i italic_ω italic_t ( italic_m + 1 ) end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG 1 - roman_exp ( italic_i italic_ω italic_t ) end_ARG )
=exp(iωt/2)mexp(iωt/2)(exp(πit)exp(πit)exp(πit/m)exp(πit/m))absent𝑖𝜔𝑡2𝑚𝑖𝜔𝑡2𝜋𝑖𝑡𝜋𝑖𝑡𝜋𝑖𝑡𝑚𝜋𝑖𝑡𝑚\displaystyle=\frac{\exp(i\omega t/2)}{m\exp(i\omega t/2)}\left(\frac{\exp(-% \pi it)-\exp(\pi it)}{\exp(-\pi it/m)-\exp(\pi it/m)}\right)= divide start_ARG roman_exp ( italic_i italic_ω italic_t / 2 ) end_ARG start_ARG italic_m roman_exp ( italic_i italic_ω italic_t / 2 ) end_ARG ( divide start_ARG roman_exp ( - italic_π italic_i italic_t ) - roman_exp ( italic_π italic_i italic_t ) end_ARG start_ARG roman_exp ( - italic_π italic_i italic_t / italic_m ) - roman_exp ( italic_π italic_i italic_t / italic_m ) end_ARG )
=sin(πt)msin(πt/m)absent𝜋𝑡𝑚𝜋𝑡𝑚\displaystyle=\frac{\sin(-\pi t)}{m\sin(-\pi t/m)}= divide start_ARG roman_sin ( - italic_π italic_t ) end_ARG start_ARG italic_m roman_sin ( - italic_π italic_t / italic_m ) end_ARG
=sin(π(aa))msin(π(aa)/m),absent𝜋𝑎superscript𝑎𝑚𝜋𝑎superscript𝑎𝑚\displaystyle=\frac{\sin(\pi(a-a^{\prime}))}{m\sin(\pi(a-a^{\prime})/m)},= divide start_ARG roman_sin ( italic_π ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_m roman_sin ( italic_π ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_m ) end_ARG ,

The third equality follows from the second by noting that the latter is a sum of a geometric series with common ratio r=exp(ωt)𝑟𝜔𝑡r=\exp(\omega t)italic_r = roman_exp ( italic_ω italic_t ). The fifth line follows from the fourth by recalling the identity sin(x)=(eixeix)/2i𝑥superscript𝑒𝑖𝑥superscript𝑒𝑖𝑥2𝑖\sin(x)=(e^{ix}-e^{-ix})/2iroman_sin ( italic_x ) = ( italic_e start_POSTSUPERSCRIPT italic_i italic_x end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - italic_i italic_x end_POSTSUPERSCRIPT ) / 2 italic_i. In the limit of t0𝑡0t\to 0italic_t → 0, the expression evaluates to 1111, consistent with the normalized inner product of a vector with itself.

To show concentration around this value, consider:

𝐠(a)𝐠(a)=1Dj=1Dexp(iωkj(aa)),𝐠superscript𝑎top𝐠superscriptsuperscript𝑎1𝐷superscriptsubscript𝑗1𝐷𝑖𝜔subscript𝑘𝑗𝑎superscript𝑎\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}=\frac{1}{D}\sum_{j=1}^{D}\exp(i% \omega k_{j}(a-a^{\prime})),bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT roman_exp ( italic_i italic_ω italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,

and note that since the complex part of the sum vanishes in expectation, we may consider, without loss of generality, the average of the real-valued quantities: (cos(ωkj(aa)))j=1Dsuperscriptsubscript𝜔subscript𝑘𝑗𝑎superscript𝑎𝑗1𝐷\left(\cos(\omega k_{j}(a-a^{\prime}))\right)_{j=1}^{D}( roman_cos ( italic_ω italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a - italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, which are bounded in the range ±1plus-or-minus1\pm 1± 1. Therefore, by Hoeffding’s inequality:

Pr(|𝐠(a)𝐠(a)𝔼[𝐠(a)𝐠(a)]|ϵ)2exp(Dϵ22),Pr𝐠superscript𝑎top𝐠superscriptsuperscript𝑎𝔼delimited-[]𝐠superscript𝑎top𝐠superscriptsuperscript𝑎italic-ϵ2𝐷superscriptitalic-ϵ22{\rm Pr}\left(\left|\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}-{\mathbb{E}% }[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}]\right|\geq\epsilon\right)% \leq 2\exp\left(-\frac{D\epsilon^{2}}{2}\right),roman_Pr ( | bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - blackboard_E [ bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] | ≥ italic_ϵ ) ≤ 2 roman_exp ( - divide start_ARG italic_D italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ,

whereupon we conclude that, with probability at least 1δ1𝛿1-\delta1 - italic_δ over randomness in the draw of k1,,kDsubscript𝑘1subscript𝑘𝐷k_{1},...,k_{D}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT:

ϵ2Dln2δ,italic-ϵ2𝐷2𝛿\epsilon\leq\sqrt{\frac{2}{D}\ln\frac{2}{\delta}},italic_ϵ ≤ square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_D end_ARG roman_ln divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG end_ARG ,

as claimed. ∎

This result can be readily extended to the binding of multiple residue number values. Let 𝐠(a)=i=1K𝐠i(a)𝐠𝑎superscriptsubscript𝑖1𝐾subscript𝐠𝑖𝑎\mathbf{g}(a)=\bigodot_{i=1}^{K}\mathbf{g}_{i}(a)bold_g ( italic_a ) = ⨀ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ), where each 𝐠i(a)subscript𝐠𝑖𝑎\mathbf{g}_{i}(a)bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) is instantiated independently. Then, by independence, we observe that:

𝔼[𝐠(a)𝐠(a)]𝔼delimited-[]𝐠superscript𝑎top𝐠superscriptsuperscript𝑎\displaystyle{\mathbb{E}}\left[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}\right]blackboard_E [ bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] =𝔼[i=1K𝐠i(a)𝐠i(a)]absent𝔼delimited-[]superscriptsubscriptproduct𝑖1𝐾subscript𝐠𝑖superscript𝑎topsubscript𝐠𝑖superscriptsuperscript𝑎\displaystyle={\mathbb{E}}\left[\prod_{i=1}^{K}\mathbf{g}_{i}(a)^{\top}\mathbf% {g}_{i}(a^{\prime})^{*}\right]= blackboard_E [ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]
=i=1K𝔼[𝐠i(a)𝐠i(a)]absentsuperscriptsubscriptproduct𝑖1𝐾𝔼delimited-[]subscript𝐠𝑖superscript𝑎topsubscript𝐠𝑖superscriptsuperscript𝑎\displaystyle=\prod_{i=1}^{K}{\mathbb{E}}\left[\mathbf{g}_{i}(a)^{\top}\mathbf% {g}_{i}(a^{\prime})^{*}\right]= ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E [ bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]

The implication is that 𝔼[𝐠(a)𝐠(a)]=1𝔼delimited-[]𝐠superscript𝑎top𝐠superscriptsuperscript𝑎1{\mathbb{E}}[\mathbf{g}(a)^{\top}\mathbf{g}(a^{\prime})^{*}]=1blackboard_E [ bold_g ( italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_g ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] = 1 if and only if all residue values agree, and zero otherwise. To show concentration around this value, we can again use Hoeffding’s inequality, which recovers the same bound on the sufficient dimension.

A.2 Information-theoretic estimate of required pattern dimension

In this section, we describe an information-theoretic estimate on the dimension D𝐷Ditalic_D necessary to retrieve n𝑛nitalic_n patterns within K𝐾Kitalic_K modules. The main result we aim to show is that D=𝒪(n(K1)/K)𝐷𝒪superscript𝑛𝐾1𝐾D=\mathcal{O}(n^{(K-1)/K})italic_D = caligraphic_O ( italic_n start_POSTSUPERSCRIPT ( italic_K - 1 ) / italic_K end_POSTSUPERSCRIPT ); equivalently, the scaling of n𝑛nitalic_n for a given D𝐷Ditalic_D is 𝒪(DK/(K1))𝒪superscript𝐷𝐾𝐾1\mathcal{O}(D^{K/(K-1)})caligraphic_O ( italic_D start_POSTSUPERSCRIPT italic_K / ( italic_K - 1 ) end_POSTSUPERSCRIPT ). This scaling roughly predicts our empirical results of finding the dimension required to achieve high accuracy, suggesting that the attractor network described here performs close to the theoretical bound.

The minimal total amount of information a network needs to store for denoising an RNS representation with coding range M𝑀Mitalic_M is 𝒪(Mlog(M))𝒪𝑀log𝑀\mathcal{O}(M\ \text{log}(M))caligraphic_O ( italic_M log ( italic_M ) ). This results from the requirement of content addressability, i.e., for serving as a unique pointer to one of n𝑛nitalic_n patterns, each pattern must at least carry information of the order of 𝒪(log(M))𝒪log𝑀\mathcal{O}(\text{log}(M))caligraphic_O ( log ( italic_M ) ). For simplicity, we now assume that each module is of size 𝒪(M1/K)𝒪superscript𝑀1𝐾\mathcal{O}(M^{1/K})caligraphic_O ( italic_M start_POSTSUPERSCRIPT 1 / italic_K end_POSTSUPERSCRIPT ). The total capacity of the network is bounded by the number of synapses, which is 𝒪(DKM1/K)=𝒪(DM1/K)𝒪𝐷𝐾superscript𝑀1𝐾𝒪𝐷superscript𝑀1𝐾\mathcal{O}(D*K*M^{1/K})=\mathcal{O}(D*M^{1/K})caligraphic_O ( italic_D ∗ italic_K ∗ italic_M start_POSTSUPERSCRIPT 1 / italic_K end_POSTSUPERSCRIPT ) = caligraphic_O ( italic_D ∗ italic_M start_POSTSUPERSCRIPT 1 / italic_K end_POSTSUPERSCRIPT ) (assuming K𝐾Kitalic_K is constant), times the capacity per synapse. Under the conservative assumption that the capacity per synapse is 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ), the dimension is of order 𝒪(eK1Klog(M)+log(log(M)))𝒪superscript𝑒𝐾1𝐾𝑀𝑀\mathcal{O}(e^{\frac{K-1}{K}\log{(M)}+\log{(\log{(M)})}})caligraphic_O ( italic_e start_POSTSUPERSCRIPT divide start_ARG italic_K - 1 end_ARG start_ARG italic_K end_ARG roman_log ( italic_M ) + roman_log ( roman_log ( italic_M ) ) end_POSTSUPERSCRIPT ). Thus, the leading order of how D𝐷Ditalic_D depends on n𝑛nitalic_n is 𝒪(M(K1)/K)𝒪superscript𝑀𝐾1𝐾\mathcal{O}(M^{(K-1)/K})caligraphic_O ( italic_M start_POSTSUPERSCRIPT ( italic_K - 1 ) / italic_K end_POSTSUPERSCRIPT ). If the capacity per synapse is assumed to be larger, O(log(M))𝑂𝑀O(\log{(M)})italic_O ( roman_log ( italic_M ) ) bits, only the non-leading term cancels and the resulting order of D𝐷Ditalic_D is still the same.

A.3 Construction of triangular frames

In order to convert a 2D2𝐷2D2 italic_D coordinate 𝐱𝐱\mathbf{x}bold_x into a 3D3𝐷3D3 italic_D frame 𝐲𝐲\mathbf{y}bold_y, we first multiply it by a matrix, ΨΨ\Psiroman_Ψ whose rows are the elements of a 3D3𝐷3D3 italic_D equiangular frame:

𝐲=[1/31/31/31/302/3]𝐱𝐲matrix13131313023𝐱\mathbf{y}=\begin{bmatrix}-1/\sqrt{3}&-1/3\\ 1/\sqrt{3}&-1/3\\ 0&2/3\end{bmatrix}\mathbf{x}bold_y = [ start_ARG start_ROW start_CELL - 1 / square-root start_ARG 3 end_ARG end_CELL start_CELL - 1 / 3 end_CELL end_ROW start_ROW start_CELL 1 / square-root start_ARG 3 end_ARG end_CELL start_CELL - 1 / 3 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 2 / 3 end_CELL end_ROW end_ARG ] bold_x (S1)

(This particular frame is commonly referred to as a ‘Mercedes Benz’ frame due to its resemblance to the iconic symbol.) A consequence of working with an overcomplete frame is that there may exist multiple values of 𝐲𝐲\mathbf{y}bold_y that correspond to the same 𝐱𝐱\mathbf{x}bold_x. For this frame, the null space of Ψ+superscriptΨ\Psi^{+}roman_Ψ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the subspace spanned by [1,1,1]superscript111{[1,1,1]^{\intercal}}[ 1 , 1 , 1 ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT – grounding the intuition that equal movement in all equiangular directions “cancels out.” It therefore might seem that triangular frames require extra operations to determine if two coordinates are equal, but here we show how to avoid this consequence.

The core strategy is to choose seed vectors 𝐠i,1,𝐠i,2,𝐠i,3subscript𝐠𝑖1subscript𝐠𝑖2subscript𝐠𝑖3\mathbf{g}_{i,1},\mathbf{g}_{i,2},\mathbf{g}_{i,3}bold_g start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_g start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , bold_g start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT for each modulus misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that implement this self-cancellation. For a modulus misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we draw the phasors of seed vectors from the m𝑚mitalic_m-th roots of unity. However, we further require that, for each vector component, the three selected phases sum to 0(mod 2π)0mod2𝜋0\ (\text{mod}\ 2\pi)0 ( mod 2 italic_π ). We then form a hexagonal coordinate vector by binding the three seed vectors:

𝐠i=𝐠i,1𝐠i,2𝐠i,3subscript𝐠𝑖direct-productsubscript𝐠𝑖1subscript𝐠𝑖2subscript𝐠𝑖3\mathbf{g}_{i}=\mathbf{g}_{i,1}\odot\mathbf{g}_{i,2}\odot\mathbf{g}_{i,3}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_g start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ⊙ bold_g start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT ⊙ bold_g start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT (S2)

By enforcing that the phases sum to 0(mod 2π)0mod2𝜋0\ (\text{mod}\ 2\pi)0 ( mod 2 italic_π ), we ensure that positions that have an equivalent 𝐱𝐱\mathbf{x}bold_x coordinate are mapped to the same 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Observe that Hadamard product binding of phasors is equivalent to summing their phases, and that binding e0isuperscript𝑒0𝑖e^{0i}italic_e start_POSTSUPERSCRIPT 0 italic_i end_POSTSUPERSCRIPT corresponds to adding nothing. Hence, a pair of three-dimensional coordinates whose differences are a multiple of [1,1,1]111[1,1,1][ 1 , 1 , 1 ] will be mapped to equivalent vector representations. Finally, we then form the residue number representation for different moduli by binding, as in Eq. 2. The presence of multiple modules and self-cancellation properties complement prior work on the efficiency of hexagonal kernels for spatial navigation tasks [KomerNavigation2020, komer2020biologically].

The equivalence of certain 3D coordinates also helps us count the number of states. Clearly, the redundancy means that we have less than m3superscript𝑚3m^{3}italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT states, but it also shows us that every position in the hexagonal grid can be represented by a 3D coordinate which contains at least one coordinate equivalent to 00. There is one state where all coordinates are 00, 3(m1)3𝑚13(m-1)3 ( italic_m - 1 ) states where exactly two coordinates are 0, and 3(m1)23superscript𝑚123(m-1)^{2}3 ( italic_m - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT states where exactly one coordinate is zero. Thus, there are 3m23m+13superscript𝑚23𝑚13m^{2}-3m+13 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 3 italic_m + 1 states for the hexagonal lattice, compared to the m2superscript𝑚2m^{2}italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT states for the square lattice.

In the case of square lattices in 2D2𝐷2D2 italic_D, all states occupy an equal proportion of space; however, this is not the case for the hexagonal lattice (see Figure 5A). This is because states with more zero-valued coordinates occur slightly more frequently. To estimate the effect of unequal proportions on the entropy, we directly calculate the Shannon entropy of hexagonal lattices for finite size spatial grids of increasing radius l𝑙litalic_l, as an approximation to the infinite lattice. We find that even for l=1000,m>7formulae-sequence𝑙1000𝑚7l=1000,m>7italic_l = 1000 , italic_m > 7 the hexagonal code has 99999999 percent of the entropy of a system that divided all possibilities equally, and that this gap decreases as m𝑚mitalic_m grows larger. Thus asymptotically, as m𝑚m\to\inftyitalic_m → ∞, the ratio of entropy for hexagonal vs. square grids tends towards log2(3)subscriptlog23\text{log}_{2}(3)log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 3 ).

Appendix B Experimental details

All experiments were implemented in Python involving standard packages for scientific computing (including NumPy, SciPy, Matplotlib). We describe here the parameters and training setup of our experiments in further detail.

B.1 Scaling in dimension

For each number of moduli, K𝐾Kitalic_K, we seek to find the smallest dimension D𝐷Ditalic_D for which our attractor model factorizes its input, p, into the correct grid states in a fixed time (50505050 iterations) with high probability (at least 99999999 percent empirically). In instances where the network states remain similar over time (at least 0.950.950.950.95 cosine similarity), we consider that it converged to a fixed point. If such convergence did not occur, we evaluate the accuracy at the last time step.

To evaluate scaling, we first choose our base moduli to be a set of K𝐾Kitalic_K consecutive primes. We randomly select one of M𝑀Mitalic_M random numbers to serve as the input and set the grid states to be random. We then evaluate a candidate dimension on the factorization task for a set number of trials (200200200200) and check accuracy. We compare accuracy by considering whether the amplitude of the complex-valued inner products are highest for the true factor. If the accuracy is above our threshold, we then evaluate performance of a slightly higher dimension (dimensions evaluated are spaced apart on a logarithmic scale). Once a sufficiently high dimension achieves the accuracy threshold, we assume that the scaling is non-decreasing and use the last successful dimension as the first try.

Finally, we fit linear regression to all data points on a log-log scale to estimate the scaling between dimension and problem size. We report the slopes to estimate the scaling coefficients.

B.2 Error correction

General experimental setup. We fix in advance the vector dimension, noise level (determined by 1/κ1𝜅1/\kappa1 / italic_κ), and number of moduli. Given these parameters, we estimate the empirical accuracy of factorization on an arbitrary input known to correspond to one of the patterns. We use the same method for checking convergence as above, though we increase the maximum number of iterations to 100100100100. For all experiments in this section, we average over 1,00010001{,}0001 , 000 trials.

In the case of input noise, the vector 𝐩𝐩\mathbf{p}bold_p is multiplied by a noise vector. In the case of update noise, after every time step, each module of the attractor network is corrupted by a von Mises noise update. In the case of codebook noise, all codebooks are corrupted before the start of any iterations.

Decoding values between integers. In order to test the ability of the modular attractor network to decode at sub-integer resolution, we fix a spatial resolution ΔxΔ𝑥\Delta xroman_Δ italic_x to decode from. In our experiments, we test Δx={1/3,1/7,1/15,1/31}Δ𝑥1317115131\Delta x=\{1/3,1/7,1/15,1/31\}roman_Δ italic_x = { 1 / 3 , 1 / 7 , 1 / 15 , 1 / 31 }, and we also report Δx=1Δ𝑥1\Delta x=1roman_Δ italic_x = 1 (integer decoding) as a control. Then, using as input a random integer and random multiple of ΔxΔ𝑥\Delta xroman_Δ italic_x, we let the modules of the attractor network settle until convergence (as in other experiments). To evaluate accuracy, we test if the resulting output of the attractor network, i𝐠^i=1K(t)subscriptdirect-product𝑖absentsuperscriptsubscript^𝐠𝑖1𝐾𝑡\odot_{i}\mathbf{\hat{g}}_{i=1}^{K}(t)⊙ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_t ), is closer to the ground truth RNS representation than to any other value. We test this with a “coarse-to-fine” approach: first checking if it is within an integer, and then checking all fractional values within one of that integer. We regard the output as correct if both the integer and fraction match, and incorrect otherwise.

Estimation of information content from a vector. To measure the total resolution of our coding scheme in bits, we factor in both the number of states distinguished (τ=MΔx𝜏𝑀Δ𝑥\tau=\frac{M}{\Delta x}italic_τ = divide start_ARG italic_M end_ARG start_ARG roman_Δ italic_x end_ARG and the empirical accuracy (ρ𝜌\rhoitalic_ρ). To quantitatively estimate this, we report the information decoded in bits according to the following equation [frady2018theory, kleyko2023efficient]:

I(τ,ρ)=alog2(τρ)+(1ρ)log2(ττ1(1ρ)).𝐼𝜏𝜌𝑎subscript2𝜏𝜌1𝜌subscript2𝜏𝜏11𝜌\begin{split}I(\tau,\rho)=&a\log_{2}(\tau\rho)+(1-\rho)\log_{2}\left(\frac{% \tau}{\tau-1}(1-\rho)\right).\end{split}start_ROW start_CELL italic_I ( italic_τ , italic_ρ ) = end_CELL start_CELL italic_a roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_τ italic_ρ ) + ( 1 - italic_ρ ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_τ end_ARG start_ARG italic_τ - 1 end_ARG ( 1 - italic_ρ ) ) . end_CELL end_ROW (S3)

A consequence of this equation is that the information decoded is 00 when the empirical accuracy is at chance (1/τ1𝜏1/\tau1 / italic_τ).

B.3 Path integration

General experimental setup. We generate paths using a statistical model simulating rodent two-dimensional trajectories in a 50 cm2superscriptcm2\text{cm}^{2}cm start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT closed square environment [raudies2012modeling, banino2018vector], with Δt=100Δ𝑡100\Delta t=100roman_Δ italic_t = 100 ms. The path integration method starts from the ground truth first position (x0,y0)subscript𝑥0subscript𝑦0(x_{0},y_{0})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) which is converted to hexagonal coordinates (a0,b0,c0)subscript𝑎0subscript𝑏0subscript𝑐0(a_{0},b_{0},c_{0})( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (see Section A.3) and encoded as an RNS representation 𝐩(0)𝐩0\mathbf{p}(0)bold_p ( 0 ) of dimension D=3,000𝐷3000D=3{,}000italic_D = 3 , 000 following the method in Section 2.1, for moduli {3,5,7}357\{3,5,7\}{ 3 , 5 , 7 }. We then factorize 𝐩(0)𝐩0\mathbf{p}(0)bold_p ( 0 ) into {𝐠^𝐢(0)}i=1Ksuperscriptsubscriptsubscript^𝐠𝐢0𝑖1𝐾\{\mathbf{\hat{g}_{i}}(0)\}_{i=1}^{K}{ over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ( 0 ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT to produce the estimated representation 𝐩^(0)=i=1K𝐠^𝐢(0)^𝐩0superscriptsubscript𝑖1𝐾subscript^𝐠𝐢0\mathbf{\hat{p}}(0)=\bigodot_{i=1}^{K}\mathbf{\hat{g}_{i}}(0)over^ start_ARG bold_p end_ARG ( 0 ) = ⨀ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ( 0 ).

At each time step t0𝑡0t\geq 0italic_t ≥ 0, we aim at estimating the position (xt+1,yt+1)subscript𝑥𝑡1subscript𝑦𝑡1(x_{t+1},y_{t+1})( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). We give the modular attractor network as input the previous position vector estimate 𝐩^(t)^𝐩𝑡\mathbf{\hat{p}}(t)over^ start_ARG bold_p end_ARG ( italic_t ). It is factorized into the residue components {𝐠^𝐣(t)}j=1Ksuperscriptsubscriptsubscript^𝐠𝐣𝑡𝑗1𝐾\{\mathbf{\hat{g}_{j}}(t)\}_{j=1}^{K}{ over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT that are then shifted according to the velocity (dat,dbt,dct)𝑑subscript𝑎𝑡𝑑subscript𝑏𝑡𝑑subscript𝑐𝑡(da_{t},db_{t},dc_{t})( italic_d italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) between (at,bt,ct)subscript𝑎𝑡subscript𝑏𝑡subscript𝑐𝑡(a_{t},b_{t},c_{t})( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (at+1,bt+1,ct+1)subscript𝑎𝑡1subscript𝑏𝑡1subscript𝑐𝑡1(a_{t+1},b_{t+1},c_{t+1})( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). Namely, for each residue module, we build a velocity vector 𝐪j(t)=𝐠j,1(da(t))𝐠j,2(db(t))𝐠j,3(dc(t))subscript𝐪𝑗𝑡direct-productdirect-productsubscript𝐠𝑗1𝑑𝑎𝑡subscript𝐠𝑗2𝑑𝑏𝑡subscript𝐠𝑗3𝑑𝑐𝑡\mathbf{q}_{j}(t)=\mathbf{g}_{j,1}(da(t))\odot\mathbf{g}_{j,2}(db(t))\odot% \mathbf{g}_{j,3}(dc(t))bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = bold_g start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT ( italic_d italic_a ( italic_t ) ) ⊙ bold_g start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT ( italic_d italic_b ( italic_t ) ) ⊙ bold_g start_POSTSUBSCRIPT italic_j , 3 end_POSTSUBSCRIPT ( italic_d italic_c ( italic_t ) ) that is binded to each residue component 𝐠^𝐣(t)subscript^𝐠𝐣𝑡\mathbf{\hat{g}_{j}}(t)over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( italic_t ). The estimated position vector is then the binding of the shifted estimated residue components: 𝐩^(t+1)=j=1K𝐠^𝐣(t)𝐪j(t)^𝐩𝑡1superscriptsubscript𝑗1𝐾direct-productsubscript^𝐠𝐣𝑡subscript𝐪𝑗𝑡\mathbf{\hat{p}}(t+1)=\bigodot_{j=1}^{K}\mathbf{\hat{g}_{j}}(t)\odot\mathbf{q}% _{j}(t)over^ start_ARG bold_p end_ARG ( italic_t + 1 ) = ⨀ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( italic_t ) ⊙ bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ). The estimated position (x^t+1,y^t+1)subscript^𝑥𝑡1subscript^𝑦𝑡1(\hat{x}_{t+1},\hat{y}_{t+1})( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) is chosen to be the position (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) in a grid of 50×50505050\times 5050 × 50 positions mapping the entire environment, corresponding to the highest similarity between 𝐩(x,y)𝐩𝑥𝑦\mathbf{p}(x,y)bold_p ( italic_x , italic_y ) and 𝐩^(t+1)^𝐩𝑡1\mathbf{\hat{p}}(t+1)over^ start_ARG bold_p end_ARG ( italic_t + 1 ).

We show the robustness of the path integration dynamics to two different sources of noise. In the case of extrinsic noise (Figure 6D), the hexagonal velocity is corrupted by additive Gaussian noise of variance 0.120.120.120.12. In the case of intrinsic noise (Figures 6A and B), the position vector p~tsubscript~𝑝𝑡\tilde{p}_{t}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is corrupted by binding with a vector sampled from a von Mises distribution with concentration parameter κ=2𝜅2\kappa=2italic_κ = 2.

Response field visualization. Given a moduli misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a vector 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we visualize its response field by computing the similarity of the modular attractor output 𝐠^i(t)subscript^𝐠𝑖𝑡\mathbf{\hat{g}}_{i}(t)over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along a trajectory. The periodicity in the distribution of random weights and the hexagonal coordinates produce periodic hexagonal receptive fields whose scale depends on misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The receptive fields of a given moduli are translations of one another, because the inner product between vector states induces a translation-invariant kernel.

Connection to sensory cues. Sensory cues are random binary vectors of size Ns=Dsubscript𝑁𝑠𝐷N_{s}=Ditalic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_D that are associated with positions along the trajectory. When the true trajectory reaches a sensory cue, the hippocampal state 𝐩^𝐭subscript^𝐩𝐭\mathbf{\hat{p}_{t}}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT is updated using the heteroassociation method described in Appendix B.4

B.4 Heteroassociation

General experimental setup. We evaluate our model’s performance for pattern denoising using a heteroassociative learning rule [sharma2022content, chandra2023high]. We consider random binary patterns of size Ns=Dsubscript𝑁𝑠𝐷N_{s}={D}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_D. We corrupt the patterns by randomly flipping bits with probability pflip[0,0.5]subscript𝑝flip00.5p_{\mathrm{flip}}\in[0,0.5]italic_p start_POSTSUBSCRIPT roman_flip end_POSTSUBSCRIPT ∈ [ 0 , 0.5 ] and associate them to place cell representations using heteroassociation with a pseudo-inverse learning rule. Let 𝐒Ns×M𝐒superscriptsubscript𝑁𝑠𝑀\mathbf{S}\in{\mathbb{R}}^{N_{s}\times M}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT be the matrix of M𝑀Mitalic_M patterns to hook to the scaffold and 𝐇M×D𝐇superscript𝑀𝐷\mathbf{H}\in{\mathbb{C}}^{M\times D}bold_H ∈ blackboard_C start_POSTSUPERSCRIPT italic_M × italic_D end_POSTSUPERSCRIPT the matrix of M𝑀Mitalic_M position vectors on which to hook the patterns. We associate pattern 𝐬𝐬\mathbf{s}bold_s to a place cell representation 𝐩=𝐇𝐒+𝐬𝐩superscript𝐇𝐒𝐬\mathbf{p}=\mathbf{H}\mathbf{S}^{+}\mathbf{s}bold_p = bold_HS start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_s, where 𝐒+superscript𝐒\mathbf{S}^{+}bold_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the pseudo-inverse of 𝐒𝐒\mathbf{S}bold_S. The model returns a denoised place cell representation 𝐩^^𝐩\mathbf{\hat{p}}over^ start_ARG bold_p end_ARG from which we can estimate a denoised pattern by inverting the heteroassociation projection 𝐬^=sgn(𝐒𝐇+𝐩^)^𝐬sgnsuperscript𝐒𝐇^𝐩\mathbf{\hat{s}}=\mathrm{sgn}\left(\mathbf{S}\mathbf{H}^{+}\mathbf{\hat{p}}\right)over^ start_ARG bold_s end_ARG = roman_sgn ( bold_SH start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT over^ start_ARG bold_p end_ARG ).

Scaling to dimensionality. We evaluate the impact the dimension D𝐷Ditalic_D has on the denoising performance in Figure 7, for a number of stored patterns M=60(in this case, 3×4×5M=60\ (\text{in this case, }3\times 4\times 5italic_M = 60 ( in this case, 3 × 4 × 5) and 210(in this case, 5×6×7)210in this case, 567210\ (\text{in this case, }5\times 6\times 7)210 ( in this case, 5 × 6 × 7 ). For each dimension D{256,512,1024,2048}𝐷25651210242048D\in\{256,512,1024,2048\}italic_D ∈ { 256 , 512 , 1024 , 2048 }, we show the evolution of accuracy for different levels of corruption. For a given dimension D𝐷Ditalic_D and noise level pflipsubscript𝑝flipp_{\mathrm{flip}}italic_p start_POSTSUBSCRIPT roman_flip end_POSTSUBSCRIPT, we denoise a pattern and consider that the denoising is correct if the denoised pattern is closest to the ground truth pattern (in terms of cosine similarity). We repeat over 500 trials and report the accuracy as well as the average similarity (normalized inner product) between the denoised pattern and its noiseless version.

Superposition of patterns. We show that our model can denoise a superposition of npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT patterns one at a time, for np{1,2,3,4,5,10}subscript𝑛𝑝1234510n_{p}\in\{1,2,3,4,5,10\}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ { 1 , 2 , 3 , 4 , 5 , 10 }. We fix the dimension D𝐷Ditalic_D to 2,00020002{,}0002 , 000 and for different values of bit flip probability pflip[0,,0.5]subscript𝑝flip00.5p_{\text{flip}}\in[0,...,0.5]italic_p start_POSTSUBSCRIPT flip end_POSTSUBSCRIPT ∈ [ 0 , … , 0.5 ], we run the model on a superposition 𝐬𝐬\mathbf{s}bold_s of random binary patterns {𝐬1,,𝐬np}subscript𝐬1subscript𝐬subscript𝑛𝑝\{\mathbf{s}_{1},...,\mathbf{s}_{n_{p}}\}{ bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_s start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT } of size Ns=2,000subscript𝑁𝑠2000N_{s}=2{,}000italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 2 , 000: 𝐬=𝐬1+.+𝐬npformulae-sequence𝐬subscript𝐬1subscript𝐬subscript𝑛𝑝\mathbf{s}=\mathbf{s}_{1}+....+\mathbf{s}_{n_{p}}bold_s = bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … . + bold_s start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We run the model npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT times and between each run the denoised pattern is explained away from the superposition [kymn2024compositional]. Namely, for run r{1,,np1}𝑟1subscript𝑛𝑝1r\in\{1,...,n_{p}-1\}italic_r ∈ { 1 , … , italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 } we denote 𝐬^(r)^𝐬𝑟\mathbf{\hat{s}}(r)over^ start_ARG bold_s end_ARG ( italic_r ) the denoised pattern. The input to run r+1𝑟1r+1italic_r + 1 is then 𝐬(r+1)=𝐬(r)𝐬^(r)𝐬𝑟1𝐬𝑟^𝐬𝑟\mathbf{s}(r+1)=\mathbf{s}(r)-\mathbf{\hat{s}}(r)bold_s ( italic_r + 1 ) = bold_s ( italic_r ) - over^ start_ARG bold_s end_ARG ( italic_r ). We find that the more patterns are superposed, the lower the overall denoising accuracy is. This is due to the fact that when a pattern is incorrectly denoised, explaining away adds noise or spurious patterns to the representation of the superposition which makes the following denoising steps more difficult.

Comparison to structured patterns. We evaluate our model’s ability to denoise structured patterns. We consider the FashionMNIST dataset, from which we select 105105105105 images of size 28×28282828\times 2828 × 28 that we binarize by setting pixel values to be 11-1- 1 if below 127127127127, and 1111 elsewhere. We compare the denoising performance to the performance on random binary patterns of size 28×28=784282878428\times 28=78428 × 28 = 784 for fair comparison (Figure S4).

Appendix C Additional results

C.1 Further visualizations of grid cell modules

We further visualize the receptive fields for path integration by showing receptive fields from different units taken from the same grid module. We simulate a trajectory that traverses the entire environment and represent the activation of different position vectors along the trajectory. For each modulus mi{3,5,7}subscript𝑚𝑖357m_{i}\in\{3,5,7\}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 3 , 5 , 7 }, we show the similarity between 4444 different vectors 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from module misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the position vectors along the trajectory. We show in Figure (S1) that the different receptive fields of a given module are translations of one another.

Refer to caption
Figure S1: Response fields visualization of 4444 different 𝐠isubscript𝐠𝑖\mathbf{g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 3333 different modules mi=3,5subscript𝑚𝑖35m_{i}=3,5italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 3 , 5 and 7777. For a fixed module, the response fields are translations of one another.

C.2 Remapping contexts

Refer to caption
Figure S2: Remapping of place cells depending on context, similar to what was observed experimentally in an experimental study of attractor network dynamics in hippocampus [wills2005attractor].

We demonstrate that the context vector can serve as a model of global remapping in hippocampal place fields, which occurs when there is no relationship between the firing of place cells in different environments [latuske2018hippocampal]. The simplest instance of this is when a place field occurs in context A but not context B, consistent with the observed sparsity of hippocampal activity [thompson1989place]. To model this kind of remapping phenomenon, we consider an instance where there is a gradation of contexts with some phase transition between them; such an instance was observed experimentally [wills2005attractor]. Towards this end, we model linear combinations of these contexts, where the weights each context is given are sigmoid(x),1sigmoid(x)sigmoid𝑥1sigmoid𝑥\text{sigmoid}(x),1-\text{sigmoid}(x)sigmoid ( italic_x ) , 1 - sigmoid ( italic_x ), with x varying from 55-5- 5 to 5555 in 8888 equally spaced increments, and with sigmoid(x)=1/(1+exp(x))sigmoid𝑥11exp𝑥\text{sigmoid}(x)=1/(1+\text{exp}(-x))sigmoid ( italic_x ) = 1 / ( 1 + exp ( - italic_x ) ). To model hippocampal units, we generate units that prefer one of the two contexts and have a random place field location, using its weight vector, or address, as 𝐜i=1K𝐠i𝐜superscriptsubscript𝑖1𝐾subscript𝐠𝑖\mathbf{c}\bigodot_{i=1}^{K}\mathbf{g}_{i}bold_c ⨀ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and compare its output to that of the context/grid system at each location and context. It is worth noting that the original experiment of [wills2005attractor] also exhibited instances of rate remapping for some units, and so there is certainly additional complexity underlying remapping that is not captured by our simple model.

Refer to caption
Figure S3: Heteroassociation with contexts on the MNIST Dataset, at varying degrees of corruption.
Refer to caption
Figure S4: Comparison of heteroassociation on random patterns vs. binarized version of the FashionMNIST dataset. For different levels of corruption, we denoise flattened binarized FashionMNIST images as well and random binary vectors of the same size. The overall denoising accuracy is lower for FashionMNIST.

C.3 Storing and retracing sequences

We demonstrate that our model can recover sequences by heteroassociation of patterns to positions and path integration in a conceptual space (Figure S5A). This is consistent with the postulated role of the hippocampal formation in performing navigation in conceptual spaces [constantinescu2016organizing, bellmund2018navigating], and the role of entorhinal cortex in generating sequences of neural firing in hippocampus [schlesiger2015medial, yamamoto2017direct]. To evaluate our attractor model’s fidelity at sequence memorization and retrieval, we simulate trajectories to form sequences of random binary patterns and recall the sequence using the path integration mechanism following the method in Section 4.1, for D=10,000𝐷10000D=10{,}000italic_D = 10 , 000 and moduli {3,5}35\{3,5\}{ 3 , 5 }. We add extrinsic noise to the velocity input, which accumulates along the trajectory and induces a drift. This implies that patterns at the end of sequences are less well recovered than ones at the beginning (Figure S5B and C).

Refer to caption
Figure S5: Flexible sequence retrieval via path integration in a conceptual space. A) An example of a hexagonal lattice with sensory observations associated with different states. Having knowledge of the underlying graph enables generalization to new trajectories in the space [behrens2018cognitive, whittington2020tolman]. B) Accuracy of random binary pattern retrieval as a function of position in the sequence for a fixed error rate and one context tag. The noiseless case achieves perfect accuracy, but errors accumulate after incorrect sequence predictions. C) Same as B), but with the additional task of inferring the context tag.