Quantifying stimulus-relevant representational drift using cross-modality contrastive learning

Siwei Wang Department of Organismal Biology and Anatomy, University of Chicago Elizabeth A de Laittre Committee on Computational Neuroscience, University of Chicago Jason MacLean Committee on Computational Neuroscience, University of Chicago Department of Neurobiology, University of Chicago Stephanie E Palmer Department of Organismal Biology and Anatomy, University of Chicago Department of Physics, University of Chicago Physics Frontier Center for Living Systems, University of Chicago

(Feb 2024)

Abstract

Previous works investigating representational drift from sensory to central nervous systems converged to show that neural coding, especially at the population level, readily overcomes these session-to-session fluctuations. However, representational drift in the primary visual cortex is more prominent when presenting naturalistic stimuli than artificial stimuli. Animals continuously navigate natural environments during the evolutionary timescale. Why did evolution not get rid of representational drift if it was just an inconvenience? Here, we investigate how representational drift simultaneously influences the encoding of multiple behaviorally relevant features in a natural movie stimulus. Although previous work already used this natural movie stimulus, the prohibitive challenge of parameterizing a natural movie forced them to simplify their analysis by only using averaged neural activity and stimulus binned into a single coarse time scale. Because natural environments contain multiple interacting spatio-temporal features, previous works only provided incomplete understanding of representational drift because of such simplification. Here, we use cross modality contrastive learning, a cutting edge machine learning method, to circumvent this parameterization challenge. This method enables us to learn an embedding of neural activity that retains only those relevant components of the natural movie stimulus without making explicit assumptions. We also observe that our learned embedding is near-optimal in decoding a whole suite of natural features (scene, optic flow, complex spatio-temporal features, and time) and generalizable to decode those features from single-trial or novel hold-out data. Using this embedding as a surrogate model, we observe that representational drift perturbs the local geometry of the embedding, and this results in various changes in performance when we decode from a different session (90 min later) even at the population level. For example, the rate of decoding change is three times faster for optic flow features than scene features. Our work further suggests that a separate compensation mechanism may be necessary for the optic flow features, as their autocorrelation scale is shorter than the minimum time needed to discriminate scene texture features. Thus, representational drift may encourage neural processing flexibility rather than be a mere nuisance.

1 Introduction

As recording technology has improved, the neuroscience community has been able to record larger neural populations over longer periods of time. These experiments have revealed that the correlations of neural activity with external variables change between separate recording sessions, over a timescale of hours to days to weeks, in a wide range of brain areas (Marks and Goard (2021); Driscoll et al. (2017); Schoonover et al. (2021); Ziv et al. (2013); Lütcke et al. (2013); Mankin et al. (2015)). Despite the prevalence of this representational drift, previous works have also shown that behavioral variables can be stably decoded at the population level in many different brain areas (Gallego et al. (2020); Funamizu et al. (2023); Levy et al. (2021); Ziv et al. (2013); Dhawale et al. (2017); Tolhurst et al. (1983); Huber et al. (2012); Rokni et al. (2007); Arieli et al. (1996); Faisal et al. (2008); Engel and Steinmetz (2019); Clopath et al. (2017); Lütcke et al. (2013); Schölvinck et al. (2015); Cohen and Maunsell (2010); Montijn et al. (2015); Rubin et al. (2015); Sheintuch et al. (2020); Driscoll et al. (2017); Rule et al. (2020); Chambers and Rumpel (2017)). Similarly, analyses of neural activity in the primary visual cortex (V1) have attempted to establish that stimulus may also be stably decoded in the presence of week-to-week representational drift (Marks and Goard (2021); Sadeh and Clopath (2022)). However, representational drift is also more prominent in the processing of natural stimuli than the artificial stimuli. Animals continuously navigate natural environments during the evolutionary timescale. If representational drift is merely an inconvenience, Why did evolution not get rid of it?

To answer the above question, it is necessary to answer a first question: what is the effect of representational drift on the encoding of behaviorally relevant features from naturalistic stimuli in V1? Naturalistic stimuli contain multiple spatio-temporal features. These features may drive distinct behaviors. For example, detecting a transient, expansive optic flow would elicit survival behaviors like arrest or escape Zhang et al. (2014); Liang et al. (2015) whereas discriminating between different textures is crucial for foraging Glickfeld et al. (2013); Poort et al. (2015); Resulaj et al. (2018). These features fluctuate at drastically different timescales. For example, a commonly used natural movie stimulus, the "touch of evil", has its optic flow features fluctuating at 33ms whereas its scene texture features fluctuate at 500-1000 ms. The complexity of parameterizing these features prevents previous works from directly decoding them from neural activity. Instead, these works simplified their analysis of the influence of representational drift by using averaged neural activity and stimulus binned into a single, coarse (1 second long) temporal window. This kind of binning gets rid of a lot of features that are important for behavior in naturalistic stimuli (for example, 1.6-ms transient stimuli can easily cause escape or arrest behavior Liang et al. (2015)). Therefore, these analyses provided an incomplete picture of representational drift. How representational drift changes multiple behaviorally relevant features is still unexplored.

In this work, we use a new machine learning method to extract stimulus-relevant components from neural activity and circumventing the above paramterization challenge. The method that allows us to learn an embedding that only retaining stimulus-relevant components from neural activity is the cross-modality, weakly supervised contrastive learning. This method is guided by "co-occurrence", meaning it extracts the shared features between neural activity across animals and the natural movie stimulus within the same temporal window (we use the 33 ms, single frame window here). Recent theory in Bayesian risk minimization showed that embeddings learned with this method is near-optimal is near-optimal in retaining the shared features between different modalities (i.e., neural activty and natural movie in our context). Namely, the learned embedding retains the components about the natural movie only from the neural activity and removes individual bias (e.g., behavioral variability between animals) that are not present in the natural movie stimulus. Using the neuropixel recordings of V1 from the Allen Brain Observatory, we demonstrate that our learned embedding is near-optimal for decoding multiple natural features (scene, motion, time, and the combination of scene and motion features, at 99% accuracy). We also show that this learned embedding can decode natural features when we project neural activity from single-trial and novel pseudomouses into the trained model. Therefore, the learned embedding extracts stimulus relevant components available in the neural code while compressing away irrelevant variability from individual behaviors or mental states.

Using this learned embedding as a surrogate model, we quantify the representational drift across the two sessions that are 90 min apart in this dataset using the learned embedding. Using the linear decoders trained on the embedding of session 1 neural activity to read out stimulus features from the embedding of session 2 neural activity, we observe a nearly 50% decrease in decoding performance for all features. We also find that the decoding decay of features fluctuating between fast and slow timescales is different: the decay ratio for fast features is three times greater than that for slow features. We further elucidate the geometry of the learned embedding and how it changes from session 1 to session 2, revealing a potential cause for the drop in decoding features fluctuating at tens of milliseconds.

2 Result

We use the publicly available recording of mouse primary visual cortex responding to a natural movie within the Allen Brain Observatory. This recording contains two sessions of neural responses during passive viewing of a natural movie that are 90 minutes apart. Previous work reported the presence of representational drift, even with such a short 90-minute interval. Although the visual coding dataset in the Allen Brain Observatory includes neurons from both V1 and higher order visual areas de Vries et al. (2019); brain map.org (2019), we focus on V1 in this study. V1 is the visual area that has been sampled the most thoroughly within this dataset. In addition, V1 not only sends behaviorally relevant features to higher visual areas, it also connects directly to subcortical structures that are responsible for driving survival behaviors Zhao et al. (2014); Liang et al. (2015). As a result, we hypothesize that V1 encodes the behaviorally relevant features that are used at both higher order areas of the cortex and subcortical structures.

2.1 The natural movie stimulus contains features with multiple spatiotemporal scales

Different from artificial stimuli, naturalistic stimuli consist of features that fluctuate across multiple spatiotemporal distributions. Previous works showed that similar multi-scale structures may be present for both scene textures (Saremi and Sejnowski (2013)) and dynamic motion (Salisbury and Palmer (2023)) in natural visual environments. These features give natural environments their complex structures within which organisms must navigate and maximize their chances of survival. Understanding how the neural population encodes stimulus features within the natural movie requires us to first understand the variety of features and their spatiotemporal scales in the natural movie stimulus.

Parametrizing a natural movie is computationally prohibitive. Here we dissect the natural movie into two streams: one corresponds to the static, textural components, and the other corresponds to the dynamic (optic flow) components (Figure. 1A, see methods for details). Using independent unsupervised clustering on static and dynamic components, we group frames with similar static or dynamic features together into separate clusters. Therefore, the clustering labeling gives us an estimate of how many discrete static and dynamic features are present in the natural movie stimulus. As a result, we decode these cluster labels as a proxy for decoding static or dynamic features in subsequent analyses. We also decode joint features, whose clustering is a result of combined clustering by aggregating the scene and flow clustering together.

When we compare the autocorrelation time scales between scene and flow labels, we find that the flow labels fluctuate one order of magnitude faster than those scene labels. In particular, the autocorrelation time scale of the flow labels is as short as 33–100 ms (1-3 frames in the 30 Hz movie), while the scene labels fluctuate at either 500 ms or 1000 ms. This emergence of fast and slow autocorrelation time scales may result from the changes in optic flow (e.g., moving left or right) in the natural movie being much faster than the switch between different scenes. Therefore, our work aims to understand how representational drift changes the neural coding of these features and whether it influences features fluctuating at fast time scales differently from features fluctuating at slower time scales.

Because the autocorrelation time scale of flow labels is only a few frames, consecutive frames in the movie may contain different flow labels. Hence, differentiating neighboring frames that are merely 33 ms apart is necessary to trace changes in stimulus features in the movie. Indeed, when we cluster the natural movie into a broad window clustering by first binning movie frames into a 1 second window (analogous to previous works (Xia et al. (2021); Sadeh and Clopath (2022)), we observe that the resulting clusters do not trace the change of stimulus features. For example, even if similar optic flow features occur at different time windows, they are still clustered into distinct clusters. As a result, whether a neural activity belongs to a broad binned temporal window may not reflect which stimulus features are encoded. Instead, we develop new tools to decode neural activity at 33ms, i.e., a single frame, in the subsequent analysis. Decoding features at single frame resolution allows us to observe and compare the impact of representational drift on stimulus features whose autocorrelation scales span a few tens to a few hundred milliseconds.

Refer to caption — Figure 1: A) Three example (static) frames from the movie (top), with their corresponding optic flow frames (bottom). We perform hierarchical clustering on the static and optic flow frames to obtain discrete "scene" and "flow" labels (separate clusterings on scene and optic flow frames; see main text and Supplementary for details). All three example static frames here have the same scene label, but their optic flow frames each have a different flow label. Thus the optic flow provides more information for discriminating these three frames from each other than the textures in the static frames. B) These two behaviorally-relevant features of interest in the natural movie stimulus vary over different timescales: the autocorrelation decays differently for scene and flow labels (schematized at the top, with actual data below). The optic flow labeling has a fast change, i.e., it decays to near zero after 33-100 ms (1-3 frames). In comparison, the decay for scene labeling is much slower. It only reaches zero after 400 500 ms and reaches a negative peak around 1s (likely corresponding to scene change in the movie). This figure shows results for the first 400 frames (1st half of the movie), and we observe the same difference in decay timescales for the second half of the movie (see Supplementary Information).

2.2 Learning a generalizable representation from weakly supervised contrastive training

Because of the complex structure present in the natural movie, we aim at learning an embedding of features of the natural movie selectively encoded in V1 without making explicit assumptions. This encourages us to use weakly supervised contrastive learning to extract stimulus-relevant components from neural activity and the natural movie. The weak supervision is guided by “co-occurrence” (Radford2021), i.e., whether samples of neural activity or samples of a natural movie happen within the same temporal window. Thus, it extracts shared features across animals and maps them to features of natural movie frames within the same temporal window. We use 33 ms, i.e., a single frame, as the temporal window because flow features only autocorrelate over a couple of frames. This means that given a sample of neural activity $a$ from pseudomouse A and a sample $b$ from pseudomouse B, along with the natural movie features $m$ that happen at frame $t$ , contrastive learning tries to bring the tuple $(a,b,m)$ together.

To obtain broad enough V1 populations that cover the entire visual field, we generate two pseudomouses by pooling neural activity from five different mice together; each pseudomouse consists of data from a non-overlapping set of five mice. We then use two training phases to obtain a neural representation that is relevant to the stimulus (see Figure 2). In the first phase, we use contrastive learning to extract features encoded by both $a$ and $b$ into the embedding, while removing the bias from individual pseudomouses. Because the embedding retains features shared by $a$ and $b$ , the resulting embedding pulls $a$ and $b$ together. In machine learning, this is self-supervised learning Chen et al. (2020). Meanwhile, we also trained a separate backbone to obtain an efficient representation of a movie by contrasting its scene and optic flow features. In other words, the single-modality phase learns both efficient representations of neural activity and the natural movie separately. In the second phase, we align the representations of neural activity with the natural movie. To do this, we contrast samples from all possible pairs of modalities (neural activity, scene, and optic flow frames). Similar to the single-modality phase, this cross-modality contrastive learning learns a compressed representation that only keeps the shared features between the natural movie and the neural activity.

Because we construct contrastive pairs based on whether they co-occur within a single frame, time at the single frame level, a.k.a. frame number, is a natural choice of decoding feature (i.e., the mapping between neural activity and the changes in the natural scene). In Figure 3, we show that we can decode time in a held-out session 1 test set with nearly 99% accuracy using both the single-modality and cross-modality trained models. We get almost perfect results on our held-out session 1 test set. This shows that the neural activity responding to different frames in the movie is sufficiently different from one another for our contrastive learning backbone to be able to learn an embedding where each sample can be easily (linearly) differentiated from the others. Demonstrating the near-optimality of our learned embedding (for linearly decoding on time) is a necessary step to ensure that we can use it to quantify the effect of drift based on the change in decoding performance, as we do in the next section.

Additionally, the learned embedding generalizes to decode time using other ”out of distribution” formats of the session 1 neural activity. Our single-trial decoding accuracy reached around 93% for known pseudomouses (i.e., trained using the PSTH) and 92% for a novel pseudomouse (trained without PSTH or single-trial data). Therefore, our encoding models trained with PSTH capture the majority of the stimulus-relevant components of neural activity. Notably, we compare this performance to simple methods (PCA, NMF, see Supplementary) based on the neural population activity directly. The difference between our decoding performance and these off-the-shelf methods shows that our learned embedding uniquely reformats stimulus-relevant features to facilitate linear decoding among them.

2.3 Quantify the magnitude of representational drift via changes in decoding performance

Because our learned embedding is nearly optimal for decoding a whole suite of natural features, we use it as a surrogate model. We then use the corresponding change in decoding performance between different sessions to quantify how much representational drift changes the encoding of stimulus features. This decoding is different from previous works on decoding. Previous works either only decoded averaged neural activity in a broad window of 1 second, and such decoding cannot track the change of stimulus features that fluctuate at faster time scales, or they only reported pairwise discriminability between different binned windows, which is a much easier decoding task than what we pose here. When we decode on session 2 with a linear decoder trained on session 1, we observe significant accuracy decay, i.e., up to 50%, in decoding all four natural features (time, scene, flow, and joint features combining scene and flow; see Fig 4). The results are symmetric across the two sessions: training on session 2 data and decoding session 1 data yields a quantitatively similar drop in decoding performance (see supplementary Table 2).

We also investigate whether the change in decoding flow features (or features fluctuating at fast time scales) is different from the change in decoding scene features (or features fluctuating at slow time scales). We look into how much improvement in decoding performance one may obtain from forgiving decoding errors within a short temporal window. We refer to this particular decoding performance as "error-tolerating accuracy”. For any given sample of time $t$ , this error tolerating accuracy within an $n$ -frame temporal window will regard the predicted time $t^{\prime}$ as correct if $t^{\prime}\in(t-n,t+n)$ , as opposed to regarding $t^{\prime}$ as correct only if $t^{\prime}=t$ . We expand the error tolerance window up to 30 frames (corresponding to the longest autocorrelation scales) to understand if any specific decoding errors occur within a certain autocorrelation timescale (shown in Figure 5B). We find that this error tolerance accuracy increases by 12.1% for flow features within its autocorrelation scale $\tau_{1}$ (and overall 27.2% towards the 1 second window). This increase in accuracy is also present in two other features that contain components of flow features, i.e., joint features and time. Meanwhile, the change in decoding scene features is smaller in magnitude as we expand the error tolerance window to 1 second. Comparing the difference in decoding accuracy between two autocorrelation scales for the scene, i.e., $\tau_{2}$ and $\tau_{3}$ , we observe a change of 10%. Therefore, the representational drift changes the encoding of flow features much more than scene features. In the next section, we further analyze how representational drift changes the geometry of learned embedding to gain insights on why it leads to such a substantial change in the decoding of flow features.

2.4 The representational drift perturbs the local geometry helpful for decoding fast features

In this section, we analyze whether the geometry of the learned embedding changes in the presence of representational drift. We focus on two geometric features that are relevant to decoding stimulus features. The first geometry is the ”coding smoothness” for similar stimulus features, as first characterized in (Stringer et al. (2019a)). This smoothness geometry suggests that the neural representation of similar stimulus features is located nearby in the representation space (Stringer et al. (2019a)). The scale of the $n$ -th principal component variance determines whether such a smoothness exists. Specifically, if the variance of principle components decays faster than the ”smoothness threshold”, i.e., $(-1-2/d)$ for a stimulus with d dimensions, then the representation is smooth. Stringer et al. (2019a) showed that any smooth representation is also a differentiable manifold. This allows two similar stimuli features to be found close to each other on the manifold (see Stringer et al. (2019a) for more information). In addition, Stringer et al. (2019a) also showed that neural activity generally obeys such a smoothness condition in encoding static stimuli like natural images. Here, we investigate whether our learned embedding inherits this coding smoothness from the neural activity for encoding spatio-temporal features from the natural movie. As we show in Fig. 5A (we use $d=400$ to calculate the smoothness threshold), the variance explained by the PC dimensions of our learned embedding decays much faster than the corresponding threshold, indicating that the learned embedding for session 1 is globally smooth. Meanwhile, representational drift does not change such a global smoothness, suggesting that global geometry does not play a role in the drop in decoding performance from session 1 to session 2.

Next, we investigate whether representational drift changes a local geometry between the embedding of similar stimulus features. Because our learned embedding is near-optimal for decoding (see Section 2.2), it exhibits another geometry (Papyan et al. (2020)). This geometry arranges the samples corresponding to the same stimulus feature to form a distinct cluster separable from other stimulus features (see supplementary for theoretical analysis). Within such a geometry, recent theory (Papyan et al. (2020)) suggests a specific optimal linear decoder whose weight for time $t$ is the mean activation of all samples belonging to $t$ . In our case, this mean activation linear decoder achieves 99% accuracy on decoding time (see Section 2.2 and supplementary). Because of the global smoothness, this mean activation clustering suggests that the learned embedding also has a local $K$ -nearest neighborhood within which the $k$ -th closest neighbor encodes the $k$ -th similar stimulus feature. For example, the 2nd nearest cluster mean for 96% of the test samples are also the clusters that are temporally adjacent (i.e., either $t-1$ or $t+1$ for a sample belonging to time $t$ , shown as the top portion of the cyan line in Figure. 5B). Such a local neighborhood pertains to 75% of samples for $t\pm 2$ and quickly drops when it is beyond 2 frames (shown as the magenta line in Figure. 5B). This local neighborhood is much weaker in the presence of representational drift (magenta line) within the range $k\in(1,4)$ (shown as the cyan line in Figure. 5B). First, only 61.5% of the samples can be linearly decoded by the respective mean activation of all samples belonging to $t$ in the drifted activations. Second, only 41.9% and 30.3% of the test samples can be read out as the 2nd or 3rd nearest neighbor. Note that this pertubation from the representational drift is prominent in local neighborhood, but not as obvious in global structure. The inset of Figure 5B shows that if we expand $K$ into 10-30 frames, the neighborhood structures become similar with and without the representational drift.

We further illustrate why such a difference may occur (Figure 5C and supplementary). Without the representational drift, samples belonging to time $t$ and nearby time (e.g., $t\pm 1$ or $t\pm 2$ ) form well separated clusters with regular shapes. With representational drift, the shapes of these clusters become irregular. The overlap of these irregularly-shaped clusters results in errors for differentiating time $t$ with its immediate neighbors (see supplementary for theoretical analysis). Notably, the width $K$ corresponds to $30-60$ ms. Within this neighborhood, it is likely that flow labels would change (because of the fast autocorrelation scale of optic flow). The representational would make it challenging to use the optimal linear decoder to read out these different flow labels because of the confusion between these nearby clusters. Because the scene labels fluctuate at much longer time scales, different times within this local temporal window may share the same scene label; thus, representational drift does not change the linear decoding of scene labels as significantly as flow labels (shown in Section 2.3). Therefore, the perturbation of this local neighborhood is a possible cause of decoding decay in flow features.

3 Discussion

This study shows that representational drift brings various changes to multiple natural features simultaneously present in a natural movie stimulus. Observing that there are features that fluctuate from tens (fast, a single frame) to hundreds (slow, 30 frames) of milliseconds, we first use weakly supervised contrastive learning to extract all available stimulus-relevant components from neural activity. Because this embedding is near-optimal for linearly decoding a whole suite of natural features, we are able to directly quantify the magnitude of that drift in decoding. This is unique because previous works only investigated averaged neural activity on a single time scale without considering the intricate structure of the natural movie stimulus. We observe a 50% decrease in decoding performance on all natural features, suggesting that there is a significant change in the encoding of these features with representational drift. In addition, the magnitude of decoding decay for the fast fluctuating flow features is three times faster than the slow fluctuating scene features, suggesting that representational drift changes the encoding of fast and slow features differently. We then analyze the learned embedding geometry to show that the substantial decoding decay of the fast features corresponds to representational drift perturbing the local geometry, which may support near-optimal linear decoding.

The fast and slow features in the natural movie stimulus may support different visually guided behaviors, considering their distinct autocorrelation time scales. The fast features fluctuate as fast as 33 ms (or possibly faster, but 33 ms is the limit from the sampling rate of the neural activity), which corresponds to transient changes in the visual field. Previous experiments demonstrated that V1 can modulate the superior colliculus (SC) in the presence of such transient stimuli (even as brief as 1.6 ms, Liang et al. (2015)) to induce temporal arrest. Meanwhile, the same circuit was hypothesized to be also responsible for detecting looming threats Zhang et al. (2014), suggesting that the V1 encoding of these fast features is informative to drive survival critical behaviors. The slow features that fluctuate at 500-100 ms are scene textures. Such slow time scales allow temporal integration from V1 to discriminate target textural features from distractions Resulaj et al. (2018); Goris et al. (2018). They may play important roles in foraging when animals navigate complex natural environments.

Our research encourages future research to explore the potential impact of representational drift in the primary visual cortex (V1) on visually guided behaviors. There are some thought-provoking inquiries. A crucial initial inquiry is whether there is a requirement for compensation mechanisms for representational drift. This is contingent upon whether the accuracy of V1 in encoding naturalistic stimuli is necessary for stable behavioral outputs. Prior studies have demonstrated that V1 encodes orientation selectivity with a precision of 0.1 degrees upon drifting grating stimuli. However, downstream systems that directly drive behaviors can only use 5 degrees of orientation selectivity. This represents a 50-time coarsening. Representational drift significantly changes the encoding of naturalistic stimuli more than artificial stimuli, with a 50% difference in our case. Therefore, it is currently unknown whether behavioral output will remain stable when using an encoding that is 100 times coarser (or two orders of magnitude coarser) than V1. This is an intriguing and unexplored future question. Our findings suggest that if compensating methods are required, then the encoding for optic flow features is distinct from that of scene texture features. Because V1 requires a minimum of 80 ms to exhibit discriminability (Resulaj et al. (2018)), and considering that optic flow characteristics change at a rate of 33 ms or even shorter, it is possible that mechanisms designed to compensate for the encoding of texture data may not be fast enough to handle optic flow features. Thus, if the superior colliculus uses those rapid characteristics to stimulate appropriate actions, it must develop a compensation mechanism distinct from mechanisms known to improve discriminability, like pooling (Stringer et al. (2019b)).

This approach has the unique advantage of extracting stimulus-relevant components from neural activity without making explicit assumptions. Alternatively, we show that the learned embedding exhibits Bayesian optimality. This optimality ensures that the resulting embedding retains all stimulus features commonly encoded across multiple animals. This optimality also makes this method suitable for learning neural representation for natural behaviors. In neural recording experiments of sensory systems, the experience is usually consistent across multiple trials. The variance we observe at the single trial level is mostly noise in neural representation. This is not the case with natural behaviors. Natural behaviors themselves exhibit significant variability in trial-by-trial execution. As a result, it is unclear whether the variability observed in neural representation for natural behaviors comes from noise in the neural code or a difference in behavioral variables. Moreover, it is unknown whether all features exhibited in natural behavior execution (based on video) are encoded by the motor relevant brain areas (i.e., the motor cortex) or can be contributed to stereotypical motor tape within local control circuits. We can adapt this method to discover behaviorally relevant features represented by the neural population. For example, we can first use single modality training to obtain an efficient representation of natural behaviors based on video or posture recording. We can then obtain another efficient representation through cross-modality training between video and neural representation. Bayesian optimality suggests that the efficient representation after cross-modality training retains all behavioral features selectively encoded by the neural population. We can determine which features the motor cortex selectively encodes by comparing this representation with the representation from single-modality training.

This Bayesian optimality also distinguishes our approach from previous methods that focused on learning compressed latent embeddings Schneider et al. (2023); Azabou et al. (2021); Zhou and Wei (2020) to reconstruct neural populations. These encoder-decoder based methods obtain latent embedding through training both an encoder and a decoder. Because of the extra training for the decoder, those embeddings are not Bayesian optimal. Furthermore, the dimensionality of the latent embedding is manually controlled as a parameter. Although the learned embedding may include stimulus features encoded by the neuronal population, There is no guarantee that all of these features will be retained in the learned embedding. Therefore, these previous methods could not be used to discover novel features about external variables.

The finding that representational drift brings different changes in the encoding of features that fluctuate at different time scales implies that representational drift may serve a useful purpose rather than being only a nuisance in neural processing. Features that fluctuate at different time scales enable various behaviors. Organisms may develop a specific compensation mechanism to compensate for the unique effect of representational drift on a single stimulus feature rather than relying on a general mechanism that applies to all features. As previously discussed, the range of time scales across these features suggests that the compensating mechanisms may also differ. Consequently, if one specific compensating mechanism fails, it would only impact one behavior rather than all behaviors that are essential for survival. Hence, our observation corroborates the notion that the brain carries out flexible, robust, and efficient computations as organisms explore and interact with the external world.

Acknowledgement

This work was supported by the National Science Foundation through the Physics Frontier Center for Living Systems (PHY-2317138) and the NSF-Simons National Institute for Theory and Mathematics in Biology, awards NSF DMS-2235451 and Simons Foundation MP-TMPS-00005320. This work was also based upon work supported by te National Science Foundation Graduate Research Fellowship Program under Grant No. (DGE-1746045). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

Arieli et al. [1996] A. Arieli, A. Sterkin, A. Grinvald, and A. Aertsen. Dynamics of ongoing activity: explanation of the large variability in evoked cortical responses. Science (New York, N.Y.), 273:1868–1871, September 1996. ISSN 0036-8075. doi: 10.1126/science.273.5283.1868.
Azabou et al. [2021] Mehdi Azabou, Mohammad Gheshlaghi Azar, Ran Liu, Chi-Heng Lin, Erik C. Johnson, Kiran Bhaskaran-Nair, Max Dabagia, Bernardo Avila-Pires, Lindsey Kitchell, Keith B. Hengen, William Gray-Roncal, Michal Valko, and Eva L. Dyer. Mine your own view: Self-supervised learning through across-sample prediction. February 2021.
brain map.org [2019] brain map.org. Allen brain observatory – neuropixels visual coding. Technical report, 2019. URL https://portal.brain-map.org/explore/circuits/visual-coding-neuropixels.
Chambers and Rumpel [2017] Anna R. Chambers and Simon Rumpel. A stable brain from unstable components: Emerging concepts and implications for neural computation. Neuroscience, 357:172–184, August 2017. ISSN 1873-7544. doi: 10.1016/j.neuroscience.2017.06.005.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. February 2020.
Clopath et al. [2017] Claudia Clopath, Tobias Bonhoeffer, Mark Hübener, and Tobias Rose. Variance and invariance of neuronal long-term representations. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 372(1715):20160161, March 2017. ISSN 1471-2970. doi: 10.1098/rstb.2016.0161.
Cohen and Maunsell [2010] Marlene R. Cohen and John H. R. Maunsell. A neuronal population measure of attention predicts behavioral performance on individual trials. The Journal of neuroscience : the official journal of the Society for Neuroscience, 30:15241–15253, November 2010. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.2171-10.2010.
de Vries et al. [2019] Saskia E. J. de Vries, Jerome A. Lecoq, Michael A. Buice, Peter A. Groblewski, Gabriel K. Ocker, Michael Oliver, David Feng, Nicholas Cain, Peter Ledochowitsch, Daniel Millman, Kate Roll, Marina Garrett, Tom Keenan, Leonard Kuan, Stefan Mihalas, Shawn Olsen, Carol Thompson, Wayne Wakeman, Jack Waters, Derric Williams, Chris Barber, Nathan Berbesque, Brandon Blanchard, Nicholas Bowles, Shiella D. Caldejon, Linzy Casal, Andrew Cho, Sissy Cross, Chinh Dang, Tim Dolbeare, Melise Edwards, John Galbraith, Nathalie Gaudreault, Terri L. Gilbert, Fiona Griffin, Perry Hargrave, Robert Howard, Lawrence Huang, Sean Jewell, Nika Keller, Ulf Knoblich, Josh D. Larkin, Rachael Larsen, Chris Lau, Eric Lee, Felix Lee, Arielle Leon, Lu Li, Fuhui Long, Jennifer Luviano, Kyla Mace, Thuyanh Nguyen, Jed Perkins, Miranda Robertson, Sam Seid, Eric Shea-Brown, Jianghong Shi, Nathan Sjoquist, Cliff Slaughterbeck, David Sullivan, Ryan Valenza, Casey White, Ali Williford, Daniela M. Witten, Jun Zhuang, Hongkui Zeng, Colin Farrell, Lydia Ng, Amy Bernard, John W. Phillips, R. Clay Reid, and Christof Koch. A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature Neuroscience, 23(1):138–151, dec 2019. doi: 10.1038/s41593-019-0550-9.
Dhawale et al. [2017] Ashesh K. Dhawale, Rajesh Poddar, Steffen Be Wolff, Valentin A. Normand, Evi Kopelowitz, and Bence P. Ölveczky. Automated long-term recording and analysis of neural activity in behaving animals. eLife, 6, September 2017. ISSN 2050-084X. doi: 10.7554/eLife.27702.
Driscoll et al. [2017] Laura N. Driscoll, Noah L. Pettit, Matthias Minderer, Selmaan N. Chettih, and Christopher D. Harvey. Dynamic reorganization of neuronal activity patterns in parietal cortex. Cell, 170(5):986–999.e16, aug 2017. doi: 10.1016/j.cell.2017.07.021.
Engel and Steinmetz [2019] Tatiana A. Engel and Nicholas A. Steinmetz. New perspectives on dimensionality and variability from large-scale cortical dynamics. Current opinion in neurobiology, 58:181–190, October 2019. ISSN 1873-6882. doi: 10.1016/j.conb.2019.09.003.
Faisal et al. [2008] A. Aldo Faisal, Luc P. J. Selen, and Daniel M. Wolpert. Noise in the nervous system. Nature reviews. Neuroscience, 9:292–303, April 2008. ISSN 1471-0048. doi: 10.1038/nrn2258.
Funamizu et al. [2023] Akihiro Funamizu, Fred Marbach, and Anthony M Zador. Stable sound decoding despite modulated sound representation in the auditory cortex. bioRxiv, 2023. doi: 10.1101/2023.01.31.526457. URL https://www.biorxiv.org/content/early/2023/02/07/2023.01.31.526457.
Gallego et al. [2020] Juan A. Gallego, Matthew G. Perich, Raeed H. Chowdhury, Sara A. Solla, and Lee E. Miller. Long-term stability of cortical population dynamics underlying consistent behavior. Nature neuroscience, 23:260–270, February 2020. ISSN 1546-1726. doi: 10.1038/s41593-019-0555-4.
Glickfeld et al. [2013] Lindsey L. Glickfeld, Mark H. Histed, and John H. R. Maunsell. Mouse primary visual cortex is used to detect both orientation and contrast changes. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33:19416–19422, December 2013. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.3560-13.2013.
Goris et al. [2018] Robbe L. T. Goris, Corey M. Ziemba, J. Anthony Movshon, and Eero P. Simoncelli. Slow gain fluctuations limit benefits of temporal integration in visual cortex. Journal of Vision, 18(8):8, August 2018. ISSN 1534-7362. doi: 10.1167/18.8.8.
Huber et al. [2012] D. Huber, D. A. Gutnisky, S. Peron, D. H. O’Connor, J. S. Wiegert, L. Tian, T. G. Oertner, L. L. Looger, and K. Svoboda. Multiple dynamic representations in the motor cortex during sensorimotor learning. Nature, 484:473–478, April 2012. ISSN 1476-4687. doi: 10.1038/nature11039.
Levy et al. [2021] Samuel J. Levy, Nathaniel R. Kinsky, William Mau, David W. Sullivan, and Michael E. Hasselmo. Hippocampal spatial memory representations in mice are heterogeneously stable. Hippocampus, 31:244–260, March 2021. ISSN 1098-1063. doi: 10.1002/hipo.23272.
Liang et al. [2015] Feixue Liang, Xiaorui R. Xiong, Brian Zingg, Xu-ying Ji, Li I. Zhang, and Huizhong W. Tao. Sensory cortical control of a visually induced arrest behavior via corticotectal projections. Neuron, 86:755–767, May 2015. ISSN 1097-4199. doi: 10.1016/j.neuron.2015.03.048.
Lütcke et al. [2013] Henry Lütcke, David J. Margolis, and Fritjof Helmchen. Steady or changing? long-term monitoring of neuronal population activity. Trends in neurosciences, 36(7):375–384, July 2013. ISSN 1878-108X. doi: 10.1016/j.tins.2013.03.008.
Mankin et al. [2015] Emily A. Mankin, Geoffrey W. Diehl, Fraser T. Sparks, Stefan Leutgeb, and Jill K. Leutgeb. Hippocampal ca2 activity patterns change over time to a larger extent than between spatial contexts. Neuron, 85:190–201, January 2015. ISSN 1097-4199. doi: 10.1016/j.neuron.2014.12.001.
Marks and Goard [2021] Tyler D. Marks and Michael J. Goard. Stimulus-dependent representational drift in primary visual cortex. Nature Communications, 12(1), aug 2021. doi: 10.1038/s41467-021-25436-3.
Montijn et al. [2015] Jorrit S. Montijn, Pieter M. Goltstein, and Cyriel M. A. Pennartz. Mouse v1 population correlates of visual detection rely on heterogeneity within neuronal response patterns. eLife, 4:e10163, December 2015. ISSN 2050-084X. doi: 10.7554/eLife.10163.
Papyan et al. [2020] Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences of the United States of America, 117:24652–24663, October 2020. ISSN 1091-6490. doi: 10.1073/pnas.2015509117.
Poort et al. [2015] Jasper Poort, Adil G. Khan, Marius Pachitariu, Abdellatif Nemri, Ivana Orsolic, Julija Krupic, Marius Bauza, Maneesh Sahani, Georg B. Keller, Thomas D. Mrsic-Flogel, and Sonja B. Hofer. Learning enhances sensory and multiple non-sensory representations in primary visual cortex. Neuron, 86(6):1478–1490, June 2015. ISSN 0896-6273. doi: 10.1016/j.neuron.2015.05.037.
Resulaj et al. [2018] Arbora Resulaj, Sarah Ruediger, Shawn R Olsen, and Massimo Scanziani. First spikes in visual cortex enable perceptual discrimination. eLife, 7, April 2018. ISSN 2050-084X. doi: 10.7554/elife.34044.
Rokni et al. [2007] Uri Rokni, Andrew G. Richardson, Emilio Bizzi, and H. Sebastian Seung. Motor learning with unstable neural representations. Neuron, 54:653–666, May 2007. ISSN 0896-6273. doi: 10.1016/j.neuron.2007.04.030.
Rubin et al. [2015] Alon Rubin, Nitzan Geva, Liron Sheintuch, and Yaniv Ziv. Hippocampal ensemble dynamics timestamp events in long-term memory. eLife, 4, December 2015. ISSN 2050-084X. doi: 10.7554/eLife.12247.
Rule et al. [2020] Michael E. Rule, Adrianna R. Loback, Dhruva V. Raman, Laura N. Driscoll, Christopher D. Harvey, and Timothy O’Leary. Stable task information from an unstable neural population. eLife, 9, July 2020. ISSN 2050-084X. doi: 10.7554/eLife.51121.
Sadeh and Clopath [2022] Sadra Sadeh and Claudia Clopath. Contribution of behavioural variability to representational drift. eLife, 11:e77907, aug 2022. ISSN 2050-084X. doi: 10.7554/eLife.77907. URL https://doi.org/10.7554/eLife.77907.
Salisbury and Palmer [2023] Jared M. Salisbury and Stephanie E. Palmer. A dynamic scale-mixture model of motion in natural scenes. October 2023. doi: 10.1101/2023.10.19.563101.
Saremi and Sejnowski [2013] Saeed Saremi and Terrence J. Sejnowski. Hierarchical model of natural images and the origin of scale invariance. Proceedings of the National Academy of Sciences, 110(8):3071–3076, February 2013. ISSN 1091-6490. doi: 10.1073/pnas.1222618110.
Schneider et al. [2023] Steffen Schneider, Jin Hwa Lee, and Mackenzie Weygandt Mathis. Learnable latent embeddings for joint behavioural and neural analysis. Nature, May 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06031-6. URL https://doi.org/10.1038/s41586-023-06031-6.
Schoonover et al. [2021] Carl E. Schoonover, Sarah N. Ohashi, Richard Axel, and Andrew J. P. Fink. Representational drift in primary olfactory cortex. Nature, 594(7864):541–546, jun 2021. doi: 10.1038/s41586-021-03628-7.
Schölvinck et al. [2015] Marieke L. Schölvinck, Aman B. Saleem, Andrea Benucci, Kenneth D. Harris, and Matteo Carandini. Cortical state determines global variability and correlations in visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 35:170–178, January 2015. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.4994-13.2015.
Sheintuch et al. [2020] Liron Sheintuch, Nitzan Geva, Hadas Baumer, Yoav Rechavi, Alon Rubin, and Yaniv Ziv. Multiple maps of the same spatial context can stably coexist in the mouse hippocampus. Current biology : CB, 30:1467–1476.e6, April 2020. ISSN 1879-0445. doi: 10.1016/j.cub.2020.02.018.
Stringer et al. [2019a] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Matteo Carandini, and Kenneth D. Harris. High-dimensional geometry of population responses in visual cortex. Nature, 571:361–365, July 2019a. ISSN 1476-4687. doi: 10.1038/s41586-019-1346-5.
Stringer et al. [2019b] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Charu Bai Reddy, Matteo Carandini, and Kenneth D. Harris. Spontaneous behaviors drive multidimensional, brainwide activity. Science, 364(6437), apr 2019b. doi: 10.1126/science.aav7893.
Tolhurst et al. [1983] D. J. Tolhurst, J. A. Movshon, and A. F. Dean. The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision research, 23:775–785, 1983. ISSN 0042-6989. doi: 10.1016/0042-6989(83)90200-6.
van den Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. July 2018.
Xia et al. [2021] Ji Xia, Tyler D. Marks, Michael J. Goard, and Ralf Wessel. Stable representation of a naturalistic movie emerges from episodic activity with gain variability. Nature communications, 12:5170, August 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-25437-2.
Zhang et al. [2014] Siyu Zhang, Min Xu, Tsukasa Kamigaki, Johnny Phong Hoang Do, Wei-Cheng Chang, Sean Jenvay, Kazunari Miyamichi, Liqun Luo, and Yang Dan. Selective attention. long-range and local circuits for top-down modulation of visual cortex processing. Science (New York, N.Y.), 345:660–665, August 2014. ISSN 1095-9203. doi: 10.1126/science.1254126.
Zhao et al. [2014] Xinyu Zhao, Mingna Liu, and Jianhua Cang. Visual cortex modulates the magnitude but not the selectivity of looming-evoked responses in the superior colliculus of awake mice. Neuron, 84:202–213, October 2014. ISSN 1097-4199. doi: 10.1016/j.neuron.2014.08.037.
Zhou and Wei [2020] Ding Zhou and Xue-Xin Wei. Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-vae. NeurIPS 2020, November 2020.
Ziv et al. [2013] Yaniv Ziv, Laurie D Burns, Eric D Cocker, Elizabeth O Hamel, Kunal K Ghosh, Lacey J Kitch, Abbas El Gamal, and Mark J Schnitzer. Long-term dynamics of CA1 hippocampal place codes. Nature Neuroscience, 16(3):264–266, feb 2013. doi: 10.1038/nn.3329.