Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Self-Attention in Transformer Networks Explains Monkeys’ Gaze Pattern in Pac-Man Game

Zhongqiao Linmissing Institute of Neuroscience, Key Laboratory of
Brain Cognition and Brain-inspired Intelligence Technology,
Center for Excellence in Brain Science and Intelligence Technology,
Chinese Academy of Sciences, Shanghai, China
Yunwei Limissing Institute of Neuroscience, Key Laboratory of
Brain Cognition and Brain-inspired Intelligence Technology,
Center for Excellence in Brain Science and Intelligence Technology,
Chinese Academy of Sciences, Shanghai, China
University of Chinese Academy of Sciences
{zqlin, liyw2021, tyang}@ion.ac.cn
Tianming Yang Institute of Neuroscience, Key Laboratory of
Brain Cognition and Brain-inspired Intelligence Technology,
Center for Excellence in Brain Science and Intelligence Technology,
Chinese Academy of Sciences, Shanghai, China
Abstract

We proactively direct our eyes and attention to collect information during problem solving and decision making. Understanding gaze patterns is crucial for gaining insights into the computation underlying the problem-solving process. However, there is a lack of interpretable models that can account for how the brain directs the eyes to collect information and utilize it, especially in the context of complex problem solving. In the current study, we analyzed the gaze patterns of two monkeys playing the Pac-Man game. We trained a transformer network to mimic the monkeys’ gameplay and found its attention pattern captures the monkeys’ eye movements. In addition, the prediction based on the transformer network’s attention outperforms the human subjects’ predictions. Importantly, we dissected the computation underlying the attention mechanism of the transformer network, revealing its layered structures reflecting a value-based attention component and a component that captures the interactions between Pac-Man and other game objects. Based on these findings, we built a condensed attention model that is not only as accurate as the transformer network but also fully interpretable. Our results highlight the potential of using transformer neural networks to model and understand the cognitive processes underlying complex problem solving in the brain, opening new avenues for investigating the neural basis of cognition.

11footnotetext: Corresponding author.
Refer to caption
Figure 1: Pac-Man game and transformer model. A. The adapted monkey version of Pac-Man. Compared to the original version, it has only two ghosts. Monkeys receive juice reward, the size of which indicated in the right-top corner of each object, for eating pellets and other objects in the game. Getting caught by a ghost incurs a time-out as penalty. B. Transformer network model. The inputs are two frames before Pac-Man reaches the junction, and the network is trained to predict the monkeys’ directional choice at the junction. C. Determining the model size. We varied the number of layers, the number of heads, and the dimensionality of embedding tokens and tested the models for their performance. The chosen model is indicated by the red frame (2 layers, 2 heads, 48 dimensions).

1 Introduction

Our eyes are not merely a camera. They are an integral component of the brain’s decision-making and problem-solving during which the brain proactively directs the eyes to collect information. The studies of humans’ and animals’ gaze pattern have a long history of research and a substantial body of literature in neuroscience and cognitive science tanenhaus1995integration ; gold2000representation ; loetscher2010eye ; lakshminarasimhan2020tracking ; andersen1989visual . Understanding gaze pattern not only is crucial for understanding the cognitive processes in the brain corbetta2002control ; adams2015active , but also advances the field of deep learning in solving complex problems karessli2017gaze ; sood2023multimodal .

One important approach to understanding the gaze pattern is to build a generative model that can predict eye movements under a behavior context. However, this has been an extremely challenging task. Traditional modeling approaches formalize eye movements as a conjunctive result of both bottom-up and top-down processes gluth2020value . The salience of visual objects treisman1980feature ; buchner2009geometry and the need to collect information and reduce uncertainty rutishauser2007probabilistic ; adams2012smooth ; callaway2021fixation in the task are two major driving forces behind eye movements. These models typically require heavy heuristic designs by characterizing the visual features. They are often limited to simple cognitive tasks and the prediction accuracy is less than ideal.

An alternative approach is to fit an artificial neural network (ANN) directly to the eye movement data assens2017saltinet ; huang2018predicting ; sun2019visual ; wang2024transgop . While these models often predict gaze patterns more accurately even in complex situations, they are typically treated as black boxes and are usually trained on eye data collected from behavioral contexts that are not goal-directed. Several recent studies incorporated heuristic inductive bias of eye movements back into the deep learning models (e.g., hahn2018modeling ; li2023modeling ). While this approach offered better interpretability on the base of the ANN models, their performance may depend on how accurately the introduced bias captures the underlying mechanism.

While these approaches directly model the eye-tracking data, recent studies have begun to investigate the parallels between the self-attention mechanism in the deep learning literature and human attention. Correlations with human gaze patterns have been demonstrated in convolutional neural networks (CNN) lai2020understanding ; sood2020interpreting , recurrent neural networks (RNN) sood2020interpreting ; sen2020human , and transformers eberle2022transformer ; brandl2022every ; bensemann2022eye ; lampinen2024multimodality , all of which are not explicitly trained with human eye-tracking data. Furthermore, some studies report that integrating a gaze predictor module into language models and decision models enhances the performance of the primary tasks sood2020improving ; koorathota2023fixating ; thammineni2023selective . However, the underlying computation in these models remains a black box, which limits us from using them to understand how the brain direct its attention.

Refer to caption
Figure 2: Example scenarios. The network’s attention rollout scores (indicated by brightness) closely match the monkeys’ gaze locations (red dots), even during complex planning sequences. A. As Pac-Man approached a junction with two branches ahead, the monkey’s gaze was fixated on a scared ghost further away, rather than just focusing on the immediate branches. This gaze pattern indicated the monkey’s plan to pursue and hunt the ghost. B. When Pac-Man was far from the remaining dots and the respawn spot, the monkey ultimately committed suicide to take advantage of the respawn mechanic. We observed that the monkey’s gaze alternated between the ghost and the pellets near the respawn spot, suggesting that it was planning its suicide strategy. See Figure 10 in the Appendix for more scenarios.

In this work, we investigated the gaze patterns in monkeys that were trained to play the Pac-Man game. We trained transformer networks to predict animals’ choices at maze junctions. We hypothesized that its attention mechanism should reflect how monkeys direct their attention and collect game information to play the game. We aimed to dissect the computation of the network to understand the cognitive process underlying monkeys’ eye movements and gameplay.

The results show that despite not being trained explicitly to replicate the monkeys’ gaze pattern, the attention pattern of the transformer network closely resembles the animals’ gaze pattern. Furthermore, we dissected the computation underlying the attention mechanism in the transformer network into a condensed model. The condensed model is as accurate as the transformer network in predicting the monkeys’ gaze pattern and fully interpretable. The model can be potentially validated in future experiments. Our study demonstrates the potential of using transformer models to understand the cognitive processes in the brain.

2 Methods

2.1 Behavior experiments

The monkey behavior experiments and data collection was reported in a previous paper yang2022monkey . Briefly, two rhesus monkeys were trained to play an adapted Pac-Man game. They gained juice rewards by clearing all the dots in a maze and avoiding being caught by ghosts. Taking an energizer turned the ghosts into a scared mode, in which the ghosts could be beaten for reward. During the entire experiment, the eye traces of the monkeys were recorded and processed with an infrared eye-tracking system (EyeLink®1000 Plus). The eye-tracking data from 20 sessions of monkey O and 15 sessions of monkey D are used for the analyses in the current study. The main text shows the results on monkey O, while the analyses on monkey D produce similar results and are included in the Appendix.

2.2 Training data

We trained transformer networks to predict the monkeys’ directional choice when the Pac-Man is at a junction tile. The input to the networks includes two game frames in which Pac-Man is one and two tiles away from entering the junction. Each frame is a 32×28×1732281732\times 28\times 1732 × 28 × 17 tensor, where 32 and 28 are the height and width of the maze in the unit of tiles, and 17 is the number of feature channels, each containing a binary indicator for the respective object. The two frames are concatenated into a tensor 𝐒32×28×34𝐒superscript322834{\bf S}\in\mathbb{R}^{32\times 28\times 34}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT 32 × 28 × 34 end_POSTSUPERSCRIPT. Details about the channels can be found in the Appendix.

2.3 Transformer Network

The networks that we studied are based on the standard vision transformers (ViT)dosovitskiy2020image ; beyer2022better . The input S𝑆Sitalic_S was reshaped into a sequence of flattened patches st22×34subscript𝑠𝑡superscriptsuperscript2234s_{t}\in\mathbb{R}^{2^{2}\times 34}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × 34 end_POSTSUPERSCRIPT where 2×2222\times 22 × 2 is the size of each patch. The flattened patches were mapped to d dimensions with a trainable embedding projection. We used fixed 2D sin-cos position embeddings and a learnable classification (CLS) token.

We determined the size of the transformer network by varying the number of transformer layers, the number of embedding dimensions, and the number of attention heads. We trained the model on 140 sessions and tested the model on 5 sessions on a held-out test set. Based on Figure 1C, we chose the model with 2 layers, 2 attention heads, and 48 token dimensions for the rest of the study.

Refer to caption
Figure 3: Attention rollout score correlates with eye movements. A. We computed the attention rollout score of each maze patch in each input state and plotted the distance of the monkeys’ gaze location from the patch against the patch’s attention score rank order. Maze patches with higher attention scores are on the left and have smaller distances from the gaze location. Each thin line is a monkey testing session, the blue is the session average, and the error bars (standard error of the mean) are too small to be visible. p-value is under the machine precision. B. Distribution of the correlation between the attention score and the gaze distance is significantly greater than 0. C. The correlation between the attention score and the gaze distance is significantly great than the correlation computed with the shuffled data. The middle bar is the data shuffled across all maze patches, and the right bar is the data shuffled only among those with game objects. The error bars (standard error of the mean) are too small to be visible.
Refer to caption
Figure 4: Human experiments. We asked human subjects to judge where the monkeys would look at when the monkeys were making directional decisions at a junction. A. Correlation between the human subjects’ judgements, the attention rollout scores, and a random control, which assigns attention randomly to the objects in the maze. The humans made similar judgements (bright color), but differed from the network’s attention rollout scores. B. Compared to the human baseline, the prediction based on attention rollout score had a smaller error. The random control had much greater errors. Each data point is the difference between the model’s prediction error and a human individual’s. A positive difference indicates greater error and worse prediction performance compared to the human. Blue bars are the average differences across all subjects.

3 Results

3.1 Attention rollout

The trained network reached a performance of 87.664% predicting the monkeys’ choice when Pac-Man is at a junction. To study the network’s attention, we adopted attention rollout to visualize the network’s attention abnar2020quantifying .

We found that the attention in the network reflects the monkeys’ gaze patterns, even in situations of complex planning. Figure 2 shows two interesting examples. We selected and plotted the monkeys’ gaze locations during a 3-tile timespan before Pac-Man entered a junction along with the network’s attention on the same maze. This was when the monkeys were looking around the maze for the decision which direction to turn when approaching a junction. Figure 2A shows a situation in which Pac-Man was facing two branches at the junction. Instead of simply weighing the available rewards in both the branches, the monkey was looking at a scared ghost far away from Pac-Man, suggesting that it was planning a hunting. The attention of the model captures this long-range planning. Figure 2B is another example. It involves an interesting strategy that the monkey learned to use, namely suicide yang2022monkey . Sometimes, when Pac-Man is far away from the respawn spot but the remaining pellets are near the respawn spot, the monkey would choose to run into a ghost to die. After the respawn, it could collect the remaining pellets more easily. We observed in this example that the monkey was looking at both the ghost and the pellets around the respawn spot, suggesting that it was planning the suicide. Again, the network’s attention score captures the same pattern. More examples are included in the Appendix (Figure 10).

To quantify the similarity between the networks’ attention and the monkeys’ gaze, we divided the maze into patches, sorted them by their attention rollout score, and plotted their average distance to the eye positions across samples. Patches with the highest attention rollout scores were closest to the monkeys’ gaze location (Figure 3A, Spearman correlation, corr=0.9995, p𝑝pitalic_p is under the machine precision). The distribution of the Spearman correlation of all samples is significantly biased to the right side of zero (Figure 3B). We further calculated the 2-d cross-correlation between the attention rollout score map and the gaze heatmap. The correlation was significantly higher than that of the attention scores shuffled across patches or shuffled only among patches that contain objects (Figure 3C).

Predicting the monkeys’ gaze patterns is not trivial. To demonstrate this, we conducted an experiment in which we asked ten human subjects to judge where the monkeys should look during the game. We showed each participant 100 game states that were drawn from the sample pool used to train the transformer network and asked them to mark what they thought were the tiles most relevant to the impending decision. Using Jaccard similarity, We observed that the human subjects’ agreed with each other on these tiles, but their guesses differed from the network’s choices (Figure 4B). Using the human subjects’ prediction error as a baseline, we found that the model’s prediction error was lower, suggesting its advantage over humans. In comparison, a random model had significantly larger error.

Refer to caption
Figure 5: Layer 1 attention. A. Object-feature component in the two heads. Weights for Pac-Man are largest in head 1 (bright color), and head 2’s weights are the largest for the objects that yield greatest rewards. The weights for the two frames (t1𝑡1t-1italic_t - 1 and t2𝑡2t-2italic_t - 2) are shown separately. B. Left: head 2’s mean attention score (left y-axis) correlate with the objects’ value (right y-axis). Right: head 1’s mean attention score of the Pac-Man (blue bar) dominates over the other objects (gray bar). C. The maze-component of layer 1’s attention.

3.2 Interpreting attention

Next, we aimed to reveal the computation in the attention rollout score in our model. We first expanded the equation for rollout score by the layers (Note that we only care about the attention that affects the CLS token, i.e. the first row of the attention matrix. We omit the head notations for simplicity):

a~CLS=0.25(𝐀1,2:(1)+𝐀1,2:(2)+𝐀1,:(2)𝐀:,2:(1)),subscript~𝑎𝐶𝐿𝑆0.25subscriptsuperscript𝐀1:12absentsubscriptsuperscript𝐀2:12absentsubscriptsuperscript𝐀21:subscriptsuperscript𝐀1::2absent,\tilde{a}_{CLS}=0.25({\bf A}^{(1)}_{1,2:}+{\bf A}^{(2)}_{1,2:}+{\bf A}^{(2)}_{% 1,:}{\bf A}^{(1)}_{:,2:})\text{,}over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT = 0.25 ( bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 2 : end_POSTSUBSCRIPT + bold_A start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 2 : end_POSTSUBSCRIPT + bold_A start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , : end_POSTSUBSCRIPT bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : , 2 : end_POSTSUBSCRIPT ) , (1)

where 𝐀(1)superscript𝐀1{\bf A}^{(1)}bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and 𝐀(2)225×225superscript𝐀2superscript225225{\bf A}^{(2)}\in\mathbb{R}^{225\times 225}bold_A start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 225 × 225 end_POSTSUPERSCRIPT are the attention score matrices of the first and the second layer.

We noticed that the multiplication term can be ignored. The norm of the multiplication term between 𝐀(1)superscript𝐀1{\bf A}^{(1)}bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and 𝐀(2)superscript𝐀2{\bf A}^{(2)}bold_A start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT is much smaller than those of the other two attention vectors (Appendix Figure 9A). The multiplication term contributed only a small proportion in all samples (Appendix Figure 9B). Approximating the rollout score by removing the multiplication term is nearly perfect (Appendix Figure 9C). Therefore, we will only examine the remaining two attention vectors.

3.2.1 First layer

The attention score of the first layer is:

𝐀1,2:(1)=softmax(αCLS,2:1),subscriptsuperscript𝐀1:12absent𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscriptsuperscript𝛼1:𝐶𝐿𝑆2absent,{\bf A}^{(1)}_{1,2:}=softmax(\alpha^{1}_{CLS,2:})\text{,}bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 2 : end_POSTSUBSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_α start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_L italic_S , 2 : end_POSTSUBSCRIPT ) , (2)
αCLS,2:1=xCLS𝐖Q1𝐖K1T𝐗2:,:T=xCLS𝐖Q1𝐖K1T𝐖embT𝐒rT+xCLS𝐖Q1𝐖K1T𝐄posT,subscriptsuperscript𝛼1:𝐶𝐿𝑆2absentsubscript𝑥𝐶𝐿𝑆subscriptsuperscript𝐖1𝑄superscriptsubscript𝐖𝐾1𝑇superscriptsubscript𝐗2::𝑇subscript𝑥𝐶𝐿𝑆subscriptsuperscript𝐖1𝑄superscriptsubscript𝐖𝐾1𝑇superscriptsubscript𝐖𝑒𝑚𝑏𝑇superscriptsubscript𝐒𝑟𝑇subscript𝑥𝐶𝐿𝑆subscriptsuperscript𝐖1𝑄superscriptsubscript𝐖𝐾1𝑇superscriptsubscript𝐄𝑝𝑜𝑠𝑇,\alpha^{1}_{CLS,2:}=x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf X}_{2:,:}^{T}=x% _{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{emb}^{T}{\bf S}_{r}^{T}+x_{CLS}{% \bf W}^{1}_{Q}{\bf W}_{K}^{1T}{{\bf E}}_{pos}^{T}\text{,}italic_α start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_L italic_S , 2 : end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT 2 : , : end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 italic_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 italic_T end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (3)

where xCLSsubscript𝑥𝐶𝐿𝑆x_{CLS}italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT is the learnable CLS token, 𝐖Q1subscriptsuperscript𝐖1𝑄{\bf W}^{1}_{Q}bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,𝐖K1d×d/hhsubscriptsuperscript𝐖1𝐾superscript𝑑𝑑{\bf W}^{1}_{K}\in\mathbb{R}^{d\times{d/h\cdot h}}bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d / italic_h ⋅ italic_h end_POSTSUPERSCRIPT and 𝐖emb136×dsubscript𝐖𝑒𝑚𝑏superscript136𝑑{\bf W}_{emb}\in\mathbb{R}^{136\times{d}}bold_W start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 136 × italic_d end_POSTSUPERSCRIPT are the query, key weight matrix of the first layer, and the linear embedding matrix, respectively. 𝐒r224×136subscript𝐒𝑟superscript224136{\bf S}_{r}\in\mathbb{R}^{224\times{136}}bold_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 224 × 136 end_POSTSUPERSCRIPT is the reshaped input game state, and 𝐄possubscript𝐄𝑝𝑜𝑠{\bf E}_{pos}bold_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT is the positional embedding matrix. As the positional encoding and the maze channels in the input encode maze information but are otherwise invariant in the game, αCLS,2:1subscriptsuperscript𝛼1:𝐶𝐿𝑆2absent\alpha^{1}_{CLS,2:}italic_α start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_L italic_S , 2 : end_POSTSUBSCRIPT can be rewritten as the sum of a maze component and an object-dependent component:

αCLS,2:1subscriptsuperscript𝛼1:𝐶𝐿𝑆2absent\displaystyle\alpha^{1}_{CLS,2:}italic_α start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_L italic_S , 2 : end_POSTSUBSCRIPT =xCLS𝐖Q1𝐖K1T𝐖mazeT𝐒r,mazeT+xCLS𝐖Q1𝐖K1T𝐄posT+xCLS𝐖Q1𝐖K1T𝐖objT𝐒r,objTabsentsubscript𝑥𝐶𝐿𝑆subscriptsuperscript𝐖1𝑄superscriptsubscript𝐖𝐾1𝑇superscriptsubscript𝐖𝑚𝑎𝑧𝑒𝑇superscriptsubscript𝐒𝑟𝑚𝑎𝑧𝑒𝑇subscript𝑥𝐶𝐿𝑆subscriptsuperscript𝐖1𝑄superscriptsubscript𝐖𝐾1𝑇superscriptsubscript𝐄𝑝𝑜𝑠𝑇subscript𝑥𝐶𝐿𝑆subscriptsuperscript𝐖1𝑄superscriptsubscript𝐖𝐾1𝑇superscriptsubscript𝐖𝑜𝑏𝑗𝑇superscriptsubscript𝐒𝑟𝑜𝑏𝑗𝑇\displaystyle=x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{maze}^{T}{\bf S}_% {r,maze}^{T}+x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{{\bf E}}_{pos}^{T}+x_{CLS}% {\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{obj}^{T}{\bf S}_{r,obj}^{T}= italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 italic_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_m italic_a italic_z italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_r , italic_m italic_a italic_z italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 italic_T end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 italic_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_r , italic_o italic_b italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
=αmaze+xCLS𝐖Q𝐖KT𝐖objT𝐒r,objT,absentsubscript𝛼𝑚𝑎𝑧𝑒subscript𝑥𝐶𝐿𝑆subscript𝐖𝑄superscriptsubscript𝐖𝐾𝑇superscriptsubscript𝐖𝑜𝑏𝑗𝑇superscriptsubscript𝐒𝑟𝑜𝑏𝑗𝑇,\displaystyle={\bf\alpha}_{maze}+x_{CLS}{\bf W}_{Q}{\bf W}_{K}^{T}{\bf W}_{obj% }^{T}{\bf S}_{r,obj}^{T}\text{,}= italic_α start_POSTSUBSCRIPT italic_m italic_a italic_z italic_e end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_r , italic_o italic_b italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (4)

where 𝐖maze8×dsubscript𝐖𝑚𝑎𝑧𝑒superscript8𝑑{\bf W}_{maze}\in\mathbb{R}^{8\times{d}}bold_W start_POSTSUBSCRIPT italic_m italic_a italic_z italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 8 × italic_d end_POSTSUPERSCRIPT,𝐖obj128×dsubscript𝐖𝑜𝑏𝑗superscript128𝑑{\bf W}_{obj}\in\mathbb{R}^{128\times{d}}bold_W start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 128 × italic_d end_POSTSUPERSCRIPT are sub-matrices that consist of the rows corresponding to the maze channels and the other object channels of the embedding matrix 𝐖embsubscript𝐖𝑒𝑚𝑏{\bf W}_{emb}bold_W start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT, respectively. 𝐒r,mazesubscript𝐒𝑟𝑚𝑎𝑧𝑒{\bf S}_{r,maze}bold_S start_POSTSUBSCRIPT italic_r , italic_m italic_a italic_z italic_e end_POSTSUBSCRIPT and 𝐒r,objsubscript𝐒𝑟𝑜𝑏𝑗{\bf S}_{r,obj}bold_S start_POSTSUBSCRIPT italic_r , italic_o italic_b italic_j end_POSTSUBSCRIPT are sub-matrices of the corresponding dimensions of 𝐒rsubscript𝐒𝑟{\bf S}_{r}bold_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. αmazesubscript𝛼𝑚𝑎𝑧𝑒{\bf\alpha}_{maze}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_z italic_e end_POSTSUBSCRIPT is the maze component of the CLS token attention and reflects the attention bias within the maze.

Refer to caption
Figure 6: Target attention removal. A. An illustration of our targeted attention removal protocol. B. Manipulations that destroy the attention to Pac-Man lead to big effects and low correlations with the normal network

We examined separately the two heads’ attention weight on the game objects based on the second term, xCLS𝐖Q1𝐖K1T𝐖objTsubscript𝑥𝐶𝐿𝑆subscriptsuperscript𝐖1𝑄superscriptsubscript𝐖𝐾1𝑇superscriptsubscript𝐖𝑜𝑏𝑗𝑇x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{obj}^{T}italic_x start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 italic_T end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (Figure 5A), which is an 1×13611361\times{136}1 × 136 vector. In head 1, the attention weight of Pac-Man is predominantly high, suggesting that the first head’s attention is mostly on Pac-Man. The head 2’s attention weights are high for objects that yield rewards. They include pellets, energizers, fruits, and scared ghosts, but not ghosts in the normal or the dead mode. The mean attention scores correlate with the reward value of these objects (Figure 5B). Therefore, head 2’s attention weight creates a reward salience map for the objects in the game. Interestingly, similar value-based salience map has been demonstrated in the orbitofrontal cortex zhang2022reward .

The maze component 𝐀mazesubscript𝐀𝑚𝑎𝑧𝑒{\bf A}_{maze}bold_A start_POSTSUBSCRIPT italic_m italic_a italic_z italic_e end_POSTSUBSCRIPT in the two heads also reveals the functional division between the two heads. Head 1’s maze component captures the junctions, passes, and the general structure of the maze. Combined with its focus on Pac-Man, head 1 may be important in computing where Pac-Man can go based on the maze structure. In contrast, the maze component of head 2 does not reflect any fine structures. It might represent a coarse-grained positional bias of the monkey.

Overall, the first-layer attention is driven by where Pac-Man is and what the non-Pac-Man objects’ reward values are. This is similar to the type of attention termed bottom-up in cognitive sciences to describe the attention driven by sensory inputs.

3.2.2 Second layer

Due to the transformer’s architecture, the attention of CLS in layer 1 does not capture the interactions between the game objects, which is essential for the network playing the game. This should be achieved at layer 2, where each token contains the information from the other tokens through the attention mechanism of layer 1. To understand how the attention in layer 2 captures the interactions between the game objects, we introduce a manipulation procedure called targeted attention removal (Figure 6), which removes the effect of interaction between specific objects in the game from layer 2 attention by manipulating the attention block of the first layer.

To achieve this, we first created a game state by manipulating the targeted objects and then feed it to the embedding layer. In a separate network, we also generated embeddings for the normal inputs. The manipulation can be seen as computing a variant of cross-attention vaswani2017attention between the embedding of a patch from the normal game state and the embedding of the tampered game state, where the targeted objects were removed from contributing to the attention. Thus, the manipulation restricts the scope of attention from the objects in the game to the objects that are intact in the manipulation and to themselves. We then replace the output of the first layer attention block with this manipulated output and continue the forward pass.

To be concrete, we computed the manipulated output of attention Attn¯isuperscript¯𝐴𝑡𝑡𝑛𝑖\overline{Attn}^{i}over¯ start_ARG italic_A italic_t italic_t italic_n end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT at layer i𝑖iitalic_i as follows:

Attn¯i(xm,𝐗¯)=am,mixm𝐖Vi+na¯m,nmix¯n𝐖Visuperscript¯𝐴𝑡𝑡𝑛𝑖subscript𝑥𝑚¯𝐗subscriptsuperscript𝑎𝑖𝑚𝑚subscript𝑥𝑚subscriptsuperscript𝐖𝑖𝑉subscript𝑛subscriptsuperscript¯𝑎𝑖𝑚𝑛𝑚subscript¯𝑥𝑛subscriptsuperscript𝐖𝑖𝑉\overline{Attn}^{i}(x_{m},\overline{\bf X})=a^{i}_{m,m}x_{m}{\bf W}^{i}_{V}+% \sum_{n}{\overline{a}^{i}_{m,n\neq m}\overline{x}_{n}{\bf W}^{i}_{V}}over¯ start_ARG italic_A italic_t italic_t italic_n end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over¯ start_ARG bold_X end_ARG ) = italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_m end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n ≠ italic_m end_POSTSUBSCRIPT over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT (5)
a¯m,ni=softmax(xm𝐖Qi𝐖Kix¯nT),subscriptsuperscript¯𝑎𝑖𝑚𝑛𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑥𝑚superscriptsubscript𝐖𝑄𝑖superscriptsubscript𝐖𝐾𝑖superscriptsubscript¯𝑥𝑛𝑇,\overline{a}^{i}_{m,n}=softmax({x}_{m}{\bf W}_{Q}^{i}{\bf W}_{K}^{i}\overline{% x}_{n}^{T})\text{,}over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , (6)

where 𝐖Visubscriptsuperscript𝐖𝑖𝑉{\bf W}^{i}_{V}bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is the value weight matrix of the i-th𝑖-𝑡i\text{-}thitalic_i - italic_t italic_h transformer attention block, am,nisubscriptsuperscript𝑎𝑖𝑚𝑛a^{i}_{m,n}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT and a¯m,nisubscriptsuperscript¯𝑎𝑖𝑚𝑛\overline{a}^{i}_{m,n}over¯ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT are the original and manipulated attention score between the m-th𝑚-𝑡m\text{-}thitalic_m - italic_t italic_h patch and the n-th𝑛-𝑡n\text{-}thitalic_n - italic_t italic_h patch of the i-th𝑖-𝑡i\text{-}thitalic_i - italic_t italic_h layer, 𝐗¯¯𝐗\overline{\bf X}over¯ start_ARG bold_X end_ARG is the embedding of the manipulated input, xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and x¯msubscript¯𝑥𝑚\overline{x}_{m}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the embeddings of the m-th𝑚-𝑡m\text{-}thitalic_m - italic_t italic_h tile of the original input and the manipulated input, respectively.

Refer to caption
Figure 7: Object attention map. The example case shown depicts Pac-Man at a junction in the top-left corner of the maze. Panels A, B, C, and D represent the attention maps for 4 objects: energizer, apple, normal ghost 1, and scared ghost 1, respectively. Darker colors indicate stronger attention on the corresponding locations.

For a given state 𝐒𝐒{\bf S}bold_S drawn from the dataset, we defined four cases of targeted attention removal. (1) 𝐒0subscript𝐒0{\bf S}_{0}bold_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: all objects including Pac-Man are removed from S, leaving only the maze channels intact; (2) 𝐒Psubscript𝐒𝑃{\bf S}_{P}bold_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT: similar to 𝐒0subscript𝐒0{\bf S}_{0}bold_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, but Pac-Man is kept intact; (3) 𝐒PRsubscript𝐒𝑃𝑅{\bf S}_{PR}bold_S start_POSTSUBSCRIPT italic_P italic_R end_POSTSUBSCRIPT: same as 𝐒Psubscript𝐒𝑃{\bf S}_{P}bold_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, except that Pac-Man is assigned to a random junction; (4) 𝐒PJsubscript𝐒𝑃𝐽{\bf S}_{PJ}bold_S start_POSTSUBSCRIPT italic_P italic_J end_POSTSUBSCRIPT: same as 𝐒Psubscript𝐒𝑃{\bf S}_{P}bold_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, except that Pac-Man is approaching the same junction from a different direction.

We then computed the correlation between the attention score of the normal and the networks with these four manipulations respectively. A low correlation suggests the manipulation removes the attention critical for the network to function, while a high correlation suggests that the removed attention is not necessary.

We find that only the attention between non-Pac-Man objects and Pac-Man is critical (6B). Manipulations that destroy Pac-Man-related attention lead to low correlations with the normal control (𝐒0subscript𝐒0{\bf S}_{0}bold_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐒PRsubscript𝐒𝑃𝑅{\bf S}_{PR}bold_S start_POSTSUBSCRIPT italic_P italic_R end_POSTSUBSCRIPT). Removing attention between non-Pac-Man objects (𝐒Psubscript𝐒𝑃{\bf S}_{P}bold_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) or moving Pac-Man around a junction (𝐒PJsubscript𝐒𝑃𝐽{\bf S}_{PJ}bold_S start_POSTSUBSCRIPT italic_P italic_J end_POSTSUBSCRIPT) does not lower the correlation significantly. Therefore, the computation in the layer 2 attention is centered around the attention from the non-Pac-Man objects to Pac-Man, and the attention between the non-Pac-Man objects is irrelevant. Because this attention depends on how an object is related to Pac-Man, reflecting the game rule, it can be considered as top-down.

The discovery justifies a significant simplification to the computation. For each non-Pac-Man object, we can compute its attention map by averaging the attention score across samples with respect to Pac-Man’s location (Figure 8A). Given Pac-Man’s location, this attention map reveals how important an object is for the network’s decision if it appears at a certain location.

3.3 A condensed model of attention

With the understanding of what the two layers in our transformer do, we can now describe the network’s attention with two components: a bottom-up component contributed by layer 1 attention and an object-interaction component contributed by layer 2. The former considers Pac-Man’s location and the value of the non-Pac-Man objects, and the latter considers only the interactions between non-Pac-Man objects and Pac-Man. Based on these findings, we formed a condensed model that describe the attention distribution at tile T𝑇Titalic_T in any game state S𝑆Sitalic_S, which is only a function of the set of objects on tile T𝑇Titalic_T, the junction tilepacman𝑡𝑖𝑙subscript𝑒𝑝𝑎𝑐𝑚𝑎𝑛tile_{pacman}italic_t italic_i italic_l italic_e start_POSTSUBSCRIPT italic_p italic_a italic_c italic_m italic_a italic_n end_POSTSUBSCRIPT where Pac-Man is:

P(attendedtile=T|S)=P(attendedtile=T|objT,tilepacman,T)𝑃𝑎𝑡𝑡𝑒𝑛𝑑𝑒𝑑𝑡𝑖𝑙𝑒conditional𝑇𝑆𝑃𝑎𝑡𝑡𝑒𝑛𝑑𝑒𝑑𝑡𝑖𝑙𝑒conditional𝑇𝑜𝑏subscript𝑗𝑇𝑡𝑖𝑙subscript𝑒𝑝𝑎𝑐𝑚𝑎𝑛𝑇proportional-toabsent\displaystyle P(attended\;tile=T|S)=P(attended\;tile=T|obj_{T},tile_{pacman},T)\proptoitalic_P ( italic_a italic_t italic_t italic_e italic_n italic_d italic_e italic_d italic_t italic_i italic_l italic_e = italic_T | italic_S ) = italic_P ( italic_a italic_t italic_t italic_e italic_n italic_d italic_e italic_d italic_t italic_i italic_l italic_e = italic_T | italic_o italic_b italic_j start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_t italic_i italic_l italic_e start_POSTSUBSCRIPT italic_p italic_a italic_c italic_m italic_a italic_n end_POSTSUBSCRIPT , italic_T ) ∝
jobjTVj+jobjTMj,tilepacman,T,subscript𝑗𝑜𝑏subscript𝑗𝑇subscript𝑉𝑗subscript𝑗𝑜𝑏subscript𝑗𝑇subscript𝑀𝑗𝑡𝑖𝑙subscript𝑒𝑝𝑎𝑐𝑚𝑎𝑛𝑇\sum_{j\in obj_{T}}{V_{j}}+\sum_{j\in obj_{T}}M_{j,tile_{pacman},T}\;,∑ start_POSTSUBSCRIPT italic_j ∈ italic_o italic_b italic_j start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ italic_o italic_b italic_j start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_j , italic_t italic_i italic_l italic_e start_POSTSUBSCRIPT italic_p italic_a italic_c italic_m italic_a italic_n end_POSTSUBSCRIPT , italic_T end_POSTSUBSCRIPT , (7)

where V𝑉Vitalic_V is the value-based attention component of game objects, including both Pac-Man and other game objects, extracted from the two heads in the first layer (Figure 5B), and M𝑀Mitalic_M is an attention map that describes the game object-Pac-Man interaction term of object j𝑗jitalic_j on tile T𝑇Titalic_T when Pac-Man is at junction tilepacman𝑡𝑖𝑙subscript𝑒𝑝𝑎𝑐𝑚𝑎𝑛tile_{pacman}italic_t italic_i italic_l italic_e start_POSTSUBSCRIPT italic_p italic_a italic_c italic_m italic_a italic_n end_POSTSUBSCRIPT. Here, we also omitted the maze components of layer 1, as their contribution is negligible in most cases.

The attention reconstructed based on this model correlates with the attention rollout score almost perfectly (Figure 8B). The correlation between the gaze pattern and the network’s attention (Figure 3A) is well preserved in the model (figure 8C).

Refer to caption
Figure 8: A condensed model of attention. A. The model combines both the bottom-up component and the top-down component. The bottom-up component includes the attention to Pac-Man and the reward salience, and the top-down component describes how each object interacts with Pac-Man. B. The Spearman correlation between the original rollout score and the attention predicted by the condensed model is near perfect. C. Attention predicted by the condensed model faithfully captures the gaze pattern.

Conclusion

In summary, we demonstrate that a two-layer transformer network trained to predict monkeys’ choices in the Pac-Man game accurately captures the monkeys’ eye movements during gameplay. We provide a mechanistic explanation of how attention is computed in the network, revealing two distinct components. The first component, computed in the first layer, is driven by the reward saliency of the objects and can be considered as bottom-up. Similar saliency signals have been demonstrated in the prefrontal cortex of the brain zhang2022reward . The second component, computed in the second layer, focuses on the interaction between non-Pac-Man objects and Pac-Man itself. This interaction depends on the current game state and may be considered top-down. We further created an accurate and condensed attention model based on the computation mechanism of the transformer. Overall, our study provides an interpretable model for the monkeys’ gaze patterns during the Pac-Man game, highlighting the usefulness of transformers as a tool in the investigation of neural and cognitive sciences.

Acknowledgement

This work was supported by National Science and Technology Innovation 2030 Major Program (Grant No. 2021ZD0203701) to T.Y.

References

  • [1] Michael K Tanenhaus, Michael J Spivey-Knowlton, Kathleen M Eberhard, and Julie C Sedivy. Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217):1632–1634, 1995.
  • [2] Joshua I Gold and Michael N Shadlen. Representation of a perceptual decision in developing oculomotor commands. Nature, 404(6776):390–394, 2000.
  • [3] Tobias Loetscher, Christopher J Bockisch, Michael ER Nicholls, and Peter Brugger. Eye position predicts what number you have in mind. Current Biology, 20(6):R264–R265, 2010.
  • [4] Kaushik J Lakshminarasimhan, Eric Avila, Erin Neyhart, Gregory C DeAngelis, Xaq Pitkow, and Dora E Angelaki. Tracking the mind’s eye: Primate gaze behavior during virtual visuomotor navigation reflects belief dynamics. Neuron, 106(4):662–674, 2020.
  • [5] Richard A Andersen. Visual and eye movement functions of the posterior parietal cortex. Annual review of neuroscience, 12(1):377–403, 1989.
  • [6] Maurizio Corbetta and Gordon L Shulman. Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3):201–215, 2002.
  • [7] Rick A Adams, Eduardo Aponte, Louise Marshall, and Karl J Friston. Active inference and oculomotor pursuit: The dynamic causal modelling of eye movements. Journal of neuroscience methods, 242:1–14, 2015.
  • [8] Nour Karessli, Zeynep Akata, Bernt Schiele, and Andreas Bulling. Gaze embeddings for zero-shot image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4525–4534, 2017.
  • [9] Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, and Andreas Bulling. Multimodal integration of human-like attention in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2648–2658, 2023.
  • [10] Sebastian Gluth, Nadja Kern, Maria Kortmann, and Cécile L Vitali. Value-based attention but not divisive normalization influences decisions with multiple alternatives. Nature human behaviour, 4(6):634–645, 2020.
  • [11] Anne M Treisman and Garry Gelade. A feature-integration theory of attention. Cognitive psychology, 12(1):97–136, 1980.
  • [12] Simon Buchner, Christoph Holscher, Lars Konieczny, and Jan Wiener. How the geometry of space controls visual attention during spatial decision making. In Proceedings of the annual meeting of the cognitive science society, volume 31, 2009.
  • [13] Ueli Rutishauser and Christof Koch. Probabilistic modeling of eye movement data during conjunction search via feature-based attention. Journal of Vision, 7(6):5–5, 2007.
  • [14] Rick A Adams, Laurent U Perrinet, and Karl Friston. Smooth pursuit and visual occlusion: active inference and oculomotor control in schizophrenia. PloS one, 7(10):e47502, 2012.
  • [15] Frederick Callaway, Antonio Rangel, and Thomas L Griffiths. Fixation patterns in simple choice reflect optimal information sampling. PLoS computational biology, 17(3):e1008863, 2021.
  • [16] Marc Assens Reina, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Saltinet: Scan-path prediction on 360 degree images using saliency volumes. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 2331–2338, 2017.
  • [17] Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European conference on computer vision (ECCV), pages 754–769, 2018.
  • [18] Wanjie Sun, Zhenzhong Chen, and Feng Wu. Visual scanpath prediction using ior-roi recurrent mixture density network. IEEE transactions on pattern analysis and machine intelligence, 43(6):2101–2118, 2019.
  • [19] Binglu Wang, Chenxi Guo, Yang Jin, Haisheng Xia, and Nian Liu. Transgop: Transformer-based gaze object prediction. arXiv preprint arXiv:2402.13578, 2024.
  • [20] Michael Hahn and Frank Keller. Modeling task effects in human reading with neural network-based attention. arXiv preprint arXiv:1808.00054, 2018.
  • [21] Jason Li, Nicholas Watters, Hansem Sohn, and Mehrdad Jazayeri. Modeling human eye movements with neural networks in a maze-solving task. In Annual Conference on Neural Information Processing Systems, pages 98–112. PMLR, 2023.
  • [22] Qiuxia Lai, Salman Khan, Yongwei Nie, Hanqiu Sun, Jianbing Shen, and Ling Shao. Understanding more about human and machine attention in deep neural networks. IEEE Transactions on Multimedia, 23:2086–2099, 2020.
  • [23] Ekta Sood, Simon Tannert, Diego Frassinelli, Andreas Bulling, and Ngoc Thang Vu. Interpreting attention models with human visual attention in machine reading comprehension. arXiv preprint arXiv:2010.06396, 2020.
  • [24] Cansu Sen, Thomas Hartvigsen, Biao Yin, Xiangnan Kong, and Elke Rundensteiner. Human attention maps for text classification: Do humans and neural networks focus on the same words? In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4596–4608, 2020.
  • [25] Oliver Eberle, Stephanie Brandl, Jonas Pilot, and Anders Søgaard. Do transformer models show similar attention patterns to task-specific human gaze? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4295–4309, 2022.
  • [26] Stephanie Brandl and Nora Hollenstein. Every word counts: A multilingual analysis of individual human alignment with model attention. arXiv preprint arXiv:2210.04963, 2022.
  • [27] Joshua Bensemann, Alex Peng, Diana Benavides-Prado, Yang Chen, Neset Tan, Paul Michael Corballis, Patricia Riddle, and Michael J Witbrock. Eye gaze and self-attention: How humans and transformers attend words in sentences. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 75–87, 2022.
  • [28] Andrew Lampinen, Samuel Nastase, Christopher Edwards, Quitterie D’Elascombe, Akilles Rechardt, Jeremy Skipper, Gabriella Vigliocco, et al. Multimodality and attention increase alignment in natural language prediction between humans and computational models. 2024.
  • [29] Ekta Sood, Simon Tannert, Philipp Müller, and Andreas Bulling. Improving natural language processing tasks with human gaze-guided neural attention. Advances in Neural Information Processing Systems, 33:6327–6341, 2020.
  • [30] Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, and Paul Sajda. Fixating on attention: Integrating human eye tracking into vision transformers. arXiv preprint arXiv:2308.13969, 2023.
  • [31] Chaitanya Thammineni, Hemanth Manjunatha, and Ehsan T Esfahani. Selective eye-gaze augmentation to enhance imitation learning in atari games. Neural Computing and Applications, 35(32):23401–23410, 2023.
  • [32] Qianli Yang, Zhongqiao Lin, Wenyi Zhang, Jianshu Li, Xiyuan Chen, Jiaqi Zhang, and Tianming Yang. Monkey plays pac-man with compositional strategies and hierarchical decision-making. Elife, 11:e74500, 2022.
  • [33] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [34] Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Better plain vit baselines for imagenet-1k. arXiv preprint arXiv:2205.01580, 2022.
  • [35] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928, 2020.
  • [36] Wenyi Zhang, Yang Xie, and Tianming Yang. Reward salience but not spatial attention dominates the value representation in the orbitofrontal cortex. Nature Communications, 13(1):6306, 2022.
  • [37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Appendix A Training Details

Training

We used AdamW as the optimizer, with an initial learning rate = 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and weight-decay = 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT. The batch size is 768. All models were trained on an A100 GPU for 800 epochs. 140 sessions from monkey O and 125 sessions from monkey D were used for training, five held-out sessions from each monkey were used for testing. The ratio of overlapped samples in the training and the testing set was negligible.

Data specification

The input of the model is a 32×28×3432283432\times{28}\times{34}32 × 28 × 34 tensor, where the last dimension corresponds to 17 feature channels in two frames. The description of each channel can be found in Table 1.

Table 1: Channel Descriptions of the Model Input
Channel Number Channel Name Description
1 maze 1 if the tile is a wall tile, 0 if the tile is a walkable tile.
2 bean 1 if a bean is on the tile, 0 otherwise.
3 energizer 1 if an energizer is on the tile. Eating an energizer leads to 4 drops of juice reward and turns the ghosts into the scared mode.
4 apple 1 if an apple is on the tile. Taking an apple leads to 12 drops of juice reward.
5 cherry 1 if a cherry is on the tile. Eating a cherry leads to 3 drops of juice reward.
6 melon 1 if a melon is on the tile. Eating a melon leads to 17 drops of juice reward.
7 orange 1 if an orange is on the tile. Eating an orange leads to 8 drops of juice reward.
8 strawberry 1 if a strawberry is on the tile. Eating a strawberry leads to 5 drops of juice reward.
9 ghost 1 (normal) 1 if ghost 1 in normal mode is on the tile. Colliding into ghosts in normal mode leads to a time-out penalty. After the time-out, Pac-Man respawns at the start tile and continues the previous game.
10 ghost 2 (beaten) 1 if ghost 1 in beaten mode is on the tile. A beaten ghost moves back to the ghost house. Colliding into a beaten ghost does not trigger any game events.
11 ghost 1 (scared) 1 if ghost 1 in scared mode is on the tile. Eating a scared ghost turns the ghost into the beaten mode and produces 8 drops of juice reward.
12 ghost 1 (flashing) 1 if ghost 1 in flashing mode is on the tile. A flashing ghost is a scared ghost that will turn back into the normal model very soon.
13 ghost 2 (normal) 1 if ghost 2 in normal mode is on the tile.
14 ghost 2 (beaten) 1 if ghost 2 in beaten mode is on the tile.
15 ghost 2 (scared) 1 if ghost 2 in scared mode is on the tile.
16 ghost 2 (flashing) 1 if ghost 2 in flashing mode is on the tile.
17 Pac-Man 1 if Pac-Man is on the tile, 0 otherwise
Refer to caption
Figure 9: Attention rollout decomposition. A. The distributions of vector norms of the three components (top) and their relative ratio (bottom) reveals that the multiplicative term is negligible. B. Histogram of the correlation between the original rollout score and the score after ignoring the multiplicative term. The correlation is very close to 1.

Appendix B More Details on the Attention Analyses

20 sessions from monkey O and 15 sessions from monkey D were used for attention analyses. The sessions were chosen for their high-quality eye-tracking data.

Correlation between the gaze pattern and the model attention

To compare the attention of the transformer model and the gaze pattern of the monkeys, we computed the correlation between the attention rollout score of the model and the gaze data in two different ways. Firstly, we report the Spearman correlation between the distance from gaze positions of each patch and its rollout score. Gaze positions in the time of three tiles before the junction were used for the analysis. The distance between gaze positions and a patch was computed by averaging the distance between each gaze position and its closet tile within the patch. The Spearman correlation and the statistical inference were implemented with Scipy.stats module.

To account for the systematic drift of eye data during the recording, we also computed a second type of correlation based on the 2D cross-correlation. We first computed a heat map for the gaze positions to the same resolution as the patches. We then computed the maximum 2D cross-correlation between the attention rollout score and the gaze heatmap with the offset varying from -4 to +4 tiles (Figure 3C). The controls are computed similarly, with the attention scores shuffled across all patches or across only patches with objects.

Attention rollout

Following [35], the attention rollout score was computed as below:

𝐀~i={(0.5𝐀i+0.5𝐈)𝐀~i1ifi>0(0.5𝐀i+0.5𝐈)ifi=0superscript~𝐀𝑖cases0.5superscript𝐀𝑖0.5𝐈superscript~𝐀𝑖1if𝑖00.5superscript𝐀𝑖0.5𝐈if𝑖0\tilde{\bf A}^{i}=\begin{cases}(0.5{\bf A}^{i}+0.5{\bf I})\tilde{\bf A}^{i-1}&% \text{if}\;i>0\\ (0.5{\bf A}^{i}+0.5{\bf I})&\text{if}\;i=0\end{cases}over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { start_ROW start_CELL ( 0.5 bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + 0.5 bold_I ) over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_i > 0 end_CELL end_ROW start_ROW start_CELL ( 0.5 bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + 0.5 bold_I ) end_CELL start_CELL if italic_i = 0 end_CELL end_ROW (8)

where 𝐀isuperscript𝐀𝑖{\bf A}^{i}bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the attention matrix of layer i𝑖iitalic_i summed across all attention heads and 𝐈𝐈{\bf I}bold_I is the identity matrix.

As we only considered the first row of the rollout matrix, i.e. the score depicting how the CLS token attends the maze patches, we derived equation 1 from equation 8. To measure the relative contribution of the three terms in equation 1, we computed their vector norms. The top row of Figure 9A shows the distributions of the vector norms. The multiplicative term is smaller than the other two terms. The bottom row of Figure 9 further shows that the multiplication term only contributes to all samples tested minimally.

Refer to caption
Figure 10: Additional example scenarios. A. A sequence of game states showing Pac-Man started a long-distance raid toward far-away pellets. The monkey initially attended to the pellet right above the ghost house, but then switched its attention to the pellet on the top-left corner and went for it. B. In this sequence of game states, the monkey attended to the energizer below Pac-Man at first. But the monkey forwent the plan and moved Pac-Man upwards. As the result, its attention was quickly shifted to the new targets upwards. C. Two examples where the monkey focused on the immediate rewards and ignored nearby ghosts. In the left case, the ghost caught Pac-Man and the monkey failed to reacg the target pellet. In the right case, the ghosts missed Pac-Man and the monkey successfully completed the game.

We approximated the attention rollout score as:

a~approxCLS=𝐀1,2:(1)+𝐀1,2:(2)subscript~𝑎𝑎𝑝𝑝𝑟𝑜𝑥𝐶𝐿𝑆subscriptsuperscript𝐀1:12absentsubscriptsuperscript𝐀2:12absent\tilde{a}_{approxCLS}={\bf A}^{(1)}_{1,2:}+{\bf A}^{(2)}_{1,2:}over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_a italic_p italic_p italic_r italic_o italic_x italic_C italic_L italic_S end_POSTSUBSCRIPT = bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 2 : end_POSTSUBSCRIPT + bold_A start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 2 : end_POSTSUBSCRIPT (9)

The approximate score correlates with the original rollout score almost perfectly, indicating that out simplification is justified.

Layer 1 attention

Because the attention score of layer 1 is fully determined by the objects in a patch and a weight matrix that describes the influence of each object on the attention score, averaging the attention scores of each object across samples and locations provides a reliable estimate of the layer 1 attention. We did this in Figure 5B and later used it as the bottom-up attention component in the simplified model.

Appendix C Human Experiment

Experiment details

Ten healthy human subjects (two male, eight female; age, 24-32) participated in the experiment. During the experiment, they sat in front of a monitor and watched 100 game states that were randomly chosen from monkey O’s dataset. The participants were asked to choose 1-10 tiles that were most relevant to the impending decision at the next junction by clicking the mouse (Figure 11). There was no set time limit for the task, and the participants finished the experiment using 30-55 minutes.

Jaccard similarity

We used Jaccard index to measure the similarity between the choices of two subjects or between the choices of a subject and a model. The Jaccard index is defined as below:

J(A,B)=|AB||AB|,𝐽𝐴𝐵𝐴𝐵𝐴𝐵J(A,B)=\frac{|A\cap B|}{|A\cup B|},italic_J ( italic_A , italic_B ) = divide start_ARG | italic_A ∩ italic_B | end_ARG start_ARG | italic_A ∪ italic_B | end_ARG , (10)

where A and B are the set of tiles chosen by the two subjects being compared. For the transformer network, we chose the tiles with highest rollout scores, with the number of tiles matched the human subject being compared against. We also included a control that simply chose randomly, with the number of chosen tiles also matched.

Refer to caption
Figure 11: Human experiment. A screenshot of the experiment.
Performance comparison

For each agent, the prediction error was quantified by first assigning the nearest selected tile to each gaze data point and then averaging the distance between the assigned tiles and the gaze position. To compare the prediction accuracy between the models and the participants, we subtract the models’ prediction error against the participants’. A negative difference indicates better model performance. Note that this analysis is sensitive to the number of selected tiles, so we matched the number of tiles for each comparison.

We also observed that human participants tended not to choose the Pac-Man tile. This was most likely because the instruction given to the human subjects encouraged them not to include Pac-Man in their answers. Both the monkeys and the transformer model assigned a large amount of gazes or attention to Pac-Man. Therefore, we included the Pac-Man tile in the human responses when evaluating human-model differences.

Condensed model

We constructed a model that combines the bottom-up and top-down attention components based on the computation abstracted from the transformer network. While the original rollout score is a patch-level measurement, the condensed model calculates attention at object-level. To compare the condensed model with the original rollout score, we accounted for the difference in terms of resolution by assigning all the rollout scores to the objects within the corresponding patch and setting all other tiles to zero. We then computed the Spearman correlation between the object-level rollout score and the attention predicted by the condensed model.

To validate the model prediction in the monkeys, we showed the correlation between the predicted attention and the distance to the gaze positions of each tile. Since the reconstruction results in a lot of zero-attention tiles, we only show the top-n tiles that are non-zero in at least 10% of the samples.

Appendix D Results for Monkey D

Shown in Figure LABEL:fig:detron are the results when the transformer network was trained with monkey D’s behavior data. Overall, the results from the two monkeys are very similar. Interestingly, monkey D’s layer 1 head 2 attention shows a stronger correlation with the object value and generally larger weights (Figure LABEL:fig:detron H and Figure 5 B), suggesting monkey D might have a stronger value-based bottom-up attention component comparing to monkey O.