Self-Attention in Transformer Networks Explains Monkeys’ Gaze Pattern in Pac-Man Game

Zhongqiao Lin^missing Institute of Neuroscience, Key Laboratory of
Brain Cognition and Brain-inspired Intelligence Technology,
Center for Excellence in Brain Science and Intelligence Technology,
Chinese Academy of Sciences, Shanghai, China Yunwei Li^missing Institute of Neuroscience, Key Laboratory of
Brain Cognition and Brain-inspired Intelligence Technology,
Center for Excellence in Brain Science and Intelligence Technology,
Chinese Academy of Sciences, Shanghai, China University of Chinese Academy of Sciences
{zqlin, liyw2021, tyang}@ion.ac.cn Tianming Yang^† Institute of Neuroscience, Key Laboratory of
Brain Cognition and Brain-inspired Intelligence Technology,
Center for Excellence in Brain Science and Intelligence Technology,
Chinese Academy of Sciences, Shanghai, China

Abstract

We proactively direct our eyes and attention to collect information during problem solving and decision making. Understanding gaze patterns is crucial for gaining insights into the computation underlying the problem-solving process. However, there is a lack of interpretable models that can account for how the brain directs the eyes to collect information and utilize it, especially in the context of complex problem solving. In the current study, we analyzed the gaze patterns of two monkeys playing the Pac-Man game. We trained a transformer network to mimic the monkeys’ gameplay and found its attention pattern captures the monkeys’ eye movements. In addition, the prediction based on the transformer network’s attention outperforms the human subjects’ predictions. Importantly, we dissected the computation underlying the attention mechanism of the transformer network, revealing its layered structures reflecting a value-based attention component and a component that captures the interactions between Pac-Man and other game objects. Based on these findings, we built a condensed attention model that is not only as accurate as the transformer network but also fully interpretable. Our results highlight the potential of using transformer neural networks to model and understand the cognitive processes underlying complex problem solving in the brain, opening new avenues for investigating the neural basis of cognition.

¹¹footnotetext: Corresponding author.

Refer to caption — Figure 1: Pac-Man game and transformer model. A. The adapted monkey version of Pac-Man. Compared to the original version, it has only two ghosts. Monkeys receive juice reward, the size of which indicated in the right-top corner of each object, for eating pellets and other objects in the game. Getting caught by a ghost incurs a time-out as penalty. B. Transformer network model. The inputs are two frames before Pac-Man reaches the junction, and the network is trained to predict the monkeys’ directional choice at the junction. C. Determining the model size. We varied the number of layers, the number of heads, and the dimensionality of embedding tokens and tested the models for their performance. The chosen model is indicated by the red frame (2 layers, 2 heads, 48 dimensions).

1 Introduction

Our eyes are not merely a camera. They are an integral component of the brain’s decision-making and problem-solving during which the brain proactively directs the eyes to collect information. The studies of humans’ and animals’ gaze pattern have a long history of research and a substantial body of literature in neuroscience and cognitive science tanenhaus1995integration ; gold2000representation ; loetscher2010eye ; lakshminarasimhan2020tracking ; andersen1989visual . Understanding gaze pattern not only is crucial for understanding the cognitive processes in the brain corbetta2002control ; adams2015active , but also advances the field of deep learning in solving complex problems karessli2017gaze ; sood2023multimodal .

One important approach to understanding the gaze pattern is to build a generative model that can predict eye movements under a behavior context. However, this has been an extremely challenging task. Traditional modeling approaches formalize eye movements as a conjunctive result of both bottom-up and top-down processes gluth2020value . The salience of visual objects treisman1980feature ; buchner2009geometry and the need to collect information and reduce uncertainty rutishauser2007probabilistic ; adams2012smooth ; callaway2021fixation in the task are two major driving forces behind eye movements. These models typically require heavy heuristic designs by characterizing the visual features. They are often limited to simple cognitive tasks and the prediction accuracy is less than ideal.

An alternative approach is to fit an artificial neural network (ANN) directly to the eye movement data assens2017saltinet ; huang2018predicting ; sun2019visual ; wang2024transgop . While these models often predict gaze patterns more accurately even in complex situations, they are typically treated as black boxes and are usually trained on eye data collected from behavioral contexts that are not goal-directed. Several recent studies incorporated heuristic inductive bias of eye movements back into the deep learning models (e.g., hahn2018modeling ; li2023modeling ). While this approach offered better interpretability on the base of the ANN models, their performance may depend on how accurately the introduced bias captures the underlying mechanism.

While these approaches directly model the eye-tracking data, recent studies have begun to investigate the parallels between the self-attention mechanism in the deep learning literature and human attention. Correlations with human gaze patterns have been demonstrated in convolutional neural networks (CNN) lai2020understanding ; sood2020interpreting , recurrent neural networks (RNN) sood2020interpreting ; sen2020human , and transformers eberle2022transformer ; brandl2022every ; bensemann2022eye ; lampinen2024multimodality , all of which are not explicitly trained with human eye-tracking data. Furthermore, some studies report that integrating a gaze predictor module into language models and decision models enhances the performance of the primary tasks sood2020improving ; koorathota2023fixating ; thammineni2023selective . However, the underlying computation in these models remains a black box, which limits us from using them to understand how the brain direct its attention.

In this work, we investigated the gaze patterns in monkeys that were trained to play the Pac-Man game. We trained transformer networks to predict animals’ choices at maze junctions. We hypothesized that its attention mechanism should reflect how monkeys direct their attention and collect game information to play the game. We aimed to dissect the computation of the network to understand the cognitive process underlying monkeys’ eye movements and gameplay.

The results show that despite not being trained explicitly to replicate the monkeys’ gaze pattern, the attention pattern of the transformer network closely resembles the animals’ gaze pattern. Furthermore, we dissected the computation underlying the attention mechanism in the transformer network into a condensed model. The condensed model is as accurate as the transformer network in predicting the monkeys’ gaze pattern and fully interpretable. The model can be potentially validated in future experiments. Our study demonstrates the potential of using transformer models to understand the cognitive processes in the brain.

2 Methods

2.1 Behavior experiments

The monkey behavior experiments and data collection was reported in a previous paper yang2022monkey . Briefly, two rhesus monkeys were trained to play an adapted Pac-Man game. They gained juice rewards by clearing all the dots in a maze and avoiding being caught by ghosts. Taking an energizer turned the ghosts into a scared mode, in which the ghosts could be beaten for reward. During the entire experiment, the eye traces of the monkeys were recorded and processed with an infrared eye-tracking system (EyeLink®1000 Plus). The eye-tracking data from 20 sessions of monkey O and 15 sessions of monkey D are used for the analyses in the current study. The main text shows the results on monkey O, while the analyses on monkey D produce similar results and are included in the Appendix.

2.2 Training data

We trained transformer networks to predict the monkeys’ directional choice when the Pac-Man is at a junction tile. The input to the networks includes two game frames in which Pac-Man is one and two tiles away from entering the junction. Each frame is a $32\times 28\times 17$ tensor, where 32 and 28 are the height and width of the maze in the unit of tiles, and 17 is the number of feature channels, each containing a binary indicator for the respective object. The two frames are concatenated into a tensor ${\bf S}\in\mathbb{R}^{32\times 28\times 34}$ . Details about the channels can be found in the Appendix.

2.3 Transformer Network

The networks that we studied are based on the standard vision transformers (ViT)dosovitskiy2020image ; beyer2022better . The input $S$ was reshaped into a sequence of flattened patches $s_{t}\in\mathbb{R}^{2^{2}\times 34}$ where $2\times 2$ is the size of each patch. The flattened patches were mapped to d dimensions with a trainable embedding projection. We used fixed 2D sin-cos position embeddings and a learnable classification (CLS) token.

We determined the size of the transformer network by varying the number of transformer layers, the number of embedding dimensions, and the number of attention heads. We trained the model on 140 sessions and tested the model on 5 sessions on a held-out test set. Based on Figure 1C, we chose the model with 2 layers, 2 attention heads, and 48 token dimensions for the rest of the study.

3 Results

3.1 Attention rollout

The trained network reached a performance of 87.664% predicting the monkeys’ choice when Pac-Man is at a junction. To study the network’s attention, we adopted attention rollout to visualize the network’s attention abnar2020quantifying .

We found that the attention in the network reflects the monkeys’ gaze patterns, even in situations of complex planning. Figure 2 shows two interesting examples. We selected and plotted the monkeys’ gaze locations during a 3-tile timespan before Pac-Man entered a junction along with the network’s attention on the same maze. This was when the monkeys were looking around the maze for the decision which direction to turn when approaching a junction. Figure 2A shows a situation in which Pac-Man was facing two branches at the junction. Instead of simply weighing the available rewards in both the branches, the monkey was looking at a scared ghost far away from Pac-Man, suggesting that it was planning a hunting. The attention of the model captures this long-range planning. Figure 2B is another example. It involves an interesting strategy that the monkey learned to use, namely suicide yang2022monkey . Sometimes, when Pac-Man is far away from the respawn spot but the remaining pellets are near the respawn spot, the monkey would choose to run into a ghost to die. After the respawn, it could collect the remaining pellets more easily. We observed in this example that the monkey was looking at both the ghost and the pellets around the respawn spot, suggesting that it was planning the suicide. Again, the network’s attention score captures the same pattern. More examples are included in the Appendix (Figure 10).

To quantify the similarity between the networks’ attention and the monkeys’ gaze, we divided the maze into patches, sorted them by their attention rollout score, and plotted their average distance to the eye positions across samples. Patches with the highest attention rollout scores were closest to the monkeys’ gaze location (Figure 3A, Spearman correlation, corr=0.9995, $p$ is under the machine precision). The distribution of the Spearman correlation of all samples is significantly biased to the right side of zero (Figure 3B). We further calculated the 2-d cross-correlation between the attention rollout score map and the gaze heatmap. The correlation was significantly higher than that of the attention scores shuffled across patches or shuffled only among patches that contain objects (Figure 3C).

Predicting the monkeys’ gaze patterns is not trivial. To demonstrate this, we conducted an experiment in which we asked ten human subjects to judge where the monkeys should look during the game. We showed each participant 100 game states that were drawn from the sample pool used to train the transformer network and asked them to mark what they thought were the tiles most relevant to the impending decision. Using Jaccard similarity, We observed that the human subjects’ agreed with each other on these tiles, but their guesses differed from the network’s choices (Figure 4B). Using the human subjects’ prediction error as a baseline, we found that the model’s prediction error was lower, suggesting its advantage over humans. In comparison, a random model had significantly larger error.

3.2 Interpreting attention

Next, we aimed to reveal the computation in the attention rollout score in our model. We first expanded the equation for rollout score by the layers (Note that we only care about the attention that affects the CLS token, i.e. the first row of the attention matrix. We omit the head notations for simplicity):

\tilde{a}_{CLS}=0.25({\bf A}^{(1)}_{1,2:}+{\bf A}^{(2)}_{1,2:}+{\bf A}^{(2)}_{% 1,:}{\bf A}^{(1)}_{:,2:})\text{,}

(1)

where ${\bf A}^{(1)}$ and ${\bf A}^{(2)}\in\mathbb{R}^{225\times 225}$ are the attention score matrices of the first and the second layer.

We noticed that the multiplication term can be ignored. The norm of the multiplication term between ${\bf A}^{(1)}$ and ${\bf A}^{(2)}$ is much smaller than those of the other two attention vectors (Appendix Figure 9A). The multiplication term contributed only a small proportion in all samples (Appendix Figure 9B). Approximating the rollout score by removing the multiplication term is nearly perfect (Appendix Figure 9C). Therefore, we will only examine the remaining two attention vectors.

3.2.1 First layer

The attention score of the first layer is:

{\bf A}^{(1)}_{1,2:}=softmax(\alpha^{1}_{CLS,2:})\text{,}

(2)

\alpha^{1}_{CLS,2:}=x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf X}_{2:,:}^{T}=x% _{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{emb}^{T}{\bf S}_{r}^{T}+x_{CLS}{% \bf W}^{1}_{Q}{\bf W}_{K}^{1T}{{\bf E}}_{pos}^{T}\text{,}

(3)

where $x_{CLS}$ is the learnable CLS token, ${\bf W}^{1}_{Q}$ , ${\bf W}^{1}_{K}\in\mathbb{R}^{d\times{d/h\cdot h}}$ and ${\bf W}_{emb}\in\mathbb{R}^{136\times{d}}$ are the query, key weight matrix of the first layer, and the linear embedding matrix, respectively. ${\bf S}_{r}\in\mathbb{R}^{224\times{136}}$ is the reshaped input game state, and ${\bf E}_{pos}$ is the positional embedding matrix. As the positional encoding and the maze channels in the input encode maze information but are otherwise invariant in the game, $\alpha^{1}_{CLS,2:}$ can be rewritten as the sum of a maze component and an object-dependent component:

	$\displaystyle\alpha^{1}_{CLS,2:}$	$\displaystyle=x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{maze}^{T}{\bf S}_% {r,maze}^{T}+x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{{\bf E}}_{pos}^{T}+x_{CLS}% {\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{obj}^{T}{\bf S}_{r,obj}^{T}$
		$\displaystyle={\bf\alpha}_{maze}+x_{CLS}{\bf W}_{Q}{\bf W}_{K}^{T}{\bf W}_{obj% }^{T}{\bf S}_{r,obj}^{T}\text{,}$		(4)

where ${\bf W}_{maze}\in\mathbb{R}^{8\times{d}}$ , ${\bf W}_{obj}\in\mathbb{R}^{128\times{d}}$ are sub-matrices that consist of the rows corresponding to the maze channels and the other object channels of the embedding matrix ${\bf W}_{emb}$ , respectively. ${\bf S}_{r,maze}$ and ${\bf S}_{r,obj}$ are sub-matrices of the corresponding dimensions of ${\bf S}_{r}$ . ${\bf\alpha}_{maze}$ is the maze component of the CLS token attention and reflects the attention bias within the maze.

We examined separately the two heads’ attention weight on the game objects based on the second term, $x_{CLS}{\bf W}^{1}_{Q}{\bf W}_{K}^{1T}{\bf W}_{obj}^{T}$ (Figure 5A), which is an $1\times{136}$ vector. In head 1, the attention weight of Pac-Man is predominantly high, suggesting that the first head’s attention is mostly on Pac-Man. The head 2’s attention weights are high for objects that yield rewards. They include pellets, energizers, fruits, and scared ghosts, but not ghosts in the normal or the dead mode. The mean attention scores correlate with the reward value of these objects (Figure 5B). Therefore, head 2’s attention weight creates a reward salience map for the objects in the game. Interestingly, similar value-based salience map has been demonstrated in the orbitofrontal cortex zhang2022reward .

The maze component ${\bf A}_{maze}$ in the two heads also reveals the functional division between the two heads. Head 1’s maze component captures the junctions, passes, and the general structure of the maze. Combined with its focus on Pac-Man, head 1 may be important in computing where Pac-Man can go based on the maze structure. In contrast, the maze component of head 2 does not reflect any fine structures. It might represent a coarse-grained positional bias of the monkey.

Overall, the first-layer attention is driven by where Pac-Man is and what the non-Pac-Man objects’ reward values are. This is similar to the type of attention termed bottom-up in cognitive sciences to describe the attention driven by sensory inputs.

3.2.2 Second layer

Due to the transformer’s architecture, the attention of CLS in layer 1 does not capture the interactions between the game objects, which is essential for the network playing the game. This should be achieved at layer 2, where each token contains the information from the other tokens through the attention mechanism of layer 1. To understand how the attention in layer 2 captures the interactions between the game objects, we introduce a manipulation procedure called targeted attention removal (Figure 6), which removes the effect of interaction between specific objects in the game from layer 2 attention by manipulating the attention block of the first layer.

To achieve this, we first created a game state by manipulating the targeted objects and then feed it to the embedding layer. In a separate network, we also generated embeddings for the normal inputs. The manipulation can be seen as computing a variant of cross-attention vaswani2017attention between the embedding of a patch from the normal game state and the embedding of the tampered game state, where the targeted objects were removed from contributing to the attention. Thus, the manipulation restricts the scope of attention from the objects in the game to the objects that are intact in the manipulation and to themselves. We then replace the output of the first layer attention block with this manipulated output and continue the forward pass.

To be concrete, we computed the manipulated output of attention $\overline{Attn}^{i}$ at layer $i$ as follows:

\overline{Attn}^{i}(x_{m},\overline{\bf X})=a^{i}_{m,m}x_{m}{\bf W}^{i}_{V}+% \sum_{n}{\overline{a}^{i}_{m,n\neq m}\overline{x}_{n}{\bf W}^{i}_{V}}

(5)

\overline{a}^{i}_{m,n}=softmax({x}_{m}{\bf W}_{Q}^{i}{\bf W}_{K}^{i}\overline{% x}_{n}^{T})\text{,}

(6)

where ${\bf W}^{i}_{V}$ is the value weight matrix of the $i\text{-}th$ transformer attention block, $a^{i}_{m,n}$ and $\overline{a}^{i}_{m,n}$ are the original and manipulated attention score between the $m\text{-}th$ patch and the $n\text{-}th$ patch of the $i\text{-}th$ layer, $\overline{\bf X}$ is the embedding of the manipulated input, $x_{m}$ and $\overline{x}_{m}$ are the embeddings of the $m\text{-}th$ tile of the original input and the manipulated input, respectively.

For a given state ${\bf S}$ drawn from the dataset, we defined four cases of targeted attention removal. (1) ${\bf S}_{0}$ : all objects including Pac-Man are removed from S, leaving only the maze channels intact; (2) ${\bf S}_{P}$ : similar to ${\bf S}_{0}$ , but Pac-Man is kept intact; (3) ${\bf S}_{PR}$ : same as ${\bf S}_{P}$ , except that Pac-Man is assigned to a random junction; (4) ${\bf S}_{PJ}$ : same as ${\bf S}_{P}$ , except that Pac-Man is approaching the same junction from a different direction.

We then computed the correlation between the attention score of the normal and the networks with these four manipulations respectively. A low correlation suggests the manipulation removes the attention critical for the network to function, while a high correlation suggests that the removed attention is not necessary.

We find that only the attention between non-Pac-Man objects and Pac-Man is critical (6B). Manipulations that destroy Pac-Man-related attention lead to low correlations with the normal control ( ${\bf S}_{0}$ and ${\bf S}_{PR}$ ). Removing attention between non-Pac-Man objects ( ${\bf S}_{P}$ ) or moving Pac-Man around a junction ( ${\bf S}_{PJ}$ ) does not lower the correlation significantly. Therefore, the computation in the layer 2 attention is centered around the attention from the non-Pac-Man objects to Pac-Man, and the attention between the non-Pac-Man objects is irrelevant. Because this attention depends on how an object is related to Pac-Man, reflecting the game rule, it can be considered as top-down.

The discovery justifies a significant simplification to the computation. For each non-Pac-Man object, we can compute its attention map by averaging the attention score across samples with respect to Pac-Man’s location (Figure 8A). Given Pac-Man’s location, this attention map reveals how important an object is for the network’s decision if it appears at a certain location.

3.3 A condensed model of attention

With the understanding of what the two layers in our transformer do, we can now describe the network’s attention with two components: a bottom-up component contributed by layer 1 attention and an object-interaction component contributed by layer 2. The former considers Pac-Man’s location and the value of the non-Pac-Man objects, and the latter considers only the interactions between non-Pac-Man objects and Pac-Man. Based on these findings, we formed a condensed model that describe the attention distribution at tile $T$ in any game state $S$ , which is only a function of the set of objects on tile $T$ , the junction $tile_{pacman}$ where Pac-Man is:

\displaystyle P(attended\;tile=T|S)=P(attended\;tile=T|obj_{T},tile_{pacman},T)\propto

\sum_{j\in obj_{T}}{V_{j}}+\sum_{j\in obj_{T}}M_{j,tile_{pacman},T}\;,

(7)

where $V$ is the value-based attention component of game objects, including both Pac-Man and other game objects, extracted from the two heads in the first layer (Figure 5B), and $M$ is an attention map that describes the game object-Pac-Man interaction term of object $j$ on tile $T$ when Pac-Man is at junction $tile_{pacman}$ . Here, we also omitted the maze components of layer 1, as their contribution is negligible in most cases.

The attention reconstructed based on this model correlates with the attention rollout score almost perfectly (Figure 8B). The correlation between the gaze pattern and the network’s attention (Figure 3A) is well preserved in the model (figure 8C).

Conclusion

In summary, we demonstrate that a two-layer transformer network trained to predict monkeys’ choices in the Pac-Man game accurately captures the monkeys’ eye movements during gameplay. We provide a mechanistic explanation of how attention is computed in the network, revealing two distinct components. The first component, computed in the first layer, is driven by the reward saliency of the objects and can be considered as bottom-up. Similar saliency signals have been demonstrated in the prefrontal cortex of the brain zhang2022reward . The second component, computed in the second layer, focuses on the interaction between non-Pac-Man objects and Pac-Man itself. This interaction depends on the current game state and may be considered top-down. We further created an accurate and condensed attention model based on the computation mechanism of the transformer. Overall, our study provides an interpretable model for the monkeys’ gaze patterns during the Pac-Man game, highlighting the usefulness of transformers as a tool in the investigation of neural and cognitive sciences.

Acknowledgement

This work was supported by National Science and Technology Innovation 2030 Major Program (Grant No. 2021ZD0203701) to T.Y.

References

[1] Michael K Tanenhaus, Michael J Spivey-Knowlton, Kathleen M Eberhard, and Julie C Sedivy. Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217):1632–1634, 1995.
[2] Joshua I Gold and Michael N Shadlen. Representation of a perceptual decision in developing oculomotor commands. Nature, 404(6776):390–394, 2000.
[3] Tobias Loetscher, Christopher J Bockisch, Michael ER Nicholls, and Peter Brugger. Eye position predicts what number you have in mind. Current Biology, 20(6):R264–R265, 2010.
[4] Kaushik J Lakshminarasimhan, Eric Avila, Erin Neyhart, Gregory C DeAngelis, Xaq Pitkow, and Dora E Angelaki. Tracking the mind’s eye: Primate gaze behavior during virtual visuomotor navigation reflects belief dynamics. Neuron, 106(4):662–674, 2020.
[5] Richard A Andersen. Visual and eye movement functions of the posterior parietal cortex. Annual review of neuroscience, 12(1):377–403, 1989.
[6] Maurizio Corbetta and Gordon L Shulman. Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3):201–215, 2002.
[7] Rick A Adams, Eduardo Aponte, Louise Marshall, and Karl J Friston. Active inference and oculomotor pursuit: The dynamic causal modelling of eye movements. Journal of neuroscience methods, 242:1–14, 2015.
[8] Nour Karessli, Zeynep Akata, Bernt Schiele, and Andreas Bulling. Gaze embeddings for zero-shot image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4525–4534, 2017.
[9] Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, and Andreas Bulling. Multimodal integration of human-like attention in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2648–2658, 2023.
[10] Sebastian Gluth, Nadja Kern, Maria Kortmann, and Cécile L Vitali. Value-based attention but not divisive normalization influences decisions with multiple alternatives. Nature human behaviour, 4(6):634–645, 2020.
[11] Anne M Treisman and Garry Gelade. A feature-integration theory of attention. Cognitive psychology, 12(1):97–136, 1980.
[12] Simon Buchner, Christoph Holscher, Lars Konieczny, and Jan Wiener. How the geometry of space controls visual attention during spatial decision making. In Proceedings of the annual meeting of the cognitive science society, volume 31, 2009.
[13] Ueli Rutishauser and Christof Koch. Probabilistic modeling of eye movement data during conjunction search via feature-based attention. Journal of Vision, 7(6):5–5, 2007.
[14] Rick A Adams, Laurent U Perrinet, and Karl Friston. Smooth pursuit and visual occlusion: active inference and oculomotor control in schizophrenia. PloS one, 7(10):e47502, 2012.
[15] Frederick Callaway, Antonio Rangel, and Thomas L Griffiths. Fixation patterns in simple choice reflect optimal information sampling. PLoS computational biology, 17(3):e1008863, 2021.
[16] Marc Assens Reina, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Saltinet: Scan-path prediction on 360 degree images using saliency volumes. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 2331–2338, 2017.
[17] Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European conference on computer vision (ECCV), pages 754–769, 2018.
[18] Wanjie Sun, Zhenzhong Chen, and Feng Wu. Visual scanpath prediction using ior-roi recurrent mixture density network. IEEE transactions on pattern analysis and machine intelligence, 43(6):2101–2118, 2019.
[19] Binglu Wang, Chenxi Guo, Yang Jin, Haisheng Xia, and Nian Liu. Transgop: Transformer-based gaze object prediction. arXiv preprint arXiv:2402.13578, 2024.
[20] Michael Hahn and Frank Keller. Modeling task effects in human reading with neural network-based attention. arXiv preprint arXiv:1808.00054, 2018.
[21] Jason Li, Nicholas Watters, Hansem Sohn, and Mehrdad Jazayeri. Modeling human eye movements with neural networks in a maze-solving task. In Annual Conference on Neural Information Processing Systems, pages 98–112. PMLR, 2023.
[22] Qiuxia Lai, Salman Khan, Yongwei Nie, Hanqiu Sun, Jianbing Shen, and Ling Shao. Understanding more about human and machine attention in deep neural networks. IEEE Transactions on Multimedia, 23:2086–2099, 2020.
[23] Ekta Sood, Simon Tannert, Diego Frassinelli, Andreas Bulling, and Ngoc Thang Vu. Interpreting attention models with human visual attention in machine reading comprehension. arXiv preprint arXiv:2010.06396, 2020.
[24] Cansu Sen, Thomas Hartvigsen, Biao Yin, Xiangnan Kong, and Elke Rundensteiner. Human attention maps for text classification: Do humans and neural networks focus on the same words? In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4596–4608, 2020.
[25] Oliver Eberle, Stephanie Brandl, Jonas Pilot, and Anders Søgaard. Do transformer models show similar attention patterns to task-specific human gaze? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4295–4309, 2022.
[26] Stephanie Brandl and Nora Hollenstein. Every word counts: A multilingual analysis of individual human alignment with model attention. arXiv preprint arXiv:2210.04963, 2022.
[27] Joshua Bensemann, Alex Peng, Diana Benavides-Prado, Yang Chen, Neset Tan, Paul Michael Corballis, Patricia Riddle, and Michael J Witbrock. Eye gaze and self-attention: How humans and transformers attend words in sentences. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 75–87, 2022.
[28] Andrew Lampinen, Samuel Nastase, Christopher Edwards, Quitterie D’Elascombe, Akilles Rechardt, Jeremy Skipper, Gabriella Vigliocco, et al. Multimodality and attention increase alignment in natural language prediction between humans and computational models. 2024.
[29] Ekta Sood, Simon Tannert, Philipp Müller, and Andreas Bulling. Improving natural language processing tasks with human gaze-guided neural attention. Advances in Neural Information Processing Systems, 33:6327–6341, 2020.
[30] Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, and Paul Sajda. Fixating on attention: Integrating human eye tracking into vision transformers. arXiv preprint arXiv:2308.13969, 2023.
[31] Chaitanya Thammineni, Hemanth Manjunatha, and Ehsan T Esfahani. Selective eye-gaze augmentation to enhance imitation learning in atari games. Neural Computing and Applications, 35(32):23401–23410, 2023.
[32] Qianli Yang, Zhongqiao Lin, Wenyi Zhang, Jianshu Li, Xiyuan Chen, Jiaqi Zhang, and Tianming Yang. Monkey plays pac-man with compositional strategies and hierarchical decision-making. Elife, 11:e74500, 2022.
[33] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[34] Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Better plain vit baselines for imagenet-1k. arXiv preprint arXiv:2205.01580, 2022.
[35] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928, 2020.
[36] Wenyi Zhang, Yang Xie, and Tianming Yang. Reward salience but not spatial attention dominates the value representation in the orbitofrontal cortex. Nature Communications, 13(1):6306, 2022.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Appendix A Training Details

Training

We used AdamW as the optimizer, with an initial learning rate = $10^{-4}$ and weight-decay = $10^{-7}$ . The batch size is 768. All models were trained on an A100 GPU for 800 epochs. 140 sessions from monkey O and 125 sessions from monkey D were used for training, five held-out sessions from each monkey were used for testing. The ratio of overlapped samples in the training and the testing set was negligible.

Data specification

The input of the model is a $32\times{28}\times{34}$ tensor, where the last dimension corresponds to 17 feature channels in two frames. The description of each channel can be found in Table 1.

Table 1: Channel Descriptions of the Model Input

Channel Number	Channel Name	Description
1	maze	1 if the tile is a wall tile, 0 if the tile is a walkable tile.
2	bean	1 if a bean is on the tile, 0 otherwise.
3	energizer	1 if an energizer is on the tile. Eating an energizer leads to 4 drops of juice reward and turns the ghosts into the scared mode.
4	apple	1 if an apple is on the tile. Taking an apple leads to 12 drops of juice reward.
5	cherry	1 if a cherry is on the tile. Eating a cherry leads to 3 drops of juice reward.
6	melon	1 if a melon is on the tile. Eating a melon leads to 17 drops of juice reward.
7	orange	1 if an orange is on the tile. Eating an orange leads to 8 drops of juice reward.
8	strawberry	1 if a strawberry is on the tile. Eating a strawberry leads to 5 drops of juice reward.
9	ghost 1 (normal)	1 if ghost 1 in normal mode is on the tile. Colliding into ghosts in normal mode leads to a time-out penalty. After the time-out, Pac-Man respawns at the start tile and continues the previous game.
10	ghost 2 (beaten)	1 if ghost 1 in beaten mode is on the tile. A beaten ghost moves back to the ghost house. Colliding into a beaten ghost does not trigger any game events.
11	ghost 1 (scared)	1 if ghost 1 in scared mode is on the tile. Eating a scared ghost turns the ghost into the beaten mode and produces 8 drops of juice reward.
12	ghost 1 (flashing)	1 if ghost 1 in flashing mode is on the tile. A flashing ghost is a scared ghost that will turn back into the normal model very soon.
13	ghost 2 (normal)	1 if ghost 2 in normal mode is on the tile.
14	ghost 2 (beaten)	1 if ghost 2 in beaten mode is on the tile.
15	ghost 2 (scared)	1 if ghost 2 in scared mode is on the tile.
16	ghost 2 (flashing)	1 if ghost 2 in flashing mode is on the tile.
17	Pac-Man	1 if Pac-Man is on the tile, 0 otherwise

Appendix B More Details on the Attention Analyses

20 sessions from monkey O and 15 sessions from monkey D were used for attention analyses. The sessions were chosen for their high-quality eye-tracking data.

Correlation between the gaze pattern and the model attention

To compare the attention of the transformer model and the gaze pattern of the monkeys, we computed the correlation between the attention rollout score of the model and the gaze data in two different ways. Firstly, we report the Spearman correlation between the distance from gaze positions of each patch and its rollout score. Gaze positions in the time of three tiles before the junction were used for the analysis. The distance between gaze positions and a patch was computed by averaging the distance between each gaze position and its closet tile within the patch. The Spearman correlation and the statistical inference were implemented with Scipy.stats module.

To account for the systematic drift of eye data during the recording, we also computed a second type of correlation based on the 2D cross-correlation. We first computed a heat map for the gaze positions to the same resolution as the patches. We then computed the maximum 2D cross-correlation between the attention rollout score and the gaze heatmap with the offset varying from -4 to +4 tiles (Figure 3C). The controls are computed similarly, with the attention scores shuffled across all patches or across only patches with objects.

Attention rollout

Following [35], the attention rollout score was computed as below:

\tilde{\bf A}^{i}=\begin{cases}(0.5{\bf A}^{i}+0.5{\bf I})\tilde{\bf A}^{i-1}&% \text{if}\;i>0\\ (0.5{\bf A}^{i}+0.5{\bf I})&\text{if}\;i=0\end{cases}

(8)

where ${\bf A}^{i}$ is the attention matrix of layer $i$ summed across all attention heads and ${\bf I}$ is the identity matrix.

As we only considered the first row of the rollout matrix, i.e. the score depicting how the CLS token attends the maze patches, we derived equation 1 from equation 8. To measure the relative contribution of the three terms in equation 1, we computed their vector norms. The top row of Figure 9A shows the distributions of the vector norms. The multiplicative term is smaller than the other two terms. The bottom row of Figure 9 further shows that the multiplication term only contributes to all samples tested minimally.

We approximated the attention rollout score as:

\tilde{a}_{approxCLS}={\bf A}^{(1)}_{1,2:}+{\bf A}^{(2)}_{1,2:}

(9)

The approximate score correlates with the original rollout score almost perfectly, indicating that out simplification is justified.

Layer 1 attention

Because the attention score of layer 1 is fully determined by the objects in a patch and a weight matrix that describes the influence of each object on the attention score, averaging the attention scores of each object across samples and locations provides a reliable estimate of the layer 1 attention. We did this in Figure 5B and later used it as the bottom-up attention component in the simplified model.

Appendix C Human Experiment

Experiment details

Ten healthy human subjects (two male, eight female; age, 24-32) participated in the experiment. During the experiment, they sat in front of a monitor and watched 100 game states that were randomly chosen from monkey O’s dataset. The participants were asked to choose 1-10 tiles that were most relevant to the impending decision at the next junction by clicking the mouse (Figure 11). There was no set time limit for the task, and the participants finished the experiment using 30-55 minutes.

Jaccard similarity

We used Jaccard index to measure the similarity between the choices of two subjects or between the choices of a subject and a model. The Jaccard index is defined as below:

J(A,B)=\frac{|A\cap B|}{|A\cup B|},

(10)

where A and B are the set of tiles chosen by the two subjects being compared. For the transformer network, we chose the tiles with highest rollout scores, with the number of tiles matched the human subject being compared against. We also included a control that simply chose randomly, with the number of chosen tiles also matched.

Performance comparison

For each agent, the prediction error was quantified by first assigning the nearest selected tile to each gaze data point and then averaging the distance between the assigned tiles and the gaze position. To compare the prediction accuracy between the models and the participants, we subtract the models’ prediction error against the participants’. A negative difference indicates better model performance. Note that this analysis is sensitive to the number of selected tiles, so we matched the number of tiles for each comparison.

We also observed that human participants tended not to choose the Pac-Man tile. This was most likely because the instruction given to the human subjects encouraged them not to include Pac-Man in their answers. Both the monkeys and the transformer model assigned a large amount of gazes or attention to Pac-Man. Therefore, we included the Pac-Man tile in the human responses when evaluating human-model differences.

Condensed model

We constructed a model that combines the bottom-up and top-down attention components based on the computation abstracted from the transformer network. While the original rollout score is a patch-level measurement, the condensed model calculates attention at object-level. To compare the condensed model with the original rollout score, we accounted for the difference in terms of resolution by assigning all the rollout scores to the objects within the corresponding patch and setting all other tiles to zero. We then computed the Spearman correlation between the object-level rollout score and the attention predicted by the condensed model.

To validate the model prediction in the monkeys, we showed the correlation between the predicted attention and the distance to the gaze positions of each tile. Since the reconstruction results in a lot of zero-attention tiles, we only show the top-n tiles that are non-zero in at least 10% of the samples.

Appendix D Results for Monkey D

Shown in Figure LABEL:fig:detron are the results when the transformer network was trained with monkey D’s behavior data. Overall, the results from the two monkeys are very similar. Interestingly, monkey D’s layer 1 head 2 attention shows a stronger correlation with the object value and generally larger weights (Figure LABEL:fig:detron H and Figure 5 B), suggesting monkey D might have a stronger value-based bottom-up attention component comparing to monkey O.