Image Reconstruction from Electroencephalography Using Latent Diffusion

Teng Fei
Department of Cognitive Science
University of California, San Diego
La Jolla, CA 92092
tfei@ucsd.edu
\And [Uncaptioned image]

Virginia R. de Sa
Department of Cognitive Science
University of California, San Diego
La Jolla, CA 92092
desa@ucsd.edu

Abstract

In this work, we have adopted the diffusion-based image reconstruction pipeline previously used for fMRI image reconstruction and applied it to Electroencephalography (EEG). The EEG encoding method is very simple, and forms a baseline from which more sophisticated EEG encoding methods can be compared. We have also evaluated the fidelity of the generated image using the same metrics used in the previous functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) works. Our results show that while the reconstruction from EEG recorded to rapidly presented images is not as good as reconstructions from fMRI to slower presented images, it holds a surprising amount of information that could be applied in specific use cases. Also, EEG-based image reconstruction works better in some categories–such as land animals and food–than others, shedding new light on previous findings of EEG’s sensitivity to those categories and revealing potential for these methods to further understand EEG responses to human visual coding. More investigation should use longer-duration image stimulations to elucidate the later components that might be salient to the different image categories.

Keywords EEG $\cdot$ visual-evoked potential $\cdot$ latent diffusion $\cdot$ rapid-serial visual presentation $\cdot$ visual perception

Code Availability

Code can be accessed on the GitHub repository:

https://github.com/desa-lab/EEG-Image-Reconstruction

1 Introduction

Visual perception is an important aspect of human cognition and a gateway to understanding more complex cognitive processes such as visual imagery and dream visuals. However, it has been a challenge to come up with an objective and reliable metric for measuring visual perception that is also commensurate with its complexity. The advent of functional neuroimaging allows for putting theories into practice, such as decoding visuals using the receptive field model (Kay et al., 2008). The introduction of the latent diffusion model (Xu et al., 2022) into the decoding scene transformed the previously blurry reconstructions into something vivid and humanly interpretable (Takagi and Nishimoto, 2023). As specific semantic contents are now rendered onto the images, it is also possible to compare the high-level semantic similarity of those images with their original images using deep neural network models trained to perform these tasks.

Compared to fMRI, which has well-defined source localization in space and fine-grained spatial resolution, EEG not only has an under-determined source space but is also constrained by volume conduction across different types of tissue between the neurons and the electrodes, which limits its functional spatial resolution to a few centimeters. Under such constraints, it is unlikely that EEG would contain remotely sufficient retinotopic information to reconstruct the images. On the other hand, it is common knowledge that the early components in visual-evoked potentials (VEP) reflect low-level visual features such as visual field (Halliday and Michael, 1970; Jeffreys and Axford, 1972), color (Paulus et al., 1988) and contrast (Schechter et al., 2005), and certain ERPs such as N170 are sensitive to certain visual categories such as faces (Taylor, 2002). This combined with the wide availability and low cost of EEG makes it appealing to test the image reconstruction despite its shortcomings.

It is worth noting that even though the latent diffusion models are generally complex and cost-prohibitive to train, the mapping from the brain signals (fMRI, MEG, EEG, etc) to the embedding space in the diffusion models is generally simple and can be achieved using linear models. We have applied regularized linear regression to map EEG signals from viewing 16740 images in an RSVP paradigm (Gifford et al., 2022) onto latent embeddings of the versatile diffusion model (Xu et al., 2022).

Earlier study

At the time of writing this, there has already been a paper on EEG image reconstruction using latent diffusion, however that dataset has already been shown to be problematic due to its blocked experiment design (Li et al., 2020). The experiment was designed so that all presentations of one object were close in time (training and test data), and separated from presentations of other objects. Because EEG is a non-stationary signal, and drifts over time, features in the EEG signals can correlate with time. This is evident in the high classifiability of that dataset, and further shown by extensive experiments in (Li et al., 2020). Notably also, visual inspection of the reconstructions from that dataset reveal similarity in class, but no similarity in visual features separate from the class.

Concurrent study

At the time of writing this, a different team has published a paper about 2 weeks in advance (Li et al., 2024). Their approach used a transformer-based model called the adaptive thinking mapper (ATM). We illustrate that it is even feasible to use a linear model for mapping.

2 Methods

We adopted the Ozcelik and VanRullen (2023) method with minimal modifications to the THINGS-EEG2 dataset. To summarize their method, the image reconstruction is a 2-stage process: the first stage maps the brain signal onto the latent space of a variational auto-encoder (VAE), specifically Very Deep Variational Auto-Encoder (VDVAE) (Child, 2020), which would provide a rough visual representation that would pass onto the diffusion model. The second stage maps the same brain signal onto the CLIP-Vision and CLIP-Text embeddings of the Versatile Diffusion model (Xu et al., 2022), which combines the CLIP and the encoded images from the VDVAE and produces the reconstructed images. The Natural Scenes Dataset (Allen et al., 2022) used in the Ozcelik and VanRullen (2023) paper used the COCO image dataset, which has captions for each image that can be used for extracting CLIP-Text. For the THINGS dataset, the images only come with their corresponding category names rather than a full-sentence description, so those category names were used to generate the training CLIP-Text embeddings.

2.1 Dataset

We used the preprocessed version of THINGS-EEG2 (Gifford et al., 2022), which has 17 posterior EEG channels compared to the 63 total channels in the raw dataset.

https://osf.io/anp5v/

Each trial in the data is from -0.2 seconds to 0.8 seconds relative to the onset of the stimulus. The training images and test images are presented in separate sessions, but within training and test images all the orders are pseudo-randomized. Each training image is shown 4 times, and each test image is shown 80 times. We took the average of all trials for each image to form the final dataset.

2.2 Computing Basic Performance Metrics

We used the same performance metrics (See Figure 1) as in Ozcelik and VanRullen (2023). Quoting from their paper:

PixCorr is the pixel-level correlation of reconstructed and groundtruth images. SSIM (Wang et al., 2004) is the structural similarity index metric. AlexNet(2) and AlexNet(5) are the 2-way comparisons of the second and fifth layers of AlexNet (Krizhevsky et al., 2017), respectively. Inception is the 2-way comparison of the last pooling layer of InceptionV3 (Szegedy et al., 2015). CLIP is the 2-way comparison of the output layer of the CLIP-Vision (Radford et al., 2021) model. EffNet-B and SwAV are distance metrics gathered from EfficientNet-B1 (Tan and Le, 2019) and SwAV-ResNet50 (Caron et al., 2020) models, respectively. The first four can be considered as low-level metrics, while the last four reflect higher-level properties. For PixCorr and SSIM metrics, we downsampled generated images from 512×512 resolution to 425×425 resolution. For the rest of the measures, generated images are preprocessed according to the input properties of each network.

Figure 1 (a) is computed by models trained on each subject’s 4-trial-averaged training data, and tested on their corresponding 80-trial-averaged test data. For Figure 1 (b), the "first 200ms", "first 400ms", "first 600ms" and "first 800ms" models use those corresponding time ranges after the onset of the stimulus. The chance level performance is computed by passing the 200ms before the onset of the stimulus onto the trained "first 200ms" model.

2.3 Model Ablation

As described in the Ozcelik and VanRullen (2023); Takagi and Nishimoto (2023), the image generation pipeline maps each trial onto 3 latent embeddings used for the reconstruction: AutoKL, CLIP-Vision, and CLIP-Text. The AutoKL is used by VDVAE, and is responsible for the general shapes and colors of the reconstructed images. Both CLIP-Vision and CLIP-Text are used by Versatile Diffusion, and the CLIP-Vision is more responsible for the visual features of the images and CLIP-Text more responsible for the general semantic alignment.

To assess the relative contribution of the 3 kinds of latent embeddings toward the model performance we tested 5 kinds of ablated models: "CLIP-Text only" uses only the CLIP-Text embedding; "no CLIP-Vision" uses AutoKL and "CLIP-Text"; "CLIP-Vision only" uses only the CLIP-Vision embedding; "no CLIP-Text" uses AutoKL and CLIP-Vision; "no AutoKL" uses CLIP-Vision and CLIP-Text.

The ablation of the AutoKL is done by feeding a blank gray image (127, 127, 127 in RGB) instead of the results from VDVAE as the input image into Versatile Diffusion, and increasing the diffusion strength from 0.75 to 0.99. The ablation of the CLIP-Vision is done by encoding a blank black image as CLIP-Vision instead of using the predicted CLIP-Vision from EEG data and setting the mixing ratio from 0.4 to 0. The ablation of the CLIP-Text is done by encoding an empty string as CLIP-Text instead of using the predicted CLIP-Text from EEG data and setting the mixing ratio from 0.4 to 0.99.

2.4 Pairwise Data Segment Replacement

Each trial in the EEG data contains 800 miliseconds worth of data, or 80 sample points at a sampling rate of 100 Hz. We took a few trials and paired them together to see the effect of swapping a small segment in the data between those two trials (See Figure 4). Take the top two rows in the figure, for example: the left-most image in the first row and the second row have their 0-50 ms segment swapped. The second from the left in the first and second row have their 10-60 ms segment swapped. The third from the left in the first and second row have their 20-70 ms segment swapped, and so forth. At 10 ms step size and the sliding window length of 50ms, there would be 75 images for the 800 ms trial. There is 1 additional image at the very right, which is from the original data without the pairwise swapping, as a reference. Each 2 subsequent rows would have the same pairwise data segment replacement procedure.

3 Results

3.1 Basic Performance Metrics

The performance across the 10 subjects are relatively consistent with a reasonable amount of variation. The performance across duration shows that using 400ms of data achieves a slightly higher performance than 200ms, despite the fact that other images have started showing by this time. See Appendix A for examples of reconstructions from subject 1. To put the performance in context, the reported THINGS-MEG data performance is slightly higher than ours (Benchetrit et al., 2024). Although they did not use the provided test set but rather took out parts of the training set as the test set, and thus did not have multiple trials to average during test time. Using 3 second duration averaged over 3 NSD presentations and 7T fMRI recording achieves significantly higher performance (Scotti et al., 2023).

3.1.1 Ablation

The full model achieved the best all-around performance compared to the ablated models. In general models without CLIP-Text achieved similar level of performance and are slightly better than models without CLIP-Vision.

3.1.2 ICA Components

We used the ICA algorithm in the MNE-Python package (Gramfort et al., 2013) on the averaged test EEG data for subject 1. Specifically, we used the "extended-infomax" method with max iterations of 1000 (for reproducibility, we used random state 1). We then reconstruct the test EEG data using the top 1-16 independent components. We are currently investigating the enigmatic performance of Structural Similarity (SSIM), which is inconsistent with the rest of the peformance metrics. See Appendix C for more details on the ICA components.

3.2 Feature Transfer Effect from Narrow Time Segment Swapping

To investigate the salient features in the data and to find out time ranges that are most sensitive to disturbance. We used pairs of image-viewing EEG data and swapped a 50ms sliding time window between them (See Figure 4). We moved the time window along the 800ms trial to see how much each image is disturbed by the other image at each time point. Refer to Section 2.4 for more details on the method.

Take the "gorilla-gopher" pair, for example, there is a clear fur color swapping effect between the 100-210ms time range. The feature swapping effect is not always present for all pairs of images, and the exchange of features is not always symmetric. However, the example indicates that a disturbance of the reconstructed image is generally present between the 100-380ms range, as can be seen even from a distance.

The examples are hand-picked from image classes that were well visualized, but without knowing the outcome of the swapping in advance.

4 Discussion

The similarity of the generated images to the ground truth images not only relies on the mapping algorithm but also the specific diffusion model and settings. Therefore to facilitate fair comparisons across studies, it would be helpful to adapt previous studies to newer diffusion models if there are significant improvements to the diffusion models. This study presents a baseline performance achievable using a relatively new diffusion model with a minimal amount of machine learning involved in the mapping.

EEG has well known limitations in spatial resolution which restrict the fidelity of the image decoding. The dataset used in this work used RSVP-style (rapid) presentation with stimuli lasting 100ms and new stimuli appearing every 200ms. While the rapid presentation increases the data quantity and thus the signal-to-noise ratio which can help greatly with the model performance, the restricted processing time (interrupted by new visual presentations) reduces any later cognitive processes that could be potentially useful in decoding. To investigate what later VEP components might be salient to different visual categories, it might be useful to prolong the duration of each image presentation to a little less than a second. Results from THINGS-MEG (Hebart et al., 2023) indicated that MEG data with longer ISI and presentation times achieved better reconstructions when longer windows of MEG were used (Li et al., 2024).

In its current form, the EEG image reconstruction may be used in a few limited cases, such as entertainment and artwork generation. However, the application needs to be aware of the nature of the signal as a response to sudden changes in the visual image. To apply it to real-world visual scenes, it might be necessary to add a rapid, automatic shutter goggle in front of the eyes to create artificial visually flashed stimuli that mimic the data format used in this study. Alternatively similar methods could be explored with more natural image presentation paradigms.

Future research may also look into the direction of video reconstruction as video generation AI is approaching a similar level of sophistication as image generation AI. Video stimuli would be more representative of natural visual experiences. Different motions occurring in a video may also contain decodable time courses, something that would take advantage of the temporal resolution of EEG and MEG. Decoding a sub-section of a video clip could provide insight into mechanisms of ongoing visual processing, which, compared to evoked transient visual responses, might be more similar to internal visual representations such as visual imagery and dream visuals.

References

Kay et al. [2008] Kendrick N. Kay, Thomas Naselaris, Ryan J. Prenger, and Jack L. Gallant. Identifying natural images from human brain activity. Nature, 452(7185):352–355, March 2008. ISSN 0028-0836, 1476-4687. doi:10.1038/nature06713.
Xu et al. [2022] Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model, 2022. URL https://arxiv.org/abs/2211.08332.
Takagi and Nishimoto [2023] Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14453–14463, 2023. doi:10.1109/CVPR52729.2023.01389.
Halliday and Michael [1970] A. M. Halliday and W. F. Michael. Changes in pattern-evoked responses in man associated with the vertical and horizontal meridians of the visual field. The Journal of Physiology, 208(2):499–513, June 1970. ISSN 0022-3751, 1469-7793. doi:10.1113/jphysiol.1970.sp009134.
Jeffreys and Axford [1972] D.A. Jeffreys and J.G. Axford. Source locations of pattern-specific components of human visual evoked potentials. i. component of striate cortical origin. Experimental Brain Research, 16(1), November 1972. ISSN 0014-4819, 1432-1106. doi:10.1007/BF00233371. URL http://link.springer.com/10.1007/BF00233371.
Paulus et al. [1988] W.M. Paulus, H. Plendl, and S. Krafczyk. Spatial dissociation of early and late colour evoked components. Electroencephalography and Clinical Neurophysiology/Evoked Potentials Section, 71(2):81–88, March 1988. ISSN 01685597. doi:10.1016/0168-5597(88)90009-3.
Schechter et al. [2005] Isaac Schechter, Pamela D. Butler, Vance M. Zemon, Nadine Revheim, Alice M. Saperstein, Maria Jalbrzikowski, Roey Pasternak, Gail Silipo, and Daniel C. Javitt. Impairments in generation of early-stage transient visual evoked potentials to magno- and parvocellular-selective stimuli in schizophrenia. Clinical Neurophysiology, 116(9):2204–2215, September 2005. ISSN 13882457. doi:10.1016/j.clinph.2005.06.013.
Taylor [2002] Margot J. Taylor. Non-spatial attentional effects on p1. Clinical Neurophysiology, 113(12):1903–1908, December 2002. ISSN 13882457. doi:10.1016/S1388-2457(02)00309-7.
Gifford et al. [2022] Alessandro T. Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M. Cichy. A large and rich eeg dataset for modeling human visual object recognition. NeuroImage, 264:119754, December 2022. ISSN 10538119. doi:10.1016/j.neuroimage.2022.119754.
Li et al. [2020] Ren Li, Jared S. Johansen, Hamad Ahmed, Thomas V. Ilyevsky, Ronnie B. Wilbur, Hari M. Bharadwaj, and Jeffrey Mark Siskind. The perils and pitfalls of block design for eeg classification experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2020. ISSN 0162-8828, 2160-9292, 1939-3539. doi:10.1109/TPAMI.2020.2973153.
Li et al. [2024] Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion. (arXiv:2403.07721), March 2024. URL http://arxiv.org/abs/2403.07721. arXiv:2403.07721 [cs, eess, q-bio].
Ozcelik and VanRullen [2023] Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using generative latent diffusion. Scientific Reports, 13(1):15666, September 2023. ISSN 2045-2322. doi:10.1038/s41598-023-42891-8.
Child [2020] Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. 2020. doi:10.48550/ARXIV.2011.10650. URL https://arxiv.org/abs/2011.10650.
Allen et al. [2022] Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L. Breedlove, Jacob S. Prince, Logan T. Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin Hutchinson, Thomas Naselaris, and Kendrick Kay. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1):116–126, January 2022. ISSN 1097-6256, 1546-1726. doi:10.1038/s41593-021-00962-x.
Wang et al. [2004] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi:10.1109/TIP.2003.819861.
Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, may 2017. ISSN 0001-0782. doi:10.1145/3065386. URL https://doi.org/10.1145/3065386.
Szegedy et al. [2015] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015. URL https://arxiv.org/abs/1512.00567.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020.
Tan and Le [2019] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. 2019. doi:10.48550/ARXIV.1905.11946. URL https://arxiv.org/abs/1905.11946.
Caron et al. [2020] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments, 2020. URL https://arxiv.org/abs/2006.09882.
Benchetrit et al. [2024] Yohann Benchetrit, Hubert Banville, and Jean-Remi King. Brain decoding: Toward real-time reconstruction of visual perception. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
Scotti et al. [2023] Paul S. Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan J. Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth A. Norman, and Tanishq Mathew Abraham. Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors. (arXiv:2305.18274), October 2023. URL http://arxiv.org/abs/2305.18274. arXiv:2305.18274 [cs, q-bio].
Gramfort et al. [2013] Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A. Engemann, Daniel Strohmeier, Christian Brodbeck, Roman Goj, Mainak Jas, Teon Brooks, Lauri Parkkonen, and Matti S. Hämäläinen. MEG and EEG data analysis with MNE-Python. Frontiers in Neuroscience, 7(267):1–13, 2013. doi:10.3389/fnins.2013.00267.
Hebart et al. [2023] Martin N Hebart, Oliver Contier, Lina Teichmann, Adam H Rockter, Charles Y Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I Baker. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. eLife, 12:e82580, February 2023. ISSN 2050-084X. doi:10.7554/eLife.82580.
McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction, 2018. URL https://arxiv.org/abs/1802.03426.

Appendix A Example Reconstructions

Appendix B UMAP of Final CLIP Embeddings

The same images from Figure 5 can also be visualized in a self-organized map 6 by taking their Final CLIP embeddings (which is the same final CLIP embeddings used in the performance metrics) and perform UMAP dimensionality reduction (McInnes et al., 2018) on them. The perfomance score based on the CLIP embeddings for each image is passed onto a sigmoid function, the result of which is then multiplied by a baseline transparency and image size. The transparency and size of ground truth images (shaded blue) are always constant.