Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2402.10251v3 [q-bio.NC] 06 Mar 2024

Brant-2: Foundation Model for Brain Signals

Zhizhang Yuan Zhejiang University zhizhangyuan@zju.edu.cn Daoze Zhang Zhejiang University zhangdz@zju.edu.cn Junru Chen Zhejiang University jrchen˙cali@zju.edu.cn Gefei Gu Zhejiang University frankgu@zju.edu.cn  and  Yang Yang\dagger Zhejiang University yangya@zju.edu.cn
(2018)
Abstract.

Foundational models benefit from pre-training on large amounts of unlabeled data and enable strong performance in a wide variety of applications with a small amount of labeled data. Such models can be particularly effective in analyzing brain signals, as this field encompasses numerous application scenarios, and it is costly to perform large-scale annotation. In this work, we present the largest foundation model in brain signals, Brant-2. Compared to Brant, a foundation model designed for intracranial neural signals, Brant-2 not only exhibits robustness towards data variations and modeling scales but also can be applied to a broader range of brain neural data. By experimenting on an extensive range of tasks, we demonstrate that Brant-2 is adaptive to various application scenarios in brain signals. Further analyses reveal the scalability of the Brant-2, validate each component’s effectiveness, and showcase our model’s ability to maintain performance in scenarios with scarce labels. The source code and pre-trained weights are available at: https://github.com/yzz673/Brant-2.

Foundation model, Brain signal, Pre-training
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: ; ; isbn: 978-1-4503-XXXX-X/18/06footnotetext: \dagger Corresponding author.

1. Introduction

Brain signals refer to the biometric information collected from the brain (Zhang et al., 2021). Their patterns provide valuable insights towards understanding the physiological function of the brain and the mechanism of related diseases, leading to various applications like neurological disorders (Alturki et al., 2020; Chen et al., 2022), sleep health research (Supratak and Guo, 2020; Phan et al., 2021, 2023), emotion recognition (Song et al., 2023, 2020) and so on. Brain signals are usually measured by invasive methods like stereoelectroencephalography (SEEG) or non-invasive methods like scalp electroencephalography (EEG). SEEG requires extra surgeries to implant the recording devices, resulting in a high cost (Kovács et al., 2021). However, it manifests advantages by providing stereotactic and detailed information about deep brain structures. As a non-invasive method, EEG fails to capture deep brain information accurately and contains more noise due to the placement of electrodes on the scalp. However, compared to SEEG, EEG is more accessible to implement without surgery, leading to more application scenarios. Despite the differences, both SEEG and EEG data use the same principle of electrical activity recording (Caune et al., 2012) and share the same physiological basis.

The field of brain signals encompasses a wide array of downstream tasks. It represents a cutting-edge domain that will continue to unveil new research directions and application scenarios in the future (Zhang et al., 2021). In addition, after collecting the brain signals, the annotation work highly relies on experts in the corresponding field, making large-scale data labeling infeasible (Diachenko et al., 2022). However, existing works in this field are mainly designed to solve specific tasks (Chen et al., 2022; Yuan et al., 2023; Zheng et al., 2023). Moreover, many of them (Jia et al., 2023; Lopes et al., 2023) require training models from scratch, which tends to have a high dependency on labels. Very limited research provides an off-the-shelf model, known as the foundation model, that can be applied to multiple scenarios in this field and serve as a tool for further investigation of the brain. Foundation models have shown great potential in language (Touvron et al., 2023a, b; Yang et al., 2023) and vision (Wang et al., 2023; Yuan et al., 2021), which not only allow for customization for diverse applications but also reduce costs for data annotation (Bommasani et al., 2022). Furthermore, by leveraging foundation models, researchers and developers no longer need to build models from scratch, which saves time and costs, benefiting the advancement of the field. Therefore, we aim to build a foundation model for brain signals that can effectively solve numerous downstream tasks for both SEEG and EEG data. However, when building a foundation model for brain signals, it is inevitable to encounter challenging issues along the way.

Firstly, different data exhibit differences in terms of sampling rates as well as the positions and quantities of electrodes. SEEG data is often recorded at a high sampling rate of at least 1000Hz (Gavvala et al., 2022) and exhibits significant inter-individual variability in electrode numbers and locations. EEG data is usually sampled at a lower frequency than SEEG (Dasgupta et al., 2022) and can vary significantly in terms of montage (the number and the places of electrodes placed on the scalp) (Yi et al., 2023). Secondly, brain signals collected from different scenarios contain distinct physiological characteristics, leading to varying modeling scales. For example, a sleep stage in sleep studies is often defined as lasting up to 30 seconds (Supratak and Guo, 2020; Phan et al., 2021, 2023; Jia et al., 2023), seizure detection may utilize a time scale of less than 10 seconds (Chen et al., 2022; Yuan et al., 2023), and existing works for emotion recognition adopt modeling scales of 5 seconds or shorter (Yi et al., 2023; Song et al., 2023). Thirdly, there is substantial diversity among different tasks in the field of brain signals. For example, in seizure detection (identifying whether a segment includes seizure waves), the model is required to extract information from the target signal, such as capturing spikes and sharp waves within the signal. On the other hand, in seizure prediction (predicting whether there will be future epileptic seizures), the model needs to anticipate future changes of the target signal.


Refer to caption
Figure 1. Overview of our work. We initially utilized approximately 4 TB of brain neural data from over 15k subjects to construct our pre-training corpus. Subsequently, we employ the corpus to train Brant-2 using two pre-training tasks. Then the pre-trained model can be fine-tuned and applied to various application scenarios of brain signals.

As Fig. 1 shows, to build such a foundation model, the first step is to gather a large amount of unlabeled brain neural data, which is then used for large-scale pre-training, overcoming the challenges above. For applications, as an off-the-shelf model, the pre-trained model can be applied to various downstream scenarios through fine-tuning. In the field of brain signals, Zhang et al. (2023) propose a foundation model for SEEG, Brant, which can capture long-term dependency, spatial correlation, and time-frequency information from SEEG signals. However, Brant has some limitations that prevent it from addressing the above challenges. Therefore, we propose Brant-2, a foundation model for brain signals, which excels in three main aspects. Firstly, the pre-training corpus of Brant-2 is large and diverse. As shown in Fig. 1(a), Brant-2 utilizes nearly 4 TB of mixed SEEG and EEG data with more than 15k subjects. Due to the large volume and diversity of data, Brant-2 contains over 1 billion parameters. Secondly, Brant-2 is robust to data variations and different modeling scales. Brant is pre-trained on multi-channel data with a fixed sampling rate and window length, by which it struggles to handle data variations and adapt to changes in modeling scales. During the pre-training process of Brant-2, we design a data augmentation module to further expand the diversity of the pre-training corpus, which enhances the robustness of Brant-2 towards data variations and modeling scales. Thirdly, Brant-2 can be applied to a broad range of tasks and scenarios. As shown in Fig. 1(b), compared to Brant, which is only pre-trained with mask-prediction, Brant-2 learns more comprehensive semantic knowledge through two pre-training tasks, leading to better generalization abilities to a wider set of downstream tasks (shown in Fig. 1(c)). In summary, our key contributions comprise:

  • We propose a foundation model Brant-2, the first off-the-shelf model that can be applied to scenarios of both SEEG and EEG. Brant-2 is the largest model in brain signals pre-trained with nearly 4 TB brain signal data from more than 15k subjects.

  • The pre-training framework we designed not only enhances the robustness of the model to significant data variations and different modeling scales but also empowers the ability to adapt to diverse downstream tasks in brain signals.

  • We evaluate Brant-2 on a wide range of downstream tasks to illustrate the generalization ability of our model. By conducting additional analysis experiments, we demonstrate our model’s scalability, confirming each component’s efficacy and highlighting its ability to sustain performance in scenarios with limited labels.

2. Method

As aforementioned, building a foundation model for brain signals primarily requires handling data variations, different modeling scales, and diverse tasks. To tackle the variations of the data and modeling scales, we employ data augmentation during pre-training to enhance the diversity of the training data, improving our model’s robustness. To learn complex semantic representations and adapt to diverse downstream tasks, Brant-2 integrates time and frequency information and simultaneously focuses on reconstructing the input and forecasting future signals based on partial observations.

Notations. We use 𝒔iC×(B+F),i{1,2,N}formulae-sequencesubscript𝒔𝑖superscript𝐶𝐵𝐹𝑖12𝑁\bm{s}_{i}\in\mathbb{R}^{C\times(B+F)},\ i\in\{1,2...,N\}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × ( italic_B + italic_F ) end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 … , italic_N } with C𝐶Citalic_C channel(s) and B+F𝐵𝐹B+Fitalic_B + italic_F time steps to represent a segment of SEEG or EEG signals obtained from the pre-training corpus, where N𝑁Nitalic_N denotes the total sample number. The sample 𝒔isubscript𝒔𝑖\bm{s}_{i}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is divided into two consecutive parts: the look back window 𝒙iC×Bsubscript𝒙𝑖superscript𝐶𝐵\bm{x}_{i}\in\mathbb{R}^{C\times B}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_B end_POSTSUPERSCRIPT and the future values 𝒙ifutC×Fsuperscriptsubscript𝒙𝑖futsuperscript𝐶𝐹\bm{x}_{i}^{\text{fut}}\in\mathbb{R}^{C\times F}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fut end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_F end_POSTSUPERSCRIPT, where the look back window 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT serves as the input.


Refer to caption
Figure 2. The architecture and pre-training framework of Brant-2. The input raw signal 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is first processed to subseries-level patches 𝒑isubscript𝒑𝑖\bm{p}_{i}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then we conduct data augmentation to increase the diversity of the training data and mask a subset of patches. We combine the information from both time and frequency domains to obtain the input embedding 𝒉i,𝒉^isubscript𝒉𝑖subscriptbold-^𝒉𝑖\bm{h}_{i},\bm{\hat{h}}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which are then fed into the temporal and spatial encoder sequentially. The output representations 𝒛i,𝒛^isubscript𝒛𝑖subscriptbold-^𝒛𝑖\bm{z}_{i},\bm{\hat{z}}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are linearly mapped to reconstruct the masked patches and forecast future signals.

2.1. Overall Architecture

The overall architecture of Brant-2 is shown in Fig. 2, which mainly involves four modules: 1) patching; 2) data augmentation and masking module; 3) input embedding module; 4) encoder.

Patching. Aggregating time steps into subseries-level patches can not only enhance the locality and capture comprehensive semantic information, but also reduce computation cost (Nie et al., 2022; Zhang et al., 2023). Thus, we divide the input sample 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into non-overlapped patches with length P𝑃Pitalic_P and generate a set of patches 𝒑iC×L×Psubscript𝒑𝑖superscript𝐶𝐿𝑃\bm{p}_{i}\in\mathbb{R}^{C\times L\times P}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_L × italic_P end_POSTSUPERSCRIPT, where L=B/P𝐿𝐵𝑃L=\lfloor B/P\rflooritalic_L = ⌊ italic_B / italic_P ⌋ is context length (i.e., the number of consecutive patches).

Data augmentation and masking. The quality of the data is of paramount importance for training a foundation model (Li et al., 2023). When assessing data quality, the diversity of the data is a crucial metric. High data diversity is beneficial for improving model performance, while low diversity can introduce biases and inaccuracies in the training process. In the field of language, LLMs (Large Language Models) are pre-trained using text corpus sourced from various domains (Touvron et al., 2023a, b; Yang et al., 2023). Furthermore, Lee et al. (2023) measure the diversity of publicly available LLM datasets and conclude that these datasets are highly diverse, which emphasizes the significance of data diversity for a foundation model.

In view of the significance of data diversity during pre-training, we introduce a data augmentation module to enhance the pre-training corpus in both the temporal and spatial dimensions, aiming to generate more diverse data. For the obtained patches 𝒑iC×L×Psubscript𝒑𝑖superscript𝐶𝐿𝑃\bm{p}_{i}\in\mathbb{R}^{C\times L\times P}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_L × italic_P end_POSTSUPERSCRIPT, the temporal augmentation involves a random resampling to adjust the sampling rate of the input sample. The variation in sampling rates enriches the temporal scale of the samples, allowing the model to become more robust to handle changes in modeling scales. Formally, we choose an adjustment factor mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from ={m1,,mK}subscript𝑚1subscript𝑚𝐾\mathcal{M}=\{m_{1},...,m_{K}\}caligraphic_M = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } uniformly at random, then the input patches and future values are resampled by a factor of mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which are denoted as 𝒑^iC×L×Pksubscript^𝒑𝑖superscript𝐶𝐿subscript𝑃𝑘\hat{\bm{p}}_{i}\in\mathbb{R}^{C\times L\times P_{k}}over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_L × italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒙^ifutC×Fksuperscriptsubscript^𝒙𝑖futsuperscript𝐶subscript𝐹𝑘\hat{\bm{x}}_{i}^{\text{fut}}\in\mathbb{R}^{C\times F_{k}}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fut end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Pk=P×mksubscript𝑃𝑘𝑃subscript𝑚𝑘P_{k}=P\times m_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P × italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Fk=F×mksubscript𝐹𝑘𝐹subscript𝑚𝑘F_{k}=F\times m_{k}italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_F × italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. For the spatial augmentation, our goal is to enable the model to handle various numbers of channels, including single-channel data. Specifically, we first select a channel number Cksubscript𝐶superscript𝑘C_{k^{\prime}}italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from 𝒞={C1,,CK}𝒞subscript𝐶1subscript𝐶superscript𝐾\mathcal{C}=\{C_{1},...,C_{K^{\prime}}\}caligraphic_C = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } (Cksubscript𝐶superscript𝑘C_{k^{\prime}}italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is set equal to C𝐶Citalic_C when Ck>Csubscript𝐶superscript𝑘𝐶C_{k^{\prime}}>Citalic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT > italic_C), where C1=1subscript𝐶11C_{1}=1italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1. Then, we shuffle the channel dimension of the resampled patches 𝒑^isubscript^𝒑𝑖\hat{\bm{p}}_{i}over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the future values 𝒙^ifutsuperscriptsubscript^𝒙𝑖fut\hat{\bm{x}}_{i}^{\text{fut}}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fut end_POSTSUPERSCRIPT according to the same rule and the data is reorganized into n𝑛nitalic_n non-overlapping subset(s) along the channel dimension with Cksubscript𝐶superscript𝑘C_{k^{\prime}}italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT channel(s), where n=C/Ck𝑛𝐶subscript𝐶superscript𝑘n=\lfloor C/C_{k^{\prime}}\rflooritalic_n = ⌊ italic_C / italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⌋. In the masking procedure, we randomly mask a subset of Ck×Lsubscript𝐶superscript𝑘𝐿C_{k^{\prime}}\times Litalic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L patches with a fixed masking ratio, where the values of the masked patches are replaced by zeros. We denote the outputs of data augmentation and masking as 𝒑~in×Ck×L×Pksubscript~𝒑𝑖superscript𝑛subscript𝐶superscript𝑘𝐿subscript𝑃𝑘\tilde{\bm{p}}_{i}\in\mathbb{R}^{n\times C_{k^{\prime}}\times L\times P_{k}}over~ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L × italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒙~ifutn×Ck×Fksuperscriptsubscript~𝒙𝑖futsuperscript𝑛subscript𝐶superscript𝑘subscript𝐹𝑘\tilde{\bm{x}}_{i}^{\text{fut}}\in\mathbb{R}^{n\times C_{k^{\prime}}\times F_{% k}}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fut end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Input embedding. For neural recordings, the time domain provides insights into the amplitude and duration of neural signals, while the frequency domain unveils oscillatory patterns and rhythmic activity (Kalaivani et al., 2014). By modeling neural signals in both domains, we can obtain a more comprehensive understanding of the underlying neurophysiological mechanisms (Morales and Bowers, 2022). Therefore, as shown in the top left corner of Fig. 2, we combine the features from both time and frequency domains to obtain the input embedding. We generate the frequency features 𝒑~iFsuperscriptsubscript~𝒑𝑖F\tilde{\bm{p}}_{i}^{\text{F}}over~ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT from the augmented data 𝒑~isubscript~𝒑𝑖\tilde{\bm{p}}_{i}over~ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by calculating the power spectral density (Youngworth et al., 2005) (PSD) that reveals the spectral power distribution in different frequency bands of the signals, which is associated with different brain functional states (Zhang et al., 2023). For example, during wakefulness, α𝛼\alphaitalic_α (8-13 Hz) and β𝛽\betaitalic_β (13-30 Hz) waves are more active; during sleep, δ𝛿\deltaitalic_δ (less than 4 Hz) and θ𝜃\thetaitalic_θ (4-8 Hz) waves are more prominent.

We use non-linear encoders to map the time and frequency data 𝒑~i,𝒑~iFsubscript~𝒑𝑖superscriptsubscript~𝒑𝑖F\tilde{\bm{p}}_{i},\ \tilde{\bm{p}}_{i}^{\text{F}}over~ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT to D2𝐷2\frac{D}{2}divide start_ARG italic_D end_ARG start_ARG 2 end_ARG-dimensional latent representations 𝒉~i,𝒉~iFn×Ck×L×D2subscript~𝒉𝑖superscriptsubscript~𝒉𝑖Fsuperscript𝑛subscript𝐶superscript𝑘𝐿𝐷2\tilde{\bm{h}}_{i},\ \tilde{\bm{h}}_{i}^{\text{F}}\in\mathbb{R}^{n\times C_{k^% {\prime}}\times L\times\frac{D}{2}}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L × divide start_ARG italic_D end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT. which are then concatenated and added with a learnable positional encoding 𝐖posL×Dsubscript𝐖possuperscript𝐿𝐷\mathbf{W}_{\text{pos}}\in\mathbb{R}^{L\times D}bold_W start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT which monitors the temporal order of patches to obtain the input embedding 𝒉in×Ck×L×Dsubscript𝒉𝑖superscript𝑛subscript𝐶superscript𝑘𝐿𝐷\bm{h}_{i}\in\mathbb{R}^{n\times C_{k^{\prime}}\times L\times D}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L × italic_D end_POSTSUPERSCRIPT:

(1) 𝒉i=Concat(𝒉~i,𝒉~iF)+Broadcast(𝐖pos),subscript𝒉𝑖Concatsubscript~𝒉𝑖superscriptsubscript~𝒉𝑖FBroadcastsubscript𝐖pos\bm{h}_{i}=\text{Concat}(\tilde{\bm{h}}_{i},\tilde{\bm{h}}_{i}^{\text{F}})+% \text{Broadcast}(\mathbf{W}_{\text{pos}}),bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Concat ( over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ) + Broadcast ( bold_W start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT ) ,

where the Broadcast()Broadcast\text{Broadcast}(\cdot)Broadcast ( ⋅ ) operator broadcasts 𝐖possubscript𝐖pos\mathbf{W}_{\text{pos}}bold_W start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT to the same shape as Concat(𝒉~i,𝒉~iF)Concatsubscript~𝒉𝑖superscriptsubscript~𝒉𝑖F\text{Concat}(\tilde{\bm{h}}_{i},\tilde{\bm{h}}_{i}^{\text{F}})Concat ( over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ). Finally, we make a clone of the input embedding 𝒉isubscript𝒉𝑖\bm{h}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and obtain 𝒉^isubscriptbold-^𝒉𝑖\bm{\hat{h}}_{i}overbold_^ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, preparing for the subsequent encoding process.

Encoder. In order to generalize to various scenarios, we incorporate both mask-prediction and forecasting tasks during pre-training to learn representations with rich semantic information. For the purpose of simultaneously accomplishing these two pre-training tasks, we design a multi-feed-forward (multi-FFN) Transformer block, as illustrated in the top right corner of Fig. 2. The block contains two FFNs, where one is used for signal reconstruction, and the other is employed for forecasting. We utilize a temporal encoder to capture time dependencies and a spatial encoder to capture channel correlations, both of which are composed of stacked multi-FFN Transformer blocks.

For temporal encoding, we model series of patches of length L𝐿Litalic_L: 𝒉i,j𝒉i,𝒉^i,j𝒉^i,j=1,2,,n×Ckformulae-sequencesubscript𝒉𝑖𝑗subscript𝒉𝑖formulae-sequencesubscriptbold-^𝒉𝑖𝑗subscriptbold-^𝒉𝑖𝑗12𝑛subscript𝐶superscript𝑘\bm{h}_{i,j}\in\bm{h}_{i},\ \bm{\hat{h}}_{i,j}\in\bm{\hat{h}}_{i},j=1,2,...,n% \times C_{k^{\prime}}bold_italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ overbold_^ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j = 1 , 2 , … , italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where 𝒉i,j,𝒉^i,jL×Dsubscript𝒉𝑖𝑗subscriptbold-^𝒉𝑖𝑗superscript𝐿𝐷\bm{h}_{i,j},\ \bm{\hat{h}}_{i,j}\in\mathbb{R}^{L\times D}bold_italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT. We denote the outputs of the l1𝑙1l-1italic_l - 1-th layer of the temporal encoder as 𝒐i,jl1,𝒐^i,jl1L×Dsuperscriptsubscript𝒐𝑖𝑗𝑙1superscriptsubscriptbold-^𝒐𝑖𝑗𝑙1superscript𝐿𝐷\bm{o}_{i,j}^{l-1},\ \bm{\hat{o}}_{i,j}^{l-1}\in\mathbb{R}^{L\times D}bold_italic_o start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT, where 𝒐i,j0=𝒉i,j,𝒐^i,j0=𝒉^i,jformulae-sequencesuperscriptsubscript𝒐𝑖𝑗0subscript𝒉𝑖𝑗superscriptsubscriptbold-^𝒐𝑖𝑗0subscriptbold-^𝒉𝑖𝑗\bm{o}_{i,j}^{0}=\bm{h}_{i,j},\ \bm{\hat{o}}_{i,j}^{0}=\bm{\hat{h}}_{i,j}bold_italic_o start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_italic_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = overbold_^ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. For the l𝑙litalic_l’th layer, outputs from the last layer 𝒐i,jl1,𝒐^i,jl1superscriptsubscript𝒐𝑖𝑗𝑙1superscriptsubscriptbold-^𝒐𝑖𝑗𝑙1\bm{o}_{i,j}^{l-1},\ \bm{\hat{o}}_{i,j}^{l-1}bold_italic_o start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT go through the same multi-head attention (Vaswani et al., 2023) followed by a residual addition and a normalization to obtain the attention outputs 𝒂i,jl,𝒂^i,jlL×Dsuperscriptsubscript𝒂𝑖𝑗𝑙superscriptsubscriptbold-^𝒂𝑖𝑗𝑙superscript𝐿𝐷\bm{a}_{i,j}^{l},\ \bm{\hat{a}}_{i,j}^{l}\in\mathbb{R}^{L\times D}bold_italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT, which will be separately fed into two FFNs (denoted as FFNmasklsuperscriptsubscriptFFNmask𝑙\text{FFN}_{\text{mask}}^{l}FFN start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and FFNfcstlsuperscriptsubscriptFFNfcst𝑙\text{FFN}_{\text{fcst}}^{l}FFN start_POSTSUBSCRIPT fcst end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT). Due to the incomplete data resulting from the mask operation during pre-training, we utilize the information reconstructed by FFNmasklsuperscriptsubscriptFFNmask𝑙\text{FFN}_{\text{mask}}^{l}FFN start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT to assist forecasting. Therefore, we establish a residual connection without gradients between the two attention outputs 𝒂i,jlsuperscriptsubscript𝒂𝑖𝑗𝑙\bm{a}_{i,j}^{l}bold_italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝒂^i,jlsuperscriptsubscriptbold-^𝒂𝑖𝑗𝑙\bm{\hat{a}}_{i,j}^{l}overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT:

(2) 𝒇i,jlsuperscriptsubscript𝒇𝑖𝑗𝑙\displaystyle\bm{f}_{i,j}^{l}bold_italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT =FFNmaskl(𝒂i,jl),absentsuperscriptsubscriptFFNmask𝑙superscriptsubscript𝒂𝑖𝑗𝑙\displaystyle=\text{FFN}_{\text{mask}}^{l}(\bm{a}_{i,j}^{l}),= FFN start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ,
(3) 𝒇^i,jlsuperscriptsubscriptbold-^𝒇𝑖𝑗𝑙\displaystyle\bm{\hat{f}}_{i,j}^{l}overbold_^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT =FFNfcstl(𝒂^i,jl+Detach(𝒂i,jl))),\displaystyle=\text{FFN}_{\text{fcst}}^{l}(\bm{\hat{a}}_{i,j}^{l}+\text{Detach% }(\bm{a}_{i,j}^{l}))),= FFN start_POSTSUBSCRIPT fcst end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + Detach ( bold_italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) ) ,

where Detach()Detach\text{Detach}(\cdot)Detach ( ⋅ ) operator returns a clone of the original input without gradients. Followed by a residual addition and a normalization, we derive the outputs of the l𝑙litalic_l-th layer 𝒐i,jl,𝒐^i,jlL×Dsuperscriptsubscript𝒐𝑖𝑗𝑙superscriptsubscriptbold-^𝒐𝑖𝑗𝑙superscript𝐿𝐷\bm{o}_{i,j}^{l},\ \bm{\hat{o}}_{i,j}^{l}\in\mathbb{R}^{L\times D}bold_italic_o start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT. The whole outputs of the temporal encoder are denoted as 𝒕i,𝒕^in×Ck×L×Dsubscript𝒕𝑖subscriptbold-^𝒕𝑖superscript𝑛subscript𝐶superscript𝑘𝐿𝐷\bm{t}_{i},\ \bm{\hat{t}}_{i}\in\mathbb{R}^{n\times C_{k^{\prime}}\times L% \times D}bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L × italic_D end_POSTSUPERSCRIPT. For spatial encoding, we take the outputs of the temporal encoder and model sets of patches of length Cksubscript𝐶superscript𝑘C_{k^{\prime}}italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from different channels: 𝒕i,j𝒕i,𝒕^i,j𝒕^i,j=1,2,,n×Lformulae-sequencesubscript𝒕𝑖𝑗subscript𝒕𝑖formulae-sequencesubscriptbold-^𝒕𝑖𝑗subscriptbold-^𝒕𝑖𝑗12𝑛𝐿\bm{t}_{i,j}\in\bm{t}_{i},\ \bm{\hat{t}}_{i,j}\in\bm{\hat{t}}_{i},\ j=1,2,...,% n\times Lbold_italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_t end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ overbold_^ start_ARG bold_italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j = 1 , 2 , … , italic_n × italic_L, where 𝒕i,j,𝒕^i,jCk×Dsubscript𝒕𝑖𝑗subscriptbold-^𝒕𝑖𝑗superscriptsubscript𝐶superscript𝑘𝐷\bm{t}_{i,j},\ \bm{\hat{t}}_{i,j}\in\mathbb{R}^{C_{k^{\prime}}\times D}bold_italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_t end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT. The encoding process of the spatial encoder is the same as the temporal encoder which is described above. The outputs of the spatial encoder 𝒛i,𝒛^in×Ck×L×Dsubscript𝒛𝑖subscriptbold-^𝒛𝑖superscript𝑛subscript𝐶superscript𝑘𝐿𝐷\bm{z}_{i},\ \bm{\hat{z}}_{i}\in\mathbb{R}^{n\times C_{k^{\prime}}\times L% \times D}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L × italic_D end_POSTSUPERSCRIPT are served as the latent representations of Brant-2.

2.2. Pre-training and Fine-tuning

Pre-training. We adopt mask-prediction and forecasting tasks during pre-training to fully extract rich semantic information to adapt to different downstream tasks. The mask-prediction allows the model to understand the patterns within a certain segment of the signal. On the other hand, the forecasting task enables the model to learn future trend changes from the current observed series. We utilize two linear heads 𝐖recD×Pk,𝐖fcstD×Fkformulae-sequencesubscript𝐖recsuperscript𝐷subscript𝑃𝑘subscript𝐖fcstsuperscript𝐷subscript𝐹𝑘\mathbf{W}_{\text{rec}}\in\mathbb{R}^{D\times P_{k}},\mathbf{W}_{\text{fcst}}% \in\mathbb{R}^{D\times F_{k}}bold_W start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT fcst end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to map the latent representations to the original signals. During pre-training, the model conducts a patch-level reconstruction and a series-level forecasting:

(4) 𝒑irecsuperscriptsubscript𝒑𝑖rec\displaystyle\bm{p}_{i}^{\text{rec}}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rec end_POSTSUPERSCRIPT =𝒛i𝐖rec,absentsubscript𝒛𝑖subscript𝐖rec\displaystyle=\bm{z}_{i}\mathbf{W}_{\text{rec}},= bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ,
(5) 𝒙ifcstsuperscriptsubscript𝒙𝑖fcst\displaystyle\bm{x}_{i}^{\text{fcst}}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fcst end_POSTSUPERSCRIPT =MeanPool(𝒛^i)𝐖fcst,absentMeanPoolsubscriptbold-^𝒛𝑖subscript𝐖fcst\displaystyle=\text{MeanPool}(\bm{\hat{z}}_{i})\mathbf{W}_{\text{fcst}},= MeanPool ( overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT fcst end_POSTSUBSCRIPT ,

where 𝒑irecn×Ck×L×Pk,𝒙ifcstn×Ck×Fkformulae-sequencesuperscriptsubscript𝒑𝑖recsuperscript𝑛subscript𝐶superscript𝑘𝐿subscript𝑃𝑘superscriptsubscript𝒙𝑖fcstsuperscript𝑛subscript𝐶superscript𝑘subscript𝐹𝑘\bm{p}_{i}^{\text{rec}}\in\mathbb{R}^{n\times C_{k^{\prime}}\times L\times P_{% k}},\bm{x}_{i}^{\text{fcst}}\in\mathbb{R}^{n\times C_{k^{\prime}}\times F_{k}}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rec end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L × italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fcst end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the MeanPool()MeanPool\text{MeanPool}(\cdot)MeanPool ( ⋅ ) operation aggregates each L𝐿Litalic_L consecutive patches in 𝒛^isubscriptbold-^𝒛𝑖\bm{\hat{z}}_{i}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then following the masked modeling and forecasting paradigms, Brant-2 is supervised by two MSE losses in the pre-training stage:

(6) recsubscript𝑟𝑒𝑐\displaystyle\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT =i=1N𝒑~i𝒑irec22,absentsuperscriptsubscript𝑖1𝑁superscriptsubscriptnormsubscript~𝒑𝑖superscriptsubscript𝒑𝑖rec22\displaystyle=\sum_{i=1}^{N}\|\tilde{\bm{p}}_{i}-\bm{p}_{i}^{\text{rec}}\|_{2}% ^{2},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ over~ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rec end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
(7) fcstsubscript𝑓𝑐𝑠𝑡\displaystyle\mathcal{L}_{fcst}caligraphic_L start_POSTSUBSCRIPT italic_f italic_c italic_s italic_t end_POSTSUBSCRIPT =i=1N𝒙ifut𝒙ifcst22,absentsuperscriptsubscript𝑖1𝑁superscriptsubscriptnormsuperscriptsubscript𝒙𝑖futsuperscriptsubscript𝒙𝑖fcst22\displaystyle=\sum_{i=1}^{N}\|\bm{x}_{i}^{\text{fut}}-\bm{x}_{i}^{\text{fcst}}% \|_{2}^{2},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fut end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fcst end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where N𝑁Nitalic_N is the number of training samples. The objective of joint optimization is obtained by adding the losses recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT and fcstsubscript𝑓𝑐𝑠𝑡\mathcal{L}_{fcst}caligraphic_L start_POSTSUBSCRIPT italic_f italic_c italic_s italic_t end_POSTSUBSCRIPT.

Fine-tuning. When fine-tuning the model, we first use a mean pooling operation to gather each L𝐿Litalic_L consecutive patches of the latent representations 𝒛i,𝒛^in×Ck×L×Dsubscript𝒛𝑖subscriptbold-^𝒛𝑖superscript𝑛subscript𝐶superscript𝑘𝐿𝐷\bm{z}_{i},\ \bm{\hat{z}}_{i}\in\mathbb{R}^{n\times C_{k^{\prime}}\times L% \times D}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_L × italic_D end_POSTSUPERSCRIPT, then the representations are aggregated by a weighted sum:

(8) 𝒓i=λMeanPool(𝒛i)+(1λ)MeanPool(𝒛^i),subscript𝒓𝑖𝜆MeanPoolsubscript𝒛𝑖1𝜆MeanPoolsubscriptbold-^𝒛𝑖\bm{r}_{i}=\lambda\text{MeanPool}(\bm{z}_{i})+(1-\lambda)\text{MeanPool}(\bm{% \hat{z}}_{i}),bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_λ MeanPool ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_λ ) MeanPool ( overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where 𝒓in×Ck×Dsubscript𝒓𝑖superscript𝑛subscript𝐶superscript𝑘𝐷\bm{r}_{i}\in\mathbb{R}^{n\times C_{k^{\prime}}\times D}bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT and λ𝜆\lambdaitalic_λ is a learnable parameter. The aggregated representation 𝒓isubscript𝒓𝑖\bm{r}_{i}bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be fed into a linear or non-linear head for the downstream tasks.

3. Experiments

3.1. Pre-training Setup

Pre-training datasets. The pre-training corpus of Brant-2 incorporates a mixed dataset of SEEG and EEG data from over 15k subjects with a total data size approaching 4 TB. The SEEG corpus used for pre-training contains intracranial neural data recorded from 26 subjects. The original corpus has a total size of 12 TB. After removing unused channels and applying preprocessing like denoising and filtering, we obtain 2.3 TB of SEEG data for pre-training. The surgical procedure involves the implantation of invasive electrodes with 47 to 238 channels. The corpus contains SEEG data at 1000Hz, 2000Hz, and 4000Hz sampling rates. The EEG corpus utilized in the pre-training is a publicly available dataset, TUEG(Harati et al., 2014), which comprises 1,643 GB of clinical recordings from 14,987 individuals with a total of 27,063 hours of data. The dataset contains over 40 different channel configurations, in which approximately 95% of the data includes a 10/20 configuration as a subset of the available channels. The sampling rate of the recordings varies between 250Hz and 1024Hz.

Pre-training details. In the encoder block of Brant-2, we apply RMSNorm(Zhang and Sennrich, 2019) and use the Swish activation function(Ramachandran et al., 2017). We set the context length L𝐿Litalic_L of Brant-2 as 16 patches, the masking ratio as 40%, and the forecasting length as 1/4 of the context length (The hyperparameter analysis of the masking ratio and forecasting length is shown in App.B). The adjustment factor of the sampling rate is uniformly chosen from ={0.25,0.5,1.0,2.0}0.250.51.02.0\mathcal{M}=\{0.25,0.5,1.0,2.0\}caligraphic_M = { 0.25 , 0.5 , 1.0 , 2.0 } and the reorganized channel number is chosen from 𝒞={1,2,4,8,16,32,64,128}𝒞1248163264128\mathcal{C}=\{1,2,4,8,16,32,64,128\}caligraphic_C = { 1 , 2 , 4 , 8 , 16 , 32 , 64 , 128 }. Brant-2 is trained using the AdamW optimizer(Loshchilov and Hutter, 2017), with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, eps=105𝑒𝑝𝑠superscript105eps=10^{-5}italic_e italic_p italic_s = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. For the learning rate scheduling, we use a linear warmup of 1k steps to reach a peak learning rate of 1.0×1051.0superscript1051.0\times 10^{-5}1.0 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, followed by a cosine decay of 150k steps to decay the final learning rate to 00. During the pre-training process, the model parameters are updated for a total of 105k steps. Our models are trained on a Linux system with 2 CPUs (AMD EPYC 9654 96-Core Processor) and 4 GPUs (NVIDIA Tesla A100 80G). Brant-2 contains 1 billion parameters, which takes over 100 hours to pre-train.

3.2. Evaluation Setup

We conduct evaluation experiments on nine diverse SEEG and EEG datasets, encompassing five downstream tasks: seizure detection, seizure prediction, sleep stage classification, emotion recognition, and motor imagery classification. We divide each dataset into several non-overlapping groups and conduct n-fold cross-validation for all groups. Each time, we set one group for evaluation and the others for fine-tuning. The fine-tuning process mainly involves updating the last two layers of the temporal encoder and the classification head, while freezing the remaining parameters of the Brant-2 (more details are shown in App. D.2).

Seizure detection. Accurate seizure detection is crucial for diagnosing and treating individuals with epilepsy and other seizure-related disorders. Seizure detection aims to identify and classify instances of seizures in brain signals recorded from epilepsy patients, which is formalized as a binary classification to classify between physiological and pathological samples. We employ 2 SEEG and 2 EEG datasets to evaluate the model performance on seizure detection. The SEEG datasets, MAYO and FNUSA (Nejedly et al., 2020), contain 5000Hz SEEG recordings from 18 and 13 subjects, respectively. The subjects are divided into six groups for each dataset, and the data is segmented into 3-second data clips. We preserve the physiological and pathological activities and remove the artifacts and power line noise. CHB-MIT (Guttag, 2010; Shoeb, 2009) consists of 23-channel EEG recordings with a sampling rate of 256Hz from 22 subjects with intractable seizures. We divide the subjects into four groups and segment the signals into 8-second data clips. Siena (Detti et al., 2020; Detti, 2020) consists of 27-channel EEG recordings of 14 patients with a sampling rate of 512Hz. The subjects are divided into five groups, and the signals are segmented into 4-second data clips. We use precision, recall, F1 and F2 scores as evaluation metrics. In the scenario of epilepsy, F2 is more valued than F1 since ignoring any seizure is costly in diagnosis.

Seizure prediction. Different from seizure detection, seizure prediction is conducted under a more challenging setting where the task is to predict the likelihood of future seizures based on the current observations. Seizure prediction is crucial for providing early warnings and alerts for individuals with epilepsy. We utilize a clinical SEEG dataset from a first-class hospital labeled by professional neurosurgeons. The dataset contains 5 subjects with a sampling rate of 1000Hz, and we adopt a 5-fold cross-validation. We sample 16-second segments and predict whether a seizure occurs within the next 1 minute. We adopt the same evaluation metrics as in seizure detection (i.e., precision, recall, F1 and F2 score) to measure the performance of the models.

Sleep stage classification. In sleep health research, sleep staging plays a critical role in enhancing our understanding of sleep states and patterns, contributing to the prevention and diagnosis of sleep-related disorders (Phan and Mikkelsen, 2022). The American Academy of Sleep Medicine (AASM) manual defines sleep into five stages: wake, N1, N2, N3, and REM  (Berry et al., 2017). Thus, sleep stage classification is formalized as a 5-class classification. We choose 2 EEG datasets, SleepEDF (Kemp et al., 2000) and Haaglanden Medisch Centrum sleep staging database (Alvarez-Estevez and Rijsman, 2021, 2022)(referred to as HMC), to verify the model performance on sleep stage classification. For SleepEDF, we adopt the SleepEDF-78 dataset which contains 153 whole-night polysomnographic sleep recordings from 78 subjects during sleep cassette studies. The EEG data with a sampling rate of 100Hz contains 1 EEG channel, which is segmented into 30-second epochs. We randomly divide the subjects into five groups. HMC is a collection of 151 whole-night polysomnographic (PSG) sleep recordings from 151 subjects. The data is sampled at 256Hz and contains 4 EEG channels. The subjects are split into five groups, and the signals are segmented into 30-second epochs. As for the evaluation metrics, we utilize accuracy, sensitivity, specificity, macro F1 score, and Cohen’s kappa κ𝜅\kappaitalic_κ.

Emotion recognition. Emotion recognition using EEG is becoming an interesting topic among researchers, which has made advancements in various domains, including biomedical research, brain-computer interfaces (BCIs), etc (Kamble and Sengupta, 2023). The SEED dataset (Zheng and Lu, 2015; Duan et al., 2013) contains 62-channel EEG data from 15 subjects while watching film clips. The film clips are carefully selected to induce different types of emotion (positive, negative, and neutral). Thus, we conduct discrete emotion recognition formalized as a 3-class classification. The data is down-sampled to 200Hz, segmented into 5-second segments, and split into five groups. The evaluation metrics include accuracy and macro F1 score.

Motor imagery classification. Motor imagery classification is to classify brain activity patterns related to imagined movements, which has gained significant attention due to its potential applications in BCIs, rehabilitation therapies, and assistive technologies. We select EEG Motor Movement/Imagery (Schalk et al., 2004; Goldberger et al., 2000)(referred to as Motor Imagery) as the dataset for this task. Motor Imagery consists of over 1500 one- and two-minute 64-channel EEG recordings with a 160Hz sampling rate obtained from 109 volunteers. For each subject, a target appears on either the left or the right side of the screen, and the subject imagines opening and closing the corresponding fist until the target disappears. The model aims to differentiate whether the subject is imagining opening and closing their left fist or right fist based on the collected EEG signals. We split the subjects into five groups and clip the data into 6.4-second segments. The evaluation metrics include accuracy and F1 score.

3.3. Baselines

We extensively compare our model with 12 advanced methods, which are divided into three categories, including 1) 3 methods aimed at time series universal modeling, 2) 3 methods based on self-supervised pre-training on brain signals, and 3) 6 methods specifically designed for each task. The methods in the first two categories are evaluated on all downstream tasks, while the methods in the third category are only evaluated on the specific tasks.

To be precise, we choose TF-C (Zhang et al., 2022) SimMTM (Dong et al., 2023) and One Fits All (Zhou et al., 2023) as universal modeling methods for time series. For the pre-training works on brain signals, we choose BrainBERT (Wang et al., 2022), Brant (Zhang et al., 2023) and MBrain (Cai et al., 2023). In addition, we select PPi (Yuan et al., 2023) (seizure detection on SEEG data), ScatterFormer (Zheng et al., 2023) (seizure detection on EEG data), Lopes et al. (2023) (seizure prediction), SleepHGNN (Jia et al., 2023) (sleep stage classification), EEG Conformer (Song et al., 2023) (emotion recognition) and TSFF-Net (Miao and Zhao, 2023a) (motor imagery classification) as the task-specific methods for each downstream task. More details of the baselines are shown in App. C.


Refer to caption
Figure 3. The overall performance comparison of our model and other baseline methods on all downstream datasets.

Table 1. Average performance on seizure detection.
Methods Metrics MAYO FNUSA
Pre. Rec. F1 F2 Pre. Rec. F1 F2
TF-C(Zhang et al., 2022) 55.12 ±plus-or-minus\pm±10.37 70.72 ±plus-or-minus\pm±14.49 55.84 ±plus-or-minus\pm±8.89 61.48 ±plus-or-minus\pm±9.43 69.80±plus-or-minus\pm±4.91 81.46 ±plus-or-minus\pm±7.63 74.76 ±plus-or-minus\pm±6.30 78.50 ±plus-or-minus\pm±8.16
SimMTM(Dong et al., 2023) 51.03 ±plus-or-minus\pm±10.75 78.97 ±plus-or-minus\pm±9.31 56.66 ±plus-or-minus\pm±10.44 67.13 ±plus-or-minus\pm±10.04 69.33*±plus-or-minus\pm±5.71 82.44 ±plus-or-minus\pm±12.05 74.74 ±plus-or-minus\pm±6.03 79.00 ±plus-or-minus\pm±9.39
One Fits All(Zhou et al., 2023) 59.60±plus-or-minus\pm±11.21 77.62 ±plus-or-minus\pm±10.71 66.26*±plus-or-minus\pm±3.23 72.22 ±plus-or-minus\pm±4.02 63.65 ±plus-or-minus\pm±11.03 82.18 ±plus-or-minus\pm±9.22 70.90 ±plus-or-minus\pm±6.00 76.95 ±plus-or-minus\pm±5.02
BrainBERT(Wang et al., 2022) 54.47 ±plus-or-minus\pm±11.25 89.70*±plus-or-minus\pm±11.43 64.65 ±plus-or-minus\pm±7.54 76.02 ±plus-or-minus\pm±6.31 58.11 ±plus-or-minus\pm±8.60 87.62*±plus-or-minus\pm±8.79 67.91 ±plus-or-minus\pm±6.46 77.67 ±plus-or-minus\pm±4.95
Brant (Zhang et al., 2023) 49.24 ±plus-or-minus\pm±12.56 93.30±plus-or-minus\pm±7.84 63.38 ±plus-or-minus\pm±8.22 77.92*±plus-or-minus\pm±2.36 48.15 ±plus-or-minus\pm±5.13 94.72±plus-or-minus\pm±4.94 63.61 ±plus-or-minus\pm±3.72 79.07 ±plus-or-minus\pm±3.31
MBrain(Cai et al., 2023) 55.39 ±plus-or-minus\pm±10.46 80.08 ±plus-or-minus\pm±10.97 61.81 ±plus-or-minus\pm±4.55 70.32 ±plus-or-minus\pm±10.12 67.88 ±plus-or-minus\pm±5.43 85.89 ±plus-or-minus\pm±9.51 75.25*±plus-or-minus\pm±6.78 81.10*±plus-or-minus\pm±10.68
PPi(Yuan et al., 2023) 68.94±plus-or-minus\pm±12.14 88.73 ±plus-or-minus\pm±8.49 73.77±plus-or-minus\pm±6.24 80.74±plus-or-minus\pm±4.63 72.77±plus-or-minus\pm±5.06 86.60 ±plus-or-minus\pm±13.67 78.39±plus-or-minus\pm±8.29 82.93±plus-or-minus\pm±9.49
Brant-2 55.41*±plus-or-minus\pm±9.74 95.88±plus-or-minus\pm±3.86 69.90±plus-or-minus\pm±8.45 83.30±plus-or-minus\pm±6.07 68.78 ±plus-or-minus\pm±9.86 95.96±plus-or-minus\pm±4.43 79.76±plus-or-minus\pm±7.61 88.59±plus-or-minus\pm±5.21
Methods Metrics CHB-MIT Siena
Pre. Rec. F1 F2 Pre. Rec. F1 F2
TF-C(Zhang et al., 2022) 17.82 ±plus-or-minus\pm±8.44 61.36 ±plus-or-minus\pm±10.66 18.01 ±plus-or-minus\pm±6.72 27.61 ±plus-or-minus\pm±6.52 7.98 ±plus-or-minus\pm±1.34 61.63 ±plus-or-minus\pm±10.92 13.68 ±plus-or-minus\pm±3.20 24.92 ±plus-or-minus\pm±3.78
SimMTM(Dong et al., 2023) 54.31 ±plus-or-minus\pm±8.12 36.77 ±plus-or-minus\pm±8.71 42.82 ±plus-or-minus\pm±8.95 38.84 ±plus-or-minus\pm±8.79 32.30 ±plus-or-minus\pm±10.51 43.37 ±plus-or-minus\pm±11.72 27.47 ±plus-or-minus\pm±6.10 32.97 ±plus-or-minus\pm±9.35
One Fits All(Zhou et al., 2023) 51.14 ±plus-or-minus\pm±12.20 56.58 ±plus-or-minus\pm±5.23 51.21 ±plus-or-minus\pm±9.95 53.69 ±plus-or-minus\pm±5.47 47.27 ±plus-or-minus\pm±10.26 43.68 ±plus-or-minus\pm±8.55 43.00 ±plus-or-minus\pm±5.42 42.97 ±plus-or-minus\pm±5.29
BrainBERT(Wang et al., 2022) 52.47 ±plus-or-minus\pm±8.04 59.44 ±plus-or-minus\pm±12.25 55.09 ±plus-or-minus\pm±8.95 57.41 ±plus-or-minus\pm±10.06 47.33*±plus-or-minus\pm±9.81 60.61 ±plus-or-minus\pm±10.15 48.54*±plus-or-minus\pm±6.20 53.52 ±plus-or-minus\pm±6.92
Brant (Zhang et al., 2023) 57.42±plus-or-minus\pm±8.42 61.41*±plus-or-minus\pm±10.07 57.99*±plus-or-minus\pm±7.63 59.65*±plus-or-minus\pm±8.09 43.07 ±plus-or-minus\pm±11.05 66.04*±plus-or-minus\pm±6.81 48.36 ±plus-or-minus\pm±8.42 55.27*±plus-or-minus\pm±4.82
MBrain(Cai et al., 2023) 53.39 ±plus-or-minus\pm±14.42 52.87 ±plus-or-minus\pm±8.08 49.35 ±plus-or-minus\pm±7.40 50.38 ±plus-or-minus\pm±6.18 38.94 ±plus-or-minus\pm±8.83 60.99 ±plus-or-minus\pm±5.61 45.88 ±plus-or-minus\pm±7.06 53.32 ±plus-or-minus\pm±6.62
ScatterFormer(Zheng et al., 2023) 55.73*±plus-or-minus\pm±8.03 64.92±plus-or-minus\pm±10.14 59.81±plus-or-minus\pm±8.80 62.72*±plus-or-minus\pm±9.52 50.33±plus-or-minus\pm±11.38 67.49±plus-or-minus\pm±7.21 53.49±plus-or-minus\pm±7.04 58.98±plus-or-minus\pm±4.15
Brant-2 56.44±plus-or-minus\pm±7.53 73.04±plus-or-minus\pm±7.83 63.42±plus-or-minus\pm±7.48 68.78±plus-or-minus\pm±7.65 50.35±plus-or-minus\pm±8.52 70.42±plus-or-minus\pm±4.53 58.14±plus-or-minus\pm±4.77 64.70±plus-or-minus\pm±1.92
Table 2. Average performance on sleep stage classification.
Methods Metrics SleepEDFx HMC
Acc. Sens. Spec. Macro F1 Kappa Acc. Sens. Spec. Macro F1 Kappa
TF-C(Zhang et al., 2022) 65.96 ±plus-or-minus\pm±1.28 51.42 ±plus-or-minus\pm±2.37 90.26 ±plus-or-minus\pm±0.47 49.37 ±plus-or-minus\pm±1.92 50.73 ±plus-or-minus\pm±2.04 48.04 ±plus-or-minus\pm±4.89 35.30 ±plus-or-minus\pm±4.67 84.35 ±plus-or-minus\pm±1.31 30.15 ±plus-or-minus\pm±6.67 23.90 ±plus-or-minus\pm±7.08
SimMTM(Dong et al., 2023) 63.85 ±plus-or-minus\pm±2.30 36.29 ±plus-or-minus\pm±1.54 88.97 ±plus-or-minus\pm±0.68 31.37 ±plus-or-minus\pm±2.62 44.40 ±plus-or-minus\pm±3.54 44.64 ±plus-or-minus\pm±2.01 31.52 ±plus-or-minus\pm±2.48 83.18 ±plus-or-minus\pm±0.76 27.27 ±plus-or-minus\pm±3.51 17.76 ±plus-or-minus\pm±3.90
One Fits All(Zhou et al., 2023) 68.45 ±plus-or-minus\pm±1.95 56.32 ±plus-or-minus\pm±3.04 91.20 ±plus-or-minus\pm±0.71 54.77 ±plus-or-minus\pm±1.70 55.03 ±plus-or-minus\pm±2.75 58.64 ±plus-or-minus\pm±1.55 51.12 ±plus-or-minus\pm±2.85 88.42 ±plus-or-minus\pm±0.70 50.54 ±plus-or-minus\pm±2.97 43.52 ±plus-or-minus\pm±3.00
BrainBERT(Wang et al., 2022) 69.56 ±plus-or-minus\pm±1.85 59.40*±plus-or-minus\pm±2.40 91.80 ±plus-or-minus\pm±0.55 58.66*±plus-or-minus\pm±1.66 57.13 ±plus-or-minus\pm±2.51 60.69 ±plus-or-minus\pm±1.67 53.06 ±plus-or-minus\pm±2.13 89.04 ±plus-or-minus\pm±0.61 51.95*±plus-or-minus\pm±2.01 46.48 ±plus-or-minus\pm±2.58
Brant (Zhang et al., 2023) 69.06 ±plus-or-minus\pm±2.69 58.25 ±plus-or-minus\pm±3.62 91.63 ±plus-or-minus\pm±0.83 56.84 ±plus-or-minus\pm±3.32 56.55 ±plus-or-minus\pm±3.77 51.02 ±plus-or-minus\pm±3.15 41.90 ±plus-or-minus\pm±2.20 85.89 ±plus-or-minus\pm±0.60 38.12 ±plus-or-minus\pm±3.74 30.42 ±plus-or-minus\pm±2.90
MBrain(Cai et al., 2023) 71.91*±plus-or-minus\pm±0.98 58.35 ±plus-or-minus\pm±1.99 92.16*±plus-or-minus\pm±0.38 58.22 ±plus-or-minus\pm±3.63 59.83*±plus-or-minus\pm±2.10 62.33*±plus-or-minus\pm±1.24 53.97*±plus-or-minus\pm±1.87 89.50*±plus-or-minus\pm±0.14 51.48 ±plus-or-minus\pm±2.94 48.65*±plus-or-minus\pm±0.88
SleepHGNN(Jia et al., 2023) 77.56±plus-or-minus\pm±2.06 70.38±plus-or-minus\pm±2.57 94.18±plus-or-minus\pm±0.60 69.79±plus-or-minus\pm±3.98 69.72±plus-or-minus\pm±3.79 64.87±plus-or-minus\pm±2.34 57.27±plus-or-minus\pm±2.48 90.24±plus-or-minus\pm±0.56 56.93±plus-or-minus\pm±3.50 52.21±plus-or-minus\pm±3.56
Brant-2 77.15±plus-or-minus\pm±1.39 67.01±plus-or-minus\pm±2.51 93.90±plus-or-minus\pm±0.43 67.20±plus-or-minus\pm±2.42 68.05±plus-or-minus\pm±2.22 68.76±plus-or-minus\pm±2.41 63.74±plus-or-minus\pm±1.74 91.52±plus-or-minus\pm±0.43 63.87±plus-or-minus\pm±1.94 58.20±plus-or-minus\pm±2.65
Table 3. Average performance on seizure prediction.
Methods Metrics Clinical
Pre. Rec. F1 F2
TF-C(Zhang et al., 2022) 34.92 ±plus-or-minus\pm±6.61 49.35±plus-or-minus\pm±13.53 37.28 ±plus-or-minus\pm±9.50 42.88 ±plus-or-minus\pm±11.10
SimMTM(Dong et al., 2023) 60.39 ±plus-or-minus\pm±10.76 37.94 ±plus-or-minus\pm±6.72 45.37 ±plus-or-minus\pm±7.69 40.42 ±plus-or-minus\pm±6.99
One Fits All(Zhou et al., 2023) 55.84 ±plus-or-minus\pm±8.48 43.41 ±plus-or-minus\pm±10.39 48.04*±plus-or-minus\pm±9.42 45.01*±plus-or-minus\pm±10.02
BrainBERT(Wang et al., 2022) 61.89*±plus-or-minus\pm±13.92 29.38 ±plus-or-minus\pm±12.47 39.05 ±plus-or-minus\pm±13.98 32.55 ±plus-or-minus\pm±13.05
Brant (Zhang et al., 2023) 55.39 ±plus-or-minus\pm±11.28 40.89 ±plus-or-minus\pm±9.97 40.01 ±plus-or-minus\pm±8.80 39.29 ±plus-or-minus\pm±9.29
MBrain(Cai et al., 2023) 54.75 ±plus-or-minus\pm±13.29 44.26 ±plus-or-minus\pm±11.86 41.79 ±plus-or-minus\pm±9.59 41.67 ±plus-or-minus\pm±9.82
Lopes et al. (2023) 62.82±plus-or-minus\pm±9.17 46.86*±plus-or-minus\pm±10.21 50.84±plus-or-minus\pm±8.46 47.85±plus-or-minus\pm±10.36
Brant-2 62.67±plus-or-minus\pm±7.25 49.94±plus-or-minus\pm±8.04 55.20±plus-or-minus\pm±7.93 51.87±plus-or-minus\pm±8.05
Table 4. Average performance on emotion recognition.
Methods Metrics SEED
Acc. Macro F1
TF-C(Zhang et al., 2022) 82.87 ±plus-or-minus\pm±5.21 82.13 ±plus-or-minus\pm±5.66
SimMTM(Dong et al., 2023) 81.69 ±plus-or-minus\pm±7.06 81.26 ±plus-or-minus\pm±7.20
One Fits All(Zhou et al., 2023) 87.80 ±plus-or-minus\pm±3.35 87.67 ±plus-or-minus\pm±3.38
BrainBERT(Wang et al., 2022) 85.98 ±plus-or-minus\pm±5.46 85.81 ±plus-or-minus\pm±5.98
Brant (Zhang et al., 2023) 89.50*±plus-or-minus\pm±3.57 89.43*±plus-or-minus\pm±3.71
MBrain(Cai et al., 2023) 84.60 ±plus-or-minus\pm±7.47 84.52 ±plus-or-minus\pm±7.45
EEG Conformer(Song et al., 2023) 93.17±plus-or-minus\pm±4.20 93.10±plus-or-minus\pm±4.22
Brant-2 93.47±plus-or-minus\pm±3.09 93.42±plus-or-minus\pm±3.08
Table 5. Average performance on motor imagery classification.
Methods Metrics Motor Imagery
Acc. F1
TF-C(Zhang et al., 2022) 60.06 ±plus-or-minus\pm±1.62 57.79 ±plus-or-minus\pm±3.00
SimMTM(Dong et al., 2023) 57.48 ±plus-or-minus\pm±1.82 57.37 ±plus-or-minus\pm±2.80
One Fits All(Zhou et al., 2023) 71.25 ±plus-or-minus\pm±3.50 72.56*±plus-or-minus\pm±3.06
BrainBERT(Wang et al., 2022) 64.84 ±plus-or-minus\pm±4.19 70.32 ±plus-or-minus\pm±3.55
Brant (Zhang et al., 2023) 72.00*±plus-or-minus\pm±1.93 71.84 ±plus-or-minus\pm±2.42
MBrain(Cai et al., 2023) 61.06 ±plus-or-minus\pm±2.09 60.42 ±plus-or-minus\pm±4.65
TSFF-Net(Miao and Zhao, 2023a) 73.00±plus-or-minus\pm±4.32 73.87±plus-or-minus\pm±2.10
Brant-2 74.33±plus-or-minus\pm±3.61 74.30±plus-or-minus\pm±3.83

3.4. Evaluation Results

Main Results. Fig. 3 summarizes the overall results of Brant-2 compared with the baseline methods on all the downstream tasks. The radar chart shows that Brant-2 outperforms all universal time series modeling methods and pre-training methods on brain signals, surpassing a majority of task-specific methods, indicating that our method exhibits strong generalization ability across various scenarios of brain signals. Detailed statistics and comparisons of each task will be discussed in the following paragraphs, where in all the tables, we mark values ranking the first (v), second (v), and third (v*) in each column.

Seizure Detection. Tab. 1 shows the results of seizure detection on SEEG and EEG datasets. In the results of MAYO and FNUSA, Brant-2 achieves the best recall and F2 score over other models, demonstrating our model’s ability in seizure detection on SEEG data. Regarding the F2 score, PPi secures the second position, which can be attributed to the fact that PPi contains a specifically designed pre-training framework for seizure detection and dedicated techniques to address inter-subject variability. From the results of CHB-MIT and Siena, we can observe that Brant-2 ranks the first in almost all performance metrics, showing a strong ability in EEG-based seizure detection.

Seizure Prediction. Tab. 3 shows the average performance of seizure prediction task on the clinical dataset. In Tab. 3, Brant-2 achieves the first place in terms of F1 and F2 scores, which are improved by 37.97% and 32.02%Here we calculate the relative improvement. compared to Brant, indicating that the pre-training forecasting task enhances the predictive capability. Lopes et al. (2023) achieves the second-best F1 and F2 scores, demonstrating the effectiveness of combining original signals and handcrafted features. However, as a fully supervised method, Lopes et al. (2023) relies heavily on labels, which are often scarce in clinical settings. We will investigate the impact of scarce labels on model performance in Sec. 4.3. Apart from Brant-2 and Lopes et al. (2023), One Fits All achieves the highest F2 score, which could be attributed to its utilization of a pre-trained GPT-2 (Radford et al., 2019) as the backbone with strong predictive abilities.

Sleep Stage Classification. The results of sleep stage classification on SleepEDFx and HMC are shown in Tab. 2. Overall, Brant-2 and SleepHGNN exhibit comparable performance, with Brant-2 outperforming SleepHGNN on HMC and SleepHGNN having a slight edge on SleepEDFx. As a specialized model designed for sleep stage classification, SleepHGNN incorporates EEG signals and synchronously collected EOG, ECG, and EMG signals, thereby leveraging multiple modalities for improved performance. Brant-2 achieves comparable performance using only EEG signals, highlighting the effectiveness of our large-scale pre-training.

Emotion Recognition. Tab. 4 contains the results of emotion recognition on SEED dataset. Our model obtains the best results and EEG Conformer achieves the second place. Like Brant-2, EEG Conformer also considers the temporal dependency and spatial correlations by designing a convolution module with temporal and spatial convolutional layers.

Motor Imagery Classification. The performance of motor imagery classification is shown in Tab. 5. The achievement of the highest accuracy and F1 score by Brant-2 demonstrates the effectiveness of our model in motor imagery classification. The utilization of time-frequency spectrograms in TSFF-Net, along with its second-best accuracy and F1 score, highlights the significance of time-frequency domain information in this scenario.

4. Analysis

4.1. Scalability Analysis

Large language and vision models(Bai et al., 2023; Yang et al., 2023; Touvron et al., 2023a, b) have shown strong scalability behavior. As a large model in brain signals, we aim to investigate the scaling behavior of our model in terms of the pre-training loss and downstream task performance.

Setup. In addition to Brant-2, we pre-trained two smaller versions with 200 million, 460 million parameters, following the same training configurations described in Sec.3.1. Then we evaluate the models on all five downstream tasks, with each task utilizing one dataset (CHB-MIT for seizure detection, Clinical for seizure prediction, SleepEDFx for sleep stage classification, SEED for emotion recognition and Motor Imagery for motor imagery classification).


Refer to caption
Figure 4. The results of scalability analysis.

Results. The training losses of the two pre-training objectives (the losses are calculated every 5k steps) are shown in Fig.4(a). One can observe that as training progresses: 1) the training losses of the models, regardless of their size, continue to decrease; 2) as we increase the model size, the losses decrease faster. These observations indicate that Brant-2 shows scalability behavior during pre-training. Furthermore, as shown in Fig.4(b), larger models attain better performance across all tasks, showcasing that our scalable overall performance transfers to a range of downstream tasks.

4.2. Ablation Study

We perform ablation experiments to assess the effectiveness of the architectural design of the model and the pre-training tasks.

Setup. We set three model variants to validate the effectiveness of our architectural design: 1) Brant-2 w/o temporal encoder: remove the temporal encoder; 2) Brant-2 w/o spatial encoder: remove the spatial encoder; 3) Brant-2 w/o multi-FFN: replace the multi-FFN Transformer encoder block of Brant-2 with the vanilla Transformer encoder block (Vaswani et al., 2023). For each model variant, we control the parameter count of the models to be approximately the same to ensure fair comparison. To illustrate the usefulness of both pre-training tasks, we perform two sets of experiments: 4) Brant-2 w/o mask: pre-train with forecasting; 5) Brant-2 w/o forecast: pre-train with mask-prediction. Brant-2 and the above five variants are evaluated on all the five downstream tasks, with each task utilizing the same dataset as the one used in the scalability analysis (i.e., CHB-MIT for seizure detection, Clinical for seizure prediction, SleepEDFx for sleep stage classification, SEED for emotion recognition and Motor Imagery for motor imagery classification). Since each model variant requires pre-training and such process for a 1-billion scale model alone takes over 100 hours as described in Sec. 3.1, all the experiments in the ablation study are based on Brant-2-460M.


Refer to caption
Figure 5. The results of ablation study.

Results. The ablation results are shown in Fig.5, in which Brant-2 outperforms the other variants across all five downstream tasks, demonstrating the effectiveness of each component of our work. Brant-2 w/o temporal encoder exhibits overall poor performance in these downstream tasks, highlighting the crucial importance of temporal dependency for brain signals. In certain tasks (e.g., seizure detection, emotion recognition), Brant-2 w/o mask outperforms Brant-2 w/o forecast, indicating that these tasks require a better understanding of patterns within a signal segment. On the other hand, in some tasks (e.g., seizure prediction), Brant-2 w/o forecast performs better, demonstrating that these tasks prioritize predicting future changes based on the current observed series. Therefore, joint training of the two pre-training tasks enhances the adaptive ability to different downstream tasks.

4.3. Label Scarcity Scenario Exploration

The results in Sec.3.4 have demonstrated that Brant-2 can generalize well to various tasks. As a foundation model, we also aim to investigate whether our model can address the issue of over-reliance on labels and be applicable to scenarios with scarce labels.

Setup. We choose to conduct experiments on the Clinical dataset originating from real-world clinical scenario of epilepsy, where the annotation cost is high. By choosing this dataset, we intend to simulate real-world scenarios closely and address the challenges associated with expensive annotations in clinical settings. We compare our model with the best-performed baseline method Lopes et al. (2023), which is fully supervised. We conduct three sets of experiments on each model with 100%, 10%, and 1% of training data.


Refer to caption
Figure 6. the performance changes of the model as the training labels decrease.

Results. The variation in model performance with decreasing training labels is shown in Fig.6. Overall, as the training labels decrease, the performance exhibits a certain degree of decline. When transitioning from 100% to 1% labels, Brant-2 and Brant-2-460M show F1 and F2 scores decreases of less than 10% and 15%, respectively, In contrast, the F1 and F2 scores of Lopes et al. (2023) decline 50.6% and 32.6%, respectively. The results indicate that Brant-2 can reduce reliance on labels, thereby ensuring performance in scenarios with scarce labels.

5. Conclusion

We propose a foundation model Brant-2, the first off-the-shelf model that can be applied to scenarios of both SEEG and EEG. Brant-2 is able to handle significant data variations and generate powerful representations of brain signals from a broad range of application scenarios. We experiment on five downstream tasks to illustrate the generalization ability of Brant-2. In addition, Brant-2 shows a scalability behavior in both pre-training and downstream tasks. Furthermore, we explore the change of model performance in low-resource labeled scenarios, in which the performance of Brant-2 remains much more stable than the supervised SOTA method designed for the scenario, indicating that our model alleviates the issue of label efficiency. The field of brain signals is continuously evolving, with emerging research directions and scenarios. In the future, we aim to train our model on a more diverse and extensive corpus, enabling its application in more research areas and scenarios.

References

  • (1)
  • Alturki et al. (2020) Fahd A. Alturki, Khalil AlSharabi, Akram M. Abdurraqeeb, and Majid Aljalal. 2020. EEG Signal Analysis for Diagnosing Neurological Disorders Using Discrete Wavelet Transform and Intelligent Techniques. Sensors 20 (2020).
  • Alvarez-Estevez and Rijsman (2022) Diego Alvarez-Estevez and Roselyne Rijsman. 2022. Haaglanden Medisch Centrum sleep staging database (version 1.1). https://doi.org/10.13026/t79q-fr32.
  • Alvarez-Estevez and Rijsman (2021) Diego Alvarez-Estevez and Roselyne M Rijsman. 2021. Inter-database validation of a deep learning approach for automatic sleep scoring. PloS one 16 (2021).
  • Bai et al. (2023) Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. 2023. Sequential Modeling Enables Scalable Learning for Large Vision Models. arXiv:2312.00785 [cs.CV]
  • Berry et al. (2017) Richard B Berry, Rita Brooks, Charlene Gamaldo, Susan M Harding, Robin M Lloyd, Stuart F Quan, Matthew T Troester, and Bradley V Vaughn. 2017. AASM scoring manual updates for 2017 (version 2.4).
  • Bommasani et al. (2022) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2022. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG]
  • Cai et al. (2023) Donghong Cai, Junru Chen, Yang Yang, Teng Liu, and Yafeng Li. 2023. MBrain: A Multi-Channel Self-Supervised Learning Framework for Brain Signals. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Caune et al. (2012) Vairis Caune, Juris Zagars, and Radu Ranta. 2012. EEG/SEEG Signal Modelling using Frequency and Fractal Analysis.. In BIOSIGNALS.
  • Chen et al. (2022) Junru Chen, Yang Yang, Tao Yu, Yingying Fan, Xiaolong Mo, and Carl Yang. 2022. BrainNet: Epileptic Wave Detection from SEEG with Hierarchical Graph Diffusion Learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Dasgupta et al. (2022) Debjani Dasgupta, Anna Miserocchi, Andrew W McEvoy, and John S Duncan. 2022. Previous, current, and future stereotactic EEG techniques for localising epileptic foci. Expert Review of Medical Devices 19 (2022), 571–580.
  • Detti (2020) Paolo Detti. 2020. Siena Scalp EEG Database (version 1.0.0). https://doi.org/10.13026/5d4a-j060.
  • Detti et al. (2020) Paolo Detti, Giampaolo Vatti, and Garazi Zabalo Manrique de Lara. 2020. EEG Synchronization Analysis for Seizure Prediction: A Study on Data of Noninvasive Recordings. Processes 8 (2020).
  • Diachenko et al. (2022) Marina Diachenko, Simon J Houtman, Erika L Juarez-Martinez, Jennifer R Ramautar, Robin Weiler, Huibert D Mansvelder, Hilgo Bruining, Peter Bloem, and Klaus Linkenkaer-Hansen. 2022. Improved manual annotation of EEG signals through convolutional neural network guidance. Eneuro 9 (2022).
  • Dong et al. (2023) Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. 2023. SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Duan et al. (2013) Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. 2013. Differential entropy feature for EEG-based emotion classification. In 6th International IEEE/EMBS Conference on Neural Engineering (NER).
  • Gavvala et al. (2022) Jay Gavvala, Muhammad Zafar, Saurabh R. Sinha, Giridhar Kalamangalam, and Stephan Schuele. 2022. Stereotactic EEG Practices: A Survey of United States Tertiary Referral Epilepsy Centers. Journal of Clinical Neurophysiology 39 (2022).
  • Goldberger et al. (2000) Ary L. Goldberger, Luís A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, …, and H. Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101 (2000).
  • Guttag (2010) John Guttag. 2010. CHB-MIT Scalp EEG Database. PhysioNet. https://doi.org/10.13026/C2K01R
  • Harati et al. (2014) A Harati, S Lopez, I Obeid, J Picone, MP Jacobson, and S Tobochnik. 2014. The TUH EEG CORPUS: A big data resource for automated EEG interpretation. In 2014 IEEE signal processing in medicine and biology symposium (SPMB).
  • Jia et al. (2023) Ziyu Jia, Youfang Lin, Yuhan Zhou, Xiyang Cai, Peng Zheng, Qiang Li, and Jing Wang. 2023. Exploiting Interactivity and Heterogeneity for Sleep Stage Classification Via Heterogeneous Graph Neural Network. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  • Kalaivani et al. (2014) M K Kalaivani, V. Kalaivani, V. Anusuya Devi, Ismail M Gursoy, André L. V. Coelho, Clodoaldo A. M. Lima, André L. V. Coelho, Deng Wang, Duoqian Miao, Kai-Cheng Hsu andSung Nien, Reza Boostani, and Ahmad Ghanizadeh. 2014. Analysis of EEG Signal for the Detection of Brain Abnormalities.
  • Kamble and Sengupta (2023) Kranti Kamble and Joydeep Sengupta. 2023. A comprehensive survey on emotion recognition based on electroencephalograph (EEG) signals. Multimedia Tools and Applications (2023), 1–36.
  • Kemp et al. (2000) Bob Kemp, Aeilko H Zwinderman, Bert Tuk, Hilbert AC Kamphuisen, and Josefien JL Oberye. 2000. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Transactions on Biomedical Engineering 47 (2000).
  • Kovács et al. (2021) Sándor Kovács, Márton Tóth, József Janszky, Tamás Dóczi, Dániel Fabó, István Boncz, Lajos Botz, and Antal Zemplényi. 2021. Cost-effectiveness analysis of invasive EEG monitoring in drug-resistant epilepsy. Epilepsy & Behavior 114 (2021).
  • Lee et al. (2023) Alycia Lee, Brando Miranda, Sudharsan Sundar, and Sanmi Koyejo. 2023. Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data. arXiv:2306.13840 [cs.CL]
  • Li et al. (2023) Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2023. From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning. arXiv:2308.12032 [cs.CL]
  • Lopes et al. (2023) Fábio Lopes, Adriana Leal, Mauro F. Pinto, António Dourado, Andreas Schulze-Bonhage, Matthias Dümpelmann, and César Teixeira. 2023. Removing artefacts and periodically retraining improve performance of neural network-based seizure prediction models. Scientific Reports (2023).
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 abs/1711.05101 (2017).
  • Miao and Zhao (2023a) Zhengqing Miao and Meirong Zhao. 2023a. Time-space-frequency feature Fusion for 3-channel motor imagery classification. arXiv preprint arXiv:2304.01461 abs/2304.01461 (2023).
  • Miao and Zhao (2023b) Zhengqing Miao and Meirong Zhao. 2023b. Time-space-frequency feature Fusion for 3-channel motor imagery classification. arXiv:2304.01461 [cs.LG]
  • Morales and Bowers (2022) Santiago Morales and Maureen Bowers. 2022. Time-frequency analysis methods and their application in developmental EEG data. Developmental Cognitive Neuroscience 54 (2022).
  • Nejedly et al. (2020) Petr Nejedly, Vaclav Kremen, Vladimir Sladky, Jan Cimbalnik, Petr Klimes, Filip Plesinger, Filip Mivalt, Vojtech Travnicek, Ivo Viscor, Martin Pail, et al. 2020. Multicenter intracranial EEG dataset for classification of graphoelements and artifactual signals. Scientific data 7 (2020).
  • Nie et al. (2022) Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.
  • Phan et al. (2021) Huy Phan, Oliver Y Chén, Minh C Tran, Philipp Koch, Alfred Mertins, and Maarten De Vos. 2021. XSleepNet: Multi-view sequential model for automatic sleep staging. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021).
  • Phan and Mikkelsen (2022) Huy Phan and Kaare Mikkelsen. 2022. Automatic sleep staging of EEG signals: recent development, challenges, and future directions. Physiological Measurement 43 (2022).
  • Phan et al. (2023) Huy P Phan, Kristian P. Lorenzen, Elisabeth Roxane Marie Heremans, Oliver Y. Ch’en, Minh C. Tran, Philipp Koch, Alfred Mertins, Mathias Baumert, Kaare B. Mikkelsen, and Marina De Vos. 2023. L-SeqSleepNet: Whole-cycle Long Sequence Modeling for Automatic Sleep Staging. IEEE Journal of Biomedical and Health Informatics 27 (2023).
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 (2019).
  • Ramachandran et al. (2017) Prajit Ramachandran, Barret Zoph, and Quoc V. Le. 2017. Searching for Activation Functions. CoRR abs/1710.05941 (2017).
  • Schalk et al. (2004) Gerwin Schalk, Dennis J. McFarland, Thilo Hinterberger, Niels Birbaumer, and Jonathan R. Wolpaw. 2004. BCI2000: A General-Purpose Brain-Computer Interface (BCI) System. IEEE Transactions on Biomedical Engineering 51 (2004).
  • Shoeb (2009) Ali Hossam Shoeb. 2009. Application of machine learning to epileptic seizure onset detection and treatment. Ph. D. Dissertation. Massachusetts Institute of Technology.
  • Song et al. (2020) Tengfei Song, Wenming Zheng, Peng Song, and Zhen Cui. 2020. EEG Emotion Recognition Using Dynamical Graph Convolutional Neural Networks. IEEE Transactions on Affective Computing 11 (2020).
  • Song et al. (2023) Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao. 2023. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2023).
  • Supratak and Guo (2020) Akara Supratak and Yike Guo. 2020. TinySleepNet: An efficient deep learning model for sleep stage scoring based on raw single-channel EEG. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  • Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
  • Wang et al. (2022) Christopher Wang, Vighnesh Subramaniam, Adam Uri Yaari, Gabriel Kreiman, Boris Katz, Ignacio Cases, and Andrei Barbu. 2022. BrainBERT: Self-supervised representation learning for intracranial recordings. In The Eleventh International Conference on Learning Representations.
  • Wang et al. (2023) Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, and Yu Qiao. 2023. InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Wu et al. (2022) Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2022. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In The Eleventh International Conference on Learning Representations.
  • Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, JunTao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, and Zhiying Wu. 2023. Baichuan 2: Open Large-scale Language Models. arXiv:2309.10305 [cs.CL]
  • Yi et al. (2023) Ke Yi, Yansen Wang, Kan Ren, and Dongsheng Li. 2023. Learning Topology-Agnostic EEG Representations with Geometry-Aware Modeling. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Youngworth et al. (2005) Richard N. Youngworth, Benjamin B. Gallagher, and Brian L. Stamper. 2005. An overview of power spectral density (PSD) calculations. In Optical Manufacturing and Testing VI.
  • Yuan et al. (2021) Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. 2021. Florence: A New Foundation Model for Computer Vision. CoRR abs/2111.11432 (2021).
  • Yuan et al. (2023) Zhizhang Yuan, Daoze Zhang, Yang Yang, Junru Chen, and Yafeng Li. 2023. PPi: Pretraining Brain Signal Model for Patient-independent Seizure Detection. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization. CoRR abs/1910.07467 (2019).
  • Zhang et al. (2023) Daoze Zhang, Zhizhang Yuan, Yang Yang, Junru Chen, Jingjing Wang, and Yafeng Li. 2023. Brant: Foundation Model for Intracranial Neural Signal. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Zhang et al. (2021) X. Zhang, L. Yao, X. Wang, J. Monaghan, D. McAlpine, and Y. Zhang. 2021. A survey on deep learning-based non-invasive brain signals: recent advances and new frontiers. Journal of Neural Engineering 18 (2021).
  • Zhang et al. (2022) Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems 35 (2022).
  • Zheng et al. (2023) Ruizhe Zheng, Jun Li, Yi Wang, Tian Luo, and Yuguo Yu. 2023. ScatterFormer: Locally-Invariant Scattering Transformer for Patient-Independent Multispectral Detection of Epileptiform Discharges. arXiv:2304.14919 [eess.SP]
  • Zheng and Lu (2015) Wei-Long Zheng and Bao-Liang Lu. 2015. Investigating Critical Frequency Bands and Channels for EEG-based Emotion Recognition with Deep Neural Networks. IEEE Transactions on Autonomous Mental Development 7, 3 (2015).
  • Zhou et al. (2023) Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. 2023. One Fits All: Power General Time Series Analysis by Pretrained LM. In Thirty-seventh Conference on Neural Information Processing Systems.

Ethics Statement

The data collection and experiments conducted in our work on the private datasets (i.e., the pre-training SEEG corpus and the clinical dataset for seizure prediction) have been approved by the Institutional Review Board (IRB) and passed ethical review. All participants have signed informed consent forms. All publicly available datasets used in this paper are not associated with any privacy or security concerns. Furthermore, we have followed guidelines on responsible use specified by the primary authors of the datasets used in the current work.

Appendix A Related Work

Scenario-specific methods for brain signals. In view of the diverse application scenarios of brain signals, researchers have designed various methods specifically tailored for these contexts. Yuan et al. (2023) propose a self-supervised learning framework and two techniques for SEEG-based patient-independent seizure detection. Zheng et al. (2023) propose a model, ScatterFormer, for patient-independent seizure detection on EEG data, which is an invariant scattering transform-based hierarchical Transformer that specifically pays attention to subtle features. Lopes et al. (2023) conduct seizure prediction by encoding the original signals and hand-crafted features with a deep and shallow networks, respectively. Jia et al. (2023) propose a novel Sleep Heterogeneous Graph Neural Network (SleepHGNN) to capture the heterogeneity and the interactivity of physiological signals for sleep stage classification. Song et al. (2023) design a compact Convolutional Transformer, named EEG Conformer, to encapsulate local and global features for emotional recognition and motor imagery classification. Miao and Zhao (2023b) propose a shallow, lightweight decoding architecture (TSFF-img) based on time-frequency spectrograms for motor imagery classification.

Universal modeling for brain signals. Universal modeling exhibits great advantages by learning highly general representations to enable customization for various applications. Developing such techniques for brain signals with a broad range of applications is suitable. Wang et al. (2022) propose an off-the-shelf model, BrainBERT, that provides embeddings for intracranial recordings. Zhang et al. (2023) propose a foundation model Brant for SEEG modeling, which is the largest model for intracranial recordings. Both BrainBERT and Brant are limited to SEEG data with a relative narrow range of application scenarios. Cai et al. (2023) design a unified self-supervised learning framework for brain signals which can be utilized on either SEEG or EEG data. However, their work cannot model different kinds of brain signals simultaneously.

Appendix B hyperparameter Analysis

The pre-training performance of a foundation model is of utmost importance as it significantly impacts the performance on downstream tasks. Therefore, it is necessary to select the optimal masking ratio and forecasting length which are two crucial hyperparameters of Brant-2. However, due to limited computational resources, it is impractical to perform a grid search for these two parameters on a model with over 1 billion parameters (as the pre-training time for Brant-2 exceeds 100 hours). Therefore, we conduct experiments on two smaller-scale models (100M and 200M) to observe their performance on downstream tasks and determine the optimal hyperparameters, which are then applied to larger-scale models. For the searching strategy, since the two pre-training tasks are relatively independent, we adopted the following strategy to determine the two optimal hyperparameters instead of grid search: First, we solely focus on the mask-prediction task during pre-training and select the optimal masking ratio. Next, with the chosen masking ratio fixed, we determine the optimal forecasting length.

Setup. To comprehensively evaluate the model to determine the optimal hyperparameters, we pre-trained two models with 100M and 200M parameters and conducted experiments on all five downstream tasks. Each task is evaluated on the same dataset as the one used in the scalability analysis (i.e., CHB-MIT for seizure detection, Clinical for seizure prediction, SleepEDFx for sleep stage classification, SEED for emotion recognition and Motor Imagery for motor imagery classification) described in Sec. 4.1. We follow the same pre-training configurations described in Sec.3.1. For the masking ratio, we experiment with settings of 20%, 40%, 60%, and 80%. Regarding the forecasting length, we try lengths of 1/16 L𝐿Litalic_L, 1/4 L𝐿Litalic_L, 1/2 L𝐿Litalic_L, and L𝐿Litalic_L, where L𝐿Litalic_L is the context length.


Refer to caption
Figure 7. The hyperparameter analysis.

Results. Fig.7 illustrates the results of the hyperparameter analysis, in which the top five graphs represent the analysis for the masking ratio, and the bottom five graphs depict the analysis for the forecasting length. It can be observed that with a masking ratio of 40% and a forecasting length of 1/4 L𝐿Litalic_L, the models achieve the best overall performance among the five downstream tasks.

Since brain signals are natural signals with heavy redundancy, a missing patch can be recovered from neighboring patches with little high-level understanding of the semantic information. Thus, masking with a relative low ratio (20% or lower) or conducting very short-term forecasting (e.g. 1/16 L𝐿Litalic_L) may lead to ineffectiveness of high-level representation learning, which is crucial for time series classification(Wu et al., 2022). However, as non-stationary time series, the values and associations between variables in brain signals significantly change over time. Therefore, a excessively high masking ratio (e.g., 80%) poses great challenges for the model to reconstruct the original signals. Similarly, when the forecasting length is equal to the context length, the prediction becomes extremely challenging, both of which may unstabilize the pre-training process and lead to a decline in performance on downstream tasks.

Appendix C Details of baselines

We extensively compare our model with 12 advanced methods which are divided into three categories, including 1) 3 methods aimed at time series universal modeling; 2) 3 methods based on self-supervised pre-training on brain signals and 3) 6 methods specifically designed for each downstream task. The detailed information of these methods are described below.

For the first category:

  • TF-C (Zhang et al., 2022): A decomposable pre-training model for general time series modeling, where the self-supervised signal is provided by the distance between time and frequency components.

  • SimMTM (Dong et al., 2023): A pre-training framework on time series to recover masked time points by the weighted aggregation of multiple neighbors outside the manifold.

  • One Fits All (Zhou et al., 2023): A unified model that leverages language models for time series analysis, leading to a comparable or SOTA performance in all main time series analysis tasks.

For the second category:

  • BrainBERT(Wang et al., 2022): A reusable transformer for intracranial field potential recordings enables classifying complex concepts and decoding neural data.

  • Brant(Zhang et al., 2023): A foundation model for intracranial neural recordings which is a large-scale, off-the-shelf model for medicine.

  • MBrain(Cai et al., 2023): A multi-channel self-supervised learning framework which explicitly capture the spatial and temporal correlations of brain signals to learn a unique representation for each channel.

For the third category:

  • PPi(Yuan et al., 2023): A pre-training-based model for patient-independent seizure detection on SEEG data, which contains two novel self-supervised tasks to extract rich information from abundant SEEG data and two techniques to tackle the domain shift problem.

  • ScatterFormer(Zheng et al., 2023): An invariant scattering transform-based hierarchical Transformer that specifically pays attention to subtle features which is designed for patient-independent detection of epileptic based on visual spectral representation of continuous EEG.

  • Lopes et al. (2023): A deep convolutional neural network-based EEG artefact removal model designed for seizure prediction using a deep convolutional neural network connected to a bidirectional long short-term memory layer (CNN-BiLSTM) using time series as input and a shallow artificial neural network trained using established handcrafted features.

  • SleepHGNN(Jia et al., 2023): A novel sleep heterogeneous graph neural network designed to capture interactivity and heterogeneity of physiological signals for accurate sleep stage classification.

  • EEG Conformer(Song et al., 2023): A compact convolutional Transformer to encapsulate local and global features in a unified EEG classification framework for motor imagery and emotion recognition.

  • TSFF-Net(Miao and Zhao, 2023a): A novel network architecture designed for motor imagery classification that integrates time-space-frequency features, effectively compensating for the limitations of single-mode feature extraction networks based on time-series or time-frequency modalities.

For TF-C, SimMTM, BrainBERT, Brant and MBrain which need to be pre-trainined and applied on all the downstream tasks, we utilize the same pre-training corpus of Brant-2 to pre-train these baselines for fair comparison. During fine-tuning, we fine-tune all the parameters of these baselines.

Appendix D Details of Experimental setup

D.1. Evaluation Metrics

In the seizure detection and prediction tasks, following the existing works(Cai et al., 2023; Yuan et al., 2023; Chen et al., 2022), we adopt precision, recall, F1 score and F2 score as the evaluation metrics. For the sleep stage classification, following the existing works (Supratak and Guo, 2020; Phan et al., 2021, 2023), we use accuracy, sensitivity, specificity, macro F1 score, and Cohen’s kappa κ𝜅\kappaitalic_κ as evaluation metrics. For emotion recognition and motor imagery classification, we use accuracy and F1 score as evaluation metrics. Detailed information of these metrics are given as follows:

  • Precision: Also known as positive predictive value (PPV), precision is the proportional accuracy of correctly identified positive outcomes out of all predicted positive outcomes. It’s a crucial metric when the cost of a false positive is high. The higher the value, the more relevant the results returned by the model. A lower value would mean that the model returns more false positives.

    (9) Precision=TPTP+FP,𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑇𝑃𝑇𝑃𝐹𝑃Precision=\frac{TP}{TP+FP},italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG ,

    where TP𝑇𝑃TPitalic_T italic_P is the number of true positives and FP𝐹𝑃FPitalic_F italic_P is the number of false positives.

  • Recall (Sensitivity): Also known as the true positive rate or sensitivity, recall measures the proportion of actual positive observations that are correctly identified as such. It helps us understand the predictive capacity of the model concerning the positive class. The higher the sensitivity, the fewer real positive cases the model will miss. A value of 1 means the model has perfect sensitivity and is not missing any real positives.

    (10) Sensitivity=Recall=TPTP+FN,𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑃𝑇𝑃𝐹𝑁Sensitivity=Recall=\frac{TP}{TP+FN},italic_S italic_e italic_n italic_s italic_i italic_t italic_i italic_v italic_i italic_t italic_y = italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG ,

    where TP𝑇𝑃TPitalic_T italic_P is the number of true positives and FN𝐹𝑁FNitalic_F italic_N is the number of false negatives.

  • Specificity: Also known as the true negative rate, specificity measures the proportion of actual negatives that are correctly identified. This provides insight into the predictive capacity of the model for the negative class. A higher specificity value means that the model is good at avoiding false positives, whilst a lower specificity indicates that the model often predicts a positive outcome when it’s actually negative.

    (11) Specificity=TNTN+FP,𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦𝑇𝑁𝑇𝑁𝐹𝑃Specificity=\frac{TN}{TN+FP},italic_S italic_p italic_e italic_c italic_i italic_f italic_i italic_c italic_i italic_t italic_y = divide start_ARG italic_T italic_N end_ARG start_ARG italic_T italic_N + italic_F italic_P end_ARG ,

    where TN𝑇𝑁TNitalic_T italic_N is the number of true negatives and FP𝐹𝑃FPitalic_F italic_P is the number of false positives.

  • F-measure: The F-measure is a metric defined as the weighted harmonic mean of precision and recall, with the following equation:

    (12) Fβ=(1+β2)×precision×recallβ2×precision+recall.F𝛽1superscript𝛽2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑟𝑒𝑐𝑎𝑙𝑙superscript𝛽2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑟𝑒𝑐𝑎𝑙𝑙\text{F}\beta=\frac{(1+\beta^{2})\times precision\times recall}{\beta^{2}% \times precision+recall}.F italic_β = divide start_ARG ( 1 + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) × italic_p italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n × italic_r italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_p italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n + italic_r italic_e italic_c italic_a italic_l italic_l end_ARG .

    In the scenario of epilepsy, F2 is more valued than F1, since ignoring any seizure is costly in diagnosis. While in other scenarios like sleep or emotion, F1 score is more valuable as a reference as it seeks a balance between precision and recall. For multi-class problems, macro FβF𝛽\text{F}\betaF italic_β score calculates FβF𝛽\text{F}\betaF italic_β for each class independently and then averages them. A higher macro FβF𝛽\text{F}\betaF italic_β score indicates that the classifier has both good precision and good recall.

  • Cohen’s Kappa: Kappa κ𝜅\kappaitalic_κ is a statistic that measures inter-rater agreement for qualitative items. It generally measures how well the model is performing over the random prediction. The value lies between -1 to 1. A high positive value (close to 1) signifies that the model’s predictions align well with the actual results beyond what would be expected by chance, a value of 0 indicates alignment similar to random chance, and a negative value indicates agreement less than chance.

D.2. Fine-tuning Details

For most of the datasets (MAYO, FNUSA, CHB-MIT, Siena, Clinical, SEED and Motor Imagery), we find that only fine-tuning the last two layers of the temporal encoder and freezing the remaining parameters of the Brant-2 encoder, we can achieve satisfactory results. Thus, for the evaluation on the above datasets, we only fine-tune the last two layers of the temporal encoder with a learning rate of 1.0×1051.0superscript1051.0\times 10^{-5}1.0 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and the classification head with a learning rate of 1.0×1031.0superscript1031.0\times 10^{-3}1.0 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. For datasets SleepEDFx and HMC, we fine-tune all the parameters of Brant-2 with a learning rate of 1.0×1061.0superscript1061.0\times 10^{-6}1.0 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT.

D.3. Model Configurations

Table 6. Configurations of Brant-2 with different sizes.
Model Config Temporal/Spatial Encoder Layer Model Dimension Inner Dimension Parameter Count
Brant-2-100M 10/2 768 2304 115M
Brant-2-200M 10/2 1024 3072 204M
Brant-2-460M 10/2 1536 4608 459M
Brant-2 8/2 2560 7680 1065M

The model configurations of Brant-2 with different sizes are shown in Tab. 6.