1. Introduction
Recognizing mental states by measuring brain activity has long been a critical challenge in neuroscience. Its complexity arises from the intricate nature of the brain area, the necessity of employing noninvasive methods when measuring brain signals, and the requirement of maintaining consciousness and comfort of the individual for accurate brain activity assessment. Electroencephalography (EEG) has been extensively studied as a standard method, allowing neuronal electrical signals to be conveniently measured through electrodes placed on the scalp [
1,
2,
3]. Another significant brain response, the blood oxygen level-dependent (BOLD) signal, is derived from the brain’s blood oxygen levels, and is coupled with the brain’s underlying electrical activity. Functional near-infrared spectroscopy (fNIRS) has emerged as a low-cost method for measuring the BOLD response, replacing the expensive and intricate acquisition of functional magnetic resonance imaging (fMRI) data. fNIRS records changes in optical density through the skull to determine the blood oxygen levels in different brain regions [
4,
5,
6]. A notable feature of fNIRS is its compatibility with EEG, allowing for simultaneous data recording using electrodes placed on the head. The availability of skullcaps equipped with both NIRS and EEG electrodes facilitates this simultaneous measurement, enabling the analysis of data from both modalities on the same subject and functional task [
7,
8].
In recent years, the advent of modern deep learning techniques has significantly impacted the field of brain analysis [
9,
10,
11]. Unlike traditional machine learning methods, which rely heavily on manual feature extraction, deep learning approaches can analyze EEG and fNIRS data multimodally without requiring exhaustive preliminary processing. This advancement is particularly beneficial in handling the coherent data generated by these two modalities, allowing for more efficient and comprehensive analysis [
12]. EEG is known for its low spatial and high temporal resolutions, providing near-immediate measurement of spikes related to brain activity. Conversely, while fNIRS offers high spatial resolution, it has limited temporal resolution owing to the delay in detecting changes in blood oxygen levels. Employing a multimodal deep learning model allows for the extraction of spatial and temporal features, leveraging the complementary strengths of EEG and fNIRS in brain activity analysis.
However, existing methods in the field of brain activity learning have predominantly focused on subject-dependent and subject-semidependent analysis [
12,
13,
14], where samples from the same subject can appear in both the training and test sets. While these approaches are effective in certain scenarios, such as creating a personalized model, they place significant limits on broader applicability to unseen subjects. Consequently, there is an essential need for methodologies that are capable of generalizing across a diverse range of subjects, including those not encountered during the model’s training phase. Implementing such methodologies would substantially enhance the universality and practicality of brain activity analysis, contributing to a more comprehensive understanding of neural patterns and behaviors. However, due to the low information-to-noise ratio, achieving generalization to unseen subjects is quite challenging. Each subject has unique characteristics and patterns that may obscure the acquisition of general features across subjects, and the distribution of data for each subject can be markedly different even for those with the same label [
15,
16].
These limitations and challenges motivate us to explore the effectiveness of the multimodal model in cross-subject brain activity learning. In this paper, we introduce
EF-Net, a convolutional neural network (CNN)-based deep learning method tailored to multimodal cross-subject analysis of mental state recognition via EEG and fNIRS data. Our approach includes analyses in various settings: subject-dependent, subject-semidependent, and subject-independent. This multifaceted strategy allows for a more flexible and versatile application of the models, empowering them to learn and adapt to a broader spectrum of neural data and mental states irrespective of the subjects’ previous interactions with the model [
15]. We evaluated our EF-Net model on the EEG-fNIRS WG dataset from [
17], achieving impressive F1 scores of 99.36% in subject-dependent analyses, 98.31% in subject-semidependent analyses, and 65.05% in subject-independent analyses. In the respective training–testing splits, EF-Net surpassed the baseline models by margins of 1.83%, 4.34%, and 2.13% when using both EEG and fNIRS data. Furthermore, we observed that the combined use of EEG and fNIRS data yields superior results, particularly when using the subject-independent setting. This finding underscores EF-Net’s effectiveness in generalizing to unseen subjects, demonstrating its potential applicability in real-world scenarios.
In summary, our paper’s key contributions are as follows:
We introduce a CNN-based deep-learning method named EF-Net for fNIRS-EEG multimodal cross-subject analysis in brain activity representation learning, and apply our method to a mental state recognition task.
We explore the challenges associated with learning a general model for cross-subject brain activity analysis, then conduct detailed experiments across various settings, including the employment of single or multiple modalities and different subject-dependent, subject-semidependent, and subject-independent training–testing splits.
The effectiveness of EF-Net is empirically demonstrated, showcasing its potential in leveraging multiple modalities in order to generalize to unseen subjects through representation learning.
The rest of this document is organized as follows.
Section 2 provides an overview of the current literature and the state of research in this field.
Section 3 describes the preprocessing of the dataset, followed by an in-depth description of the proposed model architecture. In
Section 4, we present our experiments conducted with EF-Net and all baselines; this section additionally covers the training and testing setups and reports the obtained results. Finally,
Section 7 discusses our findings and concludes the paper.
2. Related Work
Brain-computer interfaces (BCI) have revolutionized our interactions with technology, creating direct links between the brain and electronic devices [
18]. Incorporating various noninvasive methods such as EEG [
19,
20], fNIRS [
21], eye-tracking [
22,
23], and VR/AR integrations [
24,
25], BCIs promise wide-ranging applications. These include facilitating communication for those with disabilities [
19,
26] and enriching immersive gaming and virtual reality experiences [
27], as well as roles in disease diagnosis [
28,
29] and mental state monitoring [
30,
31]. BCI technology’s expanding abilities herald a new frontier in healthcare, entertainment, and education, marking a significant leap in human–computer interaction.
Existing work in the hybrid EEG-fNIRS domain is relatively sparse, with experiments varying significantly in quality and methodology. Most studies to date have employed traditional machine learning methods, with linear discriminant analysis (LDA) and support vector machine (SVM) classifiers being particularly prominent. Only a few studies have reported subject-semidependent results [
17,
32]. Many of these studies have instead relied on subject-dependent models, where the mean accuracy is reported as an average of individual subjects’ accuracies, as seen with LDA in [
33,
34], k-nearest neighbors (KNN) in [
35], and SVM in [
13,
36,
37]. The mean accuracy for subject-dependent hybrid tests has reached as high as 91.02% in certain studies [
34,
36]. However, a significant limitation of traditional ML methods is the extensive feature extraction process that they require, with additional techniques such as principal component analysis (PCA), channel selection, spectral features, and wavelet transforms being used as well.
Deep learning applied to multimodal EEG-fNIRS data has only been explored in a small number of studies. The study by [
12] employed a straightforward fully connected network trained on EEG-fNIRS data for motor imagery tasks in a subject-semidependent setting. The M2NN study [
38] showcased feature extraction using a custom-branched CNN, testing it on both EEG and fNIRS modalities for motor imagery tasks within a leave-one-out subject-independent setting. The research in [
39] introduced an EEG-fNIRS hybrid brain–computer interface (BCI). This study applied CNNs to classify overt and imagined speech, first creating two separate subnets for each modality, then fusing and processing them through gated recurrent units (GRUs). FGANet [
14] implements an fNIRS-guided attention feature in a CNN along with EEG and fNIRS convolution branches to form a fusion network for classifying mental arithmetic and imagery tasks in a subject-semi-dependent setting. A study by [
40] explored the use of a deep recurrent neural network (RNN) for seizure detection with EEG-fNIRS data, again in a subject-semidependent setting. The research in [
17] employed an LSTM to classify mental workload using n-back tests, with a focus on subject-semidependent analysis.
Existing works reveal a notable gap in the multimodal deep learning literature, particularly around studies conducting extensive testing comparisons across subject-dependent, subject-semidependent, and subject-independent setups, as our paper does. For the differences between these three training–testing split settings, see
Section 4.3 and
Section 5. Additionally, comprehensive comparisons between traditional machine learning and deep learning methods are scarce. Below, we outline our deep learning approach and the results from our systematic comparison of various machine learning techniques.
4. Experiments
We compared our method with five baseline approaches, including both traditional machine learning (ML) and deep learning (DL) techniques. For the ML baselines, we selected three widely used algorithms: Support Vector Machine classification (SVM), Random Forest (RF), and K-nearest neighbors (KNN). In terms of DL baselines, we chose two popular benchmark methods: Visual Geometry Group—Very Deep Convolutional Networks (VGG) [
47] and Residual Network—50 Layers (Resnet50) [
46].
All experiments were conducted using three random seeds, which were employed in order to shuffle either the subjects (in the case of subject-independent settings) or the samples themselves (in all other settings). We used seeds of 38, 43, and 45, and report the average results across these three seeds. The results of each experiment are presented with scores that include Accuracy, Precision, Recall, F1, and/or ROC-AUC. All scores represent the model’s performance on the testing set. Our evaluation encompassed three distinct data usage settings: only the fNIRS dataset, only the EEG dataset, and a combination of both. Additionally, we conducted experiments in subject-dependent, subject-semidependent, and subject-independent training–testing split settings. This comprehensive approach resulted in nine distinct settings (3 × 3) for evaluation.
The implementation details of our model are illustrated in
Figure 2 and discussed in
Section 3.3. The parameter tuning for our method and the deep learning baselines are included in
Table A1 of
Appendix A. For the ML and DL baselines, the implementation specifics are covered in
Section 4.1 for the ML baselines and
Section 4.2 for the DL baselines. The training–testing split settings are elaborated in
Section 4.3.
Section 4.4,
Section 4.5 and
Section 4.6 analyze the importance of different data split settings and report the corresponding results.
4.1. Machine Learning Baselines
The input shape of the EEG samples is (500, 30) and that of the fNIRS samples is (25, 72). To ensure compatibility with the ML baselines, we flattened our 2D samples into 1D vectors. This process yielded length vectors of 15,000 for the EEG samples and 1800 for the fNIRS samples. For the multimodal input, we combined fNIRS and EEG data and concatenated these two vectors, resulting in a combined input vector of length 16,800. We utilized SciKit Learn [
41] to implement SVM, RF, and KNN, operating with default parameters. We performed some parameter tuning for the three machine learning baselines with EEG-fNIRS modalities. The results are reported in
Table A2,
Table A3 and
Table A4 of
Appendix A. Overall, these baselines did not outperform the deep learning baselines, which is intuitive.
4.2. Deep Learning Baselines
In addition to traditional ML models, we conducted experiments with deep learning models, specifically VGG and ResNet. Deep learning methods are typically optimized for datasets with a large number of samples. Considering that VGG and ResNet are primarily designed for image representation learning and expect 3D input matrices (height, width, channels), our 2D EEG and fNIRS datasets (timestamps, channels) required adaptation. For the DL baseline methods, we transformed our inputs into a three-channel format as follows. First, each 5 s window of samples was segmented into the three channels, with some repetition involved. For EEG, this involved 500 timestamps split into three channels as 0–200, 150–350, and 300–500, while for fNIRS it involved a simple repeat of all timestamps as 0–25, 0–25, and 0–25. This process resulted in each input sample being reshaped into a three-dimensional form, enabling the use of the VGG and ResNet models with our data.
Our approach includes two branches for the two modalities: EEG and fNIRS. Due to the size of the ResNet50 model, we used it for train with the EEG data and employed VGG for the fNIRS branch in one baseline configuration. In another baseline setup, we simply used VGG for both the EEG and fNIRS branches. Thus, our two baseline configurations were ResNet50 (EEG) + VGG16 (fNIRS) and VGG19 (EEG) + VGG16 (fNIRS). The last layer of each branch was flattened and concatenated to form a unified representation vector. We then introduced a fully connected network that utilized this representation for the final classification task.
In the single-modality experiments, we deactivated one branch of the full model and only input the data from a single modality into the remaining active branch. Specifically, for the EEG-only experiments, we utilized one baseline incorporating a VGG19 branch and another employing a Resnet50 branch, while for the fNIRS-only experiments we conducted one baseline with a VGG16 branch and another with a Resnet50 branch. All single-modality experiments features were the same as those in the original model.
These two baselines were implemented using the Tensorflow 2.0 Keras API. The learning rates were adjusted based on the rate of change in model accuracy. Batch sizes were generally maintained at 64 samples per batch, although some experiments used 32 samples per batch. The Adam optimizer was utilized to optimize model learning, with a binary cross-entropy loss function guiding the model. The highest testing accuracy and corresponding model weights from each experiment were retained.
4.3. Training and Testing Settings
We evaluated five baseline models and our EF-Net across three different training–testing split settings: subject-dependent, subject-semidependent, and subject-independent. The results for each setting are presented in
Table 2,
Table 3, and
Table 4, respectively.
- 1.
Subject-Dependent: In this setting, both training and testing are performed using different parts of samples from the same individual subject.
- 2.
Subject-Semidependent: Here, the training and testing sets may include samples from any subject. We shuffled all samples from all subjects together before splitting them into training and testing sets.
- 3.
Subject-Independent: This approach involves using certain subjects exclusively for training and others exclusively for testing, ensuring no overlap of samples from any specific subject between the training and testing splits. With an 80–20 split, approximately twenty subjects were utilized for training, while a separate set of six subjects were reserved for testing.
4.4. Subject-Dependent Results
The subject-dependent setting ensures that the model effectively learns the unique brain activity patterns of a specific individual. This capability allows the model to accurately classify any new data received from the same subject. Such an approach is particularly relevant in Brain–Computer Interface (BCI) applications, where patient-specific calibrations can assist the model in fine-tuning its response to that particular subject, leading to more accurate responses to future BCI commands.
Setup. We conducted experiments individually on subjects 1, 2, and 3 from our dataset. For each subject, we applied all baseline models using three random seeds and report the average results. In each experiment, the data from each single subject are divided into two parts, with 80% allocated for training and the remaining 20% used for testing.
Results. The average results for subjects 1, 2, and 3 are presented in
Table 2. EF-Net demonstrates success in 10 out of 15 tests across the three data usage scenarios. Notably, EF-Net’s F1 score performance surpasses that of the best baseline, VGG16 and VGG16 + VGG19, by 7% on the fNIRS data and 1.83% on the combined fNIRS and EEG data. It can be observed that EF-Net yields excellent results in the subject-dependent setting when using only fNIRS data, achieving the best performance across all baselines and data settings. We speculate that the reason for this might be that the subject-dependent setting involves a smaller number of samples, where utilizing multimodal approaches could lead to overfitting.
4.5. Subject-Semiependent Results
Subject-semidependent experiments are designed to test models’ generalization capabilities across all subjects, thereby ensuring that the overall model performs effectively and consistently among different individuals.
Setup. We extracted all samples from subjects 1 to 26 and shuffled the entire dataset. The samples were split into two subsets, with 80% used for training and 20% for testing.
Results. The outcomes of the subject-semidependent experiment are detailed in
Table 3. EF-Net outperforms the others in 12 out of 15 tests across the three data usage scenarios in this setting. Specifically, EF-Net’s F1 score outperforms the best baselines, VGG16 and VGG16 + VGG19, by margins of 2.63% for the fNIRS data and 4.34% for the combined fNIRS and EEG data modalities. Consistent with our previous findings, utilizing fNIRS data exclusively yields the most favorable results, surpassing all other baselines and settings.
4.6. Subject-Independent Results
The subject-independent setting, especially in the context of evaluating unseen patients or subjects, is crucial for developing a diagnostic or medical aid that is both resilient and universally applicable [
15]. Each subject exhibits unique characteristics and cognitive activities, based on which certain consistent brain activity patterns may be identified and generalized. In order to effectively learn these patterns in a generic way, it is essential to conduct extensive training across a diverse range of subjects. Moreover, the model structure needs to be able to recognize features in any unseen subject, ensuring its applicability across different individuals.
Setup. After shuffling all the subjects using consistent random seeds, we partitioned the dataset into two groups, with 80% (twenty subjects) used to training the model and the remaining 20% (six subjects) used for testing. After partitioning the subjects, we gathered each of their respective samples for training and evaluation.
Results. The results for the subject-independent experiments are presented in
Table 4. EF-Net outperforms the others on 11 out of 15 tests across all three data usage scenarios in this setting. Specifically, EF-Net’s F1 score exceeds the best baseline models, VGG16 and VGG16 + VGG19, by margins of 2.48% for the fNIRS data and 2.13% for the combined fNIRS and EEG data modalities. EF-Net demonstrates robust multimodal learning capability in the subject-independent setting, particularly when utilizing both EEG and fNIRS data. This outcome substantiates the hypothesis that integrating the EEG and fNIRS modalities can enhance the model’s ability to learn cross-subject features and generalize to unseen subjects, making it highly applicable to real-world scenarios.
5. Discussion
The challenges in subject-independent settings come from the unique noisy characteristics of each subject. Even when sharing the same class label, subjects might exhibit different data distributions [
15,
16]. The higher accuracy and F1 score observed in the subject-dependent and subject-semidependent settings can be attributed to potential “information leakage”, where the model learns specific subjects’ distributions during training. Consequently, these settings are best suited for training personalized models or verifying dataset effectiveness, while their real-world applicability remains somewhat limited. In contrast, the difficulty in subject-independent settings arises from the need to perform effectively on unseen subjects with potentially varying distributions. The noisy characteristics unique to each subject can obscure the generalizable features across subjects. The objective is then to learn overarching features while discarding subject-specific noise.
Recent work on this topic, such as the M2NN method proposed by [
38], shares a similar objective with ours, namely, to automatically extract features using deep learning. They utilized another open-access dataset made available by [
33] for motor imagery classification, which includes a comparable number of subjects to the dataset used in our study. Their model employs convolutional blocks to extract features from each modality, subsequently combining them to make a class prediction. The primary distinction lies in their use of significantly smaller and one-dimensional kernels, whereas our approach incorporates both one-dimensional temporal kernels and two-dimensional spatial kernels.
In addition, there are other types of brain–computer interface (BCI) applications beyond EEG-fNIRS that could be integrated with EEG-fNIRS for various engineering purposes. For instance, combining EEG-fNIRS data with eye tracking could substantially enhance the capabilities of VR devices. A VR helmet could, for example, be engineered to simultaneously gather these types of data. Eye tracking data could then be employed to trace the user’s gaze, allowing the virtual environment to dynamically adapt and align with the user’s line of sight. Furthermore, EEG-fNIRS data could unlock a wide range of potential applications within VR devices. For instance, when integrated with VR, EEG-fNIRS technology could offer novel approaches for managing stress, anxiety, and other mental health issues by facilitating therapeutic scenarios that adjust to the user’s mental state in real time. Moreover, this technology could provide adaptive interfaces or control mechanisms specifically designed for users with disabilities, thereby improving the accessibility and enjoyment of VR by modifying the experience to suit the user’s cognitive and emotional states.
7. Conclusions
This paper introduces EF-Net, a convolutional neural network (CNN)-based multimodal learning framework designed for the analysis of brain signals in mental state recognition tasks. Leveraging both EEG and fNIRS data, EF-Net is engineered to capture the temporal and spatial features inherent in these modalities. We conducted experiments under various settings, exploring the utilization of EEG alone, fNIRS alone, and their combination. This study evaluates different training–testing split scenarios, including subject-dependent, subject-semidependent, and subject-independent configurations. We report promising results in both the subject-dependent and subject-semidependent settings, underscoring EF-Net’s effectiveness in facilitating personalized models. Although the performance in the subject-independent setting is modest and highlights areas for future improvement, our findings affirm the benefits of integrating the EEG and fNIRS modalities. This integration notably enhances the model’s ability to learn and apply cross-subject representations to unseen subjects. These results underscore EF-Net’s potential for real-world applications in mental state recognition using brain signals, and pave the way for future research on combining multiple modalities for brain activity learning.