In this section, we outline the methodologies and techniques used in our research on emotion classification using feature-selection and machine-learning algorithms. This section is divided into three main subsections corresponding to the distinct experimental approaches applied in the study.
Figure 1 describes our proposed experiment for the effective classification of the EEG brainwave datasets. We applied three different methods for feature selection to obtain effective performance. First, SelectKBest with the ANOVA
F-test was applied to calculate the correlation between a target variable and features, and the features with the highest correlation were selected. Second, we chose to apply LASSO for feature selection to remove multicollinearity between features through L1 regulation. Lastly, we employed a GA for wrapper-based feature selection to select the best feature subset for the classification of the EEG dataset. The classification models were constructed using RF, logistic regression, XGBoost, and SVM. We tuned the hyperparameter of the machine learning models using Bayesian optimization (BO). BO has been widely used for hyperparameter tuning in machine learning [
29]. The classification metric was used to assess the performance of the models. For the DEAP dataset, we used only accuracy and F1 scores as performance metrics. We employed 5-fold cross-validation to evaluate the performance of the machine learning models. All feature selection methods and machine learning models were trained and tested on systems with MSI Intel
® Core™ i7-7700HQ CPU @2.80 GHz and RAM 16 GB (MSI Global, New Taipei City, Taiwan).
3.1. Filter-Based Feature Selection Using SelectKBest with ANOVA F-Test
SelectKBest with the ANOVA
F-test, categorized as a filter method, evaluates the relevance of features based on statistical methods. Specifically, it employs the ANOVA
F-test, a univariate measure that quantifies the significance of each feature concerning the target variable. The goal is to select the ‘k’ best features, where ‘k’ is a user-defined parameter [
30].
The ANOVA
F-test function is utilized to compute the ANOVA
F-statistic between each feature and the target variable. This statistic measures the degree of linear dependency between the feature and the target, enabling the identification of features most likely to be informative for classification. The formula for the
F-statistic is as follows:
where the sum of squares between (
SSB) measures the variance between classes, the sum of squares within (
SSW) measures the variance within each group,
k is the number of classes, and
n is the total number of data points [
30]. SelectKBest is a feature selection algorithm, and when combined with the ANOVA
F-test, it becomes SelectKBest with the ANOVA
F-test. The primary purpose of this method is to evaluate the significance of individual features concerning a target variable. It operates as a filter method, meaning it ranks features based on statistical measures without involving the learning algorithm. The ANOVA
F-test, in this context, is a statistical test that assesses whether the means of different groups are equal. In feature selection, it helps quantify the relationship between each feature and the target variable. The methodology employed in this study was designed to explore and evaluate the effectiveness of various feature selection and machine learning methods for emotion classification using EEG brainwave data.
3.2. Embedded-Based Feature Selection Using LASSO
LASSO, categorized as an embedded-based method, incorporates feature selection as an integral part of the model training process. This approach is particularly effective in handling high-dimensional datasets [
31]. Its primary objective is to add a penalty term to the loss function to encourage sparsity in the model. This is achieved by minimizing the following objective function:
where
represents the observed output for the
i-th instance,
denotes the input feature vector, and
is the vector of coefficients to be estimated. The first term
represents the ordinary least squares loss, which aims to minimize the squared differences between the predicted and observed values. The second term
introduces the LASSO penalty, where
is a hyperparameter controlling the strength of regularization. The key innovation of LASSO regularization lies in the regularization term
, which enforces sparsity by penalizing the absolute values of the coefficients [
32].
This motivates some coefficients to become exactly zero for effective performance. This is beneficial because it eliminates less important predictors, thereby simplifying the model and enhancing interpretability. The zero-out coefficients correspond to features that do not significantly contribute to the model’s predictive power. By excluding these features, LASSO reduces model complexity and helps prevent overfitting, especially in scenarios with high-dimensional data. Thus, the presence of zero coefficients is crucial to achieving an effective and robust predictive model. Consequently, LASSO not only aids in fitting the model to the data but also serves as a valuable tool for identifying and emphasizing the most relevant features in the process. The objective of LASSO is to minimize the mean squared error between the predicted and actual values while imposing a penalty on the absolute values of the model coefficients [
33]. LASSO can help in selecting the most relevant EEG features by pushing some feature coefficients to zero, effectively performing feature selection. This is crucial for optimizing the performance of emotion classification models based on EEG data, as it reduces overfitting and enhances the interpretability of the model [
34]. It is particularly useful when dealing with high-dimensional data, as it helps create more capable models, making them easier to interpret and potentially more efficient. Owing to its absolute value, LASSO provides a nondifferentiable term, but despite that, there are methods to minimize it. As we show below, LASSO is also robust to outliers. LASSO can effectively handle noisy data by eliminating less important features and preventing the inclusion of irrelevant features [
35]. Additionally, it addresses multicollinearity, a strong correlation between features that affect the label concurrently, by tending to select only one feature from highly correlated features [
36].
3.4. Hyperparameters Used for Feature Selection Methods and Machine Learning
Table 1 lists the hyperparameters used for the various feature selection methods in this study, detailing the specific values for SelectKBest, LASSO, and GA. It includes parameters such as scoring function, number of top features, regularization strength, population size, number of generations, crossover rate, and mutation rate.
SelectKBest with an ANOVA F-test played a crucial role in the methodology aiming to identify the most informative features for emotion classification. This feature selection process involved ranking features based on the ANOVA F-value and selecting the top k features for further modeling. For the EEG Emotion datasets and DEAP datasets, we installed the SelectKBest method to choose 100, 500, 1000, 1500, and 2000 features and 50, 100, and 150 features, respectively. The varying number of features selected was chosen to systematically explore the impact of feature reduction on model performance, ensuring classification accuracy in the experiments for both the Emotion and DEAP datasets. This approach was necessary because filter-based feature selection requires the pre-determination of the number of selected features. Then, we applied hyperparameter optimization using BO for LASSO.
Among the representative embedded-based feature selection methods, we employed LASSO, followed by hyperparameter optimization using BO. We optimized the ‘alpha’ value of LASSO using BO over a search space ranging from 10−6 to 101 with a log-uniform distribution. We used 5-fold cross-validation to find the most predictive and robust ‘alpha’ value.
A basic GA comprises three genetic operators: selection, mutation, and crossover.
In GAs, a solution is typically represented as a binary string, called a chromosome. It is essential to evaluate and select the most effective solutions to a particular problem. Each solution is assigned a fitness value that reflects its proximity to the overall specifications of the desired solution. We used the accuracy of each machine learning model as the fitness value of wrapper-based feature selection using a GA.
Selection: This operator examines a set of individuals in a population based on their fitness values. It preferentially retains the best individuals but must also provide the less fit individuals a chance to avoid premature convergence. We used a GA with population sizes of 30 and 20 generations and a 5-fold cross-validation.
Mutation. This operation introduces a small perturbation in the chromosome of an individual, reflecting the operational characteristics of the proposed algorithm. We set the mutation probability to 0.8, 0.2, and 0.01 for the decay rate, respectively.
Crossing. This explores the search space by diversifying the population. This operation typically manipulates the chromosomes of two parents to generate two children. These operations are applied iteratively in the GA, as shown in the flowchart (
Figure 2). In our implementation, the crossover rate started at 0.2, increased exponentially, and approached 0.8 over 20 generations. These parameters were set to balance the exploration and exploitation capabilities of the GA during the feature selection process. Initially, a low crossover rate (0.2) encouraged further exploration of the solution. This helps in identifying diverse and potentially high-quality solutions early in the optimization process. As generations progressed, the crossover rate increased exponentially, reaching 0.8. A higher crossover rate in later generations promotes exploitation, in which the algorithm focuses on refining and combining the best solutions. By starting with a lower crossover rate and increasing it gradually, we ensure that the algorithm does not converge to suboptimal solutions and has a higher chance of finding the optimum. This adaptive strategy helps maintain a good balance between diversity and convergence throughout the evolutionary process.
Table 2 lists the hyperparameters for the classification algorithms used in the experiment. As we introduced above, we applied hyperparameter tuning using BO across several machine learning models, including random forest, logistic regression, XGBoost and SVM. For RF, the number of trees (n_estimators) is set to an integer value between 50 and 200, the maximum depth of a tree (max_depth) is set to a categorical choice of 10, 20, or 30 and the minimum number of samples required to be split is set to an integer value between 2 and 10. For logistic regression, the penalty parameter is set to a categorical choice of l1 and 12, c is set to a real number between 10
−3 and 10
3, and the solver parameter is set to ‘liblinear’. For XGBoost, the number of gradient-boosted trees is set to an integer value between 50 and 200, the learning rate is set to a real number between 10
−3 and 1, and a maximum tree depth (max_depth) is set to an integer value between 3 and 9. For SVM, the kernel parameter is set to a categorical choice of ‘linear’ and ‘rbf’, and the c parameter is set to a real number between 10
−3 and 10
3. The BO process is iterated through 30 trials for each model to tune hyperparameters. For the baseline model, we implemented the models with the default hyperparameter values provided by scikit-learn.