1. Introduction
The increasing popularity of the Internet and the continuous development of information technology have led to an increase in the prominence of network security issues. This has given rise to a multitude of network attacks and malicious behaviors, which have the potential to pose a significant challenge to the security and stability of network systems [
1]. IDSs represent a crucial network security protection measure. By monitoring network traffic and system activities, they can detect and respond to potential security threats in a timely manner. This makes them an invaluable tool in the protection of network security [
2]. Network traffic can be categorized into normal and abnormal, and intrusion detection can be regarded as a binary classification problem. The accuracy of IDS can be significantly improved by improving the performance of the classifier [
3]. There are three approaches to existing intrusion detection techniques: signature detection, anomaly detection, and statistical detection [
4]. Signature detection is one of the most prevalent techniques employed in intrusion detection systems. Its fundamental premise is to identify known attack patterns and malicious behaviors through the utilization of a predefined feature library or rule set. Nevertheless, signature detection is subject to certain limitations in the context of addressing unknown and variant attacks. Anomaly detection represents the opposite approach, whereby potential attacks are identified by constructing a model of normal behavior and detecting activities that deviate significantly from the norm. Although anomaly detection is an effective method for identifying unknown attacks, it is also prone to a higher false alarm rate. Statistical detection is an intrusion detection method (based on statistical principles) that identifies abnormal behavior by analyzing the statistical characteristics of network traffic or system activities. In comparison with signature detection and anomaly detection, statistical detection places greater emphasis on the analysis of data distribution and patterns, which can enhance the accuracy of detection to a certain extent.
In recent years, there has been a proliferation of intrusion detection methods, including machine learning (ML) [
5], deep learning (DL) [
6], ensemble learning (EL) [
7], and others. These methods have been employed in research on intrusion detection, demonstrating enhanced performance. Machine learning models (e.g., decision trees [
8], SVM [
9], KNN [
10]) can be utilized to identify pivotal features within network traffic and system logs, and to detect anomalous activities by learning patterns of normal network traffic and system behavior. Deep learning models (e.g., DNN [
11], RNN [
12], LSTM [
13]) can automatically learn feature representations in data through the hierarchical structure of deep neural networks, and can provide better generalization capabilities when trained on large-scale data. Ensemble learning methods, including bagging [
14], boosting [
15], and stacking [
16], are employed to enhance the model’s robustness and generalizability by integrating diverse approaches to various base estimators.
Although intrusion detection systems are capable of detecting and responding to potential security threats in a timely manner to a certain extent, there are still significant challenges when faced with the real-world, large-scale, high-speed, and complex network environment. One such challenge is the asymmetry of information characteristics and redundancy features of datasets [
17,
18]. In practical data analysis and model training tasks, it is desirable for the features in the dataset to provide a substantial amount of information that can be leveraged to support model training and prediction. However, the presence of redundant features may result in increased computational complexity and a reduction in the model’s generalization ability. Consequently, it is a significant challenge to accurately and efficiently identify redundant features for filtering. Secondly, there is the asymmetry of network traffic distribution [
19]. In a realistic distribution of network traffic data, there is usually a large amount of normal traffic and a small amount of abnormal traffic. Consequently, the model may tend to predict the majority class, which may result in a decrease in model performance and generalization ability. Therefore, another challenge is to improve the stability and generalization ability of the model in the face of unbalanced samples.
Consequently, the conventional network malicious traffic detection techniques currently face significant challenges. In light of the asymmetry of dataset information features and redundant features, and the asymmetry of network traffic distribution in the field of network intrusion detection, there is an urgent need for research on detection methods with a high detection rate in order to enhance cybersecurity. To address the aforementioned issues, this paper employs feature engineering, integrated learning, optimization algorithms, and other methodologies to conduct in-depth research on the effective detection of anomalous traffic. This paper’s contributions are as follows:
The ERT method was employed to determine the significance of the Gini factors associated with the traffic data features. This involved the removal of irrelevant and superfluous features and the construction of an optimal subset of features.
The BO algorithm was used to optimize the parameters of the KNN base estimators to improve the performance of the ensemble model.
A bagging ensemble approach was proposed for intrusion detection. Based on feature selection and parameter optimization, training samples were randomly sampled using the bootstrap method to construct KNN base estimators for hard voting integration. This approach avoided the instability of a single base classifier due to the imbalance of data categories, effectively reducing the variance and improving the model’s generalizability.
A series of comprehensive experiments were conducted on the intrusion detection dataset NSL-KDD with the objective of comparing the performance of different machine learning models integrated as base estimators. This was performed to validate the effectiveness of the proposed model.
The rest of this paper is organized as follows:
Section 2 presents related work.
Section 3 explains our proposed model.
Section 4 describes the dataset, shows the experimental results and analyzes them.
Section 5 summarizes and provides an outlook.
2. Related Work
The existing body of the literature on intrusion detection techniques is primarily concerned with three principal areas of research. One area of investigation is the development of feature engineering techniques for intrusion detection datasets. In their study, Safaldin et al. [
20] proposed an enhanced IDS by using the modified binary grey wolf optimizer with support vector machine (GWOSVM-IDS), which improves the accuracy and detection rate of intrusion detection and reduces the processing time by decreasing the rate of false alarms and the number of features generated by the IDS. Kumar et al. [
21] proposed a penalized reward-based ant colony optimization (PRACO) feature selection method, which facilitates a more optimal exploration–exploitation trade-off by rewarding useful features and penalizing others. The proposed model achieved an accuracy of 81.682% on the NSL-KDD dataset. Ghosh et al. [
22] proposed a new wormwood grouse mating (SGM) algorithm in 2022 and applied it to IDS by reducing the original 41 features to 14, achieving an average accuracy of 81.429%. Ye et al. [
23] employed the meta-heuristic algorithm hybrid breeding optimization (HBO) to IDS and proposed an integrated framework for feature selection based on the improved HBO. The framework assigns a subpopulation to each feature space and identifies the optimal subset of features. The objective was to enhance the accuracy of intrusion detection through the integration of the subpopulations. Herrera et al. [
24] conducted a comprehensive examination of existing feature selection algorithms to identify their shortcomings and developed novel multimetric feature selection algorithms to reduce the dimensionality of the training dataset by leveraging the qualitative information provided by multiple feature selection metrics, and through experimentation. This study demonstrated the efficacy of the proposed approach. Nazir et al. [
25] applied feature selection algorithms to reduce the dimensionality of the data and improve the performance of classifiers by proposing a wrapper-based feature selection method—taboo search–random forest (TS-RF)—and testing it on the UNSW-NB15 dataset. The experimental results demonstrated that TS-RF enhanced classification accuracy while reducing the number of features and the false alarm rate. In 2024, Akhiat et al. [
26] proposed an efficient integrated feature selection algorithm for intrusion detection, namely the intrusion detection efficient ensemble feature selection (IDS-EFS). This algorithm enhances the interpretability of the model by decreasing the network data dimensionality, thereby reducing resource requirements and improving generalization. Khammassi et al. [
27] tested three different decision tree classifiers. The use of binomial logistic regression and multinomial logistic regression for binary and multiclassified datasets, respectively, has been demonstrated to successfully reduce the feature space and improve classification performance. Yang et al. [
28] combined the feature selection technique with the integration method to propose an adaptive ensemble feature selection algorithm for intrusion detection systems (IDS-EFS). This combined approach resulted in the development of an adaptive integration model. Specifically, the neighborhood dependency of neighborhood rough set (NRS) was introduced into the bottle sea sheath algorithm (SSA) to propose a heuristic feature selection algorithm (NRS-SSA). Furthermore, SSA was introduced to optimize the weight matrix when setting the voting weights. The results demonstrated that the model achieved a state-of-the-art level of detection. Mohammad et al. [
29] proposed an innovative feature selection algorithm called “the highest wins” (HW), which demonstrated advantages in a variety of evaluation metrics, including recall, precision, and error rate, compared with the well-known chi-square and information gain strategies.
The second area of research is the investigation of sample category imbalance in intrusion detection datasets. Qian et al. [
30] identified the issue of sample category imbalance, which can result in the suboptimal detection performance of intrusion detection models. They proposed an improved hybrid sampling (IHS) method based on the chaotic particle swarm optimization (CPSO) algorithm as a sampling algorithm to address this issue, and a deep learning long short-term memory (DLSTM) model with long short-term memory (LSTM) functionality was proposed, which achieved high accuracy in classifying intrusion behaviors. This model outperformed other comparative models in terms of accuracy. Jiang et al. [
31] employed a one-sided selection (OSS) method to reduce the influence of noisy samples in the majority class, subsequently augmenting the minority class samples through synthetic minority oversampling (SMOTE). The proposed method enabled the model to fully learn the features of the minority class samples while reducing the training time. In 2022, Jung et al. [
32] proposed a hybrid resampling method that added minority classes and removed noisy data by using SMOTE and an edited neural network to generate a more balanced dataset. The proposed method was validated using two publicly available intrusion detection datasets, PKDD2007 and CSIC2012. The results demonstrated a clear advance on previous work. Zhang et al. [
33] integrated deep learning methods with statistical techniques to address the challenge of detecting minority class samples. The authors proposed an intrusion detection method, ICVAE-BSM, which employed an improved conditional variational auto-encoder (ICVAE) and a boundary synthesis minority classes oversampling technique (BSM). This method was designed to enhance the efficiency of Internet of Things (IoT) attack detection in the presence of sample imbalance and accuracy issues. In 2024, Liu et al. [
34] proposed a multiconstraint migration method with additional auxiliary domains and designed a multiscale and multilevel sample augmentation discriminator to accomplish final IoT intrusion detection under sample distribution imbalance. This approach achieved an average accuracy of 96.398% on four datasets and can be effectively used for intrusion detection in real IoT environments.
The third area of research concerns the improvement and fusion of intrusion detection model structures with the objective of enhancing detection performance. In their study, Esmaeili et al. [
35] explored the potential of deep learning-based intrusion detection systems and demonstrated the superiority of LSTM and BiLSTM models over other models in terms of performance. Zaryn et al. [
36] conducted a comparative study on the performance and efficiency of IoT anomaly detection models by using the NSL-KDD dataset, and the results indicated that the integrated model XGBoost represents the most accurate and efficient methodology. Lee et al. [
37] proposed a two-stage fine-tuning algorithm based on the WGAN-GP model to enhance the recognition accuracy of sparse data probes by fine-tuning the classification algorithm and model parameters. The final experimental results demonstrated that the MLP classifier exhibited a notable improvement in accuracy, rising from 74% to 80% after fine-tuning. This level of performance is significantly superior to that of all other classifiers. Farooq et al. [
5] proposed an intrusion detection scheme, IDS-FMLT, which incorporates machine learning techniques to detect intrusions in heterogeneous networks comprising disparate source networks and to safeguard the network from malevolent attacks. The scheme achieved a validation accuracy of 95.18% and a miss detection rate of 4.82%. Sarnovsky et al. [
38] proposed a hierarchical intrusion detection system based on a symmetric combination of machine learning methods and knowledge-based approaches, combining several different machine learning models. The system is capable of predicting specific types of attacks and selecting the appropriate model to perform the prediction at a selected level. Alotaibi et al. [
39] proposed a fusion of machine learning with an intelligent intrusion detection model that improves accuracy and decision making by combining the predictions of different models using a fuzzy inference system. Elnakib et al. [
40] proposed an enhanced anomaly based intrusion detection deep learning multiclass classification model (EIDM), which is able to classify different traffic behaviors, including multiple attack types. The results demonstrated that the method achieved an accuracy of 95% on the CICIDS2017 dataset. In a separate study, Wang et al. [
41] integrated deep learning models and proposed two integrated deep learning models, SDAE-ELM and DBN-Softmax. A small batch gradient descent method was employed for network training and optimization, resulting in enhanced classification accuracy and the capacity for real-time response to intrusion behaviors. In their study, Praveena et al. [
42] proposed a deep reinforcement learning technique based on the black widow optimization (DRL-BWO) algorithm and used an improved reinforcement learning-based deep belief network (DBN) for intrusion detection, achieving an accuracy of 98.5%.
Our approach is a simultaneous consideration of the three aforementioned aspects of the research. Firstly, the ERT method represents an efficacious solution to the issue of asymmetry between informative and redundant features. The method improves classification performance while reducing the dimensionality of the data, and the highly parallelized training approach is effective when dealing with large-scale data. Secondly, the use of a bagging ensemble method addresses the problem of asymmetry in sample distribution. This method improves the complexity of the model, avoids the instability of individual base estimators due to the imbalance of data categories, and enhances the generalization ability of the fused model. Finally, the parameters of the base estimators are tuned using the BO algorithm, which further improves the performance of the model.
3. Proposed Method
The proposed BO-KNN-Bagging framework is illustrated in
Figure 1. It encompasses four principal stages: (1) Data preprocessing. Initially, the original attack categories are converted into binary classification, then the category-based labeled data are converted into numerical classification through label coding, and finally the data are scaled using the min-max normalization technique. (2) Feature selection. The ERT algorithm is employed to evaluate the importance of features and determine the weights of features, thereby identifying the most informative ones. (3) Construction and optimization of base estimators. A total of 5-200 base estimators are constructed, and multiple self-help samples are generated by extracting samples from the original dataset with putback based on the method of bootstrap. Tenfold cross-validation is performed, followed by tuning of the base estimators’ parameters using BO. The model is then trained independently using each of these samples. (4) Model ensemble. In the binary classification problem, the independent prediction results of KNN base estimators are integrated using hard voting, i.e., majority voting, and the final classification result obtained is used as the output of the entire integrated model. The trained BO-KNN-Bagging model is employed for the purpose of detecting attacks.
3.1. Data Preprocessing
To facilitate the training and prediction of machine learning models, we employed label encoding to convert category-based label data into numerical form. Integer encoding maps each feature label to an integer value, which is more space efficient than binary encoding and performs well in many machine learning algorithms.
Concurrently, different features tend to have disparate sizes and ranges, which can result in some features exerting a considerable influence on the model while others have a relatively minor impact. To address this issue, all features are scaled to a uniform range using min-max normalization, also known as deviation normalization. This is a common data preprocessing technique employed to scale numerical data to a specific range, typically within the bounds of [0, 1] or [−1, 1]. The method entails a linear transformation of the original data, whereby the minimum value is mapped to the desired minimum value and the maximum value is mapped to the desired maximum value. The calculation procedure for min-max normalization is presented below.
where
represents each sample in the original dataset,
is the normalized sample value,
and
are the minimum value and maximum value in the dataset, respectively. This method is suitable for processing features of different scales, ensuring that they have similar importance within the same range. It also helps to eliminate the dimensional effect between features, thereby improving the performance of the machine learning model. This makes the model more robust and accurate.
3.2. Bootstrap Aggregating
Bagging is an algorithm proposed by Breiman to train multiple base classifiers in parallel and perform ensemble learning based on bootstrapping [
43]. As shown in Algorithm 1, bagging combines multiple models to improve the overall prediction performance and reduce the risk of overfitting. Among them, bootstrap means that a new sample matrix
can be obtained by extracting one sample from the
sample matrix
with replacement, and extracting
times:
where
and
are the feature matrix and label vector of the original samples, respectively. The number of samples is identical to that of the original sample set, and the same samples can be extracted from the new sample set. Although the sample distribution of each sample set differs to some extent, they all retain some of the information in the original training data, thus enabling each base estimator to learn slightly different content, thereby forming the diversity of the bagging ensemble model. Bootstrapping allows for the training of multiple models on different samples, thereby reducing the overall variance of the model and resulting in a more generalizable model. Additionally, by exposing the model to slightly different multisamples, the method ensures that the integrated model is better able to generalize to unseen data. This results in a model that is less sensitive to the idiosyncrasies of a single training dataset. Furthermore, the bootstrap method itself introduces randomness, which helps to deal with overfitting. Models trained on different samples produce different errors, and by combining them, these errors are averaged out, thus improving the model’s generalization performance.
Aggregation represents the integration strategy, which in classification problems is typically accomplished through voting. This involves each model providing a prediction, with the final prediction being the category that receives the greatest number of votes among these models. In regression problems, bagging typically entails taking the average of these models as the final prediction. By combining multiple base estimators, the instability of a single base classifier due to imbalance of data categories can be effectively avoided, and the variance can be effectively reduced. The application of bagging ensemble learning to a fused model enhances its generalizability compared with a single estimator [
44,
45]. In our study, for the KDDTrain+ training set in NSL-KDD, comprising 125,973 data points, a random sampling with putback was performed, with the same number of data points taken for each evaluator for the subsequent training step.
Furthermore, the probability that an individual is always not picked up in a training dataset containing
samples with putback sampling can be expressed as follows:
This indicates that approximately 36.8% of the initial samples are not included in the newly constructed samples. Consequently, these samples can be utilized as out-of-bag data (OOB data) for evaluating the generalization capacity of the novel model. The out-of-bag error (OOB error) refers to the discrepancy between the observed values of the OOB data and the predicted values, which is essentially the error of predicting the new samples when the generalization error is considered rather than the training error on the training set. The lower the value of OOB error, the better the prediction performance of the model. In particular, the OOB error reflects the performance of the model on data that have not been seen during training. A lower OOB error indicates that the model has a greater capacity for generalization ability and is more accurate when applied to data that have not been used to inform its development. Consequently, we utilize the OOB error as a supplementary metric for evaluating the model.
Algorithm 1 Bagging |
Input: //The NSL-KDD dataset //The number of models //The number of instances in the training set Output: //Prediction results of the ensemble model
- 1:
Initialize: //The initialization set is used to store the prediction results - 2:
for to - 3:
for to - 4:
//Random sampling - 5:
//Bootstrap - 6:
end for - 7:
//Train with optimal parameters - 8:
//Result integration - 9:
end for - 10:
//Aggregating - 11:
Return
|
3.3. Extremely Randomized Trees
The ERT algorithm is a machine learning algorithm based on bagging ensemble models, initially proposed by Geurts et al. in 2006 [
46]. It typically employs the CART algorithm to construct the base estimators, introducing greater randomness based on random forests. In contrast to the random forest algorithm, ERT randomly selects a subset of features when choosing node splits and randomly selects a threshold within that subset for splitting. This increases the diversity of the model and further reduces the variance of the model.
The execution of traditional decision tree algorithms in a serial manner often results in the suboptimal utilization of computational resources. In contrast, ERT has clear advantages in the feature selection task [
47], which employs a highly parallelized training method that effectively utilizes multicore processors and distributed computing resources. This enables it to perform well when dealing with large-scale data and to be robust to noise and outliers in the data. Furthermore, since each tree is trained on a random subset, the degree of overfitting of a single tree is relatively low, and the overall model is more capable of generalization. This enables ERT to maintain model stability and reliability when dealing more effectively with complex data.
The ERT algorithm assesses the significance of features by quantifying the impact of each feature on the model. The importance of a feature is typically gauged by measuring the influence of a feature on the model’s performance when the decision tree is split, with each node split based on a specific feature. The reduction in impurity (typically calculated using the Gini coefficient) before and after splitting based on that feature is determined at the time of splitting. The cumulative value of this reduction is then used as a measure of the importance of that feature. The Gini coefficient is calculated using the following formula:
where
is the Gini coefficient of the dataset
,
represents the total number of categories, and
is the proportion of samples in the dataset belonging to category I. The Gini coefficient of the dataset is 0, which means that all samples belong to the same category. The Gini coefficient of the dataset is 0, indicating that all samples belong to the same category. A smaller Gini coefficient signifies a lower impurity of the dataset, with a Gini coefficient of 0 indicating that the dataset is pure, i.e., all samples belong to the same category. In our study, the importance of each feature is calculated, and after sorting them in descending order of importance, the selection of relevant features can be made according to the performance of the model.
3.4. K-Nearest Neighbor
KNN is a frequently utilized supervised learning algorithm, typically employed for classification and regression problems. The fundamental concept of KNN is to categorize unknown samples based on the most prevalent category among their nearest neighbors or to predict the mean value of the samples with respect to their nearest neighbors, utilizing the distance metric of the samples in the feature space. The primary steps of the KNN algorithm include determining the number of neighbors, calculating the distances between the unknown samples and all the training samples, selecting the
neighbors with the closest distances, and predicting the classification or regression based on their labels. In the KNN algorithm, Euclidean distance and Manhattan distance are commonly used as the distance index. The Euclidean distance between two n-dimensional vectors
and
in the plane can be expressed as follows:
The Manhattan distance is:
As a traditional machine learning algorithm, KNN is relatively straightforward to comprehend and implement, with commendable prediction outcomes. However, it is susceptible to limitations, particularly in the selection of the number of neighbors and distance metrics, and in its capacity to handle large datasets, which may result in inefficiencies and suboptimal performance in high-dimensional and imbalanced datasets [
48].
3.5. Support Vector Machine
SVM is a supervised learning model for classification and regression analysis. It performs well in problems such as pattern recognition, classification, and regression, and has unique advantages, particularly for applications in high-dimensional spaces. The basic idea is to find an optimal separating hyperplane to categorize the data into different classes. This hyperplane should not only maximize the margin between categories, but also avoid misclassification as much as possible. SVM achieves this goal by the following steps.
When linearly divisible, the SVM searches for a linear hyperplane that maximizes the spacing between the two classes of data points, which can be represented as a hyperplane:
where
is the normal vector, which determines the orientation of the hyperplane, and
is the bias, which determines the distance of the hyperplane from the origin. For the optimization problem: maximize the margin, that is, minimize
, while satisfying the constraints on all data points:
When linearly indistinguishable, a slack variable is introduced, allowing that some data points can be on the wrong side of the line but controlling the total error through a penalty term. For the optimization problem, minimize , where is the penalty coefficient, and control the trade-off between interval and error classification.
When nonlinearly differentiable, a kernel function is employed to map the data from a low-dimensional space to a high-dimensional space and to identify a linear hyperplane within the high-dimensional space. Commonly utilized kernel functions include the polynomial kernel, the rbf kernel, and the sigmoid kernel. The optimization problem then becomes the maximization of intervals within the high-dimensional space.
SVM is capable of efficiently processing data within a high-dimensional space, while overfitting can be effectively avoided by maximizing the interval. However, this approach is not well suited to large-scale datasets, as the training time and memory consumption may become excessive for very large amounts of data. In our study, the RBF kernel was employed as the kernel function for mapping, and the SVM was assembled using the bagging method to assess the performance of distinct integrated models.
3.6. Classification and Regression Trees
CART is a tree structure based on recursive partitioning that is employed for the purpose of partitioning a dataset and constructing predictive models. The fundamental concept of this method is the recursive partitioning of the dataset into progressively smaller subsets. This process continues until a given sample within a subset is classified into the same category as the subset itself. In the construction of a CART classification tree, the process commences with the root node and entails the selection of a feature and a threshold for the partitioning of the data. This is typically determined through the application of information gain, the Gini coefficient, or the sum of squared errors. The dataset is then divided into two subsets based on the selected feature and threshold. Subsequently, the same segmentation rules are recursively applied to each subset until a stopping condition is met. The stopping condition may be that the number of samples in a node is less than a certain threshold or that the purity of the node reaches a certain threshold (e.g., all samples belong to the same category).
The CART classification tree is capable of handling high-dimensional data and complex classification scenarios, and is relatively immune to outliers. However, it is susceptible to minor variations in the data, which may result in significant alterations to the structure of the generated tree. In our study, the Gini coefficient is employed to segregate the data, and the CART is assembled using the bagging ensemble method to assess the efficacy of distinct integrated models.
3.7. Bayesian Optimization
BO is a global optimization method based on Bayes theorem, which is suitable for the situation when the objective function is difficult to calculate or the computational cost is high. The key idea is to establish a probability model of the objective function
to guide the search process, and the parameter configuration
that leads to the optimal value of the objective function is found:
In order to optimize the objective function, BO typically employs a Gaussian process as an a priori model to represent the unknown objective function. A Gaussian process characterizes the distribution of the data by means of a mean function and a covariance function, which is typically considered an approximation of the objective function. The covariance function is employed to represent uncertainty [
49]. In contrast to traditional grid search or random search, BO is capable of identifying optimal solutions in high-dimensional spaces and complex objective functions with greater efficiency [
50]. In our study, the optimization objective is to maximize the classification accuracy, and
is the accuracy of the model under the parameter. Ten-fold cross-validation is used to optimize the number of neighbors K and the type of distance metrics of the KNN.
5. Conclusions and Future Work
Aiming at the problem of asymmetry of information features and redundant features of datasets and asymmetry of network traffic distribution in the field of network intrusion detection, which leads to low accuracy and poor generalization of traditional machine learning detection methods in IDS, a network intrusion detection method based on bagging ensemble is proposed. Using ERT for feature selection, we explore the model performance of different machine learning models as bagging basis ensembles, and find that the KNN-Bagging ensemble model has the best performance, and then we use BO for parameter tuning of the basis ensembles KNN, and finally the basis ensembles are integrated. The results show that the BO-KNN-Bagging model achieves an accuracy of 82.48%, which is higher than traditional machine learning algorithms, and also has better performance compared with other methods. Nevertheless, there are still some limitations. The effectiveness of bagging depends on the variability among the base estimators, and although we performed random sampling to ensure diversity in the training data, the base estimators use the same model and parameters, and the performance gains of the integrated model may be limited. Also, bagging requires the training of multiple base estimators and therefore has a high computational overhead, especially for complex learning algorithms or large-scale datasets.
The next step is to construct more complex artificial neural network (ANN) models as base estimators for integrating different methods (e.g., boosting, stacking), and also to consider using oversampling techniques to make the model better learn the features of a few categories, so as to further improve the model’s generalization performance and accuracy.