It is well known that FS is an NP-hard problem. Finding the best subset from a high-dimensional search space is a combinatorial optimization problem. In this study, we utilize EGFA to solve the combinatorial optimization problem, i.e., feature selection for datasets with high-dimensional feature space, called EGFAFS. To verify the performance of EGFAFS, we test EGFAFS on eight gene expression datasets with high-dimensional features. It is noted that the concept of “feature selection” for a general dataset corresponds to the concept of “gene selection” for gene expression data. Since gene expression data usually consist of relatively few samples characterized by high-dimensional features, simple feature selection methods, such as filter-based methods, are not suitable to deal with gene expression data. Our proposed EGFAFS constructs a recommended feature pool for initialization. Attaching more attention to the features in the recommended feature pool, rather than all features in the original feature space, can help find the best solution more efficiently. In addition, considering all features in the original search space during the Explosion process of EGFAFS decreases the probability of being trapped in local optima.
2.3.1. Construct a Recommended Feature Pool
Since gene expression data consist of high-dimensional features, we construct a recommended feature pool based on a series of Random Forests (RF) [
51]. Based on the recommended feature pool, we are able to pay more attention to the features in the recommended feature pool, rather than all features in the original feature space. To build the recommended feature pool, we utilize Random Forest (RF) to measure the importance of each input feature based on the Gini coefficient. The pseudo-code of this strategy is given in Algorithm 2.
Algorithm 2. The pseudo-code of constructing a recommended feature pool. |
1. Input: The No. of RFs , the No. of features selected for every RF , the size of the recommended feature pool |
2. While do |
3. Sample features from the original feature space randomly for |
4. Feed the samples sliced by features to |
5. Compute the importance scores for features in in Equations (6)–(10) |
6. End while |
7. Merge all the importance scores for all features of the RFs |
8. Rank all features by sorting the scores of importance |
9. return features with maximum scores of importance to build the recommended feature pool |
The procedural steps of this strategy are described in detail, as follows:
Step 1. Use the Python package named sklearn.ensemble to initialize Random Forests (RFs).
Step 2. Randomly sample features from the original feature space (19,214 features) for times as the input features for the RFs initialized by Step 1.
Step 3. Feed the samples with features picked by Step 2 to every RF generated by Step 1 for training.
Step 4. Compute the importance scores of features in RF based on the Gini index.
Step 5. Merge all the importance scores of features computed by RFs.
Step 6. Rank all the features by sorting the scores of importance obtained by Step 5; the features with maximum scores of importance are composed of the recommended feature pool.
In this study, we set , , and . Due to the recommended feature pool jointly determined by Random Forests, the default values of all parameters for Random Forests are adopted. Specifically, , , , and . The detailed description of Step 4 is as follows.
A Random Forest is composed of several decision trees (binary trees), and the features selected are used to decide to which class the input data belong. In this study, the average loss of the entropy criterion, such as the Gini index, is adopted for growing decision trees. The Gini index for each split node
in each decision tree is presented as
and can be calculated using Equation (6).
where
is the number of classes, and
is the proportion belonging to the
-th class in node
. Then, the score of the importance of feature
in node
is presented as
, which can be calculated by Equation (7).
where
and
are the Gini indexes of newly generated nodes after node
splitting.
is defined as the set of nodes that selects the feature
in the
-th tree. Then, for the
-th tree, the score of importance for feature
can be calculated by Equation (8).
Therefore, if a Random Forest consists of
trees, the score of importance for feature
can be calculated using Equation (9).
The final score is given as Equation (10) after normalization.
2.3.2. EGFA for Feature Selection Based on a Recommended Feature Pool
The original EGFA is a novel nature-inspired heuristic search algorithm proposed by our research team in 2019 for continuous optimization problems. Detailed information about EGFA is introduced in
Section 2.2. This paper utilizes EGFA to solve the combinatorial optimization problem, i.e., feature selection, called EGFAFS. To investigate the performance of EGFAFS, we test EGFAFS on eight gene expression datasets. Because gene expression data consist of high-dimensional features, we construct a recommended feature pool by ranking the features’ importance based on a series of Random Forests. Then, we can pay more attention to the features in the recommended feature pool rather than all features in the original search space. The introduction of this strategy is described in
Section 2.3.1 in detail. The overall flow chart of EGFAFS is depicted in
Figure 2. The pseudo-code of EGFAFS is given in Algorithm 3.
Step 1. Construct a recommended feature pool based on a series of Random Forests by ranking the scores of features’ importance. This step is described in
Section 2.3.1 in detail. After this step, we can obtain the subset
of features with size
, which is the recommended feature pool.
Step 2. With initial
n dust particles, the location of
-th dust is defined by Equation (11).
where
is the
-th feature selected randomly from the recommended feature pool
, which is built in Step 1.
is the number of selected features. Then, the
of each dust particle is a feature subset with size
. We set
, and
. In this study, the mass value for each dust particle is calculated based on the Matthews Correlation Coefficient (MCC) [
52] as Equation (12).
where MCC is a metric to evaluate the dust (solution). A detailed description of MCC is given in
Section 3.2. In addition, to simplify the process of EGFAFS, the number of groups is set as 1 in this study. There is then only one center in the population. Except for the center dust particle, the other dust particles are the surrounding dust particles.
Algorithm 3. The pseudo-code of EGFA for feature selection (EGFAFS). |
1. Input: The size of the recommended feature pool , the size of dust population , the size of the features found by EGFAFS , the No. of Random Forests (RF) , the No. of iterations |
2. Construct a recommended feature pool by RFs based on the Gini index |
3. Initialize the dust population of size by Equation (9), calculate the mass of each particle by Equation (10) |
4. while do |
5. Move the surrounding dust particles toward their center as Figure 3 |
6. Some surrounding dust particles are absorbed by their center |
7. Explosion strategy produces some dust particles as Figure 4 |
8. |
9. End while |
10. Return the optimal subset of features |
Step 3. Move the surrounding dust particles toward their center. For the
-th surrounding dust particle (
), this step is depicted in
Figure 3.
Select a feature from the location of randomly, which is named .
Select a feature from the location of its center randomly, which is named .
Replace the feature by feature .
In addition, the features represented by the center will be shuffled in this step, which is the strategy of rotating used in this study.
Step 4. The surrounding dust particles with a relatively smaller value of mass are absorbed by their center. Define as the -th percentile of the mass values for population dust. For (assume that it is surrounding dust, i.e., ), if , will be absorbed; otherwise, it will remain for the next iteration. In this step, the size of the population will decrease. In addition, we set .
Step 5. Several dust particles will be generated by the explosion strategy based on the center dust. The new
is generated as shown in
Figure 4, which consists of the following procedures.
Copy to .
Select features from randomly as (consists of red highlighted features), and select features from the original feature space as (consists of green highlighted features). We set in this study.
For , replace the features in with features in one by one.
The main loop from Step 3 to Step 5 will run for several epochs (50 epochs in this study). During the main loop, to evaluate a dust particle (a solution, i.e., a subset of features), we feed the training data sliced by the subset of features represented by the dust to the Support Vector Classification (SVC) [
53] for training. After training, we utilize the Matthews Correlation Coefficient (MCC), which is a metric for evaluating the performance of classification on the validation dataset, in order to evaluate the dust. A detailed description of MCC is given in
Section 3.2. Once the main loop ends, the best subset of features will be found, and the features selected are tested on the independent test dataset to obtain the final performance metrics.
It is noted that the original EGFA was proposed for continuous optimization problems, such as Ackley and Sphere benchmark problems, but that EGFAFS is an improved version of EGFA for solving combinatorial optimization problems (i.e., feature selection) for real-world data, such as gene expression data. Due to the different types of optimization problems solved by EGFAFS and the original EGFA, the processes of Move and Explode in EGFAFS are different from those in the original EGFA, as introduced in
Section 2.2. In addition, the original EGFA has good performance for continuous optimization problems in low-dimensional search space, having 2, 3, 5, 10, or 20 dimensions, but the EGFAFS proposed in this manuscript is improved and implemented to solve a combinatorial optimization problem, i.e., feature selection, in high-dimensional search space, containing 19,214 dimensions.
Finally, we have to emphasize that EGFAFS is a general feature selection algorithm which is capable of processing any type of numerical dataset, such as gene expression data.