Genes: Classification of Microarray Gene Expression Data Using An Infiltration Tactics Optimization (ITO) Algorithm
Genes: Classification of Microarray Gene Expression Data Using An Infiltration Tactics Optimization (ITO) Algorithm
Genes: Classification of Microarray Gene Expression Data Using An Infiltration Tactics Optimization (ITO) Algorithm
T A C G
G C A T
genes
Article
Classification of Microarray Gene Expression Data
Using an Infiltration Tactics Optimization
(ITO) Algorithm
Javed Zahoor * and Kashif Zafar
Department of Computer Science, National University of Computer and Emerging Sciences (NUCES),
Lahore 54000, Pakistan; kashif.zafar@nu.edu.pk
* Correspondence: javed.zahoor@gmail.com; Tel.: +92-321-462-5747
Received: 20 May 2020; Accepted: 9 July 2020; Published: 18 July 2020
Abstract: A number of different feature selection and classification techniques have been proposed
in literature including parameter-free and parameter-based algorithms. The former are quick
but may result in local maxima while the latter use dataset-specific parameter-tuning for higher
accuracy. However, higher accuracy may not necessarily mean higher reliability of the model. Thus,
generalized optimization is still a challenge open for further research. This paper presents a warzone
inspired “infiltration tactics” based optimization algorithm (ITO)—not to be confused with the ITO
algorithm based on the Itõ Process in the field of Stochastic calculus. The proposed ITO algorithm
combines parameter-free and parameter-based classifiers to produce a high-accuracy-high-reliability
(HAHR) binary classifier. The algorithm produces results in two phases: (i) Lightweight Infantry
Group (LIG) converges quickly to find non-local maxima and produces comparable results (i.e., 70 to
88% accuracy) (ii) Followup Team (FT) uses advanced tuning to enhance the baseline performance
(i.e., 75 to 99%). Every soldier of the ITO army is a base model with its own independently chosen
Subset selection method, pre-processing, and validation methods and classifier. The successful
soldiers are combined through heterogeneous ensembles for optimal results. The proposed approach
addresses a data scarcity problem, is flexible to the choice of heterogeneous base classifiers, and is
able to produce HAHR models comparable to the established MAQC-II results.
1. Introduction
Microarray experiments produce a huge amount of gene-expression data from a single sample.
The ratio of number of genes (features) to the number of patients (samples) is very skewed which
results in the well-known curse-of-dimensionality problem [1]. This further imposes two self-inflicting
limitations on any proposed model: (i) processing all the data is not always feasible; and (ii) processing
only a subset of data may result in loss of information, overfitting, and local maxima. These two
limitations directly impact the accuracy and reliability of any machine learning model. To address the
curse-of-dimensionality, a lot of research has been done in the past to identify the most impactful feature
subset [2–5]. Both evolutionary as well as statistical methods have been proposed in the literature
for this purpose. Feature Subset Selection (FSS) techniques like Minimum Redundancy Maximum
Relevance (mRMR), Joint Mutual Information (JMI), and Joint Mutual Information Maximization
(JMIM) are amongst the most prominent statistical methods [6–8] while advanced approaches
like Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Deep Neural Networks (DNN),
Transfer Learning, mining techniques, etc. have also been shown in the literature to produce highly
accurate results [9–11]. The microarray data classification process is typically carried out in two
major phases: (i) Feature Selection: this phase focuses on selecting the most relevant features from
otherwise a huge dataset to reduce noise, computational overheads, and overfitting. (ii) Classifier
Training: this phase builds a model from the selected features to classify a given microarray sample
accurately and reliably [12]. Advanced techniques like Deep Neural Network (DNN), Convolutional
Neural Network (CNN), Transfer learning, Image processing, ANT Miner, and other exploratory
approaches have been proposed in the literature [13–21]. While the advanced approaches for both FSS
and Classifier training are capable of producing high accuracies, they need to be tuned according to the
underlying dataset in a controlled setup to achieve these good results. However, in practice, there are
a number of factors that can impact the accuracy and reliability of a model. These include the different
cancer types that need analysis of different tissues, the differences in microarray toolkits/hardware
e.g., data ranges and durabilities, experimental setups, number of samples, number of features used,
type of preprocessing methods applied, validation method used, etc. Due to these variations and
No Free Lunch (NFL) theorem, many of the existing methods can not be generalized across datasets.
Thus, it is still a challenging problem for researchers to develop a generalized approach that can
enhance both the reliability and accuracy of the model across datasets and variations. The algorithm
proposed in this paper puts these variations at an advantage by using ensembles for the classification
of microarray gene expression data. The Infiltration Tactics Optimization (ITO) algorithm proposed
in this paper is inspired by classic war-zone tactics [22]—not to be confused with the ITO algorithm
based on the Itõ Process in the field of Stochastic calculus [23,24]. It is comprised of four phases: Find,
Fix, Flank/Fight, and Finish i.e., the so called Four F’s of the basic war strategy. A small light-infantry
group (LIG) penetrates into the enemy areas to setup a quick command and control center while the
follow-up troops (FT) launch a detailed offensive with heavier and sophisticated weapons to gain finer
control and victory over the enemy. Both the LIG and FT members independently identify enemy
weak-points and choose their own routes, targets, movements, and methods of attack. The “successful”
LIG members are then combined to form a heterogeneous group that can become operational in a
short time-interval. This LIG group is joined by the “successful” survivors from the FT to gain full
control. The following text describes the four Fs (i.e., Find, Fix, Flank/Fight and Finish stages):
1. Find: In this stage, the LIG members analyze the field position to make a strategy and find the
most appropriate target to attack.
2. Fix: In this stage, the LIG members use different light-weight weapons to infiltrate into
enemy areas.
3. Flank/Fight: In this stage, LIG members keep the enemy pinned down so they could not
reorganize their forces while the FT performs a detailed offensive in the area independently.
4. Finish: In this stage, the FT members apply heavier weapons to cleanup the area and gain full
control over the enemy.
The proposed ITO algorithm is inspired by the Super Learner algorithm [25] but works in
two phases to build the overall model. In the first phase, ITO builds a heterogeneous ensemble of
parameter-free classifiers which can produce comparable results in a very short time-span. This sets a
bar for the minimum accuracy and reliability of the overall ensemble which is further refined when
fully tuned parameterized classifiers are available. The final model is guaranteed to meet this bar for
accuracy and reliability at the minimum. Parameter tuning is generally very time-consuming and
mostly it produces the most optimal results.
The microarray technology produces thousands of gene expressions in a single experiment.
However, the number of samples/patients is much smaller (upto few hundreds) as compared to
the number of features (several thousands). The small number of samples (training data) are not
sufficient to build an efficient model from the available data. This is known as data scarcity in the field
of machine learning. The ITO algorithm overcomes the data scarcity problem by building multiple
heterogeneous base classifiers. ITO does not restrict the use of any base classifiers as LIG and/or
Genes 2020, 11, 819 3 of 28
FT members. It is possible to use the most performant classifiers from literature with this algorithm.
The LIG and FT use exploration to learn about the different configurations and gain knowledge about
rewards while the ensembling phase exploits the best performers from both LIG and FT to build an
optimal model. The ITO algorithm achieves generalization and reliability by addressing data scarcity
problems and producing HAHR models.
The rest of the paper is organized as follows; Section 2 provides a background of microarray-based
cancer classification domain and literature review, Section 3 presents the proposed algorithm, Section 4
describes the experimental setup. Section 5 discusses the results and analysis and Section 9 presents
the conclusions and future directions.
cluster with the highest accuracy and lowest root mean square for further processing. This approach
may not scale well for a very large number of features because of computational and working memory
requirements. It will further require a way to strike a balance between the cluster size and number of
clustered required for such large datasets.
other methods for music sentiment analysis, birds acoustics, and scenery datasets. Again, the diversity
of problems it addresses shows the potential of heterogeneous ensemble to overcome NFL constrains.
In 2019, Yu et al. proposed a novel method using medical imaging, advanced machine learning
algorithms, and Heterogeneous Ensembles to accurately predict diagnostically complex cases of cancer
patients. They also used this system to explain what imaging features make them difficult to diagnose
even with typical Computer-Aided Diagnosis (CAD) programs [39]. Their work takes lung images as
input, performs segmentation of the image, and extracts features from them. These features are used
to train the heterogeneous base classifiers and build an ensemble of trained classifiers. Their work
improved the overall prediction accuracy to 88.90% as opposed to the highest accuracy reported in
literature as 81.17%.
In 2019, Liao et al. presented a novel Multi-task Deep Learning (MTDL) method that can reliably
predict rare cancer types by exploiting cross cancer gene-expression profiling [21]. They used different
datasets one for each type of cancer and common hidden layers that are extracted from these datasets
to train the model. The trained model’s learning is then transferred as additional input to the prediction
model. Their work showed significant improvement in correct diagnosis when there is inadequate
data available. The performance improvements were evident in all but the Leukemia database where
multi-class data are used. The proposed model learns common features from 12 different types of
cancers to effectively exploit the right features for a given cancer type. Their work also showed the way
to generalize a model across cancer-type and across datasets. The simplified approach of combining
single task learners through a DNN and use of Transfer learning makes it a scalable model for two-class
problems. For multi-class problems, further improvement will need to be done.
3. Proposed Algorithm
The proposed algorithm is inspired by warzone tactics. It is comprised of the Four Fs (Find, Fix,
Flank/Fight, and Finish) of basic war strategy for infiltration into enemy areas i.e., small light-infantry
group (LIG) backed by follow-up troops (FT) are used to conquer the area.
In our case, the LIG members are parameter-free classifiers that can be trained quickly to classify
a sample with reasonable accuracy and reliability. The LIG members independently choose to identify
enemy weak-points and choose their own routes, targets, movements, and methods of attack. While the
overall approach does not restrict the user to use any particular classifiers and any set of parameter-free
classifiers can be used; for this research, Decision Tree Classifier (DTC) [43], Adaptive Boosting
(AdaBoost) [13,44–46] and Extra Tree Classifier (also known as Extremely Randomized Trees) [47] were
used as LIG members with default settings.
The “successful” LIG members are then combined to form a heterogeneous ensemble which
can reliably classify a given unseen sample. In parallel, the FT applies heavier and sophisticated
techniques (i.e., parameter tuning) to find a better model. Random Forest [48], Deep Neural Network
(DNN) a.k.a. Multi-layer Perceptron (MLP) [16,49] and Support Vector Machine (SVM) [50–52] were
used as FT members with Grid Search and Random Grid Search for parameter tuning for binary
classification. The “successful” FT members are used to update the overall ensemble for enhanced
accuracy and reliability.
In the following text, we map the Four Fs (i.e., Find, Fix, Flank/Fight, and Finish stages) onto the
proposed algorithm:
1. Find: In this stage, a random grid search is applied on the 4-dimensional search space comprising
of pre-processing methods, FSS methods, Subset sizes, and Validation methods to generate “attack
vectors” (tuples of length 4 each from the search space with different combinations) for LIG and
FT members e.g., (Quantile method, mRMR, 50 features, 10 Fold CV) is one such tuple. Details of
the options used for each of these dimensions are given below.
2. Fix: In this stage, each of the LIG members use one of the attack vectors to construct individual
models. An efficiency index ρ is calculated using Matthews Correlation Coefficient (MCC) and
average classification accuracy (score) as:
where LIG(i) is the i-th member of LIG. Similar to MAQC-II benchmarks, MCC and accuracy
are used to compute ρ LIG . In addition, our analysis from earlier experimentation showed that,
in the case of overfitting, though the average accuracy/score of the model seemingly improves
but simultaneously the MCC of the model decreases. Hence, these two measures were used to
decide the trade-off between accuracy and MCC at the time of base classifier selection. The value
of MCC ranges between −1 and +1, but, for our experimentation, we used only (0, 1] or MCC > 0
i.e., anything better than random guess. The accuracy ranges between [0, 1] range. Both measures
are equally important, thus we use product as a statistical conjunction function. It helps balance
the trade-offs between MCC and Accuracy. In our experiments, we observed that ρ LIG helped
in improving both the MCC and accuracy in some cases and helped achieve a good trade-off
Genes 2020, 11, 819 9 of 28
between MCC and accuracy in other cases. A fitness threshold e LIG is used to filter in “successful”
members from the whole LIG i.e.,
The value of e LIG is chosen such that it filters at least the top 33% of the LIG members for
LIGEnsemble . Once the ensemble is formed, the value of e can be adjusted to tune the ensemble for
maximum ρ LIG− Ensemble yield as explained below.
3. Flank/Fight: In this stage, a heterogeneous ensemble of a subset of “successful” LIG members
(MCC > 0) is formed such that:
ρ LIG− Ensemble ≥ ∀ρ LIG(i) (3)
The ensemble is formed iteratively using a majority-vote method. In each iteration, the top LIG(i)
is added to the LIGEnsemble and ρ LIG− Ensemble is computed to ensure that the newly added LIG(i)
did not deteriorate the ensemble performance. If an LIG(i) causes decline in the ρ LIG− Ensemble ,
it is discarded.
The LIGEnsemble takes relatively very short time to build while each FT(i) may take several hours
to days to train (depending upon the parameter-space), thus, for the time-sensitive cases e.g.,
in the domain of pandemic diseases where an early prediction may be required, LIGEnsemble can
be used until FT(i) are being trained. When the FT(i) are trained and the ensemble updated
for improved performance, a follow-up prediction could be done which will either strengthen
the confidence in prediction if both LIGEnsemble and FinalEnsemble agree on the prediction Or
FinalEnsemble could be used to over-ride the earlier prediction.
4. Finish: In this stage, the FT members apply advanced classifiers such as Deep Neural Networks,
SVM, etc. to build fine-tuned models. The “successful” FT members are filtered in using:
where FT(i) is the ith member of FT. A fitness threshold is used to filter in “successful” FT
members i.e.,
Finally, a Ensemble Final is formed using filtered-in LIG(i) and filtered-in FT(i). The following
different approaches can be used to build the Ensemble Final :
(a) simply combine all the LIG and FT members from LIGEnsemble & FTEnsemble , respectively.
However, through empirical analysis, it was found that this approach actually causes a
decline in MCC and/or average accuracy of the model.
(b) start with one of LIGEnsemble or FTEnsemble and call it Ensemble Final . Choose base
classifiers from the other ensemble with ρ ≥ ρ Final − Ensemble and add to Ensemble Final .
However, starting with an ensemble with higher ρ would cause all of them to fail on
ρ ≥ ρ Final − Ensemble , thus resulting in no further improvement. In addition, our experiments
showed that, starting with an ensemble with lower ρ, the optimization gain was not as
good as the next approach because the condition ρ ≥ ρ Final − Ensemble filtered out many
Genes 2020, 11, 819 10 of 28
classifiers which still could help with reducing misclassifications of ensembles hence
improve both the accuracy and MCC.
(c) rebuild the Ensemble Final from scratch using LIG(i) ∪ FT(i) ordered by ρ. This approach
was found effective to further enhance the performance.
While the proposed algorithm is flexible to allow the choice of any classifiers, the pre-processing
method, validation method, subset size, and FSS methods, etc., the following configurations were used
in this study for LIG and FT members to carry out the Four Fs.
The imputer method [51] was used for data normalization. During feature exclusion, the features
with any missing values were completely removed from the dataset because (i) the number of features
are in abundance already and, (ii) due to missing values, these features do not represent the sample
space sufficiently. For scaling of the data Quantile method, Robust method, and Standard method
were used [51].
While, in the most recent studies [53,54], multi-objective feature selection methods have been
shown to outperform the single-objective methods; however, their implementations are not widely
available for public use. Thus, for feature subset selection (FSS), also known as Variable Selection,
three publicly available single-objective methods, namely Joint Mutual Information (JMI), Joint Mutual
Information Maximization (JMIM), and minimum Redundancy Maximum Relevance (mRMR) were
used [6–8,51]. The minimum number of features that should be chosen, largely depends upon the
dataset being used. For the basic techniques like JMI and JMIM the produced subset may contain
some level of redundancy whereas mRMR ensures that the chosen features in a subset have minimum
redundancy and maximum relevance to the class label [6,8]. These are brute-force techniques and
all the features are considered to compute a ranked list of features based on statistical relevance and
hence it is a computationally expensive step [8]. The selection of these algorithms was done due to
their out-of-the-box availability for Python, not requiring an implementation from scratch.
For validation, 10-Fold Cross Validation (CV) and Leave-one-out CV (LOOCV) were considered,
both of which have been proven in the literature to be amongst the best validation techniques [36].
Pseudo Code
The ITO Algorithm (Algorithm 1) computes LIGEnsemble using Algorithm 2. This produces an
initial baseline result which is either the best of LIG members or an improved output from the
ensemble. The Grid G on line 2 of Algorithm 1 has 4-tuples i.e., four elements wide and the length of
G will be |preps| × |searchRadius| × |searchStrategy| × |successEvaluation| to hold all possible
combinations of these four sets. The variables t LIG and t LIG (also 4-tuples) are subsets of G, for our
experiments, we used half the size of G. The ComputeEnsemble Final (Algorithm 3) conditionally
updates this ensemble using a ranked list of FT(i) if they improve the overall results.
Genes 2020, 11, 819 11 of 28
Algorithm 2: ComputeLIGEnsemble
input :
T: t × f matrix - training dataset with t samples and f features;
V: v × f matrix - validation dataset with v samples and f features;
t LIG : subset of configuration tuples each representing a combination with a preprocessing
method, FSS size, FSS method, validation method;
LIGOptions ={DT, AdaBoost, Extra Tree, ...} - set of parameter-free classifiers
output : LIGEnsemble
BEGIN
LIGEnsemble ← {};
Train ∀ LIG (i ) as LIG ∈ LIGOptions using every tuple from t LIG ;
Compute ∀ ρ LIG using Equation (1);
Sort descending on ρ LIG ;
Pickup top (50 OR 33%, which ever is bigger size) LIG members;
if ρ LIG > e LIG then
//i.e., Equation (2);
LIGFiltered ← LIGFiltered ∪ LIG (i );
Update LIGEnsemble such that ρ LIGEnsemble ≥ ρ LIG(i) using Equation (3);
return LIGEnsemble ;
END
Genes 2020, 11, 819 12 of 28
4. Experimental Setup
The ITO Algorithm was run on each of the Datasets A, B, and C to Compute Ensemble Final for
each of the datasets, respectively.
Table 2. e Thresholds.
LIG(i) LIGEnsemble
Dataset |LIG| Accuracy MCC Accuracy MCC
A 98 75% 0.89 95% 0.90
B 07 69% 0.18 72% 0.21
C 80 88% 0.76 90% 0.82
Note that the value of ρ (Equation (1)) will always fall below the accuracy and MCC, but, due to
its relative nature, the max value of ρ will always indicate the best LIG(i) (or FT(i)).
5.2. FT Optimizations
As a next step, the FTs were trained on each dataset under the same configuration options
except the set of classifiers, which, in this case, were parameterized classifiers, each requiring its
own parameter tuning. Figure 3 shows filtered-in FT(i) and their ρ. For Dataset A, top 75 FT(i) were
filtered-in based on their ρ. Similar to LIG(i) selection, as a heuristic, top 33% of “successful” FT(i)
were chosen to construct the FTEnsemble which achieved an accuracy 97% as compared to average
accuracy of 70% (ranging from 58–79%) and MCC to 0.92 as compared to average MCC of 0.82
(0.65–0.90). For Dataset C, the FTEnsemble improved the average accuracy to 91%, average MCC to 0.84
as shown in Figure 3c. For Dataset B, like before, only 12 members were chosen from FTs due to poor
MCC values for all other members. The accuracies, MCC, and ρ FT − Ensemble of FT(i) can be seen in
Figure 3b. Dataset B is a hard dataset to model [12], and it might be possible to get better individual
results through the use of advanced base-classifiers such as CNN or PSO based implementations, etc.
The limited choice of LIG and FT models used in this study (due to their out of box availability) did
not produce higher MCC. Table 4 summarizes the performance improvements through FTEnsemble :
ITO works independent of these choices, hence any better models can be used as a member for
both LIG and FT. The proposed optimization method was still able to produce comparable overall
accuracy and enhance the MCC value through optimization as shown in Figure 3c. Table 4 shows that,
for dataset B, the ITO algorithm produced a HAHR model with comparable reliability and accuracy.
FT(i) FTEnsemble
Dataset |FT| Accuracy MCC Accuracy MCC
A 199 70% 0.82 97% 0.92
B 12 68% 0.16 72% 0.08
C 108 90% 0.81 91% 0.84
From raw results, it was interesting to note that a noticeable majority of successful FT(i) were using
RandomForest, followed by a relatively small number of FT(i) using SVM. FTEnsemble constructed from
FT(i) resulted in a relatively very high efficiency index as shown in Figure 4a. It is interesting to note
that, except for the first few LIG(i) and FT(i), the MCC values and average accuracies of the individual
LIG(i) or FT(i) seemed to be inversely proportional to each other i.e., the higher accuracy, the lower
reliability, and vice versa. This is a clear indication of over/under fitting of individual LIG(i) or FT(i).
Finally, Figure 5 shows that the proposed algorithm produced an overall best result.
It is interesting to note that, instead of choosing only ρ FT (i) > ρ LIG− Ensemble , updating the
LIGEnsemble with top FT(i) without this constraint improved the ρ LIG− Ensemble i.e., ρoverall − Ensemble ≥
argmax ( LIG− Ensemble , ρ FT − Ensemble , ρ LIG(i) , ρ FT (i) ). Tables 5 and 6 show the values-of and %age
improvement in MCC, Accuracy and ρ between ITO Tuned Ensemble against LIG(i), FT(i), LIGEnsemble ,
LIGEnsemble , and Combined Ensemble (i.e., an ensemble of all LIG(i) and FT(i)), respectively.
Tables 7 and 8 show that, for datasets A & C respectively, the ITO algorithm produced a HAHR
Genes 2020, 11, 819 18 of 28
model with significantly higher reliability and accuracy, whereas, for dataset B (Table 9, however,
the accuracy increased, but the MCC decreased a bit.
Ensembles
Dataset Measure LIG(i) FT(i) LIG FT LIG & FT Combined ITO Tuned
(X1) (X2) (X3) (X4) (X5) (X6)
Accuracy 0.75 0.70 0.95 0.97 0.98 0.99
A MCC 0.89 0.82 0.90 0.92 0.95 0.97
Efficiency (ρ) 0.76 0.69 0.86 0.89 0.93 0.96
Accuracy 0.69 0.68 0.71 0.72 0.71 0.75
B MCC 0.18 0.16 0.06 0.08 0.00 0.33
Efficiency (ρ) 0.15 0.14 0.05 0.06 0.00 0.24
Accuracy 0.88 0.90 0.90 0.91 0.90 0.92
C MCC 0.76 0.81 0.82 0.84 0.83 0.87
Efficiency (ρ) 0.74 0.76 0.74 0.76 0.75 0.80
Table 6. All Datasets—ITO Improvement %age over LIG(i), FT(i), LIG Ensemble, FT Ensemble, and
Combined Ensemble.
Figure 5 and Tables 5 and 6 show that ITO was able to enhance both the accuracy as well as MCC
(and hence ρ) for all the datasets regardless of the base LIG and FT classifiers.
6. Machine Specifications
The experiments were performed on a shared machine with 64-bit ASUS GPU, 32 GB RAM,
Quad-core 64-bit Intel i7-4790K CPU, with 800MHz-4.4 GHz speed.
7. Time Complexity
The overall time complexity of the algorithm depends on:
8. Execution Times
FSS was a very time-consuming step because, for the chosen methods, all pairwise correlations
are computed between features to rank the most relevant features for final selection. To stay focused
on the generalization problem, the FSS method was chosen solely considering the availability of
out-of-the-box implementation or library for Python. Table 10 shows the minimum and maximum
times it took to generate FSS for datasets A, B, and C.
Table 10. Min and Max times for FSS Generation for Datasets A, B, and C
Data set Number of Features Min Time for FSS (Size 10) Max Time for FSS (Size 250)
A 1,004,004 26 h >72 h
B 10,560 50 min 2.6 h
C 695,556 24 h 72 h
LIG training and filtering: As can be seen from Figures: 6–8, the training time for the filtered-in
LIG(i) was under 15 s each.
LIG ensemble formation: In this phase, an ensemble is formed iteratively using the
majority-voting method. The execution time for this step was under 500 s.
FT training and filtering: The training times for the filtered-in FT(i) are relatively much larger
than LIG(i) as shown in Figures 9–11. However, the total execution times for ITO included training
and parameter-tuning for SVM and DNN as well which may have been filtered-out for Datasets A and
B. For example, for Dataset C, SVM training times fell around 3000 s to 4000 s (1.1 h each) while DNN
training times fell around 30,000 s to 35,000 s (8.3–9.7 each).
approach which balances the exploration through LIG and exploitation through FT to find a promising
initial baseline and optimizes the results beyond this baseline. It leaves the choice of underlying LIG
and FT members open to the user. A more advanced LIG or FT selection can further enhance the
optimality of the overall model. Further study can be conducted to apply the proposed algorithm on
datasets other than MAQC-II for wider comparisons.
For the LIG members, both majority-voting and soft-ensembles produced the same results.
However, it is because the underlying classifiers return the predicted class labels instead of raw
prediction values. It would be interesting to measure the impact of replacing the predicted class labels
with the raw prediction values for soft ensembles. The advantage of soft ensembles was evident when
used for FT members. Another future direction can be to cluster the erroneous instances separately
and construct a focused model for those hard instances. Once a subset is trained on this cluster, it can
be added to the beginning of the classification pipeline to bifurcate the instances accordingly. Use of
GPUs/parallel computing for FSS generation and classification should be explored to reduce the
overall execution time. Finally, the use of LIG as a filtering step for FT attack vectors should also be
explored as potential areas of improvements for the ITO Algorithm.
Author Contributions: Conceptualization, J.Z. and K.Z.; methodology, J.Z.; software, J.Z.; validation, J.Z.;
formal analysis, J.Z.; investigation, J.Z.; resources, J.Z. and K.Z.; data curation, J.Z.; writing—original draft
preparation, J.Z.; writing—review and editing, J.Z. and K.Z.; visualization, J.Z.; supervision, K.Z. All authors have
read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Acknowledgments: We would like to acknowledge and thank Usman Shahid for providing and maintaining
the lab for experimentation. Sarim Zafar for assisting in Python setup and initial setup of experimentation.
We would also like to acknowledge the following open-source contributions that were used as external libraries:
Pairwise distance calculation. Vinnicyus Gracindo; PCA in Python http://stackoverflow.com/questions/
13224362/principal-component-analysis-pca-in-python. Angle between two vectors (http://stackoverflow.com/
questions/2827393/angles-between-two-$n$-dimensional-vectors-in-python) and (https://newtonexcelbach.
wordpress.com/2014/03/01/the-angle-between-two-vectors-python-version/); parallelized MI/JMIM/MRMR
Implementation (https://github.com/danielhomola/mifs); ACO based FSS (https://github.com/pjmattingly/
ant-colony-optimization).
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Alanni, R.; Hou, J.; Azzawi, H.; Xiang, Y. Deep gene selection method to select genes from microarray
datasets for cancer classification. BMC Bioinform. 2019, 20, 608.
2. Zhao, Z.; Morstatter, F.; Sharma, S.; Alelyani, S.; Anand, A.; Liu, H. Advancing feature selection research.
ASU Feature Sel. Repos. 2010, 1–28, doi 10.1.1.642.5862
3. Elloumi, M.; Zomaya, A.Y. Algorithms in Computational Molecular Biology: Techniques, Approaches and
Applications; John Wiley & Sons: Hoboken, NJ, USA, 2011; Volume 21.
4. Bolón-Canedo, V.; Sánchez-Marono, N.; Alonso-Betanzos, A.; Benítez, J.M.; Herrera, F. A review of
microarray datasets and applied feature selection methods. Inf. Sci. 2014, 282, 111–135.
5. Almugren, N.; Alshamlan, H. A survey on hybrid feature selection methods in microarray gene expression
data for cancer classification. IEEE Access 2019, 7, 78533–78548.
6. Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data.
J. Bioinform. Comput. Biol. 2005, 3, 185–205.
Genes 2020, 11, 819 26 of 28
7. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective.
ACM Comput. Surv. (CSUR) 2017, 50, 94.
8. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency,
max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238.
9. Fakoor, R.; Ladhak, F.; Nazi, A.; Huber, M. Using deep learning to enhance cancer diagnosis and classification.
In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013;
ACM: New York, NY, USA, 2013; Volume 28.
10. Chen, Y.; Li, Y.; Narayan, R.; Subramanian, A.; Xie, X. Gene expression inference with deep learning.
Bioinformatics 2016, 32, 1832–1839.
11. Sevakula, R.K.; Singh, V.; Verma, N.K.; Kumar, C.; Cui, Y. Transfer learning for molecular cancer classification
using deep neural networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018, 16, 2089–2100.
12. Shi, L.; Campbell, G.; Jones, W.D.; Campagne, F.; Wen, Z.; Walker, S.J.; Su, Z.; Chu, T.M.; Goodsaid, F.M.;
Pusztai, L.; et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development
and validation of microarray-based predictive models. Nat. Biotechnol. 2010, 28, 827.
13. Djebbari, A.; Culhane, A.C.; Armstrong, A.J.; Quackenbush, J. AI Methods for Analyzing Microarray Data;
Dana-Farber Cancer Institute: Boston, MA, USA, 2007.
14. Selvaraj, C.; Kumar, R.S.; Karnan, M. A survey on application of bio-inspired algorithms. Int. J. Comput. Sci.
Inf. Technol. 2014, 5, 366–70.
15. Duncan, J.; Insana, M.; Ayache, N. Biomedical Imaging and Analysis In the Age of Sparsity, Big Data,
and Deep Learning. Proc. IEEE 2020, 108, doi:10.1109/JPROC.2019.2956422.
16. Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.;
Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:1604.07316.
17. Huynh, B.Q.; Li, H.; Giger, M.L. Digital mammographic tumor classification using transfer learning from
deep convolutional neural networks. J. Med. Imaging 2016, 3, 034501.
18. Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. Breast cancer histopathological image
classification using Convolutional Neural Networks. In Proceedings of the 2016 International Joint
Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 2560–2567.
doi:10.1109/IJCNN.2016.7727519.
19. Han, Z.; Wei, B.; Zheng, Y.; Yin, Y.; Li, K.; Li, S. Breast cancer multi-classification from histopathological
images with structured deep learning model. Sci. Rep. 2017, 7, 4172.
20. Lévy, D.; Jain, A. Breast mass classification from mammograms using deep convolutional neural networks.
arXiv 2016, arXiv:1612.00542.
21. Liao, Q.; Ding, Y.; Jiang, Z.L.; Wang, X.; Zhang, C.; Zhang, Q. Multi-task deep convolutional neural network
for cancer diagnosis. Neurocomputing 2019, 348, 66–73.
22. Chapman, A. Digital Games as History: How Videogames Represent the Past and Offer Access to Historical Practice;
Routledge Advances in Game Studies, Taylor & Francis: Abingdon, UK, 2016; pp. 185–185.
23. Ikeda, N.; Watanabe, S.; Fukushima, M.; Kunita, H. Itô’s Stochastic Calculus and Probability Theory; Springer:
Tokyo, Japan, 2012.
24. Sato, I.; Nakagawa, H. Approximation analysis of stochastic gradient Langevin dynamics by using
Fokker–Planck equation and Ito process. In International Conference on Machine Learning; PMLR: Bejing,
China, 2014; pp. 982–990.
25. Polley, E.C.; Van Der Laan, M.J. Super Learner in Prediction. U.C. Berkeley Division of Biostatistics Working
Paper Series. Working Paper 266. May 2010. Available online: https://biostats.bepress.com/ucbbiostat/
paper266/ (accessed on 15 March 2010).
26. Sollich, P.; Krogh, A. Learning with ensembles: How overfitting can be useful. In Advances in Neural
Information Processing Systems; NIPS: Denver, CO, USA, 1995; pp. 190–196.
27. Shi, L.; Reid, L.H.; Jones, W.D.; Shippy, R.; Warrington, J.A.; Baker, S.C.; Collins, P.J.; De Longueville, F.;
Kawasaki, E.S.; Lee, K.Y.; et al. The MicroArray Quality Control (MAQC) project shows inter-and
intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 2006, 24, 1151.
28. Chen, J.J.; Hsueh, H.M.; Delongchamp, R.R.; Lin, C.J.; Tsai, C.A. Reproducibility of microarray data:
A further analysis of microarray quality control (MAQC) data. BMC Bioinform. 2007, 8, 412.
29. Guilleaume, B. Microarray Quality Control. By Wei Zhang, Ilya Shmulevich and Jaakko Astola. Proteomics
2005, 5, 4638–4639.
Genes 2020, 11, 819 27 of 28
30. Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and
accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6.
31. Su, Z.; Łabaj, P.P.; Li, S.; Thierry-Mieg, J.; Thierry-Mieg, D.; Shi, W.; Wang, C.; Schroth, G.P.; Setterquist, R.A.;
Thompson, J.F.; et al. SEQC/MAQC-III Consortium: A comprehensive assessment of 521 RNA-seq accuracy,
reproducibility and information content by the Sequencing Quality Control 522 Consortium. Nat. Biotechnol.
2014, 32, 903–914.
32. Nguyen, X.V.; Chan, J.; Romano, S.; Bailey, J. Effective global approaches for mutual information based
feature selection. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; ACM: New York, NY, USA, 2014;
pp. 512–521.
33. Potharaju, S.P.; Sreedevi, M. Distributed feature selection (DFS) strategy for microarray gene expression data
to improve the classification performance. Clin. Epidemiol. Glob. Health 2019, 7, 171–176.
34. Wang, Z.; Palade, V.; Xu, Y. Neuro-fuzzy ensemble approach for microarray cancer gene expression data
analysis. In Proceedings of the 2006 International Symposium on Evolving Fuzzy Systems, Ambleside, UK,
7–9 September 2006; pp. 241–246.
35. Chen, W.; Lu, H.; Wang, M.; Fang, C. Gene expression data classification using artificial neural network
ensembles based on samples filtering. In Proceedings of the 2009 International Conference on Artificial
Intelligence and Computational Intelligence, Shanghai, China, 7–8 November 2009; Volume 1, pp. 626–628.
36. Bosio, M.; Salembier, P.; Bellot, P.; Oliveras-Verges, A. Hierarchical clustering combining numerical
and biological similarities for gene expression data classification. In Proceedings of the Engineering
in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, Osaka,
Japan, 3–7 July 2013; pp. 584–587.
37. Gashler, M.; Giraud-Carrier, C.; Martinez, T. Decision tree ensemble: Small heterogeneous is better than
large homogeneous. In Proceedings of the 2008 Seventh International Conference on Machine Learning and
Applications, San Diego, CA, USA, 11–13 December 2008; pp. 900–905.
38. Wu, Y. Multi-Label Super Learner: Multi-Label Classification and Improving Its Performance Using Heterogenous
Ensemble Methods; Wellesley College: Wellesley, MA, USA, 2018.
39. Yu, Y.; Wang, Y.; Furst, J.; Raicu, D. Identifying Diagnostically Complex Cases Through Ensemble Learning.
In International Conference on Image Analysis and Recognition (ICIAR); Lecture Notes in Computer Science,
Volume 11663; Springer: Cham Switzerland 2019; pp. 316–324.
40. Ayadi, W.; Elloumi, M. Biclustering of microarray data. In Algorithms in Computational Molecular Biology:
Techniques, Approaches and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2011; pp. 651–663.
41. Mohapatra, P.; Chakravarty, S.; Dash, P. Microarray medical data classification using kernel ridge regression
and modified cat swarm optimization based gene selection system. Swarm Evol. Comput. 2016, 28, 144–160.
42. Ravishankar, H.; Sudhakar, P.; Venkataramani, R.; Thiruvenkadam, S.; Annangi, P.; Babu, N.; Vaidya, V.
Understanding the mechanisms of deep transfer learning for medical images. arXiv 2017, arXiv:1704.06040.
43. Polat, K.; Güneş, S. A novel hybrid intelligent method based on C4. 5 decision tree classifier and
one-against-all approach for multi-class classification problems. Expert Syst. Appl. 2009, 36, 1587–1592.
44. Friedman, N.; Linial, M.; Nachman, I.; Pe’er, D. Using Bayesian networks to analyze expression data.
J. Comput. Biol. 2000, 7, 601–620.
45. Hastie, T.; Rosset, S.; Zhu, J.; Zou, H. Multi-class adaboost. Stat. Its Interface 2009, 2, 349–360.
46. Kégl, B. The return of AdaBoost. MH: multi-class Hamming trees. arXiv 2013, arXiv:1312.6086.
47. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42.
48. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32.
49. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436.
50. Jin, C.; Wang, L. Dimensionality dependent PAC-Bayes margin bound. In Advances in Neural Information
Processing Systems; Curran Associates, Inc.: New York, NY, USA, , 2012; pp. 1034–1042.
51. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.;
Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011,
12, 2825–2830.
52. Brown, M.P.; Grundy, W.N.; Lin, D.; Cristianini, N.; Sugnet, C.W.; Furey, T.S.; Ares, M.; Haussler, D.
Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl.
Acad. Sci. USA 2000, 97, 262–267.
Genes 2020, 11, 819 28 of 28
53. Zhang, Y.; Gong, D.W.; Cheng, J. Multi-objective particle swarm optimization approach for cost-based
feature selection in classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 2015, 14, 64–75.
54. Annavarapu, C.S.R.; Dara, S.; Banka, H. Cancer microarray data feature selection using multi-objective
binary particle swarm optimization algorithm. EXCLI J. 2016, 15, 460.
55. Plagianakos, V.; Tasoulis, D.; Vrahatis, M. Gene Expression Data Classification Using Computational
Intelligence Techniques. 2005. Available online: https://thalis.math.upatras.gr/~dtas/papers/
PlagianakosTV2005b.pdf (accessed on 15 March 2005).
56. Bosio, M.; Bellot, P.; Salembier, P.; Verge, A.O.; others. Ensemble learning and hierarchical data representation
for microarray classification. In Proceedings of the 13th IEEE International Conference on BioInformatics
and BioEngineering, Chania, Greece, 10–13 November 2013; pp. 1–4.
57. Luo, J.; Schumacher, M.; Scherer, A.; Sanoudou, D.; Megherbi, D.; Davison, T.; Shi, T.; Tong, W.; Shi, L.;
Hong, H.; et al. A comparison of batch effect removal methods for enhancement of prediction performance
using MAQC-II microarray gene expression data. Pharmacogenomics J. 2010, 10, 278–291.
58. Bosio, M.; Bellot, P.; Salembier, P.; Oliveras-Verges, A. Gene expression data classification combining
hierarchical representation and efficient feature selection. J. Biol. Syst. 2012, 20, 349–375.
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).