Grid Search-Based Hyperparameter Tuning and Classification of Microarray Cancer Data
Grid Search-Based Hyperparameter Tuning and Classification of Microarray Cancer Data
Abstract—Cancer is a group of diseases caused due to abnor- the optimal model. Achieving optimal hyperparameters is a
mal cell growth. Due to the innovation of microarray technology, challenging ahead of time and demands model tuning in a
a large variety of microarray cancer datasets are produced and trial-and-error basis.Hyperparameters are optimizer variables
hence open up avenues to carry out research work across several
disciplines such as Statistics, Computational Biology, Genomic that are executing during the training phase to get optimized
studies and other related fields. The main challenges in analyzing average values after several trial-and-error processes. To over-
microarray cancer data are the curse of dimensionality, small come the overfitting constraint with the ordinary Grid search,
sample size, noisy data, and imbalance class problem. In this the stratified cross-validation is applied where samples are
work, we are proposing grid search-based hyperparameter tuning divided into K-folds at random. The GridSearchCV model
(GSHPT) for random forest parameters to classify Microarray
Cancer Data. A grid search is designed by a set of fixed parameter taken from Scikit learn [3] is used to get the best parameters.
values which are essential in providing optimal accuracy on the The focus is to get four optimal parameters namely maximum
basis of n-fold cross-validation. In our work, the 10-fold cross number of features to split in a certain node, a number of
validation is considered. The grid search algorithm provides best estimators, which are a number of trees in the forest, the
parameters such as the number of features to consider at each Gini-index and level of the trees in the forest. To validate
split, number of trees in the forest, the maximum depth of the
tree and the minimum number of samples required to be split the method, five standard microarray cancer data are used.
at the leaf node. The maximum number of trees considered are To confirm the validity of the proposed method, extensive
10, 20 and 70 respectively for Ovarian, 3-class Leukemia, and numbers of experiments are carried out and promising results
3-class Leukemia cancer data. In the case of MLL and SRBCT, are obtained across most of the test datasets. To measure
50 trees are generated to achieve the maximum classification the performance of the method, several standard metrics are
accuracy. The Gini index is employed as criteria to split the
nodes and the maximum depth of the tree is set to 2 for all employed such as classification accuracy, precision, recall,
datasets. Experimental results of the proposed work show an f1-score, misclassification error, Out-of-bag (OOB) error and
improvement over the state of the art methods. The performance confusion matrix.
of the proposed method is evaluated using standard metrics such The rest of the paper is organized as follows. Section 2 deals
as classification accuracy, precision, recall, f1-score, confusion with related works. In Section and its subsection the proposed
matrix and misclassification rate and comparative analysis is
performed and the results are provided to reveal the performance method. Experimental results and discussion are covered in
of the proposed method. Sections 4 and 5 respectively. Finally, Section 6 covers the
concluding remarks.
Index Terms—Grid Search, Random Forest, Feature Selection,
Classification, Microarray
II. R ELATED WORKS
I. I NTRODUCTION Analysis of microarray cancer data is becoming a hot
Due to the introduction of the Microarray technology in the research area across several multidisciplinary areas including,
late 1980s, the massive volume of gene expression cancer data computer science, computational Biology, Genomic studies
is producing to launch a hot research area across several dis- machine learning, pattern recognition, statistics and other
ciples including machine learning, pattern recognition, com- related fields including engineering. Medjahed et al. [4] pro-
putation Biology aiming in the diagnosis of cancer patients, posed a complete cancer diagnostic method through kernel-
identification of cancer types and differentiation. [1]. The based learning. Salem et al. [5] proposed a classification
main constraints to be addressed in the analysis of microarray of human cancer by combining Information Gain (IG) and
cancer data are related to the high curse of dimensionality, Standard Genetic Algorithm (SGA). Liu et al. [6] proposed a
noisy data, class imbalance, and small sample size problems hybrid method to handle class imbalance at the feature and
[2]. To address these challenges, some of the research di- algorithmic level. Ramos-Gonzalez et al. [7] introduced an
rections include feature selection, dimensionality reduction, application of supervised machine learning for classification
and classification and optimization techniques. In this work, of cancer via deep learning. Farid et al. [8] proposed an adap-
we are proposing grid search-based hyperparameters tuning tive combination of feature selection with dissimilarity based
to optimize the parameters of Random forest tree classifier representation paradigm using classifiers such as Decision Tree
and apply to classify binary and multi-class microarray cancer (DT), Nave Bayes (NB) and KNN. Dashtban and Balafar [9]
datasets. The core challenge to be tackled in this work is to introduced an evolutionary-based genetic algorithm and AI
find the optimal values of the hyperparameter which produces to identify predictive genes for cancer classification applying
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on June 26,2023 at 12:54:56 UTC from IEEE Xplore. Restrictions apply.
978-1-5386-7989-0/19/$31.00 ©2019 IEEE
2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on June 26,2023 at 12:54:56 UTC from IEEE Xplore. Restrictions apply.
2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)
TABLE II: Best Parameters of the grid search for all data sets
Dataset Maximum # of features Criterion Maximum depth # of Trees
3-class Leukemia log2 Gini 2 20
2-class Leukemia log2 Gini 2 70
MLL None Gini 2 50
Ovarian auto Gini 2 10
SRBCT auto Gini 2 50
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on June 26,2023 at 12:54:56 UTC from IEEE Xplore. Restrictions apply.
2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)
Fig. 2: OOB error and number of trees for 2-class Leukemia Fig. 5: OOB error and number of trees for Ovarian
Fig. 3: OOB error and number of trees for 3-class Leukemia Fig. 6: OOB error and number of trees for SRBCT
Fig. 4: OOB error and number of trees for MLL Fig. 7: Accuracy of 2-class Leukemia cancer data
method achieves an accuracy of 100% in three datasets namely uation metric which measures the number of misclassified
2-class Leukemia, Ovarian and SRBCT and a test accuracy samples in the test set. Misclassification rate is computed as
of 0.97 is achieved for both MLL and 3-class Leukemia the ratio of false plus true positives to the size of test sets
respectively. as shown in Equation 5. It is computed in every 10% of the
Moreover, a misclassification rate is employed as an eval- training size and an average of the 10 errors is computed to
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on June 26,2023 at 12:54:56 UTC from IEEE Xplore. Restrictions apply.
2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)
TABLE V: Experimental results of the proposed method in Training and Test Accuracy, Precision, Recall and F-score
Dataset Training Accuracy Test accuracy Precision Recall Fscore
2-class Leukemia 1.00 1.00 1.00 1.00 1.00
3-class Leukemia 1.00 0.97 0.97 0.97 0.96
MLL 1.00 0.97 0.97 0.97 0.97
Ovarian 1.00 1.00 1.00 1.00 1.00
SRBCT 1.00 1.00 1.00 1.00 1.00
Fig. 8: Accuracy of 3-class Leukemia cancer data Fig. 11: Accuracy of 4-class SRBCT cancer data
Fig. 9: Accuracy of 3-class MLL cancer data Fig. 12: Confusion Matrix of 2-class Leukemia
Fig. 10: Accuracy of 2-class Ovarian cancer data Fig. 13: Confusion Matrix of 3-class Leukemia
get the final error. Even when the classification is 100%, there SRBCT cancer data.
is still misclassification error which indicates the penalty of
overconfidence during correct and wrong predictions. Figures FP + F N
E= (6)
17, 18, 19, 20, and 21 shows the misclassification error rate T P + T N + FP + F N
of 2-class Leukemia, 3-class Leukemia, MLL, Ovarian and
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on June 26,2023 at 12:54:56 UTC from IEEE Xplore. Restrictions apply.
2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)
Fig. 14: Confusion Matrix of MLL Fig. 17: Misclassification error of 2-class Leukemia cancer
data
Fig. 20: Misclassification error of 2-class Ovarian cancer data Fig. 23: 5-fold CV ROC of Ovarian
VI. C ONCLUSION
In this work, we propose microarray cancer classification
Fig. 21: Misclassification error of 4-class SRBCT cancer data
using optimized hyperparameters of random forest tree using
grid search approach. Optimization of Random Forests algo-
rithm is carried out to get the best parameters and applied
to validate the method. The proposed method provides best
parameters that give the maximum number of features to split
a node, number of decision trees in a forest, depth of the trees
and criterion to split a given node into child node. To check the
optimal parameter which leads to maximum classification ac-
curacy and minimum error, the Out-of-bag error is employed.
In the proposed approach, we have used five standard microar-
ray medical data. Performance measures such as classification
accuracy, precision, recall, f-score, misclassification error and
confusion matrix is employed to confirm the validity of the
method. The experimental results of the proposed method
exhibit perfect classification on three datasets namely 2-class
Fig. 22: 5-fold CV ROC of 2-class Leukemia leukemia, Ovarian, and SRBCT by scoring 100% and 0.97 test
accuracy is achieved on two datasets namely 3-class Leukemia
and MLL.
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on June 26,2023 at 12:54:56 UTC from IEEE Xplore. Restrictions apply.
2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)
[4] S. A. Medjahed, T. A. Saadi, A. Benyettou, and M. Ouali, “Kernel-based pso and adaptive k-nearest neighborhood technique,” Expert Systems
learning and feature selection analysis for cancer diagnosis,” Applied Soft with Applications, vol. 42, no. 1, pp. 612–627, 2015.
Computing, vol. 51, pp. 39–48, 2017.
[5] H. Salem, G. Attiya, and N. El-Fishawy, “Classification of human cancer
diseases by gene expression profiles,” Applied Soft Computing, vol. 50,
pp. 124–134, 2017.
[6] Z. Liu, D. Tang, Y. Cai, R. Wang, and F. Chen, “A hybrid method
based on ensemble welm for handling multi class imbalance in cancer
microarray data,” Neurocomputing, vol. 266, pp. 641–650, 2017.
[7] J. Ramos-González, D. López-Sánchez, J. A. Castellanos-Garzón, J. F.
de Paz, and J. M. Corchado, “A cbr framework with gradient boosting
based feature selection for lung cancer subtype classification,” Comput-
ers in biology and medicine, vol. 86, pp. 98–106, 2017.
[8] D. M. Farid, M. A. Al-Mamun, B. Manderick, and A. Nowe, “An
adaptive rule-based classifier for mining big biological data,” Expert
Systems with Applications, vol. 64, pp. 305–316, 2016.
[9] M. Dashtban and M. Balafar, “Gene selection for microarray cancer
classification using a new evolutionary method employing artificial
intelligence concepts,” Genomics, vol. 109, no. 2, pp. 91–107, 2017.
[10] R. Dash and B. B. Misra, “Pipelining the ranking techniques for
microarray data classification: A case study,” Applied Soft Computing,
vol. 48, pp. 298–316, 2016.
[11] H. Wang, X. Jing, and B. Niu, “A discrete bacterial algorithm for feature
selection in classification of microarray gene expression cancer data,”
Knowledge-Based Systems, vol. 126, pp. 8–19, 2017.
[12] R. Aziz, C. Verma, and N. Srivastava, “A fuzzy based feature selection
from independent component subspace for machine learning classifica-
tion of microarray data,” Genomics data, vol. 8, pp. 4–15, 2016.
[13] A. K. Das, S. Das, and A. Ghosh, “Ensemble feature selection using bi-
objective genetic algorithm,” Knowledge-Based Systems, vol. 123, pp.
116–127, 2017.
[14] T. Nguyen, A. Khosravi, D. Creighton, and S. Nahavandi, “Hidden
markov models for cancer classification using gene expression profiles,”
Information Sciences, vol. 316, pp. 293–307, 2015.
[15] M. Khashei, A. Z. Hamadani, and M. Bijari, “A fuzzy intelligent
approach to the classification problem in gene expression data analysis,”
Knowledge-Based Systems, vol. 27, pp. 465–474, 2012.
[16] Z. Yu, H. Chen, J. Liu, J. You, H. Leung, and G. Han, “Hybrid k-nearest
neighbor classifier,” IEEE transactions on cybernetics, vol. 46, no. 6, pp.
1263–1275, 2016.
[17] Z. Y. Algamal and M. H. Lee, “Penalized logistic regression with the
adaptive lasso for gene selection in high-dimensional cancer classifica-
tion,” Expert Systems with Applications, vol. 42, no. 23, pp. 9326–9332,
2015.
[18] M. Dashtban, M. Balafar, and P. Suravajhala, “Gene selection for tu-
mor classification using a novel bio-inspired multi-objective approach,”
Genomics, vol. 110, no. 1, pp. 10–17, 2018.
[19] J. Lv, Q. Peng, X. Chen, and Z. Sun, “A multi-objective heuristic
algorithm for gene expression microarray data classification,” Expert
Systems With Applications, vol. 59, pp. 13–19, 2016.
[20] S. Sasikala, S. A. alias Balamurugan, and S. Geetha, “A novel adaptive
feature selector for supervised classification,” Information Processing
Letters, vol. 117, pp. 25–34, 2017.
[21] E. Pashaei and N. Aydin, “Binary black hole algorithm for feature
selection and classification on biological data,” Applied Soft Computing,
vol. 56, pp. 94–106, 2017.
[22] S. Kar, K. D. Sharma, and M. Maitra, “Gene selection from microarray
gene expression data for classification of cancer subgroups employing
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on June 26,2023 at 12:54:56 UTC from IEEE Xplore. Restrictions apply.