2015-Elsevier-A-Hamming-distance-based-binary-particle-swarm-optimization-HDBPSO-algorithm-for-high-dimensional-feature-selection-classification-and-validation

Uploaded by

chandreshgovind

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

2015-Elsevier-A-Hamming-distance-based-binary-particle-swarm-optimization-HDBPSO-algorithm-for-high-dimensional-feature-selection-classification-and-validation

Uploaded by

chandreshgovind

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Pattern Recognition Letters 52 (2015) 94–100

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

A Hamming distance based binary particle swarm optimization

(HDBPSO) algorithm for high dimensional feature selection,
classiﬁcation and validation ✩
Haider Banka∗, Suresh Dara
Department of Computer Science and Engineering, Indian School of Mines, Dhanbad 826004, Jharkhand, India

a r t i c l e i n f o a b s t r a c t

Article history: Gene expression data typically contain fewer samples (as each experiment is costly) and thousands of ex-
Received 22 March 2014 pression values (or features) captured by automatic robotic devices. Feature selection is one of the important
Available online 22 October 2014
and challenging tasks for this kind of data where many traditional methods failed and evolutionary based
Keywords:
methods were succeeded. In this study, the initial datasets are preprocessed using a quartile based fast
Feature selection heuristic technique to reduce the crude domain features which are less relevant in categorizing the samples
Hamming distance of either group. Hamming distance is introduced as a proximity measure to update the velocity of particle(s)
High dimensional data in binary PSO framework to select the important feature subsets. The experimental results on three bench-
Binary particle swarm optimization mark datasets vis-á-vis colon cancer, defused B-cell lymphoma and leukemia data are evaluated by means
Classification of classification accuracies and validity indices as well. Detailed comparative studies are also made to show
Stability indices the superiority and effectiveness of the proposed method. The present study clearly reveals that by choosing
proper preprocessing method, fine tuned by HDBPSO with Hamming distance as a proximity measure, it is
possible to find important feature subsets in gene expression data with better and competitive performances.
© 2014 Published by Elsevier B.V.

1. Introduction describe the target concepts. The primary purpose of feature selection
is to design a more compact classifier with little or no performance
One of the important problems in extracting and analyzing in- degradation possibly [3].
formation from large databases is the associated high complexity. There are various approaches for feature selection categorized as
Feature selection is helpful as a pre-processing step for reducing di- filter [4], wrapper [5], embedded [6] and ensemble based [6]. The two
mensionality, removing irrelevant data, improving learning accuracy important issues in feature learning are sparsity [7] and structure
and enhancing output comprehensibility [1,2]. Gene expression data within the features [8–11]. DNA microarray technologies have also
are a typical example of presenting an overwhelmingly large num- been utilized in literature to evaluate the classification, clustering,
ber of features (genes), the majority of which are not relevant to the and feature selection problems [1,12].
description of the problem and could potentially degrade the clas- Feature selection is reported to be NP-hard problem. The high
sification performance by masking the contribution of the relevant complexity of this problem has motivated investigators to apply var-
features. The key informative features represent a base of reduced ious approximation techniques to find near-optimal solutions [13].
cardinality, for subsequent analysis aimed at determining their pos- Xue et al. [14] proposed a multi objective PSO based feature selec-
sible role in the analyzed phenotype. This highlights the importance of tion problem. The authors used 12 benchmark datasets produced
feature selection, with particular emphasis on gene expression data. with comparative study results. However, the datasets used are not
The idea is to retain only those genes that play a major role in arriving convincingly large. The authors [15] proposed a multi objective PSO,
at a decision about the output classes. Feature selection can serve as based on mutual information and entropy as two evaluation criteria,
a pre-processing tool of great importance before solving the classifi- using some low dimensional benchmark datasets, and the number of
cation problems. The selected feature subsets should be sufficient to features ranging from 18 to 42 which are of smaller dimension. Other
PSO based feature selection techniques were reported in literature
✩
[16] using support vector machines (SVM).
This paper has been recommended for acceptance by Qian Xiaoning.
∗
Corresponding author. Tel.: +91 9471191233.
Genetic algorithm (GA) based feature selection algorithm for gene
E-mail addresses: banka.h.cse@ismdhanbad.ac.in (H. Banka), darasuresh@live.in expression data is reported in Ref. [17]. Results are reported on five
(S. Dara). benchmark high dimensional datasets. Multi objective GA (MOGA)

http://dx.doi.org/10.1016/j.patrec.2014.10.007
0167-8655/© 2014 Published by Elsevier B.V.
H. Banka, S. Dara / Pattern Recognition Letters 52 (2015) 94–100 95

based feature selection reported in Ref. [13] for feature selection on from the data and query distributions, and the expected distance be-
high dimensional data shows promising results with non-dominated tween them converges to zero as the dimension k goes to infinity
Variance[d(p,q)]
sorting GA (NSGA-II) along with experimental results and compara- that is, limk→∞ Expected[d (p,q)] ≈ 0. In other words, the distance to the
tive studies with existing methods. The main disadvantage of above nearest neighbor and the distance to the farthest neighbor tend to
method is time to converge (e.g., 15,000 number of generations) converge as the dimension increases [22]. The Euclidean distance has
and at each generation incurs time consuming operations (i.e., non- an intuitive appeal as it is commonly used to evaluate the proximity
dominated sorting, crowding distance calculation etc.). However, the of objects in two or three-dimensional space. It works well when a
proposed method needed only 50 iterations to converge. dataset has compact or isolated clusters [23]. The drawback to di-
Chiang et al. [18] proposed ant colony optimization (ACO) based rect use of the Minkowski metric is the tendency of the largest-scaled
feature selection algorithm to classify the tumor samples. Hybrid ACO feature to dominate others. Solutions to this problem include normal-
for feature selection is reported in Ref. [19] with an effective balance ization of the continuous features (to a common range or variance) or
between exploration and exploitation of ants in global search space. other weighting schemes [24].
The authors have used eight benchmark datasets with varying feature The Hamming distance is the proportion of positions at which two
cardinality from 9 to 2000. binary vector sequences differ. Computing the Hamming distance
Many feature selection algorithms have been proposed, they do between two vector requires two steps: (i) compute the XOR and (ii)
not necessarily identify the same candidate of feature subsets. Even count the number of 1s in the resulting vector. For high dimensional
for the same data, one may find many different feature subsets that data, normal Euclidian distance measure is not a suitable choice as two
can achieve the same prediction accuracy [20]. The stability of feature dissimilar objects may show their similarity in large feature space.
selection algorithms needs to prove evidence that the selected fea- This problem may be dealt with Hamming distance measure to some
tures are relatively robust to the variations in training data. Measuring extent as a fruitful proximity measure.
the stability of feature selection algorithms requires some similarity If feature vectors are high-dimensional, many data structures for
measures for two sets of feature selection results. In this paper, we similarity queries and other tasks may not work properly using Eu-
used some similarity measures based on stability indices related to clidian distance [22]. Many methods (such as feature selection, clus-
this problem [21]. tering, and classification) fail to work due to the high dimensionality
Feature selection is one of the challenging tasks for gene expres- of feature vectors based on their spatial properties [25,26]. This raises
sion datasets. For the same reason, many traditional feature selection the issues of different searching strategies. The Euclidean distance has
methods failed and evolutionary algorithm based methods were suc- an intuitive appeal as it is commonly used to evaluate the proximity
ceeded (such as GA, MOGA, PSO, ACO, etc.). In this study, a fast Ham- of objects in low dimensional space. It works well when a dataset has
ming distance based binary PSO based algorithm is proposed to select compact or isolated clusters. Moreover, when high dimensional data
important features from gene expression data. The initial datasets are is considered, it is not so effective to work through the search space
preprocessed using a quartile based fast heuristic technique to re- and Hamming distance may be good choice in some cases [24]. The
duce the crude domain features which are less important and mostly focus of the current study is to explore further in this direction.
contains redundant values. The present study clearly shows that by
choosing the proper pre-processing technique, fine tuned by the pro- 2.2. Stability measurements
posed HDBPSO algorithm on the pre-processed data and by incorpo-
rating Hamming distance as a proximity measure to update velocity of The selection stability is a desired characteristic for feature se-
particle, it is possible to find important features with better accuracy lection algorithms. Since the target concept of a datum is fixed, the
and/or competitive performance which have been further validated relevant features should not change across different samples of the
by 10-fold cross validation method. data. Several stability measurements have been proposed to calcu-
The rest of this paper is organized as follows. Section 2 describes late stability of feature selection algorithms and these methods are
the preliminaries of Hamming distance as a proximity measure and categorized as: index-based, rank-based and weight-based.
stability indices on the chosen feature subsets relevant to this study. In this study, few commonly used stability measurements such as
The proposed HDBPSO algorithm for feature selection incorporating Dice, Tanimoto, Jaccard’s and Kuncheva indices are discussed below:
preprocessing of gene expression data, fitness function and algorithm
implementation are described in Section 3. The experimental results 1. Dice’s coefficient: It is used to calculate the overlap between two
on colon, lymphoma and leukemia data, along with the external vali- feature sets. Dice index takes value between 0 and 1, where 0
dation using with different machine learning classifiers, are presented means no overlap and 1 means the two sets are identical. The Dice
in Section 4. Finally, Section 5 concludes this paper. index between two feature sets F1 and F2 is given by: Dice(F1 , F2 ) =
2|F1 ∩F2 |
|F1 |+|F2 |
2. Preliminaries 2. Tanimoto distance and Jaccard’s index: Measures the amount
of overlap between two datasets and produces value in the
This section explains some basics of Hamming distance and binary same range as Dice index and defined as: Tanimoto(F1 , F2 ) =
|F1 |+|F2 |−2|F1 ∩F2 | 2|F ∩F |
particle swarm optimization as a continuation and understanding of 1− F1 |+|F2 |−|F1 ∩F2 |
, Jaccard(F1 , F2 ) = |F 1∪F 2| . In general, Dice,
the present work. 1 2
Tanimoto, and Jaccard indices behave similarly in all cases although
it is noticeable that Dice index sometimes gives slightly higher and
2.1. Hamming distance more meaningful stability results with respect to the intersection
between the two subsets and can deal with sets of different car-
The curse of dimensionality has a direct bearing on similarity dinalities. Beside that, they do not take the dimensionality m in
searching in high dimensions in the sense that it raises the issue account, yet, they comprise the number of selected features k in
of whether or not nearest neighbor searching is even meaningful in the measurement.
such a domain. One reason is that most of us are not particularly 3. Kuncheva index (KI): As the cardinality of the selected feature sub-
adept at visualizing high-dimensional data. In particular, letting d de- set increases hence the chance of overlap between them increases
note a distance function that need not necessarily be a metric [22], the as well. KI helps to avoid intersection by chance between the two
nearest neighbor searching is not meaningful when the ratio of the subsets of the features to overcome the drawback of the previous
variance of the distance between two random points p and q, drawn measurements. KI’s result values in the range of [−1,1], where 1
96 H. Banka, S. Dara / Pattern Recognition Letters 52 (2015) 94–100

means that F1 and F2 are identical which means the cardinality of For a distinction table F with N condition attributes and a sin-
the intersection set equals k. KI achieves −1 when there is no com- gle decision attribute d, the problem of finding a minimal subset
monalities between the lists and k = m 2 . KI assumes values close of columns R(⊆ {1, 2, . . . , N}) in the distinction table (1), satisfying
to zero for independently drawn lists [27]. The KI is defined by ∀(k, j)∃i ∈ R : b((k, j), i) = 1, whenever d(xk ) = d(xj ).
2|F1 ∩F2 |.m−k2 So, with this, we may consider the distinction table to consist of N
KI(F1 , F2 ) = k(m−k)
.
columns, and rows corresponding to only those object pairs (xk , xj )
such that d(xk ) = d(xj ).
3. The proposed approach
(i) As object pairs corresponding to the same class do not constitute
In this section, the preprocessing of gene expression data, gener- a row of the distinction table, there is a considerable reduction
ation of distinction table, formulation of fitness functions, and finally in its size thereby leading to a decrease in computational cost.
the proposed HDBPSO algorithm are described with illustration. (ii) Additionally, if either of the objects in a pair has ‘∗’ as an entry
under an attribute in table Fr then in the distinction table, put
3.1. Preprocessing of gene expression data ‘0’ at the entry for that attribute and pair.
(iii) The entries ‘1’ in the matrix correspond to the attributes of in-
Gene expression data typically consist of huge number of features terest for arriving at a classification decision.
and limited number of samples. The majority of features were not
relevant to the description of the problem, hence could potentially Let the number of objects initially in the two classes be C1 and C2 re-
degrade the classification performance by masking the contribution spectively. Then the number of rows in the distinction table becomes
of the relevant features. Preprocessing directs to eliminating of am- (C1 ∗ C2 ) < C∗(C−1
2
) , where C + C = C. This reduces time complexity
1 2
biguously expressed genes as well as the constantly expressed genes of fitness computation to O(N ∗ C1 ∗ C2 ). Table 1 describes how a sam-
across the tissue classes. ple distinction table looks like. Here, assume that there are seven con-
Different normalization methods have been proposed for gene ditional features { f1 , f2 , f3 , f4 , f5 , f6 , f7 }, the length of vector is N = 7.
expression data such as global normalization, lowess normalization, In a vector v, the binary data ‘1’ represents if the corresponding fea-
housekeeping gene normalization, invariant set normalization, LVS ture is present, and a ‘0’ represents its absence. The two classes are
normalization, and quantile normalization each having their own C1 (C11 , C12 ) and C2 (C21 , C22 , C23 ). The rows represent the object pairs
merits and demerits [28]. After checking several experiments on and columns represent the features or attributes. The objective is to
global based normalization, a quartile based normalization method choose minimal number of columns (features) from the table that
was selected for this study as done in Ref. [13]. The normalization is covers all the rows (i.e., object pairs in the table).
performed on each of the attributes so that it falls between 0.0 and Note that, if there are three classes with C1 , C2 , and C3 number of
1.0. This helps us to give equal priority to each of the attributes as samples, there will be C1 ∗ C2 ∗ C3 number of rows in the distinction
there is no way of knowing important or unimportant features. We table. Therefore, the proposed HDBPSO can be extended to solve multi
used quantile normalization, implemented as follows, attribute-wise class problem as well. However, the present work is focused on two
aj (xi )−minj
normalization is done by aj (xi ) = maxj −minj
, ∀i, where maxj and minj class problems for benchmark datasets as available in literature.
correspond to the maximum and minimum gene expression values
for attribute aj over all samples. This constitutes the normalized gene 3.2. Fitness function
dataset, i.e., continuous attribute value table between [0,1]. Then we
choose thresholds Thi and Thf , based on the idea of quartiles [13]. The feature selection can be done by HDBPSO Algorithm (1) us-
Let the N patterns be sorted in the ascending order of their values ing the following objective function. We proposed a fitness function,
along the jth axis. In order to determine the partitions, we divide the which includes two sub functions (F1 , F2 ). Where F1 finds number of
measurements into a number of small class intervals of equal width features (i.e. number of 1’s), F2 decides the extent to which the feature
δ and count the corresponding class frequencies f rc . The position of can recognize among the object pairs. The proposed fitness function
the kth partition value (k = 1, 2, 3 for four partitions) is calculated is as follows:
R −cf r
as Thk = lc + k f r c−1 ∗ δ where lc is the lower limit of the cth class Fit = α1 F1 (v) + (1 − α1 )F2 (v) (2)
c
interval, Rk = N∗k4 is the rank of the kth partition value, and cf rc−1 is N−O Rv
the cumulative frequency of the immediately preceding class interval, where the two sub functions F1 (v) = N v , and F2 (v) = C ∗C under
1 2
such that cf rc−1 ≤ Rk ≤ cf rc . Here we use Thi = Th1 and Thf = Th3 . As the condition 0 < α1 < 1. Here, v is the chosen feature subset, Ov
a result, Th1 is statistically chosen (for four partition i.e., k = 3) such represents the number of 1’s in v, C1 and C2 are the number of objects
that 1/3 of the sample values lies below Th1 . Similarly, Th3 is statisti- in the two classes, and Rv is the number of object pairs (i.e., rows in
cally chosen in such a way so that 2/3 of the sample values lies below the distinction table) v can discern between. The fitness function F1
Th3 under that particular feature. Find the average number of ‘∗s’ as gives the candidate credit for containing less number of features or
‘Tha ’ over all the features. Remove from the table those attributes for attributes in v, and F2 determines the extent to which the candidates
which the number of ‘∗s’ are ≥ Tha . This is the modified (reduced) can discern among object pairs in the distinction table.
attribute value table F r [13]. As an example to calculate F1 and F2 , let us take a sample in-
put vector v = (1, 0, 1, 1, 0, 1, 1), Two classes are C1 and C2 , where
3.1.1. Distinction table generation
The distinction table is a binary matrix with dimensions (C 2−C ) ×
2
Table 1
N, where N is the number of features in F, C is objects. An entry An example of a simple distinction table.
b((k, j), i) of the matrix corresponds to the attribute ai and pair of
f1 f2 f3 f4 f5 f6 f7
objects (xk , xj ).
(C11 , C21 ) 1 1 1 0 1 0 1
1, if ai (xk ) = ai (xj ). (C11 , C22 ) 0 1 0 1 0 1 0
b((k, j), i) = (1) (C11 , C23 )
0, if ai (xk ) = ai (xj ).
0 1 1 0 1 0 0
(C12 , C21 ) 1 0 1 0 1 0 1
The presence of ‘1’ signifies the ability of the attribute ai to discern (C12 , C22 ) 0 1 0 0 1 0 0
(C12 , C23 ) 1 0 1 0 1 0 0
(or distinguish) between the pair of objects (xk , xj ).
H. Banka, S. Dara / Pattern Recognition Letters 52 (2015) 94–100 97

Fig. 1. Plot of number of generations versus cardinality of the feature subsets with different population size on three datasets.

class lengths are C1 = 2, C2 = 3, and length of vector is N = 7 (as de- Table 2

Description of the three cancer datasets.
picted in Table 1). The number of 1’s in v is Ov = 5, and Rv is cal-
culated as compared with input vector v matching number of the Datasets Total features Reduced featuresa Classes Samples
presented 1’s from each row in distinction table, i.e. Rv = 5. Therefore Colon 2000 1102 Colon cancer 40
N−O Rv
F1 (v) = N v = 7−5
7 = 0.29, and F2 (v) = C1 ∗C2 = 6 = 0.84. Finally we
5
Normal 22
calculate ﬁtness function (2) using F1 and F2 . Lymphoma 4026 1867 Other type 54
B-cell 42

3.3. HDBPSO algorithm for feature selection Leukemia 7129 3783 ALL 47
AML 25
a
The HDBPSO Algorithm (1) starts with an initialization of ran- After preprocessing.
dom population called particles. We updated the velocity function
Table 3
as shown in Algorithm (1). At each iteration, the Pbest (i.e. Pi ) and
The confusion matrix.
gbest (i.e. Pg ) are updated accordingly. For each dimension of particle,
update the positions based on its corresponding velocities. Note that, Positive Negative
PSO velocity function may generate negative values. However, using Positive 10 1
our proposed one, it does not generate any negative value, hence no Negative 2 18
use of Vmin to set velocity boundary. Here, X and P are binary stings,
the difference after the operation between X and P is also binary.
Hence the proposed approach is more logical. boundary Vmax was set to 4, from literature [30] related to this prob-
lem, and notice that there is no Vmin parameter required in this study
Algorithm 1: The proposed HDBPSO algorithm for feature se- as mentioned earlier. The inertia weight (w) is one of the most impor-
lection. tant parameters in BPSO which can improve performance by properly
balancing its local and global parameters [31]. The inertia weight(w)
Input: c1 , c2 , w, Vmax , distinction table
was set to 0.9 after several runs. The varied population size was taken,
Output: Feature subsets
to check feature subsets, and the swarm size is set equal to popula-
Initialize population randomly
tion size based on Ref. [32]. We also tested different population sizes
while (maximum iterations) do
like 10, 20, 30, 50, 100, 150 and 200. The process was repeated finite
for (i=1 to number of Particles) do
Evaluate fitness value of particle using (2) number of iterations i.e. 50. It is observed that when the number of
if (fitness value of Xi > Pi ) then iterations exceeds 50 there is no further improvement.
Pi = Xi
if (fitness value of Xi > Pg ) then 4. Experimental results
Pg = Xi
for (d=1 to n) do We have implemented the HDBPSO algorithm to find minimal
Vid (t + 1) = w × ∗Vid (t) + c1 ρ1 ∗ feature subsets on high dimensional microarray data. The microarray
HD(Pid (t), Xid (t)) + c2 ρ2 ∗ HD(Pgd (t), Xid (t)) data consists of three different cancer datasets; colon, lymphoma, and
if Vid (t + 1) > Vmax then leukemia. We are interested in two-class problems (i.e., normal and
Vid (t + 1) = Vmax diseased samples), taken three high dimensional i.e., colon cancer,1
if S(Vid (t + 1)) > rand(0, 1) then lymphoma2 and leukemia dataset.3 Here, 50% of samples is used for
Xid (t + 1) = 1 training and remaining 50% of the samples for testing. The details of
else all datasets are reported in Table 2. Fig. 1 depicts the plot of number
Xid (t + 1) = 0 of generations versus cardinality of the reduced feature subsets by
the proposed HDBPSO algorithm.
We reported the results in correct classification accuracy over two
class selected features. Using confusion matrix (see Table 3), for exam-
ple to calculate average of two class correct classifications accuracy =
(no. of correctly classified sample of both group/total no. of test
3.4. Implementation 10+18
sample)∗100 = (10+1+2+18 )))∗100 = 90.32%.

The HDBPSO algorithm is coded in C language with GCC compiler

and using the classifiers available in Weka [29]. Parameter selection 1
http://microarray.princeton.edu/oncology.
of algorithm may influence the quality of computational results. We 2
http://llmpp.nih.gov/lymphoma/data/figure1/figure1.cdt.
set two accelerator coefficient parameters (c1 , c2 ) to 2, and velocity 3
http://www.genome.wi.mit.edu/MPR.
98 H. Banka, S. Dara / Pattern Recognition Letters 52 (2015) 94–100

Table 4 Table 6
The 10-fold cross validation results on three datasets. Comparative study of k-NN classiﬁer using HDBPSO, NSGA-II and GA on three datasets.

Data correct score variance Mean absolute Standard Dataset Feature subset Used k-NN classiﬁcation
deviation deviation size method (%) on test set

Colon 90.28 0.1498 0.2792 0.3871 k=1 k=3 k=5

Lymphoma 93.25 0.1909 0.36 0.4369
Colon ≤10 Proposed 100 93.55 90.33
Leukemia 91.25 0.2645 0.4096 0.514
≤10 NSGA-II [13] 90.3 90.3 87.1
≤15 GA [13] 71.0 58.10 48.40
Lymphoma ≤20 Proposed 100 93.75 95.80
Table 5
≤2 NSGA-II 93.8 95.80 95.80
Comparative study of stability indices on HDBPSO, NSGA-II and GA for
≤18 GA 89.59 89.59 93.76
three datasets.
Leukemia ≤10 Proposed 100 94.74 94.74
Dataset Used method Stability indexes
≤5 NSGA-II 94.1 91.2 91.2
Dice Tanimoto Jaccard KI ≤19 GA 73.50 73.53 60.77

Colon Proposed 1.0 1.0 1.0 1.0

NSGA-II 1.0 1.0 1.0 1.0
GA 1.0 1.0 1.0 1.0 algorithm (GSA) for selecting informative feature subsets. They have
Lymphoma Proposed 1.0 1.0 1.0 1.0 tested on four high dimensional benchmark datasets, and the num-
NSGA-II 1.0 1.0 1.0 1.0 ber of selected feature subsets are in the range of 25–101. The pro-
GA 0.95 0.9 0.9 0.95
posed HDBPSO selects feature subsets in the range of 6–17 which
Leukemia Proposed 1.0 1.0 1.0 1.0 are quite smaller. Moreover, the classification accuracies for three
NSGA-II 1.0 1.0 1.0 1.0
datasets using GSA are 87.1%, 94.81% and 97.22% respectively. The
GA 1.0 1.0 1.0 1.0
proposed HDBPSO performs 100% classification on these datasets us-
ing various classifiers. Hence, the proposed algorithm is better with
respect to the classification accuracy and feature subset cardinality
Table 4 shows 10-fold cross validation results, reported variance, as well.
mean absolute deviation and standard deviation. Cross-validation Several other existing filter based methods have been imple-
mimics the use of training and test sets by repeatedly training the mented for comparisons. Some of the filter based feature selection
algorithm K times with a fraction 1/K of training examples left out algorithms are: Chi-square [37], fast correlation based filter (FCBF)
for testing purposes. We use 10-fold cross validation (i.e., K = 10) [35], Fisher’s score [38], Gini [39], information theoretic [41] and t-
in each experimental run, nine folds are used for training (90%) and test [40]. Few of the embedded methods for feature selection are
remaining one fold is used for testing (10%). Here, we perform 10- Bayes logistic regression [34], Kruskal Wallis [42] and sparse multi-
fold cross validation on three datasets with feature subsets which nomial logistic regression (SBMLR) [36].
are generated by our proposed algorithm using BLR classifier. We The experimental results with different feature cardinalities and
executes 10 runs of 10-fold cross validation method and used the accuracies are summarized in Tables 7–9 for three datasets. Table 7
random seed for % split options (seed value is taken odd numbers like shows the results for colon data using various classifiers. It is seen
1, 3, 5 for 10 runs) which produces a different set of cross validation that the proposed algorithm with feature subset size 10 or less is
folds for every run. Finally we calculate the standard deviation of all able to recognize correctly in more than 90% cases, whereas most
runs. of the other classifiers fail to achieve the same score. However,
Stability indices (such as Dice, Tanimato, Jaccard and Kuncheva) Bayes LogReg is able to get four feature subsets but the classifica-
were also measured on these three datasets with few other meth- tion accuracy is inferior than the proposed one. Other filter based
ods for comparison. The Dice, Tanimato and Jaccard indices vary be- algorithms have reported based on top 10 features, and their cor-
tween [0,1]. The proposed algorithm achieves the stability value of responding classification accuracies are inferior than the proposed
1.0 indicating identical behavior of the selected feature subsets. Sim- algorithm as well.
ilarly, Kuncheva index value varies between [−1,1], and the proposed Table 8 shows the results for lymphoma data using various classi-
algorithm achieves the best value i.e., 1.0. Therefore, the proposed fiers. Here it is observed that NSGA-II is able to select a subset of three
HDBPSO algorithm performs better with respect to stability. features but with an approximate accuracy of 85% on an average case,
Table 5 depicts the different stability values on three datasets. as compared to the proposed one with an accuracy more than 95%,
Here, F1 and F2 are two subsequences of feature subsets indexed at on an average. It is to be mentioned here that NSGA-II converges after
the end of the algorithm. Notice that GA and NSGA-II results were 15,000 generations whereas the proposed HDBPSO converges after
taken after 15,000 generations, on an average, with varied popula- 50 iterations only. The proposed algorithm performs better for most
tion size(s). The proposed HDBPSO algorithm iterates only 200 iter- of the cases except SMBLR [36] which is able to find a subset of length
ations, on an average, with varied population(s). However, all these 5 with good accuracy for lymphoma data.
methods achieve the desirable high stability value on three datasets, Similarly, Table 9 shows the results for leukemia data using various
except for the case of GA which fails on lymphoma dataset. How- classifiers. It is seen that the proposed algorithm with feature subset
ever, the proposed algorithm achieves good stability value for all the size 10 or less is able to recognize correctly in more than 95% cases,
cases. on an average, whereas most of the other classifiers fail to achieve
the same score. However, NSGA-II, Bayes LogReg, FCBF are able to
4.1. Comparisons get feature subsets of length 5, 3, 1 respectively but the classification
accuracy is inferior than the proposed one.
Table 6 depicts comparative performance among HDBPSO, GA, and All the above tables on three benchmark datasets clearly demon-
multi objective GA (NSGA-II [13]) using with k-NN classifier on test strate the superiority and competitiveness of the proposed method.
data. As shown in table, in most of the cases, on all the three datasets, Hence, it may be concluded that by proper choice of pre-processing
the proposed HDBPSO performs better as compared to others. technique fine tuned by the proposed HDBPSO algorithm on the pre-
Wen et al. [33] introduce a new evolving personalized model- processed data, it is possible to find important features with better
ing method and system (evoPM) that integrates gravitational search performance.
H. Banka, S. Dara / Pattern Recognition Letters 52 (2015) 94–100 99

Table 7
Performance of different feature selection algorithms using different classiﬁers on colon data.

Used method Feature subset size Correct classiﬁcation accuracy on test dataset in (%)

BLR BayesNet NB LibLinear SVM MLP J48 LMT RF

Proposed ≤10 90.33 93.55 93.55 100 100 100 100 93.55 96.78
GA [13] ≤15 64.51 45.16 67.74 51.61 64.51 61.29 54.83 58.06 51.61
NSGA-II [13] ≤10 67.74 64.51 70.96 77.42 64.51 77.42 58.83 67.74 54.83
Bayes LogReg [34] 4 64.51 74.19 64.51 70.96 64.51 64.51 61.29 61.29 54.83
FCBF [35] 11 70.96 74.19 70.96 74.19 64.51 74.19 61.29 58.06 74.19
SMBLR [36] 9 77.41 77.41 80.64 77.41 64.51 70.96 74.1 77.41 80.64
Chi-square [37] Top 10 74.19 64.51 74.19 67.64 64.51 54.83 58.06 64.51 61.29
Fisher [38] Top 10 64.51 64.51 64.29 58.06 64.51 64.51 67.74 67.74 54.83
GiniIndex [39] Top 10 64.51 64.51 54.83 64.51 64.51 41.93 64.51 64.51 58.06
t-test [40] Top 10 61.29 64.51 67.74 77.41 64.51 67.41 64.51 64.51 64.51
Information gain [41] Top 10 74.19 64.51 74.19 67.64 64.51 54.83 58.06 64.51 61.29
Kruskal Wallis [42] Top 10 64.51 64.51 64.51 80.64 64.51 70.96 67.74 90.32 70.96

Table 8
Performance of different feature selection algorithms using different classiﬁers on lymphoma data.

Used method Feature subset size Correct classiﬁcation accuracy on test dataset in (%)

BLR BayesNet NB LibLinear SVM MLP J48 LMT RF

Proposed ≤10 93.55 93.55 93.55 100 91.66 97.92 97.92 97.92 97.92
GA ≤15 67.55 67.55 83.75 64.7 83.33 85.41 81.25 73.75 79.16
NSGA-II ≤3 81.25 87.5 85.41 75 85.41 85.41 89.58 81.25 79.16
Bayes LogReg 3 91.66 91.66 89.58 93.75 93.75 93.75 91.66 93.75 91.66
FCBF 37 93.75 93.75 95.83 93.75 95.83 95.83 83.33 95.83 89.58
SMBLR 5 95.83 95.83 95.83 93.75 95.83 95.83 93.75 93.75 93.75
Chi-square Top 10 81.25 56.25 72.91 75 81.25 77.08 56.25 77.08 64.58
Fisher Top 10 41.66 56.25 77.08 41.66 58.33 50 56.25 43.75 62.5
GiniIndex Top 10 60.41 56.25 58.33 62.5 56.25 39.58 56.25 52.8 33.33
Information gain Top 10 81.25 56.25 72.91 75 81.25 77.08 56.25 77.08 64.58
t-test Top 10 83.33 77.08 83.33 85.41 87.5 83.33 85.41 75 77.08
Kruskal Wallis Top 10 60.41 62.5 66.66 60.41 62.5 60.41 58.33 64.58 60.41

Table 9
Performance of different feature selection algorithms using different classiﬁers on leukemia data.

Used method Feature subset size Correct classiﬁcation accuracy on test dataset in (%)

BLR BayesNet NB LibLinear SVM MLP J48 LMT RF

Proposed ≤10 97.37 95.84 94.74 100 100 100 97.37 96.78 100
GA ≤15 64.7 64.7 64.7 55.88 83.33 85.41 81.25 73.75 79.16
NSGA-II ≤5 71.02 71.02 64.7 78.72 85.42 85.41 81.25 73.55 79.16
Bayes LogReg 3 67.64 91.17 88.23 76.47 58.82 91.17 91.17 61.76 58.82
FCBF 1 52.94 91.17 91.17 76.47 58.82 91.17 91.17 64.51 67.74
SMBLR 8 73.52 82.35 64.7 64.7 58.82 70.58 79.41 61.76 58.82
Chi-square Top 10 64.7 58.82 70.58 58.82 58.82 76.47 52.94 73.52 73.52
Fisher Top 10 55.88 58.82 58.82 76.47 58.82 76.47 52.94 61.76 60.52
GiniIndex Top 10 58.82 58.82 58.82 76.47 58.82 50 58.82 69.54 60.52
t-test Top 10 61.76 58.52 70.58 76.47 58.82 64.7 52.94 70.58 61.76
Information Gain Top 10 64.7 58.82 70.58 58.82 58.82 76.47 52.94 73.52 73.52
Kruskal Wallis Top 10 58.82 52.94 61.76 58.82 58.82 58.82 55.88 55.88 58.82

5. Conclusion Acknowledgments

We have presented a Hamming distance based binary PSO algo- The authors are thankful to the anonymous reviewers for shar-
rithm for feature selection and classification in gene expression data. ing their valuable comments that definitely improved the quality of
The experimental results validate that the proposed HDBPSO per- the paper. This work was partially supported by Council of Scientific
forms better using Hamming distance as proximity measure for this and Industrial Research (CSIR), New Delhi, India, under the Grant No.
problem. The main objective of the feature selection is selecting min- 22(0586)/12/EMR-II.
imal number of features and get higher classification accuracy. Here,
this has been achieved by the two fitness functions. The performances
of the proposed method and few of the existing methods were com-
pared to show the superiority of proposed algorithm. Experimental References
results on three high dimensional benchmark cancer datasets demon-
[1] C. Lazar, et al., A survey on filter techniques for feature selection in gene expression
strate the feasibility and effectiveness of the proposed algorithm. The microarray analysis, IEEE/ACM Trans. Comput. Biol. Bioinform. 9 (4) (2012) 1106–
results are validated by stability indices as well. The proposed HDBPSO 1119.
algorithm is not only suitable for feature selection and classification [2] B.M.E. Moret, L.S. Wang, T. Warnow, Special Issue on Bioinformatics, vol. 35, IEEE
Computer Society, 2002.
in high dimensional data but also for other application domains such [3] I.I. Yvan Saeys, P. Larranaga, A review of feature selection techniques in Bioinfor-
as face recognition or any other high dimensional data classification. matics, Int. J. Comput. Sci. (IAENG) 23 (19) (2007) 2507–2517.
100 H. Banka, S. Dara / Pattern Recognition Letters 52 (2015) 94–100

[4] P. Saengsiri, S. Wichian, P. Meesad, U. Herwig, Comparison of hybrid feature [22] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is “nearest neighbor”
selection models on gene expression data, in: 8th International Conference on meaningful? in: Database Theory—ICDT’99, Springer, 1999, pp. 217–235.
ICT and Knowledge Engineering, 2010, pp. 13–18. [23] J. Mao, A.K. Jain, A self-organizing network for hyperellipsoidal clustering (hec),
[5] W. Altidor, T. Khoshgoftaar, J. Van Hulse, An empirical study on wrapper-based IEEE Trans. Neural Netw., 7 (1) (1996) 16–29.
feature ranking, in: 21st International Conference on Tools with Artificial Intelli- [24] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv.
gence (ICTAI’09), 2009, pp. 75–82. 31 (3) (1999) 264–323.
[6] C. Wahid, A. Ali, K. Tickle, A novel hybrid approach of feature selection through [25] P. Maji, C. Das, Relevant and significant supervised gene clusters for microarray
feature clustering using microarray gene expression data, in: 11th International cancer classification, IEEE Trans. NanoBiosci., 11 (2) (2012) 161–168.
Conference on Hybrid Intelligent Systems (HIS), 2011, pp. 121–126. [26] S. Mitra, S. Ghosh, Feature selection and clustering of gene expression profiles
[7] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc.: Ser. using biological knowledge, IEEE Trans. Syst., Man, Cybern., C, 42 (6) (2012)
B (1996) 267–288. 1590–1599.
[8] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped vari- [27] L.I. Kuncheva, A stability index for feature selection, in: Proceedings of the 25th
ables, J. R. Stat. Soc.: Ser. B 68 (1) (2006) 49–67. IASTED International Multi-Conference on Artificial Intelligence and Applications,
[9] L. Jacob, G. Obozinski, J.P. Vert, Group lasso with overlap and graph lasso, in: February 12–14, Austria, 2007, pp. 390–395.
Proceedings of the 26th Annual International Conference on Machine Learning, [28] S. Calza, Y. Pawitan, Normalization of gene-expression microarray data, in: Com-
ACM, 2009, pp. 433–440. putational Biology, Springer, 2010, pp. 37–52.
[10] H. Liu, J. Zhang, X. Jiang, J. Liu, The group dantzig selector, in: International Con- [29] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The weka
ference on Artificial Intelligence and Statistics, 2010, pp. 461–468. data mining software: an update, SIGKDD Explor. 11 (2009) 10–18.
[11] J. Liu, C. Zhang, C.A. Mccarty, P.L. Peissig, E.S. Burnside, D. Page, High-dimensional [30] D. Sudholt, C. Witt, Runtime analysis of binary pso, in: Proceedings of 10th An-
structured feature screening using binary markov random fields, in: International nual Conference on Genetic and Evolutionary Computation, New York, 2008,
Conference on Artificial Intelligence and Statistics, 2012, pp. 712–721. pp. 135–142.
[12] D. Stekel, Microarray Bioinformatics, Oxford University and Bius, 2003. [31] Y. Shi, R. Eberhart, Empirical study of particle swarm optimization, in: Proceedings
[13] M. Banerjee, S. Mitra, H. Banka, Evolutionary rough feature selection in gene of IEEE Congress, vol. 3, 1999, pp. 1945–1950.
expression data, IEEE Trans. Syst., Man, Cybern. C 37 (2007) 622–632. [32] M. Clerc, Binary particle swarm optimisers: toolbox, derivations, and mathemat-
[14] B. Xue, M. Zhang, W. Browne, Particle swarm optimization for feature selection ical insights, Hal-00122809, 2007.
in classification: a multi-objective approach, IEEE Trans. Cybern. 43 (6) (2013) [33] W. Liang, Y. Hu, N. Kasabov, Evolving personalized modeling system for integrated
1656–1671. feature, neighborhood and parameter optimization utilizing gravitational search
[15] B. Xue, L. Cervante, L. Shang, M. Zhang, A particle swarm optimisation based algorithm, Evolv. Syst. (2013) 1–14.
multi-objective filter approach to feature selection for classification, in: PRICAI [34] G.C. Cawley, N.L. Talbot, Gene selection in cancer classification using sparse
2012: Trends in Artificial Intelligence, Lecture Notes in Computer Science, vol. logistic regression with bayesian regularization, Bioinformatics 22 (19) (2006)
7458, Springer, Berlin/Heidelberg, 2012, pp. 673–685. 2348–2355.
[16] S.M. Vieira, L.F. Mendona, G.J. Farinha, J.M. Sousa, Modified binary {PSO} for feature [35] L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based
selection using {SVM} applied to mortality prediction of septic patients, Appl. Soft filter solution, in: ICML, vol. 3, 2003, pp. 856–863.
Comput. 13 (8) (2013) 3494–3504. [36] C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene
[17] A. El Akadi, A. Amine, A. El Ouardighi, D. Aboutajdine, A two-stage gene selection expression data, J. Bioinform. Comput. Biol. 3 (02) (2005) 185–205.
scheme utilizing mrmr filter and ga wrapper, Knowl. Inform. Syst. 26 (3) (2011) [37] H. Liu, R. Setiono, Chi2: Feature selection and discretization of numeric attributes,
487–500. in: IEEE 24th International Conference on Tools with Artificial Intelligence, IEEE
[18] Y.M. Chiang, H.M. Chiang, S.Y. Lin, The application of ant colony optimization for Computer Society, 1995, p. 388.
gene selection in microarray-based cancer classification, in: International Con- [38] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2012.
ference on Machine Learning and Cybernetics, vol. 7, 2008, pp. 4001–4006. [39] C. Gini, Reprinted in Memorie di Metodologia Statistica, in: E. Pizetti and T.
[19] M.M. Kabir, M. Shahjahan, K. Murase, A new hybrid ant colony optimization Salvemini (Eds.), 1955.
algorithm for feature selection, Expert Syst. Appl. 39 (3) (2012) 3747–3763. [40] D.C. Montgomery, G.C. Runger, N.F. Hubele, Engineering Statistics, John Wiley &
[20] L. Yu, C. Ding, S. Loscalzo, Stable feature selection via dense feature groups, in: Sons, 2009.
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge [41] T.M. Cover, J.A. Thomas, Elements of Information Theory, John Wiley & Sons, 2012.
Discovery and data mining, ACM, 2008, pp. 803–811. [42] L. Wei, Asymptotic conservativeness and efficiency of kruskal-wallis test for k
[21] A. Kalousis, J. Prados, M. Hilario, Stability of feature selection algorithms: a study dependent samples, J. Am. Stat. Assoc. 76 (376) (1981) 1006–1009.
on high-dimensional spaces, Knowl. Inform. Syst. 12 (1) (2007) 95–116.