A model-based relevance estimation approach for feature selection in microarray datasets

A model-based relevance estimation
approach for feature selection in
microarray datasets
Gianluca Bontempi, Patrick E. Meyer
{gbonte,pmeyer}@ulb.ac.be
Machine Learning Group,
Computer Science Department
ULB, Université Libre de Bruxelles
Boulevard de Triomphe - CP 212
Bruxelles, Belgium
http://www.ulb.ac.be/di/mlg
A model-based relevance estimation approach for feature selection in microarray datasets – p. 1/1

Outline
• Feature selection in microarray classification tasks
• Definition of relevance
• Relevance and feature selection
• Our approach to relevance estimation: between filter and wrapper
• Experimental results

Feature selection in microarray
• The availability of massive amounts of experimental data based on
genome-wide studies has given impetus in recent years to a large effort
in developing mathematical, statistical and computational techniques to
infer biological models from data.
• In many bioinformatics problems, the number of features is significantly
larger than the number of samples (high feature-to-sample ratio
datasets).
• This is typical of cancer classification tasks where a systematic
investigation of the correlation of expression patterns of thousands of
genes to specific phenotypic variations is expected to provide an
improved taxonomy of cancer.
• In this context, the number of features n corresponds to the number of
expressed gene probes (up to several thousands) and the number of
observations N to the number of tumor samples (typically in the order of
hundreds).
• Feature selection and consequently gene selection is required to
perform classification in such an high dimensional task.

State-of-the-art
Feature selection requires an accurate assessment of a large number of
alternative subsets in terms of predictive power or relevance to the output
class.
Three main state-of-the-art approaches are
Filters: these are preprocessing methods which assess the merits of features
from the data without having recourse to any learning algorithm.
Examples: ranking, PCA, t-test.
Wrappers: these methods rely on a learning algorithm to assess and compare
subsets of variables. They conduct a search for a good subset using the
learning algorithm itself as part of the evaluation function. Examples are
the forward/backward methods proposed in classical regression
analysis.
Embedded methods: they perform variable selection as part of the learning
procedure and are usually speciﬁc to given learning machines.
Examples are classiﬁcation trees and methods based on regularization
techniques (e.g. lasso)

Between filters and wrappers
• Filter approaches rely on learner independent estimators to assess the
relevance of a set of features. The rationale of filter techniques is that
the importance of a set of feature should be independent of the
prediction technique.
Our contribution: we propose a model-based strategy to assess
the relevance of a set of features.
• Wrapper depends on a specific learner to assess a set of features and
end up with returning a quantity which confounds the relevance of a
subset (desired quantity) with the quality of the learner (not required). In
other terms wrapper returns a biased estimation of the relevance of a
subset.
Our contribution: since wrapper bias may have a strong negative
impact on the selection procedure we propose a model-based
technique for relevance assessment which is low-biased.

Feature selection and relevance
• Let us consider a binary classiﬁcation problem where x ∈ X ⊂ Rn
and
y ∈ Y = {y0, y1}. Let s ⊆ x, s ∈ S be a subset of the input vector.
• Let us denote
p1(s) = Prob {y = y1|s}
p0(s) = Prob {y = y0|s}
• A feature selection problems can be formalized as a problem of (learner
independent) relevance maximization
s∗
= arg max
s⊆x,|s|≤d
Rs
where the goal is to ﬁnd the subset s that maximizes the relevance
quantity Rs which accounts for the predictive power that the input s has
on the target y.

Relevance deﬁnitions
• A well known example of relevance measure is mutual information
I(s, y) = H(y) − H(y|s).
• Here we will focus on the quantity
Rs =
S
p2
0(s) + p2
1(s) dFs(s) =
S
r(s)dFs(s)
where r(s) = 1 − g(s) and g(s) is Gini index of diversity.
• Note that
r(s) = p2
0(s) + p2
1(s) = 1 − 2p0(s)(1 − p0(s)) = 1 − 2Var {y|s}
where Var {y|s} is the conditional variance of y.
• Also a monotone function
GH (·) : [0, 1] → [0, 0.5]
maps the entropy H(y|s) of a binary variable y to the related Gini index
g. A model-based relevance estimation approach for feature selection in microarray datasets – p. 7/1

Bias of the wrapper approach
Given a learner h trained on dataset of size N, the wrapper approach
translates the (learner independent) relevance maximization problem into a
(learner dependent) minimization problem
arg min
s⊆x,|s|≤d
Mh
s = arg min
s⊆x,|s|≤d S
MME
h
(s)dFs(s)
where the Mean Misclassiﬁcation Error is decomposed as follows (Wolpert,
Kohavi, 96)
MME
h
(s) =
1
2
1 − (p2
0(s) + p2
1(s)) +
+
1
2
(p0(s) − pˆ0(s))2
+ (p1(s) − pˆ1(s))2
+
+
1
2
1 − (p2
ˆ0
(s) + p2
ˆ1
(s)) =
1
2
(n(s) + b(s) + v(s))
where pˆ0 = Prob {ˆy = y0|s}, n(s) = 1 − r(s) is the noise variance term, b(s) is
the learner squared bias and v(s) is the learner variance.
NB: the term b(s) is NOT dependent on relevance.

Bias of the wrapper approach
• In real classification tasks, the one-zero misclassification error Mh
s of a
learner h for a subset s cannot be derived analytically but only estimated
(typicallly by cross-validation).
• A wrapper selection returns
sh
= arg min
s⊂x,|s|≤d
M
h
s (1)
where M
h
s the estimate of the misclassification error of the learner h (e.g.
computed by cross-validation)
If a wrapper strategy relies on a generic learner h, that is a learner
where the bias term b(s) is significantly different from zero, the
returned feature selection will depend on a quantity which is a
biased estimate of the term r(s) and consequently of the relevance
Rs. In other words, wrappers do not maximize relevance.

Unbiased wrapper approach
• Intuitively, the bias would be reduced if we adopted a learner having a
small bias term. A low bias, yet high variance, learner is the k-nearest
neighbour classifiers (kNN) for small values of k
• In particular, it has been shown that for a 1NN learner and a binary
classification problem
lim
N→∞
M1NN
s = 1 − Rs
where M1NN
s is the misclassification error of a nearest neighbour.
• Since cross-validation returns a consistent estimation of Mh
and since
the quantity Mh
asymptotically converges to one minus the relevance
Rs, we have that 1 − M
1NN
s is a consistent estimator of the relevance Rs.
• We propose then as relevance estimator
ˆRkNN
s = 1 − M
kNN
s
which is the cross-validation error of a kNN learner with low k. This term
returns an unbiased, yet high variance, estimate of the relevance of the
subset s. A model-based relevance estimation approach for feature selection in microarray datasets – p. 10/1

Reducing the variance of the estimator
The low-bias high-variance nature of the ˆRkNN
s estimator suggests that the
best way to employ this estimator is by combining it with other relevance
estimators.
We will take into consideration two possible estimators to combine with:
1. a direct model-based estimator ˆp1 of the conditional probability
p1(s) = Prob {y = y1|s} and consequently of the quantity r(s).
This estimator ﬁrst samples a set of N unclassiﬁed input vectors si
according to the empirical distribution ˆFs and then computes the Monte
Carlo estimation
ˆRD
s =
1
N
N
i=1
ˆp2
1(si ) + ˆp2
0(si ) = 1 −
2
N
N
i=1
ˆp1(si )(1 − ˆp1(si ))
A similar estimator was proposed by Fukunaga in 1973 to estimate the
Bayes error.

2. a ﬁlter estimator based on the notion of mutual information: several ﬁlter
algorithms exploit this notion in order to estimate the relevance. An
example is the MRMR algorithm (Peng et al., 05) where the relevance of
a feature subset s, expressed in terms of the mutual information
I(s, y) = H(y) − H(y|s), is approximated by the incremental formulation
IMRMR(s; y) = IMRMR(si; y) + I(xi, y) −
1
m − 1 xj ∈si
I(xj; xi) (2)
where xi is a feature belonging to the subset s, si is the set s with the xi
feature set aside and m is the number of components of s. Now since
H(y|s) = H(y) − I(s, y) and Gs = 1 − Rs = GH (H(y|s)) we obtain that
ˆRMRMR
s = 1 − GH (H(y) − IMRMR(s, y))
is a MRMR estimator of the relevance Rs where GH (·) is the monotone
mapping between H and Gini index.

Proposed relevance estimators
We propose two novel relevance estimators based on the principle of
averaging
ˆRs =
ˆRCV
s + ˆRD
s
2
,
ˆRs =
ˆRCV
s + ˆRMRMR
s
2
and the associated feature selection algorithms:
sR
= arg max
s⊂x,|s|≤d
ˆRs
sR
= arg max
s⊂x,|s|≤d
ˆRs

Experimental session
• 20 public domain microarray expression datasets
• external cross-validation scheme three-fold cross-validation strategy
• to avoid any dependency between the learning algorithm employed by
the wrapper and the classifier used for prediction, the experimental
session is composed of two parts:
• Part 1: comparison with the wrapper WSVM and we use the set of
classifiers C1 ={TREE, NB, SVMSIGM, LDA, LOG} which does not
include the SVMLIN learner,
• Part 2: comparison with the wrapper WNB and we use the set of
classifiers C2 ={TREE, SVMSIGM, SVMLIN, LDA, LOG} which does
not include the NB learner.

Experiments with cancer datasets
Name N n K
Golub 72 7129 2
Alon 62 2000 2
Notterman 36 7457 2
Nutt 50 12625 2
Shipp 77 7129 2
Singh 102 12600 2
Sorlie 76 7937 2
Wang 286 22283 2
Van’t Veer 65 24481 2
VandeVijver 295 24496 2
Sotiriou 99 7650 2
Pomeroy 60 7129 2
Khan 63 2308 4
Hedenfalk 22 3226 3
West 49 7129 4
Staunton 60 7129 9
Su 174 12533 11
Bhattacharjee 203 12600 5
Armstrong 72 12582 3
Ma 60 22575 3

Results 1st part
Name R’ WSVM R” MRMR RANK
Golub 0.0917 0.1177 0.1 0.1079 0.1225
Alon 0.2704 0.2658 0.2267 0.1996 0.2281
Notterman 0.1966 0.0985 0.1494 0.1472 0.1432
Nutt 0.3798 0.4171 0.3873 0.3847 0.4189
Shipp 0.1429 0.1319 0.1322 0.1362 0.1873
Singh 0.1619 0.1517 0.1266 0.1374 0.1328
Sorlie 0.3835 0.4314 0.3963 0.4004 0.3987
Wang 0.4282 0.4111 0.4218 0.4232 0.4181
Van’t Veer 0.2786 0.2638 0.2492 0.2217 0.2277
VandeVijver 0.454 0.4724 0.4365 0.4636 0.4482
Sotiriou 0.5279 0.5796 0.5351 0.5708 0.5339
Pomeroy 0.428 0.4191 0.4141 0.3876 0.4181
Khan 0.0878 0.1143 0.0582 0.0686 0.131
Hedenfalk 0.5475 0.5263 0.452 0.5273 0.5389
West 0.6463 0.6109 0.6186 0.5746 0.6109
Staunton 0.6822 0.71 0.6511 0.6865 0.7407
Su 0.2568 0.307 0.2549 0.3772 0.3352
Bhattacharjee 0.1232 0.1347 0.1105 0.1057 0.1515
Armstrong 0.1082 0.1199 0.1306 0.115 0.1122
Ma 0.2456 0.2041 0.2257 0.2413 0.2317
AVG 0.323 0.331 0.310 0.326 0.331
W/B than R’ (R”) 10/7 9/6 9/2

Results 2nd part
Name R’ WNB R” MRMR RANK
Golub 0.0886 0.1114 0.0971 0.1019 0.0904
Alon 0.2376 0.2568 0.2181 0.2109 0.221
Notterman 0.1852 0.2059 0.1491 0.1512 0.1645
Nutt 0.3929 0.3402 0.36 0.3898 0.4258
Shipp 0.1261 0.127 0.1198 0.1338 0.1734
Singh 0.1495 0.1454 0.1297 0.1377 0.1245
Sorlie 0.3848 0.4254 0.3808 0.3953 0.3838
Wang 0.4363 0.4345 0.4298 0.4281 0.4255
Van’t Veer 0.2747 0.2715 0.2421 0.2253 0.2325
VandeVijver 0.4626 0.44 0.4763 0.4721 0.4358
Sotiriou 0.5126 0.5578 0.5505 0.5732 0.5611
Pomeroy 0.4367 0.4389 0.4007 0.3902 0.4224
Khan 0.0804 0.0896 0.0628 0.0631 0.0901
Hedenfalk 0.5379 0.5187 0.4369 0.4904 0.4949
West 0.6413 0.6696 0.5542 0.5882 0.6728
Staunton 0.6689 0.8298 0.6981 0.6661 0.83
Su 0.2544 0.3096 0.2646 0.3739 0.3529
Bhattacharjee 0.1235 0.1209 0.101 0.1061 0.1186
Armstrong 0.1079 0.1668 0.125 0.1148 0.1034
Ma 0.2565 0.2635 0.2335 0.2443 0.2681
AVG 0.322 0.3335 0.315 0.327 0.331
W/B than R’ (R”) 9/2 10/3 11/2

Conclusions
• Feature selection demands accurate estimation of relevance of subsets
of features.
• Wrapper methods use cross-validation estimation of misclassiﬁcation
error with generic learners. We show that this means a biased
estimation of relevance.
• The cross validation assessment ˆRkNN
s returned by kNN techniques
with low k provide a low bias yet high variance estimator of relevance.
• Variance can be reduced by combining with other estimators.
• Experiments on real datasets showed that the resulting relevance
estimator can outperform both conventional wrapper and ﬁlter
algorithms.

A model-based relevance estimation approach for feature selection in microarray datasets

More Related Content

A model-based relevance estimation approach for feature selection in microarray datasets