Perspectives of feature selection in bioinformatics:
from relevance to causal inference
Gianluca Bontempi
Machine Learning Group,
Interuniversity Institute of Bioinformatics in Brussels (IB)2
Computer Science Department
ULB, Université Libre de Bruxelles
The long way from data to knowledge
Which information can be extracted from data?
1 Descriptive statistics.
2 Parameters of a given model (model fitting, parameter
estimation, least squares).
3 Best predictive model among a set of candidates (validation,
assessment, bias/variance).
4 Most relevant features (multivariate statistics, regularization,
5 Causal information.
This is also a good outline for a statistical machine learning course
as well as my personal research journey.
Causality in science
A major goal of the scientific activity is to model real
phenomena by studying the dependency between entities,
objects or more in general variables.
Sometimes the goal of the modelling activity is simply
predicting future behaviours. Sometimes the goal is to
understand the causes of a phenomenon (e.g. a disease).
Understanding the causes of a phenomenon means
understanding the mechanisms by which the observed
variables take their values and predicting how the values of
those variables would change if the mechanisms were subject
to manipulations (what-if scenarios).
Applications: understanding which actions to perform on a
system to have a desired effect (eg. understanding the causes
of tumor, the causes of the activation of a gene, the causes of
different survival rates in a cohort of patients.)
Causal knowledge
Most of human knowledge is causal and concerns how things
work in the world, about mechanisms, behaviors.
This knowledge is causal in the sense it is about the
mechanisms which bring from causes to effects.
Mechanism: it is characterized by some inputs and outputs,
the setting of inputs determines the outputs but not viceversa.
Causal discovery aims to understand the mechanism by which
variables came to take on the values they have and to predict
what the values of those variables would be if the naturally
occurring mechanisms were subject to manipulations.
Intelligent behaviour should be related to the ability of
inferring from observations cause and effect relationships
Prediction by supervised learning
Relevance vs. causality
The design of predictive models is one of the main
contributions of machine learning.
The design of a model able to predict the value of a target
variable (e.g. phenotype, survival time) requires the definition
of a set of input variables (e.g. genome expression, weight,
age, smoking habits, nationality, frequency of vacations) which
are relevant, in the sense that they provide information about
the target.
It is easy to observe that the features which are good predictors
are not always the causes of the variable to be predicted.
In other terms if causal variables are always relevant, the
contrary is not necessarily true. Sometimes, effects appears to
be better predictors than causes. Sometimes, good predictors
do not have a direct causal link with the target.
Relevance and causality: common cause pattern
The height of a child provides information (i.e. is relevant)
about his reading capability though it is not causing it.
Your child will not read better by pulling his legs...and reading
books doesn’t make him taller...
The variable age is called confounding variable.
Other examples
Some examples of false correlation may serve to illustrate the
difference between relevance and causality. In all these examples
the input is informative about the output (i.e. relevant) though it is
not a cause.
Input: number of firemen intervening in an accident. Target:
number of casualties.
Input: amount of Cokes drunk per day by a person. Target:
her sport performance.
Input: sales of ice-cream in a country. Target: number of
drowning deaths.
Input: sleeping with shoes. Target: wake up with an headache.
Input: chocolate consumption. Target: life expectancy.
Input: expression of gene 1. Target: expression of coregulated
gene 2.
Large dimensionality and causality
The problem of finding causes is still more difficult in large
dimensional tasks (bioinformatics) where often the number of
features (e.g. number of probes, variants) is very large with
respect to the number of samples.
Even when experimental interventions are possible, performing
thousands of experiments to discover causal relationships
between thousands of variables is often not practical.
Dimensionality reduction techniques have been largely
discussed in statistics and machine learning. However, most of
the time they focused on improving prediction accuracy.
Open issue: can these techniques be useful also for causal
feature selection? Is prediction accuracy compatible with
causal discovery?
Feature selection: state of the art
Filter methods: preprocessing methods which assess the
merits of features from the data, ignoring the effects of the
selected feature subset on the performance of the learning
algorithm: ranking, PCA or clustering.
Wrapper methods: assess subsets of variables according to
their usefulness to a given predictor. Search for a good subset
using the learning algorithm itself as part of the evaluation
function: stepwise methods in linear regression.
Embedded methods: variable selection as part of the
learning procedure and are specific to learning machines:
classification trees, random forests, and methods based on
regularization techniques (e.g. lasso)
Ranking: the simplest feature selection
The most common feature selection strategies in bioinformatics
is ranking where each variable is scored with the univariate
association with the target returned by a measure of relevance,
like mutual information, correlation, or p-value.
Ranking is simple and fast but:
1 it cannot take into consideration higher-order interaction terms
(e.g. complementarity)
2 it disregards redundancy between features
3 it does not distinguish between causes and effects. This is due
to the fact that univariate correlation (or relevance) does not
imply causation
Causality is not addressed either in multivariate feature
selection approaches since their cost function typically takes
into consideration accuracy but disregards causal aspects.
Causality vs. dependency in a stochastic setting
A variable x is dependent on a variable y if the distribution of y is
different from the marginal one when we observe the value x = x
Prob {y|x = x} = Prob {y}
Dependency is symmetric. If x is dependent of y, then y is
dependent on x.
Prob {x|y = y} = Prob {x}
A variable x is a cause of a variable y if the distribution of y is
different from the marginal one when we set the value x = x
Prob {y|set(x = x)} = Prob {y}
Causality is asymmetric:
Prob {x|set(y = y)} = Prob {x}
Main properties of causal relationships
Given causes (inputs) x and effects (output) y
stochastic dependency: changing x is likely to end up with a
change in y, in probabilistic terms the effects y are dependent
on the causes x
asymmetry: changing y won’t modify (the distribution of ) x
conditional independency: the effect y is independent of all
the other variables (apart from its effects) given the direct
causes x. In other words the direct causes screen off the
indirect causes from the effects. Note the analogy with the
notion of (Markov) state in (stochastic) dynamic systems.
temporality: the variation of y does not occur before x.
All this make Directed Acyclic Graphs a convenient formalism to
represent causality.
Graphical model (exc. from Guyon et al paper.)
(d) (c)
Lung cancer
Causation and data
Causation is much harder to measure than dependency (e.g.
correlation or mutual information). Correlations can be
estimated directly in a single uncontrolled observational study,
while causal conclusions are stronger with controlled
Data may be collected in experimental or observational setting.
Manipulation of variable is possible only in the experimental
setting. Two types of experimental configurations exist:
randomised and controlled. These are the typical settings
allowing causal discovery.
Most statistical study are confronted with observational
static settings. Notwithstanding, causal knowledge is more
and more demanded by final users.
Entropy and conditional entropy
Consider a binary output class y ∈ {c1 = 0, c2 = 1}
The entropy of y is
H(y) = −p0 log p0 − p1 log p1
This quantity is greater equal than zero and measures the
uncertainty of y
Once introduced the conditional probabilities
Prob {y = 1|x = x} = p1(x), Prob {y = 0|x} = p0(x)
we can define the conditional entropy for a given x
H[y|x] = −p0(x) log p0(x) − p1(x) log p1(x)
which measures the lack of predictability of y given x.
Information and dependency
Let us use the formalism of information theory to quantify the
dependency between variables.
Given two continuous rvs x1 ∈ X1, x2 ∈ X2, the mutual
I(x1; x2) = H(x1) − H(x1|x2)
measures stochastic dependence between x1 and x2.
In the case of Gaussian distributed variables
I(x1; x2) = −
log(1 − ρ2
where ρ is the Pearson correlation coefficient.
Conditional information
The conditional mutual information
I(x1; x2|y) = H(x1|y) − H(x1|x2, y)
quantifies how the dependence between two variables depends
on the context.
The conditional mutual information is null iff x1 and x2 are
conditionally independent given y.
This is the case of the example with x1 =reading, x2 =height
and y =age.
The information that a (set of) variable(s) brings about
another is
1 conditional on the context (i.e. which other variables are
2 non monotone: it can increase or decrease according to the
Interaction information
The interaction information quantifies the amount of trivariate
dependence that cannot be explained by bivariate information.
I(x1; x2; y) = I(x1; y) − I(x1; y|x2).
When it is different from zero, we say that x1, x2 and y
A non-zero interaction can be either negative, and in this case we
say that there is a synergy or complementarity between the
variables, or positive, and we say that there is redundancy.
I(x1; x2; y) = I(x1; y) − I(x1; y|x2) =
= I(x2; y) − I(x2; y|x1) = I(x1; x2) − I(x1; x2|y)
Complementary configuration: negative interaction
I(x1, y) = 0, I(x2, y) = 0, I(x1, y|x2) is maximal.
Confounding configuration: positive interaction
I(x1, x2) > 0, I(x1, x2|y) = 0
Causal patterns: negative interaction
(a) Common effect pattern, (b) spouse pattern
I(x1, x2) = 0, I(x1, x2|y) > 0 I(x2, y) = 0, I(x2, y|x1) > 0
I(x1; x2; y) < 0 I(x1; x2; y) < 0
Causal patterns: positive interaction
c) common cause pattern d) brotherhood pattern
I(x1, x2) > 0, I(x1, x2|y) = 0 I(x2, y) > 0, I(x2, y|x1) = 0
I(x1; x2; y) > 0 I(x1; x2; y) > 0
y x2x1
I(x1, x2) > 0, I(x1, x2|y) = 0
(e) causal chain pattern I(x1; x2; y) > 0
Joint information and interaction
I((x1, x2); y) = I(x2; y) + I(x1; y|x2)
I(x1; y|x2) = I(x1; y) − I(x1; x2; y)
it follows that
I((x1, x2); y)
Joint information
= I(x1; y) + I(x2; y) − I(x1; x2; y) =
= I(x1; y)
+ I(x2; y)
− [I(x1; x2) − I(x1; x2|y)]
Note that the above relationships hold also when either x1 or x2 are
vectorial random variables.
min-Interaction Max-Relevance (mIMR) filter
Let X+ = {xi ∈ X : I(xi ; y) > 0} the subset of X containing all
variables having non null mutual information (i.e. non null
relevance) with y.
The mIMR forward step is
d+1 = arg max
xk ∈X+−XS
[I(xk; y) − λI(XS; xk ; y)] ≈
≈ arg max
xk ∈X+−XS
I(xk ; y) −
xi ∈XS
(I(xi ; xk ; y)
where λ measures the amount of causation that we want to take
into consideration.
Note that λ = 0 boils down to the conventional ranking approach.
Experimental setting
Goal: identification of a causal signature of breast cancer
Two steps:
1 compare the generalization accuracy of conventional ranking
with mIMR
2 interpret the causal signature
Each experiment was conducted in a meta-analytical and
cross-validation framework.
6 public microarray datasets (i.e., n = 13, 091 unique genes)
derived from different breast cancer clinical studies.
All the microarray studies are characterized by the collection of
gene expression data and the survival data.
In order to adopt a classification framework, the survival of the
patients was transformed in a binary class such as low or high
risk of the patients given their clinical outcome at five years
Two sets of meta-analysis validation experiments:
Holdout: 100 training-and-test repetitions.
Leave-one-dataset-out where for each dataset the features
used for classification are selected without considering the
patients of the dataset itself.
All the experiments were repeated for three sizes of the gene
signature (number of selected features): v = 20, 50, 100.
All the mutual information terms are computed by using the
Gaussian approximation.
The quality of the selection is represented by the accuracy of a
Naive Bayes classifier measured by four different criteria to be
1 the Area Under the ROC curve (AUC),
2 1-RMSE where RMSE stands for Root Mean Squared Error
3 the SAR (Squared error, Accuracy, and ROC score)
4 the precision-recall F score measure.
Holdout results
v = 20 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2
AUC 0.688 0.688 0.694 0.699 0.703 0.704 0.705 0.707
1-RMSE 0.460 0.466 0.481 0.493 0.504 0.510 0.515 0.542
SAR 0.559 0.561 0.569 0.575 0.580 0.583 0.585 0.595
F 0.255 0.254 0.260 0.262 0.265 0.265 0.266 0.274
W-L 1-0 3-0 5-0 6-0 5-0 5-0 5-0
v = 50 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2
AUC 0.693 0.698 0.702 0.706 0.709 0.710 0.711 0.715
1-RMSE 0.451 0.458 0.465 0.471 0.477 0.479 0.482 0.503
SAR 0.552 0.556 0.562 0.567 0.571 0.572 0.574 0.583
F 0.263 0.265 0.268 0.270 0.272 0.271 0.273 0.277
W-L 2-0 3-0 3-0 2-0 2-0 3-0 4-0
v = 100 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2
AUC 0.699 0.704 0.708 0.711 0.714 0.715 0.715 0.716
1-RMSE 0.454 0.457 0.459 0.463 0.467 0.470 0.472 0.487
SAR 0.545 0.549 0.553 0.557 0.561 0.563 0.564 0.573
F 0.272 0.271 0.272 0.274 0.274 0.274 0.275 0.284
W-L 1-0 1-0 1-0 2-0 3-0 4-1 4-1
W-L is the number of datasets for which the causal filter is significantly
more (W) or less (L) accurate than the conventional ranking filter .
Causal interpretation
The introduction of a causality term leads to a prioritization of
the genes according to their causal role.
Since genes are not acting in isolation but rather in pathways,
we analyzed the gene rankings in terms of gene set enrichment
analysis (GSEA).
By quantifying how the causal rank of genes diverges from the
conventional one (λ = 0) with respect to λ we can identify the
gene sets that are potential causes or effects of breast cancer.
Causal characterization of genes
Genes that remains among the top ranked ones for increasing
λ can be considered as individually relevant (i.e. they
contain predictive information about survival) and causal.
Genes whose rank increases for increasing λ are putative
causes: they have less individual relevance than other genes
(for example, those being direct effects) but they are causal
together with other. These genes would have been missed by
conventional ranking (false negatives).
Genes whose rank decreases for increasing λ are putative
effects in the sense that they are individually relevant but
probably not causal. This set of genes could be erroneously
considered as causal (false positives ) by conventional
GSEA analysis
Normalized Enrichment Score
−2 −1 0 1 2
Normalized Enrichment Score
−2 −1 0 1 2
Normalized Enrichment Score
−2 −1 0 1 2
Microtubule cytoskeleton
organization and biogenesis
Coenzyme metabolic process
Regulation of cyclin dependent
protein kinase activity
Cellular defense
Inflammatory response
Defense response
M phase of mitotic cycle
DNA replicaton
M phase
0 0.5 1 2
Larger the NES of a GO term, stronger the association of this gene
set with survival; the sign of NES reflects the direction of
association of the GO term with survival, a positive score meaning
that over-expression of the genes implies worst survival and
Individually causal genes
The first group of GO terms are implicated in cell movement and
division, cellular respiration and regulation of cell cycle. It was
shown that this family of proteins may cause dysregulation of cell
proliferation to promote tumor progression.
The second GO term represents the co-enzyme metabolic process
which includes proteins showed to be early indicators of breast
cancer; perturbation of these co-enzymes might cause cancers by
compromising the structure of important enzyme complexes
implicated in mitochondrial functions.
The genes of the third GO term regulation cyclin-dependent protein
kinase activity are key players in cell cycle regulation and inhibition
of such kinases proved to block proliferation of human breast cancer
Jointly causal genes
Counterintuitively, the three GO terms in this category are related
to the immune system that is thought to be more an effect of the
tumor growth as lymphocytes strike cancer cells as they proliferate.
However, several findings support the idea that the immune system
might have a causal role in tumorigenesis.
There is strong evidence of interplay between immune system and
tumors since solid tumors are commonly infiltrated by immune cells;
in contrast to infiltration of cells responsible for chronic
inflammation, the presence of high numbers of lymphocytes,
especially T cells, has been reported to be an indicator of good
prognosis in many cancers what concours with the sign of the
Putative effects
The last group of GO terms are are related to cell-cycle and
In our previous research, we have shown that a quantitative
measurement of proliferation genes using mRNA gene expression
could provide an accurate assessment of prognosis of breast cancer
The enrichment of these proliferation-related genes seems to be a
downstream effect of the breast tumorigenesis instead of its cause.
Indistinguishable cases
mIMR shows that some causal patterns (e.g. open triplets or
unshielded colliders) can be discriminated by using notions
based on conditional independence.
These notions are exploited also by structural identification
approaches (e.g. PC algorithm in Bayesian networks) which
rely on notions of independence and conditional independence
to detect causal patterns in the data.
Unfortunately, these approaches cannot deal with
indistinguishable configurations like the two-variable setting
and the completely connected triplet configuration where it is
impossible to distinguish between cause and effects by means
of conditional or unconditional independence tests.
From dependency to causality
However indistinguishability does not prevent the existence of
statistical algorithms able to reduce the uncertainty about the
causal pattern even in indistinguishable configurations.
In recent years of a series of approaches appeared to deal with
the two variable setting like ANM and IGCI.
What is common to these approaches is that they use
alternative statistical features of the data to detect causal
patterns and reduce the uncertainty about their directionality.
A further important step in this direction has been represented
by the recent ChaLearn cause-effect pair challenge (YouTube
video "CauseEffectPairs" by I. Guyon).
ChaLearn cause-effect pair challenge
Hundreds of pairs of real variables with known causal
relationships from several domains (chemistry, climatology,
ecology, economy, engineering, epidemiology, genomics,
Those were intermixed with controls (pairs of independent
variables and pairs of variables that are dependent but not
causally related) and semi-artificial cause-effect pairs (real
variables mixed in various ways to produce a given outcome).
The good rate of accuracy obtained by the competitors shows
that learning strategies can infer with success (or at least
significantly better than random) indistinguishable
We took part to the ChaLearn challenge and we developed a
Dependency to Causality (D2C) learning approach for bivariate
settings which ranked 8th in the final leader board.
Is A cause of B?
... or B cause of A?
The D2C approach
Given two variables, the D2C approach infers from a number
of observed statistical features of the bivariate distribution
(e.g. the empirical estimation of the copula) or the n-variate
distribution (e.g. dependencies between members of the
Markov Blankets) the probability of the existence and then of
the directionality of the causal link between two variables.
The approach is an example of how the problem of causal
inference can be formulated as a supervised machine learning
approach where the inputs are features describing the
probabilistic dependency and the output is a class denoting the
existence (or not) of the causal link.
Once sufficient training data are made available, conventional
feature selection algorithms and classifiers can be used to
return a prediction.
The D2C approach: n > 2 variables
i c
Some asymmetrical relationships
By using d-separation we can write down a set of asymmetrical
relations between the members of the two Markov Blankets:
Unconditional Conditioning on the effect Conditioning on the cause
∀k c
i ⊥⊥ c
j ∀k c
i ⊥⊥ c
j |zj ∀k c
i ⊥⊥ c
j |zi
∀k e
i ⊥⊥ c
j ∀k e
i ⊥⊥ c
j |zj ∀k e
i ⊥⊥ c
j |zi
∀k c
i ⊥⊥ e
j ∀k c
i ⊥⊥ e
j |zj ∀k c
i ⊥⊥ e
j |zi
∀k zi ⊥⊥ c
j ∀k zi ⊥⊥ c
j |zj
∀k c
i ⊥⊥ zj ∀k c
i ⊥⊥ zj|zi
The algorithm
1 infers the Markov Blankets MBi = {m(ki ), ki = 1, . . . , Ki } and
MBj = {m(kj ), kj = 1, . . . , Kj } of zi and zj ,
2 computes the positions Pi (ki ) of m(ki ) of MBi in MBj and the
positions Pj (kj ) of m(kj ) in MBi .
3 computes
1 I = [I(zi ; zj ), I(zi ; zj |MBj  zi ), I(zi ; zj |MBi  zj)] where 
denotes the set difference operator,
2 Ii (ki ; kj) = I(m
(ki )
i ; m
(kj )
j |zi ) and Ij (ki ; kj) = I(m
(ki )
i ; m
(kj )
j |zj)
where ki = 1, . . . , Ki , kj = 1, . . . , Kj
4 creates a vector of descriptors
x = [Q( ˆPi ), Q( ˆPj ), I, Q(ˆIi ), Q(ˆIj ), N, n]
where ˆPi and ˆPj are the empirical distributions of Pi and Pj ,
ˆIi and ˆIj are the empirical distributions of Ii (ki , kj ) and
Ij (ki , kj ) (ki = 1, . . . , Ki , kj = 1, . . . , Kj ), and Q returns the
sample quantiles of a distribution.
The algorithm in words
Asymmetries between MBi and MBj induce an asymmetry on
( ˆPi , ˆPj ), and (ˆIi , ˆIj ) and the quantiles provide information
about the directionality of causal link (zi → zj or zj → zi .)
The distribution of these variables should return useful
information about which is the cause and the effect.
These distributions would more informative if we were able to
rank the terms of the Markov Blankets by prioritizing the
direct causes (i.e. the terms ci and cj ) since these terms play
a major role in the asymmetries.
The D2C algorithm can then be improved by choosing mIMR
to prioritize the direct causes in the MB set.
Training data generation
x11, x12, …, x1d
x21, x22, …, x2d
1-> 2
1-> 3
3-> 4
4-> 3
5-> 7
5-> 6
6-> 7
Descriptor vector Class
z11, z21, …, z1n
z21, z22, …, z2n
z11, z21, …, z1n
z21, z22, …, z2n
zN1, zN2, …, zNn
Experimental validation
Training set made of D = 6000 pairs xd , yd and is obtained
by generating 750 DAGs and storing for each of them the
descriptors associated to 4 positives examples (i.e. a pair
where the node zi is a direct cause of zj) and 4 negatives
examples (i.e. a pair where zi is not a direct cause of zj ).
Dependency between children and parents modelled by 3 types
of additive relationships (linear, quadratic, nonlinear)
A Random Forest classifier is trained on the balanced dataset
and assessed on the test set.
Test set made of 190 independent DAGs for the small
configuration and 90 for the large configuration. For each
DAG we select 4 positives examples (i.e. a pair where the node
zi is a direct cause of zj) and 6 negatives examples (i.e. a pair
where the node zi is not a direct cause of zj ).
Comparison with state-of-the-art approaches implemented by
the package bnlearn: ANM? DAGL, GS, IAMB, PC, HC,
Balanced error rate (BER) results: small networks (nodes
n ∈ [20, 30])
ANM DAGL1 DAGSearch DAGSearchSparse gs hc iamb mmhc si.hiton.pc tabu D2C400_lin D2C3000_lin D2C60000_lin D2C400 D2C3000 D2C60000
Balanced error rate (BER) results: large networks (nodes
n ∈ [500, 1000])
ANM DAGL1 DAGSearch DAGSearchSparse hc mmhc si.hiton.pc tabu D2C400_lin D2C3000_lin D2C60000_lin D2C400 D2C3000 D2C60000
The scientific community is demanding of learning algorithms
able to detect in a fast and reliable manner subsets of
informative and causal features from observational data.
Pessimistic point of view: Correlation (or dependency) does
not imply causation.
Optimistic point of view: Causation implies correlation (or
Causality leaves footprints on the patterns of stochastic
dependency which can be (hopefully) retrieved from
This implies that inferring causes without designing
experiments is possible once we look for such constraints.
Quote from Scheines
Statisticians are often skeptical about causality but in almost every
case these same people (over beer later) confess their heresy by
concurring that their real ambitions are causal and their public
agnosticism is a prophylactic against the abuse of statistics by their
clients or less careful practitioners. (Scheines 2002)

