Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Perspectives of feature selection in bioinformatics:
from relevance to causal inference
Gianluca Bontempi
Machine Learning Group,
Interuniversity Institute of Bioinformatics in Brussels (IB)2
Computer Science Department
ULB, Université Libre de Bruxelles
http://mlg.ulb.ac.be
The long way from data to knowledge
Which information can be extracted from data?
1 Descriptive statistics.
2 Parameters of a given model (model fitting, parameter
estimation, least squares).
3 Best predictive model among a set of candidates (validation,
assessment, bias/variance).
4 Most relevant features (multivariate statistics, regularization,
search).
5 Causal information.
This is also a good outline for a statistical machine learning course
as well as my personal research journey.
Causality in science
A major goal of the scientific activity is to model real
phenomena by studying the dependency between entities,
objects or more in general variables.
Sometimes the goal of the modelling activity is simply
predicting future behaviours. Sometimes the goal is to
understand the causes of a phenomenon (e.g. a disease).
Understanding the causes of a phenomenon means
understanding the mechanisms by which the observed
variables take their values and predicting how the values of
those variables would change if the mechanisms were subject
to manipulations (what-if scenarios).
Applications: understanding which actions to perform on a
system to have a desired effect (eg. understanding the causes
of tumor, the causes of the activation of a gene, the causes of
different survival rates in a cohort of patients.)
Causal knowledge
Most of human knowledge is causal and concerns how things
work in the world, about mechanisms, behaviors.
This knowledge is causal in the sense it is about the
mechanisms which bring from causes to effects.
Mechanism: it is characterized by some inputs and outputs,
the setting of inputs determines the outputs but not viceversa.
Causal discovery aims to understand the mechanism by which
variables came to take on the values they have and to predict
what the values of those variables would be if the naturally
occurring mechanisms were subject to manipulations.
Intelligent behaviour should be related to the ability of
inferring from observations cause and effect relationships
Prediction by supervised learning
ERROR
PREDICTION
MODEL
STOCHASTIC
DEPENDENCY
INPUT
PREDICTION
OUTPUT
DATA
Relevance vs. causality
The design of predictive models is one of the main
contributions of machine learning.
The design of a model able to predict the value of a target
variable (e.g. phenotype, survival time) requires the definition
of a set of input variables (e.g. genome expression, weight,
age, smoking habits, nationality, frequency of vacations) which
are relevant, in the sense that they provide information about
the target.
It is easy to observe that the features which are good predictors
are not always the causes of the variable to be predicted.
In other terms if causal variables are always relevant, the
contrary is not necessarily true. Sometimes, effects appears to
be better predictors than causes. Sometimes, good predictors
do not have a direct causal link with the target.
Relevance and causality: common cause pattern
age
reading
capability
height
The height of a child provides information (i.e. is relevant)
about his reading capability though it is not causing it.
Your child will not read better by pulling his legs...and reading
books doesn’t make him taller...
The variable age is called confounding variable.
Other examples
Some examples of false correlation may serve to illustrate the
difference between relevance and causality. In all these examples
the input is informative about the output (i.e. relevant) though it is
not a cause.
Input: number of firemen intervening in an accident. Target:
number of casualties.
Input: amount of Cokes drunk per day by a person. Target:
her sport performance.
Input: sales of ice-cream in a country. Target: number of
drowning deaths.
Input: sleeping with shoes. Target: wake up with an headache.
Input: chocolate consumption. Target: life expectancy.
Input: expression of gene 1. Target: expression of coregulated
gene 2.
Large dimensionality and causality
The problem of finding causes is still more difficult in large
dimensional tasks (bioinformatics) where often the number of
features (e.g. number of probes, variants) is very large with
respect to the number of samples.
Even when experimental interventions are possible, performing
thousands of experiments to discover causal relationships
between thousands of variables is often not practical.
Dimensionality reduction techniques have been largely
discussed in statistics and machine learning. However, most of
the time they focused on improving prediction accuracy.
Open issue: can these techniques be useful also for causal
feature selection? Is prediction accuracy compatible with
causal discovery?
Feature selection: state of the art
Filter methods: preprocessing methods which assess the
merits of features from the data, ignoring the effects of the
selected feature subset on the performance of the learning
algorithm: ranking, PCA or clustering.
Wrapper methods: assess subsets of variables according to
their usefulness to a given predictor. Search for a good subset
using the learning algorithm itself as part of the evaluation
function: stepwise methods in linear regression.
Embedded methods: variable selection as part of the
learning procedure and are specific to learning machines:
classification trees, random forests, and methods based on
regularization techniques (e.g. lasso)
Ranking: the simplest feature selection
The most common feature selection strategies in bioinformatics
is ranking where each variable is scored with the univariate
association with the target returned by a measure of relevance,
like mutual information, correlation, or p-value.
Ranking is simple and fast but:
1 it cannot take into consideration higher-order interaction terms
(e.g. complementarity)
2 it disregards redundancy between features
3 it does not distinguish between causes and effects. This is due
to the fact that univariate correlation (or relevance) does not
imply causation
Causality is not addressed either in multivariate feature
selection approaches since their cost function typically takes
into consideration accuracy but disregards causal aspects.
Causality vs. dependency in a stochastic setting
A variable x is dependent on a variable y if the distribution of y is
different from the marginal one when we observe the value x = x
Prob {y|x = x} = Prob {y}
Dependency is symmetric. If x is dependent of y, then y is
dependent on x.
Prob {x|y = y} = Prob {x}
A variable x is a cause of a variable y if the distribution of y is
different from the marginal one when we set the value x = x
Prob {y|set(x = x)} = Prob {y}
Causality is asymmetric:
Prob {x|set(y = y)} = Prob {x}
Main properties of causal relationships
Given causes (inputs) x and effects (output) y
stochastic dependency: changing x is likely to end up with a
change in y, in probabilistic terms the effects y are dependent
on the causes x
asymmetry: changing y won’t modify (the distribution of ) x
conditional independency: the effect y is independent of all
the other variables (apart from its effects) given the direct
causes x. In other words the direct causes screen off the
indirect causes from the effects. Note the analogy with the
notion of (Markov) state in (stochastic) dynamic systems.
temporality: the variation of y does not occur before x.
All this make Directed Acyclic Graphs a convenient formalism to
represent causality.
Graphical model (exc. from Guyon et al paper.)
Coughing
Allergy
Smoking
Anxiety
Genetic
factor
(a)
(d) (c)
Hormonal
factor
Metastasis
(b)
Other
cancers
Lung cancer
Causation and data
Causation is much harder to measure than dependency (e.g.
correlation or mutual information). Correlations can be
estimated directly in a single uncontrolled observational study,
while causal conclusions are stronger with controlled
experiments.
Data may be collected in experimental or observational setting.
Manipulation of variable is possible only in the experimental
setting. Two types of experimental configurations exist:
randomised and controlled. These are the typical settings
allowing causal discovery.
Most statistical study are confronted with observational
static settings. Notwithstanding, causal knowledge is more
and more demanded by final users.
Entropy and conditional entropy
Consider a binary output class y ∈ {c1 = 0, c2 = 1}
The entropy of y is
H(y) = −p0 log p0 − p1 log p1
This quantity is greater equal than zero and measures the
uncertainty of y
Once introduced the conditional probabilities
Prob {y = 1|x = x} = p1(x), Prob {y = 0|x} = p0(x)
we can define the conditional entropy for a given x
H[y|x] = −p0(x) log p0(x) − p1(x) log p1(x)
which measures the lack of predictability of y given x.
Information and dependency
Let us use the formalism of information theory to quantify the
dependency between variables.
Given two continuous rvs x1 ∈ X1, x2 ∈ X2, the mutual
information
I(x1; x2) = H(x1) − H(x1|x2)
measures stochastic dependence between x1 and x2.
In the case of Gaussian distributed variables
I(x1; x2) = −
1
2
log(1 − ρ2
)
where ρ is the Pearson correlation coefficient.
Conditional information
The conditional mutual information
I(x1; x2|y) = H(x1|y) − H(x1|x2, y)
quantifies how the dependence between two variables depends
on the context.
The conditional mutual information is null iff x1 and x2 are
conditionally independent given y.
This is the case of the example with x1 =reading, x2 =height
and y =age.
The information that a (set of) variable(s) brings about
another is
1 conditional on the context (i.e. which other variables are
known).
2 non monotone: it can increase or decrease according to the
context.
Interaction information
The interaction information quantifies the amount of trivariate
dependence that cannot be explained by bivariate information.
I(x1; x2; y) = I(x1; y) − I(x1; y|x2).
When it is different from zero, we say that x1, x2 and y
three-interact.
A non-zero interaction can be either negative, and in this case we
say that there is a synergy or complementarity between the
variables, or positive, and we say that there is redundancy.
I(x1; x2; y) = I(x1; y) − I(x1; y|x2) =
= I(x2; y) − I(x2; y|x1) = I(x1; x2) − I(x1; x2|y)
Complementary configuration: negative interaction
x1
x2
I(x1, y) = 0, I(x2, y) = 0, I(x1, y|x2) is maximal.
Confounding configuration: positive interaction
x1
x2
I(x1, x2) > 0, I(x1, x2|y) = 0
Causal patterns: negative interaction
y
x2x1
x1
x2y
(a) Common effect pattern, (b) spouse pattern
I(x1, x2) = 0, I(x1, x2|y) > 0 I(x2, y) = 0, I(x2, y|x1) > 0
I(x1; x2; y) < 0 I(x1; x2; y) < 0
Causal patterns: positive interaction
y
x2x1
x1
x2y
c) common cause pattern d) brotherhood pattern
I(x1, x2) > 0, I(x1, x2|y) = 0 I(x2, y) > 0, I(x2, y|x1) = 0
I(x1; x2; y) > 0 I(x1; x2; y) > 0
y x2x1
I(x1, x2) > 0, I(x1, x2|y) = 0
(e) causal chain pattern I(x1; x2; y) > 0
Joint information and interaction
Since
I((x1, x2); y) = I(x2; y) + I(x1; y|x2)
and
I(x1; y|x2) = I(x1; y) − I(x1; x2; y)
it follows that
I((x1, x2); y)
Joint information
= I(x1; y) + I(x2; y) − I(x1; x2; y) =
= I(x1; y)
Relevance
+ I(x2; y)
Relevance
− [I(x1; x2) − I(x1; x2|y)]
Interaction
(1)
Note that the above relationships hold also when either x1 or x2 are
vectorial random variables.
min-Interaction Max-Relevance (mIMR) filter
Let X+ = {xi ∈ X : I(xi ; y) > 0} the subset of X containing all
variables having non null mutual information (i.e. non null
relevance) with y.
The mIMR forward step is
x∗
d+1 = arg max
xk ∈X+−XS
[I(xk; y) − λI(XS; xk ; y)] ≈
≈ arg max
xk ∈X+−XS
I(xk ; y) −
λ
d
xi ∈XS
(I(xi ; xk ; y)
where λ measures the amount of causation that we want to take
into consideration.
Note that λ = 0 boils down to the conventional ranking approach.
Experimental setting
Goal: identification of a causal signature of breast cancer
survival.
Two steps:
1 compare the generalization accuracy of conventional ranking
with mIMR
2 interpret the causal signature
Each experiment was conducted in a meta-analytical and
cross-validation framework.
Datasets
6 public microarray datasets (i.e., n = 13, 091 unique genes)
derived from different breast cancer clinical studies.
All the microarray studies are characterized by the collection of
gene expression data and the survival data.
In order to adopt a classification framework, the survival of the
patients was transformed in a binary class such as low or high
risk of the patients given their clinical outcome at five years
Experiments
Two sets of meta-analysis validation experiments:
Holdout: 100 training-and-test repetitions.
Leave-one-dataset-out where for each dataset the features
used for classification are selected without considering the
patients of the dataset itself.
All the experiments were repeated for three sizes of the gene
signature (number of selected features): v = 20, 50, 100.
All the mutual information terms are computed by using the
Gaussian approximation.
Assessment
The quality of the selection is represented by the accuracy of a
Naive Bayes classifier measured by four different criteria to be
maximized:
1 the Area Under the ROC curve (AUC),
2 1-RMSE where RMSE stands for Root Mean Squared Error
3 the SAR (Squared error, Accuracy, and ROC score)
4 the precision-recall F score measure.
Holdout results
v = 20 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2
AUC 0.688 0.688 0.694 0.699 0.703 0.704 0.705 0.707
1-RMSE 0.460 0.466 0.481 0.493 0.504 0.510 0.515 0.542
SAR 0.559 0.561 0.569 0.575 0.580 0.583 0.585 0.595
F 0.255 0.254 0.260 0.262 0.265 0.265 0.266 0.274
W-L 1-0 3-0 5-0 6-0 5-0 5-0 5-0
v = 50 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2
AUC 0.693 0.698 0.702 0.706 0.709 0.710 0.711 0.715
1-RMSE 0.451 0.458 0.465 0.471 0.477 0.479 0.482 0.503
SAR 0.552 0.556 0.562 0.567 0.571 0.572 0.574 0.583
F 0.263 0.265 0.268 0.270 0.272 0.271 0.273 0.277
W-L 2-0 3-0 3-0 2-0 2-0 3-0 4-0
v = 100 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2
AUC 0.699 0.704 0.708 0.711 0.714 0.715 0.715 0.716
1-RMSE 0.454 0.457 0.459 0.463 0.467 0.470 0.472 0.487
SAR 0.545 0.549 0.553 0.557 0.561 0.563 0.564 0.573
F 0.272 0.271 0.272 0.274 0.274 0.274 0.275 0.284
W-L 1-0 1-0 1-0 2-0 3-0 4-1 4-1
W-L is the number of datasets for which the causal filter is significantly
more (W) or less (L) accurate than the conventional ranking filter .
Causal interpretation
The introduction of a causality term leads to a prioritization of
the genes according to their causal role.
Since genes are not acting in isolation but rather in pathways,
we analyzed the gene rankings in terms of gene set enrichment
analysis (GSEA).
By quantifying how the causal rank of genes diverges from the
conventional one (λ = 0) with respect to λ we can identify the
gene sets that are potential causes or effects of breast cancer.
Causal characterization of genes
Genes that remains among the top ranked ones for increasing
λ can be considered as individually relevant (i.e. they
contain predictive information about survival) and causal.
Genes whose rank increases for increasing λ are putative
causes: they have less individual relevance than other genes
(for example, those being direct effects) but they are causal
together with other. These genes would have been missed by
conventional ranking (false negatives).
Genes whose rank decreases for increasing λ are putative
effects in the sense that they are individually relevant but
probably not causal. This set of genes could be erroneously
considered as causal (false positives ) by conventional
ranking.
GSEA analysis
Normalized Enrichment Score
−2 −1 0 1 2
Normalized Enrichment Score
−2 −1 0 1 2
Normalized Enrichment Score
−2 −1 0 1 2
Microtubule cytoskeleton
organization and biogenesis
Coenzyme metabolic process
Regulation of cyclin dependent
protein kinase activity
A B C
Cellular defense
response
Inflammatory response
Defense response
M phase of mitotic cycle
DNA replicaton
NES NES NES
M phase
0 0.5 1 2
λ
Larger the NES of a GO term, stronger the association of this gene
set with survival; the sign of NES reflects the direction of
association of the GO term with survival, a positive score meaning
that over-expression of the genes implies worst survival and
inversely.
Individually causal genes
The first group of GO terms are implicated in cell movement and
division, cellular respiration and regulation of cell cycle. It was
shown that this family of proteins may cause dysregulation of cell
proliferation to promote tumor progression.
The second GO term represents the co-enzyme metabolic process
which includes proteins showed to be early indicators of breast
cancer; perturbation of these co-enzymes might cause cancers by
compromising the structure of important enzyme complexes
implicated in mitochondrial functions.
The genes of the third GO term regulation cyclin-dependent protein
kinase activity are key players in cell cycle regulation and inhibition
of such kinases proved to block proliferation of human breast cancer
cells
Jointly causal genes
Counterintuitively, the three GO terms in this category are related
to the immune system that is thought to be more an effect of the
tumor growth as lymphocytes strike cancer cells as they proliferate.
However, several findings support the idea that the immune system
might have a causal role in tumorigenesis.
There is strong evidence of interplay between immune system and
tumors since solid tumors are commonly infiltrated by immune cells;
in contrast to infiltration of cells responsible for chronic
inflammation, the presence of high numbers of lymphocytes,
especially T cells, has been reported to be an indicator of good
prognosis in many cancers what concours with the sign of the
enrichment.
Putative effects
The last group of GO terms are are related to cell-cycle and
proliferation.
In our previous research, we have shown that a quantitative
measurement of proliferation genes using mRNA gene expression
could provide an accurate assessment of prognosis of breast cancer
patients.
The enrichment of these proliferation-related genes seems to be a
downstream effect of the breast tumorigenesis instead of its cause.
Indistinguishable cases
mIMR shows that some causal patterns (e.g. open triplets or
unshielded colliders) can be discriminated by using notions
based on conditional independence.
These notions are exploited also by structural identification
approaches (e.g. PC algorithm in Bayesian networks) which
rely on notions of independence and conditional independence
to detect causal patterns in the data.
Unfortunately, these approaches cannot deal with
indistinguishable configurations like the two-variable setting
and the completely connected triplet configuration where it is
impossible to distinguish between cause and effects by means
of conditional or unconditional independence tests.
From dependency to causality
However indistinguishability does not prevent the existence of
statistical algorithms able to reduce the uncertainty about the
causal pattern even in indistinguishable configurations.
In recent years of a series of approaches appeared to deal with
the two variable setting like ANM and IGCI.
What is common to these approaches is that they use
alternative statistical features of the data to detect causal
patterns and reduce the uncertainty about their directionality.
A further important step in this direction has been represented
by the recent ChaLearn cause-effect pair challenge (YouTube
video "CauseEffectPairs" by I. Guyon).
ChaLearn cause-effect pair challenge
Hundreds of pairs of real variables with known causal
relationships from several domains (chemistry, climatology,
ecology, economy, engineering, epidemiology, genomics,
medicine).
Those were intermixed with controls (pairs of independent
variables and pairs of variables that are dependent but not
causally related) and semi-artificial cause-effect pairs (real
variables mixed in various ways to produce a given outcome).
The good rate of accuracy obtained by the competitors shows
that learning strategies can infer with success (or at least
significantly better than random) indistinguishable
configuration.
We took part to the ChaLearn challenge and we developed a
Dependency to Causality (D2C) learning approach for bivariate
settings which ranked 8th in the final leader board.
Is A cause of B?
... or B cause of A?
The D2C approach
Given two variables, the D2C approach infers from a number
of observed statistical features of the bivariate distribution
(e.g. the empirical estimation of the copula) or the n-variate
distribution (e.g. dependencies between members of the
Markov Blankets) the probability of the existence and then of
the directionality of the causal link between two variables.
The approach is an example of how the problem of causal
inference can be formulated as a supervised machine learning
approach where the inputs are features describing the
probabilistic dependency and the output is a class denoting the
existence (or not) of the causal link.
Once sufficient training data are made available, conventional
feature selection algorithms and classifiers can be used to
return a prediction.
The D2C approach: n > 2 variables
zi
zj
c
(1)
i c
(2)
i
e
(1)
i
c
(1)
j
c
(2)
j
e
(1)
j
s
(1)
i
s
(1)
j
Some asymmetrical relationships
By using d-separation we can write down a set of asymmetrical
relations between the members of the two Markov Blankets:
Unconditional Conditioning on the effect Conditioning on the cause
∀k c
(k)
i ⊥⊥ c
(k)
j ∀k c
(k)
i ⊥⊥ c
(k)
j |zj ∀k c
(k)
i ⊥⊥ c
(k)
j |zi
∀k e
(k)
i ⊥⊥ c
(k)
j ∀k e
(k)
i ⊥⊥ c
(k)
j |zj ∀k e
(k)
i ⊥⊥ c
(k)
j |zi
∀k c
(k)
i ⊥⊥ e
(k)
j ∀k c
(k)
i ⊥⊥ e
(k)
j |zj ∀k c
(k)
i ⊥⊥ e
(k)
j |zi
∀k zi ⊥⊥ c
(k)
j ∀k zi ⊥⊥ c
(k)
j |zj
∀k c
(k)
i ⊥⊥ zj ∀k c
(k)
i ⊥⊥ zj|zi
The algorithm
1 infers the Markov Blankets MBi = {m(ki ), ki = 1, . . . , Ki } and
MBj = {m(kj ), kj = 1, . . . , Kj } of zi and zj ,
2 computes the positions Pi (ki ) of m(ki ) of MBi in MBj and the
positions Pj (kj ) of m(kj ) in MBi .
3 computes
1 I = [I(zi ; zj ), I(zi ; zj |MBj  zi ), I(zi ; zj |MBi  zj)] where 
denotes the set difference operator,
2 Ii (ki ; kj) = I(m
(ki )
i ; m
(kj )
j |zi ) and Ij (ki ; kj) = I(m
(ki )
i ; m
(kj )
j |zj)
where ki = 1, . . . , Ki , kj = 1, . . . , Kj
4 creates a vector of descriptors
x = [Q( ˆPi ), Q( ˆPj ), I, Q(ˆIi ), Q(ˆIj ), N, n]
where ˆPi and ˆPj are the empirical distributions of Pi and Pj ,
ˆIi and ˆIj are the empirical distributions of Ii (ki , kj ) and
Ij (ki , kj ) (ki = 1, . . . , Ki , kj = 1, . . . , Kj ), and Q returns the
sample quantiles of a distribution.
The algorithm in words
Asymmetries between MBi and MBj induce an asymmetry on
( ˆPi , ˆPj ), and (ˆIi , ˆIj ) and the quantiles provide information
about the directionality of causal link (zi → zj or zj → zi .)
The distribution of these variables should return useful
information about which is the cause and the effect.
These distributions would more informative if we were able to
rank the terms of the Markov Blankets by prioritizing the
direct causes (i.e. the terms ci and cj ) since these terms play
a major role in the asymmetries.
The D2C algorithm can then be improved by choosing mIMR
to prioritize the direct causes in the MB set.
Training data generation
2
3
1
4
6
5
7
x11, x12, …, x1d
x21, x22, …, x2d
1
0
0
1
1-> 2
1-> 3
3-> 4
4-> 3
5-> 7
5-> 6
6-> 7
0
1
1
Descriptor vector Class
z11, z21, …, z1n
z21, z22, …, z2n
DAG 1
DAG 2
z11, z21, …, z1n
z21, z22, …, z2n
zN1, zN2, …, zNn
Simulation
Simulation
Experimental validation
Training set made of D = 6000 pairs xd , yd and is obtained
by generating 750 DAGs and storing for each of them the
descriptors associated to 4 positives examples (i.e. a pair
where the node zi is a direct cause of zj) and 4 negatives
examples (i.e. a pair where zi is not a direct cause of zj ).
Dependency between children and parents modelled by 3 types
of additive relationships (linear, quadratic, nonlinear)
A Random Forest classifier is trained on the balanced dataset
and assessed on the test set.
Test set made of 190 independent DAGs for the small
configuration and 90 for the large configuration. For each
DAG we select 4 positives examples (i.e. a pair where the node
zi is a direct cause of zj) and 6 negatives examples (i.e. a pair
where the node zi is not a direct cause of zj ).
Comparison with state-of-the-art approaches implemented by
the package bnlearn: ANM? DAGL, GS, IAMB, PC, HC,
Balanced error rate (BER) results: small networks (nodes
n ∈ [20, 30])
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
linearquadraticsigmoid
mediummediummedium
ANM DAGL1 DAGSearch DAGSearchSparse gs hc iamb mmhc si.hiton.pc tabu D2C400_lin D2C3000_lin D2C60000_lin D2C400 D2C3000 D2C60000
method
BER
Balanced error rate (BER) results: large networks (nodes
n ∈ [500, 1000])
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
linearquadraticsigmoid
largelargelarge
ANM DAGL1 DAGSearch DAGSearchSparse hc mmhc si.hiton.pc tabu D2C400_lin D2C3000_lin D2C60000_lin D2C400 D2C3000 D2C60000
method
BER
Conclusions
The scientific community is demanding of learning algorithms
able to detect in a fast and reliable manner subsets of
informative and causal features from observational data.
Pessimistic point of view: Correlation (or dependency) does
not imply causation.
Optimistic point of view: Causation implies correlation (or
dependency).
Causality leaves footprints on the patterns of stochastic
dependency which can be (hopefully) retrieved from
data.
This implies that inferring causes without designing
experiments is possible once we look for such constraints.
Quote from Scheines
Statisticians are often skeptical about causality but in almost every
case these same people (over beer later) confess their heresy by
concurring that their real ambitions are causal and their public
agnosticism is a prophylactic against the abuse of statistics by their
clients or less careful practitioners. (Scheines 2002)

More Related Content

Perspective of feature selection in bioinformatics

  • 1. Perspectives of feature selection in bioinformatics: from relevance to causal inference Gianluca Bontempi Machine Learning Group, Interuniversity Institute of Bioinformatics in Brussels (IB)2 Computer Science Department ULB, Université Libre de Bruxelles http://mlg.ulb.ac.be
  • 2. The long way from data to knowledge Which information can be extracted from data? 1 Descriptive statistics. 2 Parameters of a given model (model fitting, parameter estimation, least squares). 3 Best predictive model among a set of candidates (validation, assessment, bias/variance). 4 Most relevant features (multivariate statistics, regularization, search). 5 Causal information. This is also a good outline for a statistical machine learning course as well as my personal research journey.
  • 3. Causality in science A major goal of the scientific activity is to model real phenomena by studying the dependency between entities, objects or more in general variables. Sometimes the goal of the modelling activity is simply predicting future behaviours. Sometimes the goal is to understand the causes of a phenomenon (e.g. a disease). Understanding the causes of a phenomenon means understanding the mechanisms by which the observed variables take their values and predicting how the values of those variables would change if the mechanisms were subject to manipulations (what-if scenarios). Applications: understanding which actions to perform on a system to have a desired effect (eg. understanding the causes of tumor, the causes of the activation of a gene, the causes of different survival rates in a cohort of patients.)
  • 4. Causal knowledge Most of human knowledge is causal and concerns how things work in the world, about mechanisms, behaviors. This knowledge is causal in the sense it is about the mechanisms which bring from causes to effects. Mechanism: it is characterized by some inputs and outputs, the setting of inputs determines the outputs but not viceversa. Causal discovery aims to understand the mechanism by which variables came to take on the values they have and to predict what the values of those variables would be if the naturally occurring mechanisms were subject to manipulations. Intelligent behaviour should be related to the ability of inferring from observations cause and effect relationships
  • 5. Prediction by supervised learning ERROR PREDICTION MODEL STOCHASTIC DEPENDENCY INPUT PREDICTION OUTPUT DATA
  • 6. Relevance vs. causality The design of predictive models is one of the main contributions of machine learning. The design of a model able to predict the value of a target variable (e.g. phenotype, survival time) requires the definition of a set of input variables (e.g. genome expression, weight, age, smoking habits, nationality, frequency of vacations) which are relevant, in the sense that they provide information about the target. It is easy to observe that the features which are good predictors are not always the causes of the variable to be predicted. In other terms if causal variables are always relevant, the contrary is not necessarily true. Sometimes, effects appears to be better predictors than causes. Sometimes, good predictors do not have a direct causal link with the target.
  • 7. Relevance and causality: common cause pattern age reading capability height The height of a child provides information (i.e. is relevant) about his reading capability though it is not causing it. Your child will not read better by pulling his legs...and reading books doesn’t make him taller... The variable age is called confounding variable.
  • 8. Other examples Some examples of false correlation may serve to illustrate the difference between relevance and causality. In all these examples the input is informative about the output (i.e. relevant) though it is not a cause. Input: number of firemen intervening in an accident. Target: number of casualties. Input: amount of Cokes drunk per day by a person. Target: her sport performance. Input: sales of ice-cream in a country. Target: number of drowning deaths. Input: sleeping with shoes. Target: wake up with an headache. Input: chocolate consumption. Target: life expectancy. Input: expression of gene 1. Target: expression of coregulated gene 2.
  • 9. Large dimensionality and causality The problem of finding causes is still more difficult in large dimensional tasks (bioinformatics) where often the number of features (e.g. number of probes, variants) is very large with respect to the number of samples. Even when experimental interventions are possible, performing thousands of experiments to discover causal relationships between thousands of variables is often not practical. Dimensionality reduction techniques have been largely discussed in statistics and machine learning. However, most of the time they focused on improving prediction accuracy. Open issue: can these techniques be useful also for causal feature selection? Is prediction accuracy compatible with causal discovery?
  • 10. Feature selection: state of the art Filter methods: preprocessing methods which assess the merits of features from the data, ignoring the effects of the selected feature subset on the performance of the learning algorithm: ranking, PCA or clustering. Wrapper methods: assess subsets of variables according to their usefulness to a given predictor. Search for a good subset using the learning algorithm itself as part of the evaluation function: stepwise methods in linear regression. Embedded methods: variable selection as part of the learning procedure and are specific to learning machines: classification trees, random forests, and methods based on regularization techniques (e.g. lasso)
  • 11. Ranking: the simplest feature selection The most common feature selection strategies in bioinformatics is ranking where each variable is scored with the univariate association with the target returned by a measure of relevance, like mutual information, correlation, or p-value. Ranking is simple and fast but: 1 it cannot take into consideration higher-order interaction terms (e.g. complementarity) 2 it disregards redundancy between features 3 it does not distinguish between causes and effects. This is due to the fact that univariate correlation (or relevance) does not imply causation Causality is not addressed either in multivariate feature selection approaches since their cost function typically takes into consideration accuracy but disregards causal aspects.
  • 12. Causality vs. dependency in a stochastic setting A variable x is dependent on a variable y if the distribution of y is different from the marginal one when we observe the value x = x Prob {y|x = x} = Prob {y} Dependency is symmetric. If x is dependent of y, then y is dependent on x. Prob {x|y = y} = Prob {x} A variable x is a cause of a variable y if the distribution of y is different from the marginal one when we set the value x = x Prob {y|set(x = x)} = Prob {y} Causality is asymmetric: Prob {x|set(y = y)} = Prob {x}
  • 13. Main properties of causal relationships Given causes (inputs) x and effects (output) y stochastic dependency: changing x is likely to end up with a change in y, in probabilistic terms the effects y are dependent on the causes x asymmetry: changing y won’t modify (the distribution of ) x conditional independency: the effect y is independent of all the other variables (apart from its effects) given the direct causes x. In other words the direct causes screen off the indirect causes from the effects. Note the analogy with the notion of (Markov) state in (stochastic) dynamic systems. temporality: the variation of y does not occur before x. All this make Directed Acyclic Graphs a convenient formalism to represent causality.
  • 14. Graphical model (exc. from Guyon et al paper.) Coughing Allergy Smoking Anxiety Genetic factor (a) (d) (c) Hormonal factor Metastasis (b) Other cancers Lung cancer
  • 15. Causation and data Causation is much harder to measure than dependency (e.g. correlation or mutual information). Correlations can be estimated directly in a single uncontrolled observational study, while causal conclusions are stronger with controlled experiments. Data may be collected in experimental or observational setting. Manipulation of variable is possible only in the experimental setting. Two types of experimental configurations exist: randomised and controlled. These are the typical settings allowing causal discovery. Most statistical study are confronted with observational static settings. Notwithstanding, causal knowledge is more and more demanded by final users.
  • 16. Entropy and conditional entropy Consider a binary output class y ∈ {c1 = 0, c2 = 1} The entropy of y is H(y) = −p0 log p0 − p1 log p1 This quantity is greater equal than zero and measures the uncertainty of y Once introduced the conditional probabilities Prob {y = 1|x = x} = p1(x), Prob {y = 0|x} = p0(x) we can define the conditional entropy for a given x H[y|x] = −p0(x) log p0(x) − p1(x) log p1(x) which measures the lack of predictability of y given x.
  • 17. Information and dependency Let us use the formalism of information theory to quantify the dependency between variables. Given two continuous rvs x1 ∈ X1, x2 ∈ X2, the mutual information I(x1; x2) = H(x1) − H(x1|x2) measures stochastic dependence between x1 and x2. In the case of Gaussian distributed variables I(x1; x2) = − 1 2 log(1 − ρ2 ) where ρ is the Pearson correlation coefficient.
  • 18. Conditional information The conditional mutual information I(x1; x2|y) = H(x1|y) − H(x1|x2, y) quantifies how the dependence between two variables depends on the context. The conditional mutual information is null iff x1 and x2 are conditionally independent given y. This is the case of the example with x1 =reading, x2 =height and y =age. The information that a (set of) variable(s) brings about another is 1 conditional on the context (i.e. which other variables are known). 2 non monotone: it can increase or decrease according to the context.
  • 19. Interaction information The interaction information quantifies the amount of trivariate dependence that cannot be explained by bivariate information. I(x1; x2; y) = I(x1; y) − I(x1; y|x2). When it is different from zero, we say that x1, x2 and y three-interact. A non-zero interaction can be either negative, and in this case we say that there is a synergy or complementarity between the variables, or positive, and we say that there is redundancy. I(x1; x2; y) = I(x1; y) − I(x1; y|x2) = = I(x2; y) − I(x2; y|x1) = I(x1; x2) − I(x1; x2|y)
  • 20. Complementary configuration: negative interaction x1 x2 I(x1, y) = 0, I(x2, y) = 0, I(x1, y|x2) is maximal.
  • 21. Confounding configuration: positive interaction x1 x2 I(x1, x2) > 0, I(x1, x2|y) = 0
  • 22. Causal patterns: negative interaction y x2x1 x1 x2y (a) Common effect pattern, (b) spouse pattern I(x1, x2) = 0, I(x1, x2|y) > 0 I(x2, y) = 0, I(x2, y|x1) > 0 I(x1; x2; y) < 0 I(x1; x2; y) < 0
  • 23. Causal patterns: positive interaction y x2x1 x1 x2y c) common cause pattern d) brotherhood pattern I(x1, x2) > 0, I(x1, x2|y) = 0 I(x2, y) > 0, I(x2, y|x1) = 0 I(x1; x2; y) > 0 I(x1; x2; y) > 0 y x2x1 I(x1, x2) > 0, I(x1, x2|y) = 0 (e) causal chain pattern I(x1; x2; y) > 0
  • 24. Joint information and interaction Since I((x1, x2); y) = I(x2; y) + I(x1; y|x2) and I(x1; y|x2) = I(x1; y) − I(x1; x2; y) it follows that I((x1, x2); y) Joint information = I(x1; y) + I(x2; y) − I(x1; x2; y) = = I(x1; y) Relevance + I(x2; y) Relevance − [I(x1; x2) − I(x1; x2|y)] Interaction (1) Note that the above relationships hold also when either x1 or x2 are vectorial random variables.
  • 25. min-Interaction Max-Relevance (mIMR) filter Let X+ = {xi ∈ X : I(xi ; y) > 0} the subset of X containing all variables having non null mutual information (i.e. non null relevance) with y. The mIMR forward step is x∗ d+1 = arg max xk ∈X+−XS [I(xk; y) − λI(XS; xk ; y)] ≈ ≈ arg max xk ∈X+−XS I(xk ; y) − λ d xi ∈XS (I(xi ; xk ; y) where λ measures the amount of causation that we want to take into consideration. Note that λ = 0 boils down to the conventional ranking approach.
  • 26. Experimental setting Goal: identification of a causal signature of breast cancer survival. Two steps: 1 compare the generalization accuracy of conventional ranking with mIMR 2 interpret the causal signature Each experiment was conducted in a meta-analytical and cross-validation framework.
  • 27. Datasets 6 public microarray datasets (i.e., n = 13, 091 unique genes) derived from different breast cancer clinical studies. All the microarray studies are characterized by the collection of gene expression data and the survival data. In order to adopt a classification framework, the survival of the patients was transformed in a binary class such as low or high risk of the patients given their clinical outcome at five years
  • 28. Experiments Two sets of meta-analysis validation experiments: Holdout: 100 training-and-test repetitions. Leave-one-dataset-out where for each dataset the features used for classification are selected without considering the patients of the dataset itself. All the experiments were repeated for three sizes of the gene signature (number of selected features): v = 20, 50, 100. All the mutual information terms are computed by using the Gaussian approximation.
  • 29. Assessment The quality of the selection is represented by the accuracy of a Naive Bayes classifier measured by four different criteria to be maximized: 1 the Area Under the ROC curve (AUC), 2 1-RMSE where RMSE stands for Root Mean Squared Error 3 the SAR (Squared error, Accuracy, and ROC score) 4 the precision-recall F score measure.
  • 30. Holdout results v = 20 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2 AUC 0.688 0.688 0.694 0.699 0.703 0.704 0.705 0.707 1-RMSE 0.460 0.466 0.481 0.493 0.504 0.510 0.515 0.542 SAR 0.559 0.561 0.569 0.575 0.580 0.583 0.585 0.595 F 0.255 0.254 0.260 0.262 0.265 0.265 0.266 0.274 W-L 1-0 3-0 5-0 6-0 5-0 5-0 5-0 v = 50 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2 AUC 0.693 0.698 0.702 0.706 0.709 0.710 0.711 0.715 1-RMSE 0.451 0.458 0.465 0.471 0.477 0.479 0.482 0.503 SAR 0.552 0.556 0.562 0.567 0.571 0.572 0.574 0.583 F 0.263 0.265 0.268 0.270 0.272 0.271 0.273 0.277 W-L 2-0 3-0 3-0 2-0 2-0 3-0 4-0 v = 100 λ = 0 λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8 λ = 0.9 λ = 1 λ = 2 AUC 0.699 0.704 0.708 0.711 0.714 0.715 0.715 0.716 1-RMSE 0.454 0.457 0.459 0.463 0.467 0.470 0.472 0.487 SAR 0.545 0.549 0.553 0.557 0.561 0.563 0.564 0.573 F 0.272 0.271 0.272 0.274 0.274 0.274 0.275 0.284 W-L 1-0 1-0 1-0 2-0 3-0 4-1 4-1 W-L is the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the conventional ranking filter .
  • 31. Causal interpretation The introduction of a causality term leads to a prioritization of the genes according to their causal role. Since genes are not acting in isolation but rather in pathways, we analyzed the gene rankings in terms of gene set enrichment analysis (GSEA). By quantifying how the causal rank of genes diverges from the conventional one (λ = 0) with respect to λ we can identify the gene sets that are potential causes or effects of breast cancer.
  • 32. Causal characterization of genes Genes that remains among the top ranked ones for increasing λ can be considered as individually relevant (i.e. they contain predictive information about survival) and causal. Genes whose rank increases for increasing λ are putative causes: they have less individual relevance than other genes (for example, those being direct effects) but they are causal together with other. These genes would have been missed by conventional ranking (false negatives). Genes whose rank decreases for increasing λ are putative effects in the sense that they are individually relevant but probably not causal. This set of genes could be erroneously considered as causal (false positives ) by conventional ranking.
  • 33. GSEA analysis Normalized Enrichment Score −2 −1 0 1 2 Normalized Enrichment Score −2 −1 0 1 2 Normalized Enrichment Score −2 −1 0 1 2 Microtubule cytoskeleton organization and biogenesis Coenzyme metabolic process Regulation of cyclin dependent protein kinase activity A B C Cellular defense response Inflammatory response Defense response M phase of mitotic cycle DNA replicaton NES NES NES M phase 0 0.5 1 2 λ Larger the NES of a GO term, stronger the association of this gene set with survival; the sign of NES reflects the direction of association of the GO term with survival, a positive score meaning that over-expression of the genes implies worst survival and inversely.
  • 34. Individually causal genes The first group of GO terms are implicated in cell movement and division, cellular respiration and regulation of cell cycle. It was shown that this family of proteins may cause dysregulation of cell proliferation to promote tumor progression. The second GO term represents the co-enzyme metabolic process which includes proteins showed to be early indicators of breast cancer; perturbation of these co-enzymes might cause cancers by compromising the structure of important enzyme complexes implicated in mitochondrial functions. The genes of the third GO term regulation cyclin-dependent protein kinase activity are key players in cell cycle regulation and inhibition of such kinases proved to block proliferation of human breast cancer cells
  • 35. Jointly causal genes Counterintuitively, the three GO terms in this category are related to the immune system that is thought to be more an effect of the tumor growth as lymphocytes strike cancer cells as they proliferate. However, several findings support the idea that the immune system might have a causal role in tumorigenesis. There is strong evidence of interplay between immune system and tumors since solid tumors are commonly infiltrated by immune cells; in contrast to infiltration of cells responsible for chronic inflammation, the presence of high numbers of lymphocytes, especially T cells, has been reported to be an indicator of good prognosis in many cancers what concours with the sign of the enrichment.
  • 36. Putative effects The last group of GO terms are are related to cell-cycle and proliferation. In our previous research, we have shown that a quantitative measurement of proliferation genes using mRNA gene expression could provide an accurate assessment of prognosis of breast cancer patients. The enrichment of these proliferation-related genes seems to be a downstream effect of the breast tumorigenesis instead of its cause.
  • 37. Indistinguishable cases mIMR shows that some causal patterns (e.g. open triplets or unshielded colliders) can be discriminated by using notions based on conditional independence. These notions are exploited also by structural identification approaches (e.g. PC algorithm in Bayesian networks) which rely on notions of independence and conditional independence to detect causal patterns in the data. Unfortunately, these approaches cannot deal with indistinguishable configurations like the two-variable setting and the completely connected triplet configuration where it is impossible to distinguish between cause and effects by means of conditional or unconditional independence tests.
  • 38. From dependency to causality However indistinguishability does not prevent the existence of statistical algorithms able to reduce the uncertainty about the causal pattern even in indistinguishable configurations. In recent years of a series of approaches appeared to deal with the two variable setting like ANM and IGCI. What is common to these approaches is that they use alternative statistical features of the data to detect causal patterns and reduce the uncertainty about their directionality. A further important step in this direction has been represented by the recent ChaLearn cause-effect pair challenge (YouTube video "CauseEffectPairs" by I. Guyon).
  • 39. ChaLearn cause-effect pair challenge Hundreds of pairs of real variables with known causal relationships from several domains (chemistry, climatology, ecology, economy, engineering, epidemiology, genomics, medicine). Those were intermixed with controls (pairs of independent variables and pairs of variables that are dependent but not causally related) and semi-artificial cause-effect pairs (real variables mixed in various ways to produce a given outcome). The good rate of accuracy obtained by the competitors shows that learning strategies can infer with success (or at least significantly better than random) indistinguishable configuration. We took part to the ChaLearn challenge and we developed a Dependency to Causality (D2C) learning approach for bivariate settings which ranked 8th in the final leader board.
  • 40. Is A cause of B?
  • 41. ... or B cause of A?
  • 42. The D2C approach Given two variables, the D2C approach infers from a number of observed statistical features of the bivariate distribution (e.g. the empirical estimation of the copula) or the n-variate distribution (e.g. dependencies between members of the Markov Blankets) the probability of the existence and then of the directionality of the causal link between two variables. The approach is an example of how the problem of causal inference can be formulated as a supervised machine learning approach where the inputs are features describing the probabilistic dependency and the output is a class denoting the existence (or not) of the causal link. Once sufficient training data are made available, conventional feature selection algorithms and classifiers can be used to return a prediction.
  • 43. The D2C approach: n > 2 variables zi zj c (1) i c (2) i e (1) i c (1) j c (2) j e (1) j s (1) i s (1) j
  • 44. Some asymmetrical relationships By using d-separation we can write down a set of asymmetrical relations between the members of the two Markov Blankets: Unconditional Conditioning on the effect Conditioning on the cause ∀k c (k) i ⊥⊥ c (k) j ∀k c (k) i ⊥⊥ c (k) j |zj ∀k c (k) i ⊥⊥ c (k) j |zi ∀k e (k) i ⊥⊥ c (k) j ∀k e (k) i ⊥⊥ c (k) j |zj ∀k e (k) i ⊥⊥ c (k) j |zi ∀k c (k) i ⊥⊥ e (k) j ∀k c (k) i ⊥⊥ e (k) j |zj ∀k c (k) i ⊥⊥ e (k) j |zi ∀k zi ⊥⊥ c (k) j ∀k zi ⊥⊥ c (k) j |zj ∀k c (k) i ⊥⊥ zj ∀k c (k) i ⊥⊥ zj|zi
  • 45. The algorithm 1 infers the Markov Blankets MBi = {m(ki ), ki = 1, . . . , Ki } and MBj = {m(kj ), kj = 1, . . . , Kj } of zi and zj , 2 computes the positions Pi (ki ) of m(ki ) of MBi in MBj and the positions Pj (kj ) of m(kj ) in MBi . 3 computes 1 I = [I(zi ; zj ), I(zi ; zj |MBj zi ), I(zi ; zj |MBi zj)] where denotes the set difference operator, 2 Ii (ki ; kj) = I(m (ki ) i ; m (kj ) j |zi ) and Ij (ki ; kj) = I(m (ki ) i ; m (kj ) j |zj) where ki = 1, . . . , Ki , kj = 1, . . . , Kj 4 creates a vector of descriptors x = [Q( ˆPi ), Q( ˆPj ), I, Q(ˆIi ), Q(ˆIj ), N, n] where ˆPi and ˆPj are the empirical distributions of Pi and Pj , ˆIi and ˆIj are the empirical distributions of Ii (ki , kj ) and Ij (ki , kj ) (ki = 1, . . . , Ki , kj = 1, . . . , Kj ), and Q returns the sample quantiles of a distribution.
  • 46. The algorithm in words Asymmetries between MBi and MBj induce an asymmetry on ( ˆPi , ˆPj ), and (ˆIi , ˆIj ) and the quantiles provide information about the directionality of causal link (zi → zj or zj → zi .) The distribution of these variables should return useful information about which is the cause and the effect. These distributions would more informative if we were able to rank the terms of the Markov Blankets by prioritizing the direct causes (i.e. the terms ci and cj ) since these terms play a major role in the asymmetries. The D2C algorithm can then be improved by choosing mIMR to prioritize the direct causes in the MB set.
  • 47. Training data generation 2 3 1 4 6 5 7 x11, x12, …, x1d x21, x22, …, x2d 1 0 0 1 1-> 2 1-> 3 3-> 4 4-> 3 5-> 7 5-> 6 6-> 7 0 1 1 Descriptor vector Class z11, z21, …, z1n z21, z22, …, z2n DAG 1 DAG 2 z11, z21, …, z1n z21, z22, …, z2n zN1, zN2, …, zNn Simulation Simulation
  • 48. Experimental validation Training set made of D = 6000 pairs xd , yd and is obtained by generating 750 DAGs and storing for each of them the descriptors associated to 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and 4 negatives examples (i.e. a pair where zi is not a direct cause of zj ). Dependency between children and parents modelled by 3 types of additive relationships (linear, quadratic, nonlinear) A Random Forest classifier is trained on the balanced dataset and assessed on the test set. Test set made of 190 independent DAGs for the small configuration and 90 for the large configuration. For each DAG we select 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj ). Comparison with state-of-the-art approaches implemented by the package bnlearn: ANM? DAGL, GS, IAMB, PC, HC,
  • 49. Balanced error rate (BER) results: small networks (nodes n ∈ [20, 30]) 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 linearquadraticsigmoid mediummediummedium ANM DAGL1 DAGSearch DAGSearchSparse gs hc iamb mmhc si.hiton.pc tabu D2C400_lin D2C3000_lin D2C60000_lin D2C400 D2C3000 D2C60000 method BER
  • 50. Balanced error rate (BER) results: large networks (nodes n ∈ [500, 1000]) 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 linearquadraticsigmoid largelargelarge ANM DAGL1 DAGSearch DAGSearchSparse hc mmhc si.hiton.pc tabu D2C400_lin D2C3000_lin D2C60000_lin D2C400 D2C3000 D2C60000 method BER
  • 51. Conclusions The scientific community is demanding of learning algorithms able to detect in a fast and reliable manner subsets of informative and causal features from observational data. Pessimistic point of view: Correlation (or dependency) does not imply causation. Optimistic point of view: Causation implies correlation (or dependency). Causality leaves footprints on the patterns of stochastic dependency which can be (hopefully) retrieved from data. This implies that inferring causes without designing experiments is possible once we look for such constraints.
  • 52. Quote from Scheines Statisticians are often skeptical about causality but in almost every case these same people (over beer later) confess their heresy by concurring that their real ambitions are causal and their public agnosticism is a prophylactic against the abuse of statistics by their clients or less careful practitioners. (Scheines 2002)