To Machine Learning: Isabelle Guyon
To Machine Learning: Isabelle Guyon
to
Machine Learning
Isabelle Guyon
isabelle@clopinet.com
What is Machine Learning?
Learning Trained
algorithm machine
TRAINING
DATA Answer
Query
What for?
• Classification
• Time series prediction
• Regression
• Clustering
Some Learning Machines
• Linear models
• Kernel methods
• Neural networks
• Decision trees
Applications
training
Market
examples5
10 Ecology Analysis Machine
Vision
104 Text
Categorization
103 OCR
System diagnosis
HWR
102
Bioinformatics
10
inputs
10 102 103 104 105
Banking / Telecom / Retail
• Identify:
– Prospective customers
– Dissatisfied customers
– Good customers
– Bad payers
• Obtain:
– More effective advertising
– Less credit risk
– Fewer fraud
– Decreased churn rate
Biomedical / Biometrics
• Medicine:
– Screening
– Diagnosis and prognosis
– Drug discovery
• Security:
– Face recognition
– Signature / fingerprint / iris
verification
– DNA fingerprinting 6
Computer / Internet
• Computer interfaces:
– Troubleshooting wizards
– Handwriting and speech
– Brain waves
• Internet
– Hit ranking
– Spam filtering
– Text categorization
– Text translation
– Recommendation
7
Challenges
104
Dexter, Nova
103 Gisette
Gina
Madelon
102 Arcene,
Dorothea, Hiva
10
inputs
10 102 103 104 105
Ten Classification Tasks
40 150
ARCENE 100 ADA
20 50
0
0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0 5 10 15 20 25 30 35 40 45 50
40 150
DEXTER 100 GINA
20 50
0
0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0 5 10 15 20 25 30 35 40 45 50
40
150
DOROTHEA 100 HIVA
20
50
0
0
0 5 10 15 20 25 30 35 40 45 50 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
40
GISETTE 150
20 100 NOVA
50
0 0
0 5 10 15 20 25 30 35 40 45 50
40 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
MADELON
20 150
100 SYLVA
0 50
0 5 10 15 20 25 30 35 40 45 50 0
Test BER (%) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Challenge Winning Methods
1.8
1.6 Gisette (HWR)
1.4 Gina (HWR)
1.2 Dexter (Text)
BER/<BER>
1 Nova (Text)
0.8 Madelon (Artificial)
0.6 Arcene (Spectral)
0.4 Dorothea (Pharma)
0.2 Hiva (Pharma)
0 Ada (Marketing)
Linear Neural Trees Naïve Sylva (Ecology)
/Kernel Nets /RF Bayes
Conventions
X={xij} m y ={yj}
xi
w
Learning problem
Data matrix: X
m lines = patterns (data points,
examples): samples, patients,
documents, images, …
n columns = features: (attributes,
input variables): genes, proteins,
words, pixels, …
Unsupervised learning
Is there structure in data?
Supervised learning
Predict an outcome y.
Colon cancer, Alon et al 1999
Linear Models
• f(x) = w x +b = j=1:n wj xj +b
Linearity in the parameters, NOT in the input
components.
• f(x) = w (x) +b = j wj j(x) +b (Perceptron)
Axon
wn
Activation xn b Activation function
of other
neurons 1 Dendrites
Synapses
hyperplane
0.5
x2 0.4
0.3
0.2
0.1
X3
x3
-0.1
-0.2
-0.3
-0.4
-0.5
-0.5
0 -0.
0
x1 x
0.5 0.5
X2 2 xX11
Perceptron
Rosenblatt, 1957
x1 1(x)
w1
x2 2(x)
w2
f(x)
xn wN
N(x) b
f(x) = w (x) + b
1
NL Decision Boundary
x2
0.5
Hs.7780
x3 0
-0.5
0.5
0.5
0
0
x
Hs.234680
2
-0.5 -0.5
Hs.128749
x1 x1
Kernel Method
Potential functions, Aizerman et al 1964
x1 k(x1,x)
1
x2 k(x2,x) 2
xn m
k(xm,x) b
f(x) = i i k(xi,x) + b
1 k(. ,. ) is a similarity measure or “kernel”.
Hebb’s Rule
wj wj + yi xij
Activation
of another xj wj y
neuron Axon
Dendrite
Synapse
• f(x) = i i k(xi, x)
Dual forms
• f(x) = w (x)
• w = i i (xi)
What is a Kernel?
A kernel is:
• a similarity measure
• a dot product in some feature space: k(s, t) = (s) (t)
Examples:
• k(s, t) = exp(-||s-t||2/2) Gaussian kernel
xj
internal “latent” variables
“hidden units”
Chessboard Problem
Tree Classifiers
CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
f2
All the
data
f1
Choose f2 At each step,
choose the
feature that
“reduces
Choose f1 entropy” most.
Work towards
“node purity”.
Iris Data (Fisher, 1936)
Figure from Norbert Jankowski and Krzysztof Grabczewski
Linear discriminant Tree classifier
setosa versicolor
virginica
x2 x2
x1 x1
15
Performance evaluation
x2 x2
f(x)
0
=0
x2 f(x) =
)=
x2
-1
-1
x2 x2
1
f(x)
=1
Positive class
success rate
C
(hit rate, RO
dom
sensitivity) n
Ra
0 100%
1 - negative class success rate
(false alarm rate, 1-specificity)
ROC Curve
For a given
threshold Ideal ROC curve (AUC=1)
100%
on f(x),
you get a
point on the ROC
t ual
ROC curve. Ac
0.5 )
Positive class U C=
success rate (A
O C
R
(hit rate,
dom
n
sensitivity) Ra
0 AUC 1
0 100%
1 - negative class success rate
(false alarm rate, 1-specificity)
Lift Curve
Gini M
O
Gini=2 AUC-1
0 Fraction of customers selected 100%
0 Gini 1
Performance Assessment
Predictions: F(x)
Cost matrix
Class -1 Class +1 Total Class +1 / Total