Ml2 Script v2
Ml2 Script v2
Script of
1 Lecture 15/04 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Linear Models for Classification . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Directly learn the decision function . . . . . . . . . . . . . . . . . 4
2 Lecture 17/04 5
3 Lecture 22/04 9
3.1 Algorithms for Logistic Regression . . . . . . . . . . . . . . . . . . . . . 9
3.2 Why is SGD fast for large N ? . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Lecture 24/04 15
4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 NN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Lecture 29/04 19
5.1 Theoretical Capabilities of NN . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 The Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Loss functions depend on application: . . . . . . . . . . . . . . . . . . . . 22
6 Lecture 06/05 23
6.1 NN training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Classical Tricks to make this work . . . . . . . . . . . . . . . . . . . . . . 24
7 Lecture 08/05 27
7.1 RPROP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Dropout [Srivastava & Hinton 2012] . . . . . . . . . . . . . . . . . . . . . 27
7.3 Piecewise linear activation functions . . . . . . . . . . . . . . . . . . . . 29
7.4 MaxOut [Goodfellow et al. 2013] . . . . . . . . . . . . . . . . . . . . . . . 30
7.5 PReLU “parametric ReLU” [He et al 2015] . . . . . . . . . . . . . . . . . . 31
8 Lecture 13/05 33
8.1 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
i
Contents
9 Lecture 20/05 37
9.1 Coding Matrices for multi-class problems . . . . . . . . . . . . . . . . . . 37
9.2 Gaussian Processes (or The Statistical Theory of Interpolation) . . . . . . 38
10 Lecture 22/05 41
10.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
11 Lecture 27/05 45
12 Lecture 29/05 49
12.1 Uncertainty of GP interpolation . . . . . . . . . . . . . . . . . . . . . . . 49
12.2 Application [Snoek et al. 2012]: GP to optimize the hyper parameters of a
learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
12.3 Application: GP classification . . . . . . . . . . . . . . . . . . . . . . . . 51
13 Lecture 03/06 53
13.1 GP classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
13.2 The Bayesian Interpretation of GP regression (and their relation to “reproducing kernel
Hilbert spaces”(RKHS)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
14 Lecture 10/06 57
14.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
15 Lecture 12/06 61
15.1 Bayesian Networks (directed graphical models) . . . . . . . . . . . . . . . 61
16 Lecture 17/06 65
16.1 Inference in BN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
17 Lecture 19/06 69
17.1 Temporal Models/Belief Networks . . . . . . . . . . . . . . . . . . . . . . 71
18 Lecture 24/06 73
18.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
18.2 Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . . . . . 74
19 Lecture 26/06 81
19.1 Learning the parameters (= transition probabilities) of a HMM . . . . . . 81
20 Lecture 01/07 87
20.1 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
21 Lecture 03/07 93
21.1 Create BNs from data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
22 Lecture 08/07 97
22.1 Detecting conditional independence by statistical tests . . . . . . . . . . 97
ii
Contents
iii
1 Lecture 15/04
1.1 Goals
• Find a function Y = f (X ), where X ∈ RD :
– Y ∈ R or RM : Regression,
– Y ∈ {1, ...,C}1 : Classification2
and learn the desired function f from training data:
N : Supervised (correct answer known),
– {(Xi ,Yi )}i=1
N : Unsupervised (must infer interesting categories).
– {Xi }i=1
• generalization vs. overfitting: Loss on independent data (test set) may be much
bigger than loss on training data.
1
1 Lecture 15/04
1.2 Notation
feature matrix X of dimension N × D 7
p(X |Y )p(Y ) X
p(Y |X ) = where p(X ) = p(X |Y = k )p(Y = k )
p(X ) k
.
Nk
– Learn p(Y = k ) = N with Nk : Number of class k instances in training
data.
7 WithN the instance count and D the feature count.
8 Independent,identically distributed
9 If this assumption is violated probabilistic graphical models are a possibility.
10 Special case C = 2: p(Y = 1|X ; θ ) = 1 − p(Y = 0|X ; θ )
11 for C = 2:
2
1.3 Linear Models for Classification
– Learn data likelihood per class ∀k : p(X |Y = k; θk ). How are the instances
of class k distributed in feature space (= density estimation problem)?
This approach implies 2. via Bayes’ rule and 1. via “winner takes all”.
• 1. and 2. are called discriminative models.
1.3.1 LDA
• Assume that p(X |Y = k,θk ) is a Gaussian distribution for all k with a single joint
covariance13 (otherwise: QDA (see ML1)).
Figure 1.1: Two examples for linear decision boundaries in 2D for the 2 class case of LDA.
Total mean: µ = N1 i Xi
P
3
1 Lecture 15/04
β = Σ−1 (µ 1T − µ −1
T
), b = −µβ .
⇒ Decision rule is defined by:
1, X β + b > 0
arg max p(Y = k |X ) ⇔ Ŷ = f (x ) =
−1 X β + b < 0
k
where X β + b = 0 is called the decision boundary.
• This is a very good model if Gaussian assumption holds (approximately).
−Yi Xi β, Yi Xi β < 0
Loss (Xi β,Yi ) = = (−Yi Xi β )+
0,
Yi Xi β ≥ 0
X X
Loss (β ) = −Yi Xi β = (−Yi Xi β )+
i:Yi ,Ŷi i
.
– Perceptron algorithm: gradient descent on loss function
∂Loss X
= −Yi Xi
∂β
i:Yi ,Ŷi
4
2 Lecture 17/04
• example generative model: LDA
Figure 2.1: Depiction of perceptron loss (max(0, −Yi Xi β )) and hinge loss (max(0, 1 − Yi Xi β ))
functions as well as the squared hinge loss (max(0, 1 − Yi Xi β ) 2 ) for comparison.
confidence
∗ The maximum margin plane is found by minimizing ||β || 2 = β T β under
hinge loss if data are separable.
∗ For non-separable data: minimization of loss and of ||β || 2 cannot be
achieved simultaneously. ⇒ control trade-off by regularization parameter
λ
5
2 Lecture 17/04
β Tβ λ X
βˆ = arg min + max(0, 1 − Yi Xi β )
β 2 N i
many algorithms
· standard solver for quadratic programming
· primal space algorithm: (stochastic) gradient descent, e.g. Pegasos
· dual space algorithms: sequential minimal optimization (SMO, LIB-
SVM); dual coordinate ascent (LIBLINEAR)
∗ Advantages:
· good practical performance, relatively easy training (choose λ by cross-
validation)
· dual formulation can be kernelized (non-linear classifier)
∗ Disadvantage: confidence is |Xi β |, but this cannot be interpreted as proba-
bility of being correct
β Tβ λ X
βˆ = arg min + log(1 + exp(−Yi Xi β ))
β 2 N i
• Many algorithms can solve this (see Minka 2003/2007, Bottou 2007), but since there
is no closed-form solution, iterative algorithms are needed.
1 use 1 − σ (z) = σ (−z)
6
1. gradient descent-type algorithms: work well because objective is convex
λ X
!
(t+1) (t ) (t ) (t )
β =β +τ 1 − σ (Yi Xi β ) Yi Xi − β
T
N i | {z }
0, if correct
>0≈
1, if false
β (t+1) = β (t ) + τt д (t+1)
τ0
here τt = (1+t ) 3/4
, i.e. the rate should have a slower rate, д is initialized to
zero.
∗ What does the averaging mean?
T0
(t+1)
wt 0 · f (t−t
0)
wt 0 = µ t = exp(−t 0/η)
X 0
д =
t 0 =0
7
2 Lecture 17/04
(t+1) (t )
β = (1 − µ)β + µβ (t+1)
дi(t ) , i , i0
дi(t+1)
=
(t ) (t )
λ(1 − σ (Yi Xi β ))Yi Xi − β , i = i
0 0 0 0
0
8
3 Lecture 22/04
f 00 (a (t ) ) 2
f (a (t ) + ∆a) ≈ f (a (t ) ) + f 0 (a (t ) )∆a + ∆a → min
2
∂d
f (a (t ) + ∆a) ≈ f 0 (a (t ) ) + f 00 (a (t ) )∆a = 0
!
∂∆a
f 0 (a (t ) )
⇒ ∆a = − f 00 (a (t ) )
if a is a vector: ∆a = −H −1 |a (t ) ∇f |a (t ) ⇒ update: a (t+1) = a (t ) + ∆a, need gradient
β Tβ λ X
min + (1 − σ (Yi Xi β )) (−Yi XiT )
β 2 N i | {z }
σ (−Yi X i β )
∂ 2 Loss λ
Hessian: ∂β 2
=I− N (− σ 0 (Yi Xi β ) ) Yi2 XiTXi
| {z } |{z}
σ (t )(1−σ (t )) =1
∂ 2Loss λ X
= I + σ (Xi β )(1 − σ (Xi β ))XiTXi = I + X TW X
∂β 2 N i
λ
where W = N diag (σ (X i β )(1 − σ (Xi β ))) is a N × N matrix
• simplify gradient using W 1 :
λ λ X Yi
(1 − σ (Yi Xi β ))Yi XiT = σ (Xi β )(1 − σ (Xi β )) XT
N N i σ (Yi Xi β ) i
= X TW Ỹ
1Ỹ Yi
= σ (Yi X i β ) is N × 1
9
3 Lecture 22/04
β (t+1) = β (t ) + (I + X TW (t ) X ) −1 (X TW (t )Ỹ (t ) − β (t ) )
= (I + X TW (t ) X ) −1 (I + X TW (t ) X )β (t ) + X TW (t )Ỹ (t ) − β (t )
= (I + X TW (t ) X ) −1 X TW (t ) (X β (t ) + Ỹ (t ) )
= (I + X TW (t ) X ) −1X TW (t ) Z (t )
where Z (t ) = X β (t ) + Ỹ (t )
• even faster: use fast approximation of the Hessian ⇒ “quasi-Newton”, e.g. BFGS
(Broyden–Fletcher–Goldfarb–Shanno) algorithm
• in the primal space, we approach the optimum from above: all β (t ) are upper
bounds of Loss (β ∗ ) ≤ Loss (β (t ) )
• In difficult optimization problems, one often brackets the (unknown) global optimum
between a primal upper bound and a dual lower bound. The difference between the
bounds is the “duality gap”.
• If the dual bound is tight, the duality gap is zero, and primal and dual solutions
agree, e.g. LR.
10
3.1 Algorithms for Logistic Regression
Figure 3.1: Plot showing the logistic loss function in comparison to the above introduced
hinge loss.
βTβ
⇒ Laдranдian(β,α ) = 2 + Nλ i (−αi Yi Xi β − αi log αi − (1 − αi ) log(1 − αi )), re-
P
place loss with its lower bound s.t. αi ∈ [0, 1]2
∂Laдranдian λ X
=β+ −αi Yi XiT = 0
!
∂β N i
λ X
⇒β= αi Yi XiT
N i
λ2 X λ X
DualLoss (α ) = − α i α i 0 Yi Yi 0 X i X 0 −
T
i (αi log αi + (1 − αi ) log(1 − αi ))
2N 2 i,i 0 N i
2 with 0 log 0 := 0
11
3 Lecture 22/04
∂Dual λ2 λ αi (1 − αi )
X !
= − 2 Yi Xi (αi 0Yi 0 Xi 0 ) − log αi + − log(1 − αi ) −
∂αi N i0
N αi (1 − αi )
λ2 X λ αi
= − 2 Yi Xi (αi 0Yi 0 Xi 0 ) − log
N i0
N 1 − αi
∂ 2 Dual λ2 2 λ 1 λ 1
= − Yi Xi XiT − −
∂αi2 N 2 N αi N 1 − αi
λ2 λ 1
= − 2 Xi XiT −
N N αi (1 − αi )
∗ clip at [0, 1]
– LIBLINEAR implements a slightly improved version (numerically more stable)
– seems to be the fastest algorithm for large N
• Both errors should decrease at about the same rate, otherwise our efforts on mini-
mizing one of them are useless.
12
3.2 Why is SGD fast for large N ?
log N q !
log N
• numerical analysis: ε est ∈ O N best case, O N worst case
r
log N log N +
ε ∼ ε est ∼ ε opt ∼ *or
N , N -
1
N ∼ log N
ε
1
log N ∼ log + log log N
ε | {z }
≈0
1 1
N ∼ log (best case)
ε ε
1 1
N ∼ 2 log (worst case)
ε ε
Algorithm Time per step Steps to accu- Time to accu- Time to total
racy racy ε opt accuracy
SGD + Dual O(D) 1
O( εopt ) O( εDopt ) O( Dε )
GD O(N D) O(log εopt
1
) O(DN log εopt
1
) O(D ε12 (log ε1 ) 2 )
Newton O(D 2 N ) O(log log εopt
1
) O(D N log log εopt
2 1
) O( Dε 22 log ε1 log log ε1 )
13
4 Lecture 24/04
• neural networks (NN) combine the ideas in 2. : connect linear classifiers in parallel
(“layers”) and layers in series (“multi-layer” or “deep” if ≥ 4)
• history:
– 1940/50s: first neuron models (Hebb, McCulloch/Pitts) and idea of multi-layered
architectures inspired by brain research and meant to explain the brain
– 1958: Perceptron and multylayer perceptron (Rosenblatt): first working training
algorithm (gradient descent on centered hinge loss), but no good algorithm for
multi-layer training
– 1969: book by Papert & Minsky: proved limitations of single layered perceptron
(cannot solve the XOR-problem) and they conjectured (falsly) that multi-layered
architectures are not much better.
⇒ first death
– 1986: Rumelhardt & Hinton: popularized backpropagation training for multi-
layer NN; first practical training algorithm for multi-layer NN ⇒ first rebirth
– ... 1995:
15
4 Lecture 24/04
4.2 NN architecture
• each neuron has arbitrary many inputs and a single output
• originally: neuron computes a weighted sum of the inputs and “fires” if a threshold
is exceeded (threshold activation function), inspired by the brain
Zi = φ (Xi β + β 0 )
Zi is the response, φ the function, Xi the features, β the weights and β 0 the bias
2
16
4.2 NN architecture
Input # 1
Input # 2
Output
Input # 3
Input # 4
• “network architecture” = how many neurons and how to combine them (must be
fixed by network designer)
• in a “feed forward network” all connections are directed from input → output ⇒ NN
is a DAG (directed acyclic graph)
17
4 Lecture 24/04
• notation:
number of layers L;
layer index l = 0, 1, ...,L;
number of inputs/features: D, j = 0, ...,D 6 ;
number of outputs: M,m = 1, ...,M;
number of hidden neurons in layer l: Hl , if there is only 1 hidden layer:
H , h = 0, ...,H 7 ;
B8 : 3-dimensional array of weights;
Bl matrix of weights between layer (l − 1) and l;
Blh : column vector 9 input weights of neuron h in layer l;
Blhj : single weight from neuron j in layer (l − 1) to neuron h in layer l;
output (row) vector of all neurons in layer l: Zl ;
output of neuron h in layer l: Zlh ;
φ l : activation functions in layer l (all identical)
XH1 X D
Ŷi = Z 21 = φ 2 . B 21h · φ 1 *. B 1hj Xij +/+/
*
, h=0 , j=0 --
18
5 Lecture 29/04
19
5 Lecture 29/04
• prediction:
input layer Z 0 ∈ [(D + 1) × 1]
hidden layer Z̃l = Bl Zl−1 ∈ [Hl × 1] = [Hl × Hl−1 ][Hl−1 × 1] layer (l − 1) to l
Zl = φl (Z̃l ) pointwise [Hl × 1]
output layer L depends on the application
regression linear activation Ŷi = Z̃ LT (= Z LT ) ∈ [1 × M]
decision rule ŷ = arg maxk Z̃ Lk , k ∈ 1, ...,C number of classes
2-class posterior p(Ŷi = 1|Xi ) = Z L = σ (Z̃ L ), k ∈ {0, 1}, scalar output
multi-class posterior (C ≥ 2) ⇒ k ∈ {1, ...,C},∀k : p(Ŷi = k |Xi ) = Z Lk =
e Z̃l k 4
Z̃ 0
,
k 0 e Lk
P
5.3 Backpropagation
fancy name for gradient descent training of the weights
3 The problem with this theorem is, that it only gives a statement of existence with no indication on how to
construct fˆ.
4 known as “soft-max function” a generalization of the sigmoid function
20
5.3 Backpropagation
∂Zl+1 Z̃l+1
= φl+1
0
(Z̃l+1 ) = Bl+1
T
φl+1
0
(Z̃l+1 )
∂Zl ∂Zl
• backpropagation algorithm:
– init δL = ∂Loss , δ˜L = δLφ 0 (Z̃ L )
∂Zl L
– for l = L, ..., 1 :
δl−1 = B Tδ˜l
l
δ˜l−1 = δl φl0 (Z̃l )
21
5 Lecture 29/04
∂Loss −1/Zl ,
ify = 1 ∂Z L
δL = ∂Zl =
1/(1 − Z L ), i f y = 0 , ∂Z̃ L
= Z L (1 − Z L )
Z L − 1, i f Y = 1
δ˜L = δL ∂∂ZZ̃ L =
L Z L , ifY = 0
• regularization: RLoss = Loss + Reдularizer . Popular Regularizers: L 2 , L 1
22
6 Lecture 06/05
Z 0 = [1 XiT ]T
for l = 1, ...,L
– weight update:
23
6 Lecture 06/05
• weight initialization: init such that neurons are not saturated at the beginning
– standardize features (zero mean, unit variance)
– assume that the neuron activations are zero mean and unit variance1
– initialize weights with zero mean and variance s 2
– input properties of next layer neuron
– for gradient training to work, E(Blh Zl−1 ) ± std(Blh Zl−1 ) should not be in the
saturated region.
√
⇒ Hl−1 + 1s ≤ 1 (≤ 2), solve for s = √H 1 +1
l −1
q q
⇒ init Blh ∼ N (0, (Hl−1 + 1) ) or ∼ U − Hl −1 +1 , Hl −1 +1
−1 3 3
• optimization algorithms
– plain gradient descent (“batch training”)
– stochastic gradient descent (“online training”)
– mini-batch SGD
⇒ need to adjust learning rate and momentum (→ later)
– methods that automatically adjust the step size
– usual suspects:
∗ Newton, quasi Newton (BFGS), conjugate gradient with line search
∗ RPROP:
1 only assume this for the weight initialization... in general it is wrong
24
6.2 Classical Tricks to make this work
· idea: adjust the log of the training rate by gradient descent log(θt ) =
log(τt−1 ) + ∆t
⇒ multiplicative update on τ , after some math and simplifications
(t ) (t−1)
(t )
η + (= 1.25), if ∇Blhj ∇Blhj >0
ηlhj = −
η (= 0.7) else
(t )
τt,lhj = min τmax , max τmin ,τt−1,lhj ηlhj
(t ) (t−1) (t )
∆Blhj = ∆Blhj − τt,lhj sign(∇Blhj )
• termination criterion: training easily overfits ⇒ keep a separate validation set and
monitor the validation error
⇒ stop when validation error starts to go up
• learning rate and momentum for GD and SGD [Wilson & Martinez 2003]
– we want ∆B (t ) = −τt E ∂Loss
f g , finite data estimate:
∂B t−1
∂Loss 1 X ∂Lossi
" #
E =
∂B NB i ∈ Batch ∂B t−1
– for the full GD: Batch = training set, NB = N ⇒ get accurate estimate of
E[дrad] at cost O(N )
– SGD: Batch = single instance, NB = 1 ⇒ inaccurate estimate at cost O(1)
– minibatch is between these extremes
– rule
√ of thumb: the more accurate E[дrad] the bigger
√ we can choose τ . τGD ≈
NτSGD ⇒ the time for equal progress in GD is N longer than SGD because
of O(N ).
– learning rate schedules:
∗ keep the learning rate constant
∗ divide τ → τ /10 when learning stalls (2x)
τ0
∗ τt = 1+t/t 0
25
7 Lecture 08/05
7.1 RPROP
(t ) (t−1) (t )
• normal update of a single weight: Blhj = Blhj − τ ∆Blhj
(t )
• give each weight an individual training rate τlhj and train it via GD: log τlhj =
(t−1) (t )
log τlhj + ∆(log τ )lhj ⇒ multiplicative update in τ itself
(t ) (t−1) (t ) (t )
⇒ update rule Blhj = Blhj − τlhj ∆Blhj .
(t ) ∂Loss
• after some math and approximations, we find: ∆Blhj = sign ∂Blhj t
⇒ the gradient w.r.t B only determines the step direction not the length
(t )
• step length is completely absorbed into τlhj
(t−1) (t )
τ (t ) = max(τmin , min(τmax ,τlhj ηlhj ))
(t ) (t−1)
(t )
η + (= 1.25), if ∇Blhj ∇Blhj >0
ηlhj = −
η (= 0.7) else
×
× ×
dropout
× ×
× ×
Figure 7.1: Depiction of a deep neural network demonstrating the influence of applying
dropout with p = 0.5.
27
7 Lecture 08/05
• at end of training: downscale all weights B → p · B and use all neurons for prediction.
– this is an approximation for the statistical interpretation of dropout
– there are 2H possible subnetworks ⇒ dropout trains a (small) random fraction
of these
– all subnetworks share O(H 2 ) weights
– at prediction time: sample M of the 2H possible subnetworks and return the
average of their prediction.
– but: this is expensive ⇒ approximate the average of the predictions with
the prediction using the average network ⇒ single prediction with average
network instead of M predictions from subnetworks
– if all activations φl (t ) were linear, the average network is just B → p · B
– the inventors showed experimentally that this also works for nonlinear activa-
tions
– equivalent alternative (easier to implement): upscale all active weights during
training by B active → p1 B active
• practical recommendations:
– learning rate must be increased by a factor of 10...100, and we need more
iterations
τ0
– learning rate must decrease over time τt = 1+t/t 0
lh ||2 )
– use max-norm regularization Blh → Blh min(C,||B
||Blh ||2 with C ≈ 3..4
• theory:
– observation:
∗ since neurons cannot rely on the presence of any input neuron during
training, subtle co-adaptation effects (huge weights that cancel each other)
cannot occur ⇒ strong regularization
28
7.3 Piecewise linear activation functions
• ReLU (“rectified linear unit”) [Nair & Hinton, 2010 / Glorot et al. 2011]
ReLU(t ) = max(0,t )
29
7 Lecture 08/05
Figure 7.2: Rectified Linear Unit (ReLU) activation function, which is zero when t < 0 and
then linear with slope 1 when t > 0.
• any continuous function can be expressed as difference between two convex func-
tions1
⇒ we can arbitrarily well approximate any function by a piecewise linear function,
taking the difference of two of these max(linear functions)
l = 1, 3, 5, ... : Zl = Z̃l
l = 2, 4, 6, ... : Zlh = max [Zl−1 ]
Slh ⊂[0,Hl −1 ]
where Slh is a subset of neurons in layer l − 1 (typically: each neuron Zlh uses k
neurons of Zl−1 without sharing, Hl−1 = kHl )
• k ∈ [2, ..., 20] is the number of linear segments after each maxout neuron.
√
• main
√ application: convolutional neural networks: “max pooling”: reduce a k×
k window to a single pixel by taking the maximum.
30
7.5 PReLU “parametric ReLU” [He et al 2015]
9 2 9 6 4 3
5 0 9 3 7 5 max pooling 9 9 7
0 7 0 0 9 0 9 5 9
7 9 3 5 9 4
Figure 7.3: Graphical depiction of the max pooling function used in convolutional neural
networks to reduce a 2 × 2 window to a single pixel by taking the maximum.
Figure 7.4: Leaky ReLUs or PReLUs are one attempt to fix the “dying ReLU” problem.
Instead of the function being zero when t < 0, a leaky ReLU or PReLU will
instead have a small negative slope a. For PReLUs the value of a is made into a
parameter which is adaptively learned during training.
31
7 Lecture 08/05
32
8 Lecture 13/05
• distinguish from multi-labeled problems: each instance can have several labels.
(Image: sunset, beach, surfing, ...; document: several topics)
– scores: Y ∈ RC
• goal: predict Ŷi = Yi , we can obtain hard decisions from posteriors and scores:
arg max Yk
k
– random forest:
∗ ensembles of decision trees:
· train each tree on a random subset of the instances
· only consider a random subset of the features when selecting a split
∗ prediction: average over all tree predictions Ŷ = Ŷt
P
33
8 Lecture 13/05
– neural network:
∗ define the non-linearity of the output layer via the “soft-max” function
exp(Z̃ Lk )
Z Lk = P
k 0 exp Z̃ Lk 0
∗ back propagation:
∂Loss −1/Z Lk , k = Yi
∂Lk = =
∂Z Lk 1 , k , Yi
1−Z Lk
∂Z Lk
= .... = Z Lk − Z Lk
2
∂Z̃ Lk
∂Z Lk
= ... = −Z Lk Z Lk 00
∂Z̃ Lk 00
Z Lk (1 − Z Lk ), k = k 0
⇒ Jacobian Jkk 0 =
−Z Lk Z Lk 0 ,
k , k0
Z 00
∂Loss X Z Lk − 1 − k 00,k 1−ZLk 00 Z Lk , Yi = k
P
δ˜Lk = = δLk Jk k = Lk
00 00
Z Lk 00
∂Z̃ Lk 1−Z Lk 00 Z Lk , Yi = k 0 , k
P
k 00 2Z Lk − k 00 ,k,k 0
equivalently
λ X
min 12 ||β ||22 + ξi s.t. 1 Yi (Xi β + b) ≥ 1 − ξi , ξi ≥ 0
β,b N i
34
8.1 Multi-class Classification
⇒ N (C − 1) slack variables
– compute the dual and train in the dual space ⇒ difficult
• traditional belief: CS is better than WW because there are only N constraints instead
of N (C − 1)
• but: [Drogan et al. 2011] claim that the crucial difference is actually the elimination
of the intercepts b because they lead to difficult equality constraints in the dual ⇒
eliminate b in the WW and got better than CS
35
8 Lecture 13/05
36
9 Lecture 20/05
1 1... 0 0 ...
1 −1 ... −1 *.−1 0 ... 1 1 ...+/
−1 1 −1 ... //
*. +
e.g. M ovr = .. M ovo = .. 0 −1 ... −1 0 ...//
. /
,
. ... ... ... ... // .. 0 0 ... 0 −1 ...//
,−1 ... −1 1 - , ... ... ... ... ... ...-
• in principle, M can be arbitrary, the best choice of M is an open problem (restriction,
rows of M must differ)
– OVO, OVR are still good choices
– [Dietterich et al. 1995]: Error-correcting output codes (ECOC). Idea: make
the rows of M pairwise as different as possible. ⇒ classification becomes robust
against errors in a few of the L classifiers. If for classes k and k 0, at least L0
elements of M differ, we can recover from b L 2−1 c by majority vote
0
• stagewise optimization (various authors): add a new column to M until the over-
all performance is satisfying, choose the new column “optimally” with respect to
existing columns.
• prediction (“decoding”):
10 means, that the instance is not used to train classifier l
37
9 Lecture 20/05
– if all hl (X ) return crisp binary labels, assign X to the row of M, with minimal
Hamming distance.
– if all hl (X ) return posteriors or scores: compute the loss of all rows and choose
k which minimizes the loss.
• This is easy for binary classification, hl (X ) just must be a bit better than guessing.
38
9.2 Gaussian Processes (or The Statistical Theory of Interpolation)
2. factorize (p(X 1 ,Y1 ), ..., (X N ,YN )) as good as possible, (e.g. = Z1 p((X 1 ,Y1 ), (X 2 ,Y2 ))·
... · p((X N −1 ,YN −1 ), (X N ,YN )) “Markov assumption: only neighboring trace-
points are related”)
– graphical models find such factorizations systematically (⇒ chapter
“GM”)
3. learn the full joint probability p((X 1 ,Y1 ), ..., (X N ,YN )) ⇒ Gaussian processes
– this is only tractable when we use a simple model for p(..)
– obvious choice: multi-variate Gaussian distribution ⇒ everything an be
computed in closed form
– typical application is regression, classification is modeled via regression
of the posterior class probability.
39
10 Lecture 22/05
D E
– Under our model p( f ; f¯,S ) all functions with finite norm f − f¯, f − f¯
H (S )
have non-zero probability.2
– A function f has high probability if it is similar to f¯ and conforms to the
covariance structure given by S ⇔ || f − f¯||H (S ) is small.
– S (X ,X + ∆X ) decreases slowly with ∆X ⇒ neighboring points are highly
correlated ⇒ f should be smooth.3
– S (X ,X + ∆X ) decreases quickly ⇒ noisy functions are also probable4
– S must be chosen by the designer to model the properties of the application.
• in practice, we are only interested in finitely many points of f (training & test points)
⇒ We create marginal distributions of N( f¯,S ) by integrating out all points we don’t
care about.
⇒ general property: any marginal of a Gaussian is again Gaussian
let [X 1 , ...,X N ] be the training locations with observed response [Y1 , ...,YN ]T = Y ,
[X N +1 , ...,X N +N 0 ] the test locations where we want to find [ŶN +1 , ..., ŶN +N 0 ]T = Y 0
Y
" # !
⇒ the marginal distribution of [Y1 , ...,YN +N 0 ] is N − Y ,S 1:N +N 0 5
Y0
41
10 Lecture 22/05
• we are interested in
• we can compute the test responses one point at a time, i.e. we can set N 0 = 1
!
p(Y |Y ) ∼ N k(X ) K Y ,κ − k(X ) K k(X )
0 0 T −1 0 T −1 0
• Ŷ = k(X 0 ) TỸ
42
10.1 Gaussian Processes
10t = X 0 − bX 0c
43
11 Lecture 27/05
Reminder: E(Y 0 ) = Ŷ = k(X 0 ) TK −1Y = k(X 0 ) TỸ
Kernel functions: A function K (X ,X 0 ) is a kernel iff it is positive definite (Mercer’s
condition).
• New kernel functions can be constructed from existing ones with easy operations.
– a positive linear combination of kernels is a kernel K new (X ,X 0 ) = α 1K 1 (X ,X 0 ) +
... + α M KM (X ,X 0 ), α 1 , ....,α M > 0
– a product of kernels is a kernel K new (X ,X 0 ) = K 1 (X ,X 0 )K 2 (X ,X 0 )
– We may map the X into an arbitrary feature space before applying the kernel
function K new (X ,X 0 ) = K 1 (ϕ(X ),ϕ(X 0 )).
– the exponential of a kernel is again a kernel K new (X ,X 0 ) = exp(K 1 (X ,X 0 ))
– [...]
• two big classes of popular kernels: “radial basis functions”, “tensor product
kernels”
∗ α = 1/2 K (r ) = q 1
1+ ρr
2
45
11 Lecture 27/05
∗ α = 1, K (r ) = 1
1+ 2 ρr
1 2
• all kernels so far have infinite support K (r ) > 0 even for very large r ⇒ kernel
matrix K is dense, i.e. expensive to invert when N is big (solution of linear system
K Ỹ = Y takes O(N 3 ))
• A kernel with compact support (i.e. K (r ) = 0 if r > r max ) leads to a sparse kernel
matrix. ⇒ sparse solvers need O(N )
– truncate a non-compact kernel — approximation
γ
– define compact kernels, e.g. Wendland splines K (r ) = 1 − ρr polyγ ( ρr )
+
∗ choose γ and poly according to feature space dimension and the required
# of derivatives
γ
· smooth, but not differentiable: K (r ) = 1 − ρr , γ = b D2 c + 1
+
γ
· 1-times differentiable: K (r ) = 1 − ρr γ γr + 1 , γ = b D2 c + 2
+
· [...]
• radial basis functions are best if the training points Xi are irregularly arranged —
“scattered data interpolation”
• if the Xi form a regular grid, tensor product kernels — allow to work in one dimension
at a time
• squared exponential:
K (x,x 0 ) = exp − 21 ||X − X 0 || 2 = exp − 12 (X 1 − X 10 ) 2 · · · exp − 21 (X D − X D0 ) 2
46
1 − |x |, |x | ≤ 1
B 1 (x ) =
0
4 −x ,
3
2 |x | ≤ 12
B 2 (x ) = ( 2 − |x |) 2 , 12 ≤ |x | ≤
1 3 3
2 2
0
3 − x + 2 |x | , |x | ≤ 1
2 2 1 3
B 3 (x ) = (2 − |x |) 3 , 1 < |x | ≤ 2
1
6
0
1 x 2
−2 ρ
B ∞ (x ) ∼ e
47
12 Lecture 29/05
• on the grid, Gaussian process interpolation is just filtering (convolution)
Y ∼ N ( f (= 0),K + σ I)
where K is the uncertainty about the true function f , whereas σ 2 I is the uncertainty
in the measurements of Yi .
1Y values at grid points, K = I for Catmull-Rom
49
12 Lecture 29/05
• We need the conditional probability for unseen points, given the training
points: p(Y 0 |Y ) = p(Y ,Y 0 )/p(Y ) 2 . The computations are almost the same as
with σ 2 = 0, giving us
K +σ IN )Y + |{z}
Ŷ = E(Y 0 (X 0 )) = k(X 0 ) T (|{z} Eε 0
=A =0
var(Y ) = k (X ,X ) + σ − k(X ) (K + σ IN ) k(X ).
0 0 0 2 0 T 2 −1 0
• repeat
A + σ IN B
!
2 where S= , compare to an earlier lecture
BT C + σ 2 IN 0
50
12.3 Application: GP classification
∂2
H =− log p(Z |X ,D)
∂Z 2
Z =Ẑ
1
Ẑ |X ,D) − (Z −Ẑ )H (Z −Ẑ )
p(Z |X ,D) ≈ e|log p(
{z } e 2 = N (Ẑ ,H −1 )
const
51
13 Lecture 03/06
13.1 GP classification
• p(y 0 = ±1|X 0 ) = Z 0 p(y|Z 0,X )p(Z 0 |X )dZ 0 latent variable Z : simplifies matters
R
• in order to make predictions p(Z 0 |X 0 ) = p(Z 0 |X 0, D) we need the Zi values for the
training points E(Z 0 ) = kT (X 0 )K −1Ẑ
⇒ training = determine Ẑ
N ) = p(Ẑ |{X ,Y } N )p({X ,Y } N ) = p({Y }|Ẑ )p(Ẑ |{X })p({X })
• p(Ẑ , {Xi ,Yi }i=1 i i i=1 i i i=1 i i i
N
Ẑ = arg max p(Z |{Xi ,Yi }i=1 ) = arg max p({Yi }|Z )p(Z |{Xi })
Z Z
= arg max = log p({Y }|Z ) + log p(Z |{Xi })
Z | {zPi } | {z }
log p(Yi |Z i )= log σ (Yi Z i )
P
i i | {zGP }
1 1 N
=− 2 Z T K −1Z + 2 log(K )+ 2 log 2π
X
⇒Ẑ = arg max = − log (1 + exp(−Yi Zi )) − 21 Z TK −1Z + const = ψ (Z )
Z i
∂ψ (Z )
= v − K −1Z = 0
!
∂Z
t 1 − π1
where v = ... , ti = 2Yi − 1 ∈ {0, 1},πi = σ (Zi )
t N − π N
∂ 2ψ (Z )
= W − K −1
∂Z2 2
−π1 (1 − π1 ) 0
where W = ... ...
−π N (1 − π N )
0
Update step: Z (t+1) = Z (t ) − (W (t ) − K −1 ) −1 (v (t ) − K −1Z (t ) ), Ẑ = Z (t→∞)
53
13 Lecture 03/06
• predictions:
– solve p(Y |X 0 ) = Z 0 σ (YZ 0 )GP (Z 0 |Ẑ ,X 0 )dZ 0, no closed form solution
R
⇒ solve numerically or use the normal CDF instead of σ (t ) (but then we must
∂ψ ∂ 2ψ
adjust ∂Z , ∂Z 2 during training)
– if we only need a decision function:
Figure 13.1: A depiction of the use of parallel coordinates as plotting technique for multi-
variate data. It allows one to see clusters in data and to estimate other statistics
visually. When we are using parallel coordinates points are represented as
connected line segments. Each vertical line represents one attribute of the
car data set. One set of connected line segments represents one data point.
Points that tend to cluster will appear closer together. The dataset is clustered
in dependence of the number of cylinders given in the legend in the upper
right (MPG-miles per gallon).
54
13.2 The Bayesian Interpretation of GP regression (and their relation to “reproducing kernel Hilbert spaces”(RKHS))
∀x 0 ,∀f (x ) ∈ H (k ) : f ,Kx 0 H (k ) = f (x 0 )
• to actually compute k 0−1 , its best to use Fourier transform: convolution theorem:
F ( f ∗ д)(x ) = F ( f )F (д)
!
1
k 0 ⇔ F (k 0 )F (k 0 ) = 1,k 0 = F
−1 −1 −1 −1
F (k 0 )
but often, no closed form expression for k 0−1 (x ) exists, no problem in practice, because
we can always explicitly invert the kernel matrix for our finite training set
• scalar product:
Z Z
f ,д f (x 0
)(k 0−1 ∗ д)(x )dx =
0 0
(k 0−1/2 ∗ f )(x 0 ) · (k 0−1/2 ∗ д)(x 0 )dx 0
H (k ) :=
Z Z
(∗) k 0−1 (x 00 )kx 0 (x 0 − x )dx =
00 00
k 0−1 (x 00 )k 0 ((x 0 − x 0 ) − x 00 )dx 00 = δ (x 0 − x 0 )
55
13 Lecture 03/06
• application to GP regression:
– given training data D, find the function f (x ) that has maximum a posteriori
probability p( f |D)
– expand according to Bayes p( f |D) ∝ p(D| f ) p( f )
| {z } |{z}
training error prior for f
– data probability: squared loss p(D| f ) = exp i (Yi − f (Xi )) 2
1 P
2σ 2
– prior: choose a Gaussian process p( f ) = exp − 21 f , f
H (k )
– prior experience encodes the expected smoothness of f in the kernel K and
prefers f that conforms to this smoothness requirement. ⇔ f , f small
⇔ p( f ) high
1 X
fˆ = arg max p( f |D) = 2 (Yi − f (xi )) 2 + f , f H (k ) (∗)
f σ i | {z }
regularization
– to solve this, we need the “representer theorem” [Kimeldorf & Wahba 1971]
Thm: In any problem (∗), the optimal solution can be expressed as a linear com-
bination of kernel functions centered at the training points fˆ (x ) = i αi Kxi (X ),
P
just determine the αi
• insert the representer theorem into (∗)
D E DX X E X D E
fˆ, fˆ = αi Kx i , αi Kx i = αi α j Kxi ,Kx j
H (k )
i,j
X
= αi α j K (Xi ,X j ) = α TKα
i,j
56
14 Lecture 10/06
p(A=boy∧B=boy∧(A=Sun∨B=Sun)|(A=boy∨B=boy)∧(A=Sun∨B=Sun) )
1 There will probably be a lot of plots in this chapter, which won’t be reproduced here, see e.g. Barber,
Koller&Friedman for those.
2 We could just ask Bob.
57
14 Lecture 10/06
⇒ missing, that it’s the same person who is a boy and born on a sunday
⇒ correct model:
p(A=boy,B=boy∧(A=Sun∨B=Sun)|(A=boy∧A=Sun)∨(B=boy∧B=Sun) )
13 2 · 7 − 1
p4 = =
27 4 · 7 − 1
5. see exercise
58
14.1 Graphical Models
Gene Gene
Figure 14.1: Three possible models for smoking and cancer. (a) Direct causal influence; (b)
indirect causal influence via a latent common cause (Gene); (c) incorporated
model with both influences.
2. omitted variable bias: apparent association could be causal, but can also
have a common cause (smoking → lung cancer, gene → lung cancer and
gene → smoking) or a mediating property (sex → admission, but sex→field
→ admission)
if the additional variable is ignored (marginalized out), very misleading
conclusions will be drawn
• trick: can drop arcs when variables are conditionally independent (remember condi-
tional independence does not in general imply general independence)
59
15 Lecture 12/06
• catch:
– every permutation of the variables results in a different, but equivalent factor-
ization: D! possibilities
– but in some factorizations we can remove many more edges
⇒ goal: use the permutation that results in the fewest edges after step 3. the
best permutation tends to be the one that results in a causal graph (i.e. arc
direction = cause → effect)
61
15 Lecture 12/06
4. learn the parameters of the distributions p(X j |PA(X j )) for all j (p(X j |PA(X j ))),
can be represented by conditional probability tables (CPT) or parametric models
⇒ Bayesian or Belief Network (BN)
• chain:
– if B is marginalized, we just loose information: p(C |A) = B p(C |B)p(B|A)
P
(uncertainty increases)
– if B is known (B = b), then C is independent of A : A ⊥
⊥ C |B (dictated by the
graph structure, otherwise A must be in PA(C))
62
15.1 Bayesian Networks (directed graphical models)
• common cause:
– if B is marginalized, an association between A and C results (the arrow direction
does not follow from the graph, but often from the application) example:
Simpson’s paradox, Berkley admission
– if B is known: A ⊥
⊥ C |B
• common effect:
– if B is not marginalized but unknown A ⊥
⊥C
– if B is marginalized: unconditional independence still holds
– if B is known, A and C become conditionally dependent A 6⊥
⊥ C |B (“Bergson’s
paradox”)
Now suppose you live in California: the alarm can be triggered by an earthquake
marginalize out B:
Bergson’s paradox: given A = 1, we learn (e.g. from the news) that there was an
p(A=1,B,E=1)
earthquake E = 1. Compute p(B|A = 1,E = 1) = p(A=1|E=1)
B p(B|A = 1,E = 1)
0 0.97
1 0.03
63
15 Lecture 12/06
• Consider a set S ⊂ {X 1 , ...,X D } such that evidence is available for all nodes in S. Let
X ,Y < S. Then, an undirected path X ! Y is blocked by S if any of the following
is true:
1. Ai−1 — Ai — Ai+1 is a chain and Ai ∈ S
2. Ai−1 ← Ai → Ai+1 and Ai ∈ S
3. Ai−1 → Ai ← Ai+1 and neither Ai < S nor for Z ∈ DE(Ai ) Z < S
64
16 Lecture 17/06
• What independence assumptions does a BN encode?
– when X and Y are associated, information must flow between them:
– in a BN information can only flow along the arcs (in both directions!)
⇒ we must consider an undirected path between X and Y (X ! Y )
– if information can flow along X ! Y the path is “active”, otherwise “blocked”
– if all paths between X and Y are blocked ⇒ X ⊥
⊥ Y (unconditionally)
– given a set S of nodes where we have evidence (know the variable value), path
activation can change1
• algorithm to check for d-separateness:
Given: directed graph G, nodes S; X ,Y < S
1. Define the ancestral subgraph G 0 of G: remove all nodes not in {X ,Y ,S, ancestors(X ,Y ,S )}
and their arcs.
2. Define moral graph G 00 of G 0: for each node in G 0 connect all unconnected
parents (“unmarried”) by an undirected arc and remove all arrows.
3. Construct G 000 of G 00 by removing all nodes from S in G 00.
4. X and Y are d-separated given S if they are unconnected given G 000.
• Def: A joint probability p(X 1 , ...,X D ) satisfies (directed global) Markov property
w.r.t a graph, if X j and X j 0 , are d-separated by S implies X j ⊥
⊥ X j 0 |S in p(X 1 , ...,X D ).
• Theorem: If p(X 1 , ...,X D ) is Markov w.r.t a DAG G, then it can be factorized as
p(X 1 , ...,X D ) = j p(X j |PA(X j )).
Q
• The converse is not generally true: conditional independence does not always imply
d-separation.
Example: X and Y are not d-separated but independent: X → Z → Y ← X .
linear model:
X = εX ∼ N (0,σX2 )
Z = aX + εZ
Y = bZ + cX + εY
= abX + bεZ + cX + εY = (ab + c)X + εZ0
if (ax + c) = 0 ⇒ X ⊥
⊥Y
1 see end of last lecture
65
16 Lecture 17/06
• Claim: many important models for p(X j |PA(X j )) are faithful with probability 1.
Advantage: d-separation, i.e. the structure of the graph, fully specifies all indepen-
dence assumptions ⇒ we can separate the two problems
1. define/learn the structure of the graph
2. learn the probabilities, given the graph
16.1 Inference in BN
• Inference: compute interesting properties that are implicitly represented by the
BN (graph structure and p(X j |PA(X j )) are known)
• basic algorithm: variable elimination: split the variables into 3 (disjoint & complete)
sets
– T : variables we are interested in (“targets”)
– V : variables where we have evidence (“visible”)
– U : variables we are not interested in
66
16.1 Inference in BN
• solution, idea: use the distributive law to minimize the complexity in the sum over
products
– ab + ac = a(b + c) (three operations vs 2). In a complex network, proper
grouping of terms can give dramatic gains:
X
q(X 1 ,X 3 ) = q 1 (X 1 ,X 2 ,X 3 )q(X 1 ,X 4 )
X 2 ,X 4
gives us b 3 + 2b 2
67
17 Lecture 19/06
• marginalization in Bayesian networks is generally done by “variable elimination”
• but: VE has exponential complexity in the # of eliminated variables when applied
naively
• We can take advantage of the distributive law to group sums and products such that
the complexity is minimized.
• “belief propagation” finds an optimal evaluation order automatically for tree-shaped
graphs.
• original algorithm [Pearl, 1988] for BN, here: use the generalization to factor graphs
by [Kschischang et al., 2001]
– factor graph:
∗ two types of nodes: variables (X j ), factors (functions) fl (small squares)
X3 X3 X3
fc fb f (X 1 ,X 2 ,X 3 )
X1 X2 X1 fa X2 X1 X2
Figure 17.1: The left figure shows the undirected graph for the middle and right picture
with single clique potential Φ (X 1 ,X 2 ,X 3 ). The picture in the middle is the
factor graph of Φ (X 1 ,X 2 ,X 3 ) = fa (X 1 ,X 2 ) fb (X 2 ,X 3 ) fc (X 3 ,X 1 ) and the right
figure is the factor graph for Φ (X 1 ,X 2 ,X 3 ) = f (X 1 ,X 2 ,X 3 ).
∗ bipartite, i.e. edges are only between nodes of different types (undirected
edges)
∗ edge X j —fl exist ⇔ X j is an argument of fl
∗ example: Burglary alarm
– Variable elimination is implemented by “message passing”. Each node sends
and receives messages to/from its neighbors. messages = reduced probability
tables = partial variable elimination
69
17 Lecture 19/06
fE fB
E B
fA
1. variable to factor:
Y
µX j →fl (X j ) = µ f 0→X j (X j )
f 0 ∈N e (X
j )\fl
2. factor to variable:
X Y
µ fl →X j (X j ) = fl (N e ( fl )) µX 0→fl (X 0 )
{X 0 }∈N e ( fl )\X i {X 0 }∈N e ( fl )\X j
– message scheduling: a message can be sent as soon as all required incoming messages
| {z } | {z }
LHS RHS
at the sender have been received
⇒ leaf nodes in a tree can send messages to its only neighbor without waiting
or prerequisites
⇒ message passing proceeds in rounds:
. round 0: send messages from leaf nodes
. round t: send messages where last prerequisites were received in round
(t − 1)
termination time T = diameter of the tree (longest path)
70
17.1 Temporal Models/Belief Networks
– finalization rule: compute the marginals from all incoming messages of the
variable nodes
Y
q(X j ) = µ f →X j (X j )
f ∈N e (X j )
X
p(X j ) = q(X j )/ q(X j0 )
X j0
– For small graphs, this is no improvement over naive variable elimination, but it
is easy to implement as an algorithm for arbitrary large graphs “sum-product
algorithm”.
– If the graph has cycles (no tree) ⇒ belief propagation is generalized to “loopy
belief propagation”.
∗ due to cycles, a node can receive messages through the same edge repeat-
edly (either through alternative paths or by repeated winding around a
cycle)
⇒ Whenever this happens, the node sends updated messages through the
other edges.
∗ no hard termination condition, but converges to a fixed point (local optimal
solution) = reasonable approximation of full variable elimination, but
depends on the initial state
∗ two possibilities to incorporate evidence (observed states)
· set fl (X ) = 0 if the evidence is incompatible with the variable states
f (A,B,C),A = 1 ∀B,E : f (A = 0,B,E) = 0
· attach unary factors to the variable nodes where we have evidence
example: Alice’s children, version (2) (We know that at least one of
the children is a boy.) 1
• The system is stationary if all X j have the same set of states and R (j) = R (j ) .
0
71
17 Lecture 19/06
0.2
Rain
0.4 0.7
0.1 0.2
72
18 Lecture 24/06
This lecture is actually two lectures.
73
18 Lecture 24/06
3. on any page there is a constant prob λ that the monkey goes to any other page
uniformly at random instead of clicking a link.
X
∀k : Rk0 0k = 1.
k0
1
Rk 0k = λ + (1 − λ)Rk0 0k
C
• instead we can observe features Yj that depend on the “hidden” or “latent” vari-
ables X j (dependency is causal, but probabilistic)
⇒ BN: (X 1 ) → (X 2 ) → ... → (X D ) and (Xi ) → (Yi ),∀i
probability factorizes: p(X 1 , ...,X D ,Y1 , ...,YD ) = j p(X j |X j−1 )p(Yj |X j )
Q
• example:
– speech recognition: X 1 , ...,X D is what the speaker said (phonemes), Y1 , ...,YD
is what you heard
– wireless communication (e.g. cell phones): Xi symbol sent, Yj symbol received
• major task:
– compute marginals for the hidden states, given observations Y = O 1 : p(X j |Y =
O ), j = 1, ...,D
1 observed state vector
74
18.2 Hidden Markov Models (HMM)
(note: the global ML sequence â generally differs from the sequence of pointwise
maxima α̃ j = arg max p(X j |Y = O ))
a
– learn the probabilities p(X j |X j−1 ) and p(Yj |X j ) from training data
• factor graph
f 1 (X 1 ) = p(X 1 ) prior of X 1
f j (X j ,X j−1 ) = p(X j |X j−1 ) transition probability
дj (Yj ,X j ) = p(Yj |X j ) observation probability
f 1 (X 1 ) X1 f 2 (X 2 | X 1 ) X2 f 3 (X 3 | X 2 ) X3
Y1 Y2 Y3
75
18 Lecture 24/06
• all factor nodes have degree 2 (“pairwise factors”) the products (∗) contain only a
single term, we can simplify message passing by concatenating two consecutive
messages2
• message schedule:
– round 0: send all γ messages (in parallel) and α 1 (X 1 ) (these messages have no
prerequisites, because f 1 and Yj are leaves)
– round j: send α j+1 (X j+1 ) and βD−j (X D−j )
⇒ forward-backward algorithm
α 1 (X 1 ) = p(X 1 )
X
β j (X j ) = f j+1 (X j+1 ,X j )µX j+1 →f j+1 (X j+1 )
X 0 ∈N E ( f j+1 )\X j
| {z }
X j+1
X
= f j+1 (X j+1 ,X j )γ j+1 (X j+1 )β j+1 (X j+1 )
X j+1
βD (X D ) = 1
• algorithm:
– round 0: propagate γ j and α 1
– round j: propagate α j+1 and βD−j
2◦ represents the concatenation
76
18.2 Hidden Markov Models (HMM)
• Remark: most textbooks derive the F/B algorithm directly, without factor graphs.
Then, the messages α j and γ j are usually merged to α̃ j (X j ) = α j (X j )γ j (X j ).
• computing p(X j |Y = O ) is also called “smoothing”, intuition: X and Y have the
same statespace (symbols of an alphabet), Y is a noisy version of the true message
X ⇒ p(X j |Y = O ) is a denoised (smoothed) version of Y = O
• in smoothing, we condition on all observations Y1 = O 1 , ...,YD = O D
• in an online system, we can only condition on the observations received so far Y1 =
O 1 , ...,Yj = O j (we do not yet know the values of future observations Yj+1 , ....,YD )
⇒ (online) filtering
• Derive filtering from scratch and show that it gives the same results as belief propa-
gation on factor graphs:
p(X j |Y1 = O 1 , ...,Yj = O j ) = p(X j , .Y1 , ...,Yj )/p(Y1 , ...,Yj )
q(X 1 |Y1 = O 1 , ...,Yj = O j ) = p(X j ,Y1 = O 1 , ...,Yj = O j )
X
p(X j ,Y1 , ...,Yj ) = p(X 1 , ..,X D ,Y1 , ...,YD )
X \X j ,Y \{Y1 ,...,Yj }
X
= p(X j−1 ,X j ,Y1 , ...,Yj ).
X j−1
X
α̃ j (X j ) = p(Yj = O j |X j ) p(X j |X j−1 )α̃ j−1 (X j−1 )
X j−1
| {z }
γ j (X j ) | {z }
α j (X j )
77
18 Lecture 24/06
• Kalman filter3 :
– HMM with continuous state space for X and Y . X j ∈ RN , Yj ∈ RN
0
– objective
78
18.2 Hidden Markov Models (HMM)
⇒ it works for other tasks that have the required algebraic properties
• This obviously applies to ordinary addition and multiplication with “0” = 0, “1” = 1.
(ii) a ⊗ b = a + b, “1” = 0, a + 0 = a
(iv) a ⊗ “0” = a + ∞ = ∞
• using this algebra, belief propagation becomes the “min-sum algorithm”, intuition:
replace all products with sums and all sums with “min” in the sum product algorithm
⇒ reuse the messages α, β,γ and update scheduling
– round 0: initialization
79
18 Lecture 24/06
finally:
â 1 = arg min .. α 1 (X 1 ) +β 1 (X 1 ) + γ 1 (X 1 )
* +/
/
X 1 =a 1 | {z } | {z }
,− log p(X 1 =a1 ) − log p(Y1 =O 1 |X 1 =a 1 ) -
“Viterbi algorithm”
• remark 1: one can also do a forward sweep first, followed by a backward sweep ⇒
same result
• remark 2: in principle, one can also use this to maximize the likelihood directly
(instead of negative log-likelihood)
– use the “max-product algebra” ⊕ = max, “0” = −∞, ⊗ = ·, “1” = 1 to get the
“max-product algorithm”
– numerically not advisable, because it involves products of small numbers ⇒
loss of precision
better: work with logarithms and min-sum algorithm
80
19 Lecture 26/06
Note: This lecture contains a lot of equations given in a very short amount of time. The
frequency of typing errors is therefore probably higher. Proceed with caution!
• Difference between point-wise marginals p(X j |Y = O ) and the global MAP
solution â = arg min p(X = a|Y = O )
a
• Example 1:
– consider a problem where X j ∈ 1, ...,C and labels are ordinal (e.g. discretized
values of a continuous phenomenon) and Yj ∈ 1, ...,C are noisy observations
of the X j .
– Point-wise marginals describe the local uncertainty about X j . For example
X̄ j = E[X j ], std[X j ] can be computed from p(X j |Y = O ) and give local error
bars1 .
– The MAP solution is the most likely global solution, within the local error bars.
• Example 2: consider a random walk in a maze: the room entered most frequently2 is
not necessarily part of the most likely way out.
81
19 Lecture 26/06
• training data:
– N sequences, n = 1, ..., N , length Dn j = 1, ..,Dn
– Observations Y = O, Yj(n) = O j(n) ∈ {0, 1, 2}
– hidden states X j(n) are unknown
• Kullback-Leibler (KL) divergence between two distributions p1 (ω) and p2 (ω) over
the same domain ω ∈ Ω
X p1 (ω)
KL(p2 |p1 ) = p1 (ω) log ≥0
ω
p2 (ω)
pθ (X ,Y =O ) pθ 0 (X ,Y =O )
We choose: p1 (ω) = pθ (X |Y = O ) = pθ (Y =O ) , p 2 (ω) = pθ 0 (X |Y = O ) = pθ 0 (Y =O )
X pθ (X ,Y = O ) pθ (X ,Y = O )pθ 0 (Y = O )
KL(p2 |p1 ) = log
X
pθ (Y = O ) pθ 0 (X ,Y = O )pθ (Y = O )
pθ 0 (Y = O ) 1 X pθ (X ,Y = O )
= log + pθ (X ,Y = O ) log
pθ (Y = O ) pθ (Y = O ) X pθ 0 (X ,Y = O )
≥0
abbreviation:
X
Q (θ 1 ,θ 2 ) = pθ 1 (X ,Y = O ) log pθ 2 (X ,Y = O )
X
pθ 0 (Y = O ) Q (θ ,θ ) − Q (θ ,θ 0 )
KL(p2 |p1 ) = log + ≥0
pθ (Y = O ) pθ (Y = O )
3 marginalize over all possible assignments X
82
19.1 Learning the parameters (= transition probabilities) of a HMM
Q (θ ,θ 0 ) − Q (θ ,θ ) pθ 0 (Y = O )
⇒ ≤ log
p (Y = O ) p (Y = O )
| θ {z } | θ {z }
lower bound for RHS
| ≥{z 1 desired
}
≥ 0 desired
p 0 (Y =O )
⇒ Improve the objective pθθ (Y =O ) as much as possible by maximizing the lower
bound.
⇒ define θˆ0 = arg max Q (θ ,θ 0 )
θ0
• EM algorithm4 outline:
1. define initial guess θ (0)
2. for t = 1, ...,T (or until convergence)
– E-step: define
Q (θ (t−1) ,θ 0 ) = Eθ (t −1) [log pθ 0 ] =
X
pθ (t −1) (X ,Y = O ) log pθ 0 (X ,Y = O )
X
– set θ (t ) = θˆ0
• thanks to the BN factorization of our HMM; the calculations simplify tremendously
X
Q (θ ,θ 0 ) = pθ (X ,Y = O ) log pθ 0 (X ,Y = O )
X
| {z }
j pθ 0 (X j |X j−1 )pθ 0 (Yj =O j |X j )
Q Q
n
N Dn
* log pθ 0 (X (n) ) + log pθ 0 X j(n) |X j−1
(n)
X X X
= pθ (X ,Y = O ) 1
X n=1 , j=2
Dn
log pθ 0 Yj(n) = O j(n) |X j(n) +
X
+
j=1 -
• when minimizing w.r.t θ 0 = {ρ 0,π 0, µ 0 } we also need to preserve the normalization
of these probabilities ⇒ Lagrangian
N Dn
* log pθ 0 (X (n) ) + log pθ 0 X j(n) |X j−1
(n)
X X X
L(θ ) =
0
pθ (X ,Y = O )
X n=1 ,
| {z1 } j=2 | {z }
ρ0 π0
Dn
log pθ 0 Yj(n) = O j(n) |X j(n) + + λ ρ *1 −
X X
+ ρk0 +
j=1 | {z }- k
µ0
, -
X X X
+ λk 1 −
* πk 0k + ηk 1 −
0 + * µmk
0 +
k , k 0 - , m -
4E for expectation, M for maximization
83
19 Lecture 26/06
pθ (X 1(n) |Y = O )
X
ρk0 ∝
n
Dn
pθ (X j(n) = k,Yj(n) = m|Y = O )
XX
µmk
0
∝
n j=1
Dn
pθ (X j(n) = k |Y = O ) 1(Yj(n) = m)
XX
∝
n j=1 | {z }
standard F/B algorithm
84
19.1 Learning the parameters (= transition probabilities) of a HMM
5norm = normalization
85
20 Lecture 01/07
20.1 Causality
• causality is second major application of BN: (first: temporal models)
87
20 Lecture 01/07
– we may not be able to actively intervene ⇒ the groups {Xi = a 1 } and {Xi = a 0 }
are outside of our control
88
20.1 Causality
X X X
Y Z Y Z Y Z
Figure 20.1: Common- Figure 20.2: Causal-Chain Figure 20.3: Common-
Cause Model Model Effect Model
X X X
Y Z Y Z Y Z
do Y do Y do Z
Figure 20.4: Influence of interventions on the three basic causal models
⇒ “structural interventions”5
– marginalization: (under hard intervention δ (X j = a j ))
X
p(X 1 , ...,X j−1 ,X j+1 , ...,X D |do(X j = a j )) = p(X 1 , ...,X D |do(X j = a j ))
Xj
Y
= p(X j 0 | PA(X j 0 ,X j = a j ) )
j 0 ,j
| {z }
X j = a j , whenever X j ∈ PA(X j 0 )
89
20 Lecture 01/07
[X d-separate Y |Z ]G ⇔ [X ⊥
⊥ Y |Z ]p .
90
20.1 Causality
91
21 Lecture 03/07
• Theorem: The minimal true causal model is unique [Peters et al. 2014].
• problem: given data, infer the true model, or (weaker) a member of the Markov
equivalence class
93
21 Lecture 03/07
– Theorem: This produces the true skeleton if the oracle is always correct.
In practice, the oracle is some statistical test (⇒ later) that may be erro-
neous. ⇒ we get only an approximate skeleton
– optimization 1: PC-algorithm3
∗ start with small conditioning sets S: CI-test (“conditional independence
test”) is faster and more accurate
∗ showed that S only needs to include neighbors of either X j , or X j 0 ⇒
after a few edge removals, a lot fewer S can be constructed
∗ early termination: After some removals, it may become impossible to
create new S.
∗ algorithm:
· start with complete graph
· set S = ∅: for all pairs (X j ,X j 0 ) remove edge if X j ⊥
⊥ X j 0 |∅
· work on S with |S | = 1: for all nodes X j that have at least 2 neighbors
+ for all X j 0 ∈ N E(X j ) and all X j 00 ∈ N E(X j )\X j 0 remove edge
(X j ,X j 0 ) if X j ⊥
⊥ X j 0 |X j 00
· work on S with |S | = 2: for all X j with at least 3 neighbors
+ for all X j 0 ∈ N E (X j ) and all X j1 ,X j2 ∈ N E (X j )\X j 0 remove edge
(X j ,X j 0 ) if X j ⊥
⊥ X j 0 |(X j1 ,X j2 )
· and so on, until no X j has the required number of neighbors
∗ in the worst case this is not faster than IC algorithm, but can be proven
[Classen et al. 2013] that a variant of PC has worst case complexity
O(D 2(deg+2) ) 4
⇒ if the skeleton is sparse (deg is small) ⇒ polynomial runtime; in
practice this is usually the case
∗ If the oracle is always correct: PC creates the correct skeleton, oth-
erwise the result is order-dependent because errors lead to different
subsequent tests and errors.
– optimization 2: stable parallel PC-algorithm: eliminate order dependence
by performing all CI tests for given |S | in parallel, only remove edges until
each round |S | is finished
∗ don’t remove edges in parallel but sort by confidence of the CI test
(increasing p-value) and remove in that order, but skip removals whose
preconditions no longer hold (i.e. the edge a particular test relies upon
have already been removed) = no, this requires more thought
∗ faster than PC if CI tests are performed concurrently
3 PC = “Peter & Clark”
4 deg: maximum degree of any node in the skeleton
94
21.1 Create BNs from data
B C
Figure 21.1: Example where applying stable-PC gives a different outcome.
A B C D A B C D
E E
Figure 21.2: True Causal Graph of Figure 21.3: True Causal Pattern of
Example 2 Example 2
example7
example 2[Spirtes et al 2010]8
4. get interventional data to orient remaining edges: we know that after do(X j =
a j ), there can be no incoming arcs to X j ; all edges at X j must go out in the
interventional graph
– [Eberhardt et al.2006] showed: in the worst case, two situations are
needed for every edge (X j ,X j 0 )
a) exp 1: intervene on X j , but not on X j 0 , exp 2: intervene on X j 0 , but not
on X j
5 excludes turning j − j 00 − j 0 ⇒ j → j 00 ← j 0
6 ifone edge is already connected, this implies the orientation of the other
7 the lecture contains a couple of graphical examples here
8 see Automated Search for Causal Relations_ Theory and Practice.pdf figure 1 for the graph
95
21 Lecture 03/07
A IA A
B C IB B C
Figure 21.4: On the left we see the true unknown complete graph among the variables A,B,C.
In one experiment, the researcher performs simultaneously and independently
a parametric intervention on A and B (IA and IB , respectively, shown on the
right). Since the interventions do not break any edges, the graph on the right
represents the post-manipulation graph.
5. Theorem:
– If one intervenes on exactly one variable per experiment, at most D − 1
experiments are needed to get full BN.9
– If one can intervene on up to D/2 variables simultaneously, log2 D + 1
experiments are sufficient.10
practical problems are usually not worst case
96
22 Lecture 08/07
– Mutual Information
MI (X ,Y ) = H (X ) + H (Y ) − H (X ,Y )
– usual hypothesis H 0 : X ⊥
⊥Y
Under H 0 the actual observed counts N follow a multinomial distribution
Ĝ = 2N M̂I (X ,Y )
• Problems:
– Normally this test is conservative: we reject H 0 only when we are confident
(high p-value). This means we assume independence in case of doubt. We
would like to have a test for X 6⊥
⊥ Y but this is difficult.
– continuous variables must be discretized and MI is very sensitive to particular
discretization ⇒ active area of research
1 dof = degrees of freedom
2 see https://xkcd.com/882/
97
22 Lecture 08/07
– if the conditioning set is large, only few samples fulfill the condition ⇒ high
variance of MI, in practice |S | = 4 is max
– errors propagate in PC also
• map data into augmented feature space X̃ = ϕ(X ), Ỹ = ψ (Y ), where ϕ,ψ are nonlin-
ear
– implementations vary in
∗ allowed moves
∗ score functions
∗ amount of randomness
– initialization: usually empty graph
X j = f j (PA(X j ), N j )
where N j is noise
98
22.1 Detecting conditional independence by statistical tests
• Theorem: If p(X 1 , ...,X D ) is strictly positive and Markov with respect to a DAG G
then there exists a SEM on G that generates p(X 1 , ...,Xd ).
• identifiability:
if f j is linear and N j is Gaussian ⇒ cannot distinguish between X → Y , Y → X
same if f j is asymptotically constant and strictly monotone and N j is exp. distributed3
3 N ∼ e − |x |
j
4 [Peters et al. 2014]
99
23 Lecture 15/07
• algorithm:
– phase 1 determine the optimal ordering of the variables for Bayesian factoriza-
tion.
∗ S = {X 1 , ...,X D }
∗ for t = D, ..., 1 (construct order backwards)
· for each X j ∈ S :
+ regress X j on S\X j using a suitable regression method and compute
residual R j
+ conduct independence test for R j ⊥
⊥ S\X j and store p-value p j
1 structured equation model
2 (non-linear least squares, kernel regression, ...)
3 The lecture contains an example graph here: X → X 0 via д(X ) and the anti-causal model X 0 → X via
j j j j j
f (X j 0 ). Our wrong SEM predicts x̂ j = fˆ (д(X j )) if д is not invertible, information is lost ⇒ R j = X j − X̂ j
contains information that could be used to predict X j 0 from X̂ j and R j .
101
23 Lecture 15/07
+ set π (t ) = arg max p j (place the variable with biggest p-value, i.e.
j
highest certainty of independence at position t )
+ PA(X π (t ) ) = S\X π (t )
+ update S = S\X π (t )
– phase 2: determine SEM (remove as many edges from the graph as possible
and determine the regression functions)
∗ initialize G as the complete DAG for the order of phase 1 (add arcs X j 0 → X j
for all X j 0 ∈ PA(X j ))
∗ for t = 2, ...,D:
· for each X j 0 ,PA(X π (t ) ) (try to get rid of as many parents as possible)
+ regress X π (t ) on PA(X π (t ) )\X j 0 and compute residuals R j 0
if R j 0 ⊥
⊥ PA(X π (t ) ): remove X j 0 from parents PA(X π (t ) ) = PA(X π (t ) )\X j 0 4
– output the resulting graph and regression functions
• properties:
– only marginal dependence tests are needed (no conditional ones) → easier
– can prove: The true causal model is identifiable when the regression functions
are non-linear, provided that the regression is sufficiently powerful and the
additive noise assumption is fulfilled.5
– efficient: O(D 2Q ) operations, where Q is complexity of the regressions and
independence tests6
• if the variables are continuous (or too many discrete states): use an SEM and regres-
sion
102
23.3 Drawing Conclusions from a BN
∗ if the missing value is irrelevant for the instance at hand (e.g. physicians
wouldn’t do a useless diagnostic)
⇒ introduce class “irrelevant” as an additional state of the variable
∗ otherwise: non-trivial problem ⇒ later
103
23 Lecture 15/07
8 i.e.the BN is still not complete, because it doesn’t explain the treatment choice ⇒ more hidden factors,
e.g. risk factors, speed of recovery, cost,...
104
24 Lecture 17/07
• illustration using potential outcomes: Imagine that for each individual, the reaction
to treatment A and B is pre-determined but unknown to us.
– By applying a treatment, we will observe one of the potential outcomes, but
since we cannot rewind time, the other outcomes (“counter-factual” outcomes)
are missing = “fundamental missing data problem of causal inference”.
– If we have binary variables (2 treatments A and B, 2 outcomes true and false),
there are 4 different types of people according to potential outcomes:
type|pot outcome A B
a 1 0 A responsive
b 0 1 B responsive
c 1 1 complete responsive
d 0 0 doomed1
two observation groups: NA individuals who received treatment A: TA 2
– the (unknown) proportion of the 4 types in both groups: pa ,pb ,pc ,pd for TA ;
qa ,qb ,qc ,qd for TB
– we cannot distinguish types: a from c, b from d in TA and a from d, b from c in
TB because the counter-factual outcome is unknown.
– compute the number of recovered people: RA = NA (pa + pc ); R B = NB (qc + qd )
RA RB
– naive estimate: compare the proportions: NA − NB = pa + pc − (qb + qc )
– we are actually interested in the treatment effect: pa − pb
– we must make sure that the two groups TA and TB are comparable (“exchange-
able” treatment assignment)
⇒ pa ≈ qa ,pb ≈ qb ⇒ NRAA − NR BB ≈ pa − pb as desired.
1 insert dark laughter here
2N T
B B analogously
,
105
24 Lecture 17/07
106
24.1 Confounder Adjustment
iii) propensity score matching: arrange the two groups into a complete bipartite
graph
∗ weight edges by absolute difference |FA (Zi ) − FA (Zi 0 )| = wii 0
∗ remove edges whose weight is above some threshold
∗ find the minimum cost by bipartite matching by Hungarian algorithm
(or a greedy approximation8 )
7 analogously for B
8 results do not really change according to literature
107
24 Lecture 17/07
Treated
Control
1
6
2
7
3
8
4
9
5
10
∗ form subgroups TA0 and TB0 from all matched pairs: in TA0 and TB0 the
naive formula is equivalent to the exact formula
108
25 Lecture 22/07
109
25 Lecture 22/07
[X ,X ,O], if i ∈ source
X̃ = [X , (1 − D)X ,DX ] =
[X ,O,X ], if i ∈ target
– since we don’t have label 0, we add two augmented instances for each unlabeled
instance, one with either label X̃u , Ỹu = [O,X , −X , +1], X̃u 0mỸu 0 = [O,X , −X , −1]
– train normally and predict Ŷ = X (βc + βt ) for target points
– for linear SVM, we get the rolling loss functions
– outperforms many complex methods
110
25.4 Importance sampling by reweighting
• ...
• especially popular for neural networks since they need a lot of training data anyway,
and can create augmented data on the fly
• replace question by counterfactual question: “How would the BN have behaved had
the parameters been θ 0 provided the hidden mechanisms didn’t change.”
⇒ virtually replay the data to simulate BN(θ 0)
• example: advertisement placement on the Bing Search result [Bottou et al. 2013]
– three conflicting goals:
∗ don’t annoy the user (few and relevant ads)
∗ attract advertisers (high click rates at reasonable price)
∗ maximize Bing’s revenue
– BN:
Y
p(X |θ ) = p(X j |PA(X j ),θ )
j
P (X j |PA(X j ),θ 0 )
p(X |θ ) = P (X ,θ )
0
P (X |PA(X ),θ )
| j {z j }
w j = reweighting factor for j
1 X
E[Y |θ 0] = E[Yw j |θ ] ≈ Yi wij
N i
111
25 Lecture 22/07
ads a bids b
scores q
slate s prices c
clicks y revenue z
• examples: the lecture contains some examples here, but since they aren’t reproduced
here, the corresponding calculations are left out as well
• intuitive goal: transform the expressions such that no conditional probability con-
tains S = 1, and do(X ) simultaneously
112
25.5 Causal Theory of transferability [Barenboim, Pearl, Tian, 2012-2015]
113
26 Lecture 24/07
• typical merges
– unary potentials Ec (X c ) = E (X j ) encode the local evidence for probable state
of X j
– pairwise potentials E(X j ,X j 0 ) encode the desire of X j ,X j 0 to take similar values
(= attractive potential) or different values (= repulsive potentials)
– higher-order potentials |C | ≥ 3: encode preferences of the structure/pattern of
the variables involved (often neglected by assumption)
• typical inference algorithms for discrete X
– exact:
∗ reformulate the problem as an integer linear program arg min w · X s.t.
X
linear inequality constraints are met
(usually NP-hard, but often tractable by heuristics in practice)
∗ special cases:
· tree-shaped models ⇒ belief propagation gives the exact solution in
one forward/backward sweep
· sub-modular models X j ,X j 0 1[X j = X j 0 ]E (X j ,X j 0 ) ≤ X j ,X j 0 1(X j ,
P P
X j 0 )E (X j ,X j 0 ) ⇒ graph-cut algorithm is exact [maximum flow in a
graph = standard problem]
1 maximum a posteriori
115
26 Lecture 24/07
– approximations:
∗ relaxation, i.e. allow real values for X in the linear program (rounded later)
∗ move making: given a guess X (t ) , define elementary moves (changes of
few variables) and accept the best one (iterated conditional models ICM,
Lazy Flipper2 , tree submodels)
∗ move making: reduce the problem to a tractable subproblem (one-label-
against-the-rest = α expansion, one-label-against-one = α-β swap3 )
∗ loopy belief propagation: iterate message passing until convergence
∗ sampling methods: randomly simulate the model and choose the best
solution we have seen (Markov Chain Monte Carlo (MCMC), Svendsen-
Wang, Gibbs sampling)
– learning the potentials:
∗ learn them in isolation, independently of the others
∗ better but much more difficult: learn potentials jointly, s.t. they reinforce
each other towards a global loss function (on large patterns)
∗ details: watch Fred Hamprecht’s new video lecture4
• multiple instance learning: a single label for a group (“bag”) of instances (e.g. one
label per image for all pixels jointly) with the understanding that only some instances
in each bag conform to the label ⇒ find these instances and the corresponding
classifier
• sparse annotation:5 for each instance, only a few true labels are known (e.g. movies
that someone likes) ⇒ infer the missing labels and learn a model (e.g. recommender
systems)
• active learning: minimize the required training set size by actively selecting the
most informative (don’t waste annotator effort on the easy decisions)
2 see lazyflipper.pdf
3 both can be solved by graph-cut
4 https://www.youtube.com/playlist?list=PLuRaSnb3n4kSgSV35vTPDRBH81YgnF3Dd
5 related to multiple instance learning
116
26.1 The omitted chapters (aka. “Machine Learning III”)
• semi-supervised learning: combine a small labeled training set with a big unlabeled
(combine supervised & unsupervised learning)
• transfer learning: combine a small labeled training set with a big labeled training
set from a similar domain
26.1.0.8 Features
• feature learning:
– initial layers of a neural network
– kernel approximation: k (x,x 0 ) = hϕ(x ),ϕ(x 0 )i ⇒ use feature selection to find
the important coordinates in ϕ(x ) and compute ϕ(x ˜ ) explicitly (without kernel)
• random features:
– random projections have a lot of interesting structural properties (they are not
chaos)
⇒ use these properties (e.g. Johnson-Lindenstrauss lemma6 )
⇒ multi-dimensional (randomized) hashing for similarity
⇒ extreme learning machine 2 layer NN: visible→ hidden random, hidden →
output analytically optimized
6 https://en.wikipedia.org/wiki/Johnson-Lindenstrauss_lemma
117