Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
44 views

Ml2 Script v2

This document summarizes lectures on advanced machine learning topics: - It discusses linear models for classification like LDA and directly learning the decision function. - It covers neural networks, including their architecture, backpropagation, and different loss functions and activation functions. - It describes multi-class classification, Gaussian processes, graphical models like Bayesian networks and hidden Markov models. - It discusses learning the parameters of models from data, causality, transfer learning, and more. The lectures aim to find functions to perform regression or classification from training data in a supervised learning setting.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Ml2 Script v2

This document summarizes lectures on advanced machine learning topics: - It discusses linear models for classification like LDA and directly learning the decision function. - It covers neural networks, including their architecture, backpropagation, and different loss functions and activation functions. - It describes multi-class classification, Gaussian processes, graphical models like Bayesian networks and hidden Markov models. - It discusses learning the parameters of models from data, causality, transfer learning, and more. The lectures aim to find functions to perform regression or classification from training data in a supervised learning setting.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Advanced Machine Learning

Script of

PD Dr. rer. nat., Dipl. phys. Ullrich Köthe

at the Heidelberg Collaboratory for Image Processing, Interdisciplinary


Center for Scientific Computing
Image Analysis and Learning Group

Transcript of: Manuel Haussmann

15. August 2017


Heidelberg University
Heidelberg Collaboratory for Image Processing
Berliner Str. 43
D-69120 Heidelberg
Contents

1 Lecture 15/04 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Linear Models for Classification . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Directly learn the decision function . . . . . . . . . . . . . . . . . 4

2 Lecture 17/04 5

3 Lecture 22/04 9
3.1 Algorithms for Logistic Regression . . . . . . . . . . . . . . . . . . . . . 9
3.2 Why is SGD fast for large N ? . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Lecture 24/04 15
4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 NN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Lecture 29/04 19
5.1 Theoretical Capabilities of NN . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 The Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Loss functions depend on application: . . . . . . . . . . . . . . . . . . . . 22

6 Lecture 06/05 23
6.1 NN training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Classical Tricks to make this work . . . . . . . . . . . . . . . . . . . . . . 24

7 Lecture 08/05 27
7.1 RPROP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Dropout [Srivastava & Hinton 2012] . . . . . . . . . . . . . . . . . . . . . 27
7.3 Piecewise linear activation functions . . . . . . . . . . . . . . . . . . . . 29
7.4 MaxOut [Goodfellow et al. 2013] . . . . . . . . . . . . . . . . . . . . . . . 30
7.5 PReLU “parametric ReLU” [He et al 2015] . . . . . . . . . . . . . . . . . . 31

8 Lecture 13/05 33
8.1 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

i
Contents

9 Lecture 20/05 37
9.1 Coding Matrices for multi-class problems . . . . . . . . . . . . . . . . . . 37
9.2 Gaussian Processes (or The Statistical Theory of Interpolation) . . . . . . 38

10 Lecture 22/05 41
10.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

11 Lecture 27/05 45

12 Lecture 29/05 49
12.1 Uncertainty of GP interpolation . . . . . . . . . . . . . . . . . . . . . . . 49
12.2 Application [Snoek et al. 2012]: GP to optimize the hyper parameters of a
learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
12.3 Application: GP classification . . . . . . . . . . . . . . . . . . . . . . . . 51

13 Lecture 03/06 53
13.1 GP classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
13.2 The Bayesian Interpretation of GP regression (and their relation to “reproducing kernel
Hilbert spaces”(RKHS)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

14 Lecture 10/06 57
14.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

15 Lecture 12/06 61
15.1 Bayesian Networks (directed graphical models) . . . . . . . . . . . . . . . 61

16 Lecture 17/06 65
16.1 Inference in BN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

17 Lecture 19/06 69
17.1 Temporal Models/Belief Networks . . . . . . . . . . . . . . . . . . . . . . 71

18 Lecture 24/06 73
18.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
18.2 Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . . . . . 74

19 Lecture 26/06 81
19.1 Learning the parameters (= transition probabilities) of a HMM . . . . . . 81

20 Lecture 01/07 87
20.1 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

21 Lecture 03/07 93
21.1 Create BNs from data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

22 Lecture 08/07 97
22.1 Detecting conditional independence by statistical tests . . . . . . . . . . 97

ii
Contents

23 Lecture 15/07 101


23.1 RESIT algorithm (regression with subsequent independence test) . . . . 101
23.2 Parameter estimation in BNs . . . . . . . . . . . . . . . . . . . . . . . . . 102
23.3 Drawing Conclusions from a BN . . . . . . . . . . . . . . . . . . . . . . . 103

24 Lecture 17/07 105


24.1 Confounder Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

25 Lecture 22/07 109


25.1 Hidden Confounders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
25.2 Transfer Learning = Domain Adaptation . . . . . . . . . . . . . . . . . . 109
25.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
25.4 Importance sampling by reweighting . . . . . . . . . . . . . . . . . . . . 111
25.5 Causal Theory of transferability [Barenboim, Pearl, Tian, 2012-2015] . . 112

26 Lecture 24/07 115


26.1 The omitted chapters (aka. “Machine Learning III”) . . . . . . . . . . . . 115

iii
1 Lecture 15/04

1.1 Goals
• Find a function Y = f (X ), where X ∈ RD :
– Y ∈ R or RM : Regression,
– Y ∈ {1, ...,C}1 : Classification2
and learn the desired function f from training data:
N : Supervised (correct answer known),
– {(Xi ,Yi )}i=1
N : Unsupervised (must infer interesting categories).
– {Xi }i=1

• The function f stems from a model class (predefined, parameterized by θ ), i.e.


f (X |θ ). The optimal θ are defined by the loss function Loss (Xi ,Yi |θ ) → R:
X
choose θ such that Loss (Xi ,Yi |θ ) is minimized.3
i

• loss/gain are selected according to the application

• generalization vs. overfitting: Loss on independent data (test set) may be much
bigger than loss on training data.

• predict generalization error:


– theoretical models (e.g. (Vapnik-Chervonenkis) VC dimension)
– empirical on independent test data or via cross-validation

• use models that generalize well:


– simple models (|θ | < N )4
– regularization (restrict the search space for θ )5
– ensembles (combine several classifiers)6
– randomization
1C: class count
2 We will mainly concentrate on classification.
3 ...or a gain is maximized.
4 e.g. linear regression
5 e.g. Lasso
6 e.g. random forests, boosting

1
1 Lecture 15/04

1.2 Notation
feature matrix X of dimension N × D 7

instance index i (i 0,i 1 ,i 2 ), i = 1, ..., N and Xi is a row of X

feature index j (j 0, j 1 , j 2 ), j = 1, ...,D and X j is a column of X

class index k = 1, ...,C Depending on the algorithm/context we also use k ∈ {0, 1} or


k ∈ {−1, 1}.

1.3 Linear Models for Classification


• Assume that the elements contained in the training data are i.i.d8,9 , i.e. (Xi ,Yi ) ⊥

(Xi 0 ,Yi 0 ), i , i .
0

• in general, there are 3 approaches:


1. Learn a decision function: Ŷ = f (X |θ ), f : RD → {1, ...,C}.
This gives us a hard decision on class membership with no confidence estimate.
2. Learn class posterior probabilities: p(Y = k |X ; θ ) for all k : RD → [0, 1], s.t.
k p(Y = k |X ; θ ) = 1 .
P 10

This approach always implies 1. with

f (x ) = arg max p(Y = k |X ; θ ) = Ŷ


k

, “winner takes all” but with an added confidence:


p(Ŷ |X ; θ ) − maxk,Ŷ p(Y = k |X ; θ ) 11
giving us a soft class membership.
3. Generative model (can be used to generate new data): learn the per class
likelihood + prior probability with Bayes’ theorem

p(X |Y )p(Y ) X
p(Y |X ) = where p(X ) = p(X |Y = k )p(Y = k )
p(X ) k
.
Nk
– Learn p(Y = k ) = N with Nk : Number of class k instances in training
data.
7 WithN the instance count and D the feature count.
8 Independent,identically distributed
9 If this assumption is violated probabilistic graphical models are a possibility.
10 Special case C = 2: p(Y = 1|X ; θ ) = 1 − p(Y = 0|X ; θ )
11 for C = 2:

p(Ŷ |X ) − (1 − p(Ŷ |X )) = 2p(Ŷ |X ) − 1 ⇔ p(Ŷ |X ) − 1


2

2
1.3 Linear Models for Classification

– Learn data likelihood per class ∀k : p(X |Y = k; θk ). How are the instances
of class k distributed in feature space (= density estimation problem)?
This approach implies 2. via Bayes’ rule and 1. via “winner takes all”.
• 1. and 2. are called discriminative models.

Application to linear models: 12

1.3.1 LDA
• Assume that p(X |Y = k,θk ) is a Gaussian distribution for all k with a single joint
covariance13 (otherwise: QDA (see ML1)).

Figure 1.1: Two examples for linear decision boundaries in 2D for the 2 class case of LDA.

• Fit the Gaussians with:


k different means: µk = N1k Xi
P
i:Yi =k

Total mean: µ = N1 i Xi
P

One joint covariance matrix: Σ = i (X i − µYi ) T (Xi − µYi ).


1 P
N

• Per class likelihood is given by:


1  
p(X |Y = k ) = p exp − 12 (X − µk )Σ−1 (X − µk ) T
(2π ) D |Σ|
.
12 in the order 3. (linear discriminant analysis (LDA)) 1. (perceptron, linear support vector machine) 2.
(logistic regression)
13 It means classes differ only by their location and not by their shape. In the picture we also see an example

with varying covariances.

3
1 Lecture 15/04

• This leads to a linear posterior if C = 2:


p(Y = 1|X ) = σ (X β + b)
with the logistic function σ (t ) = (1 + exp(−t )) −1 and

β = Σ−1 (µ 1T − µ −1
T
), b = −µβ .
⇒ Decision rule is defined by:

 1, X β + b > 0

arg max p(Y = k |X ) ⇔ Ŷ = f (x ) = 
 −1 X β + b < 0
k 
where X β + b = 0 is called the decision boundary.
• This is a very good model if Gaussian assumption holds (approximately).

1.3.2 Directly learn the decision function


• For simplicity, we augment the data matrix X with a column of 1s ⇒ thereby we
can absorb b into β.
• we consider C = 2, Y ∈ {−1, 1}
Perceptron (Rosenblatt, 1958)
– If the classifier is always correct, we have ∀i : Yi Xi β > 0. ⇒ We should
pay a penalty, when Yi Xi β < 0

 −Yi Xi β, Yi Xi β < 0

Loss (Xi β,Yi ) =  = (−Yi Xi β )+
 0,
 Yi Xi β ≥ 0
X X
Loss (β ) = −Yi Xi β = (−Yi Xi β )+
i:Yi ,Ŷi i
.
– Perceptron algorithm: gradient descent on loss function
∂Loss X
= −Yi Xi
∂β
i:Yi ,Ŷi

∗ choose β (0) randomly, learning rate τ


∗ repeat until convergence (t = 1, ...,Tmax )
X
β (t ) = β (t − 1) + τ Yi Xi
i:Yi ,sign(X i β (t −1) )

– Converges in finitely many steps if training data are linearly separable


(zero training error).

4
2 Lecture 17/04
• example generative model: LDA

• example decision function: Perceptron


– better: linear support vector machine (SVM)
– disadvantages Perceptron:
∗ solution not unique if data are separable, but most solutions generalize
badly
∗ may not converge if data non-separable (oscillation)
– solution: require “safety margin” around the decision plane and maximize its
size ⇒ hinge loss: already pay penalty if classification is correct, but with low

Figure 2.1: Depiction of perceptron loss (max(0, −Yi Xi β )) and hinge loss (max(0, 1 − Yi Xi β ))
functions as well as the squared hinge loss (max(0, 1 − Yi Xi β ) 2 ) for comparison.

confidence
∗ The maximum margin plane is found by minimizing ||β || 2 = β T β under
hinge loss if data are separable.
∗ For non-separable data: minimization of loss and of ||β || 2 cannot be
achieved simultaneously. ⇒ control trade-off by regularization parameter
λ

5
2 Lecture 17/04

∗ SVM objective function

β Tβ λ X
βˆ = arg min + max(0, 1 − Yi Xi β )
β 2 N i

many algorithms
· standard solver for quadratic programming
· primal space algorithm: (stochastic) gradient descent, e.g. Pegasos
· dual space algorithms: sequential minimal optimization (SMO, LIB-
SVM); dual coordinate ascent (LIBLINEAR)
∗ Advantages:
· good practical performance, relatively easy training (choose λ by cross-
validation)
· dual formulation can be kernelized (non-linear classifier)
∗ Disadvantage: confidence is |Xi β |, but this cannot be interpreted as proba-
bility of being correct

• alternative: learn the posterior probability p(Y |X ; β ) ⇒ if C = 2: confidence


2p(Ŷ |X ) − 1

• The posterior probability of LDA is the logistic function (σ (z) = (1 + exp(−z)) −1 ),


p(Yi = 1|X ; β ) = σ (Xi β ).

• choose β according to maximum likelihood rule: maximize likelihood of training


data1
Y Y
p({Xi ,Yi }; β ) = p(Yi |Xi ; β ) = σ (Yi Xi β )
i i

define loss negative log-likelihood:


X X
− log p({Xi ,Yi }; β ) = − log σ (Yi Xi β ) = log(1 + exp(−Yi Xi β ))
i i

• Performance improves if we combine the loss with a regularization of β. ⇒ regular-


ized logistic regression (LR) objective:

β Tβ λ X
βˆ = arg min + log(1 + exp(−Yi Xi β ))
β 2 N i

λ → ∞ gives us the traditional LR without regularization

• Many algorithms can solve this (see Minka 2003/2007, Bottou 2007), but since there
is no closed-form solution, iterative algorithms are needed.
1 use 1 − σ (z) = σ (−z)

6
1. gradient descent-type algorithms: work well because objective is convex

∂loss (Xi ,Yi , β ) ∂


= log(1 + exp(−Yi Xi β ))
∂β ∂β
1
= exp(−Yi Xi β )(−Yi XiT )
1 + exp(−Yi Xi β )
(−1 + 1 + exp(−Yi Xi β ))
= (−Yi XiT )
1 + exp(−Yi Xi β )
= (1 − σ (Yi Xi β ))(−Yi XiT )

– (plain/batch) gradient descent: repeat until convergence2 , t = 1, ...,Tmax :

λ X
!
(t+1) (t ) (t ) (t )

β =β +τ 1 − σ (Yi Xi β ) Yi Xi − β
T
N i | {z }

 0, if correct


>0≈


 1, if false

– stochastic gradient descent (SGD): we want to minimize E[loss (Xi ,Yi , β )]


repeat until convergence3 , t = 1, ....,Tmax :
∗ choose i ∈ 1, ..., N at random
∗ β (t+1) = β (t ) + τt λ(1 − σ (Yi Xi β (t ) ))Yi Xi − β (t )
 

– SGD with momentum 4 :

д (t+1) = (1 − µ)д (t ) + µ λ(1 − σ (Yi Xi β (t ) ))Yi Xi − β (t )


 
| {z }
=f (t )

β (t+1) = β (t ) + τt д (t+1)
τ0
here τt = (1+t ) 3/4
, i.e. the rate should have a slower rate, д is initialized to
zero.
∗ What does the averaging mean?

T0
(t+1)
wt 0 · f (t−t
0)
wt 0 = µ t = exp(−t 0/η)
X 0
д =
t 0 =0

i.e. wt 0 decays exponentially with η being the half life


⇒ η ≈ N : behavior should be similar to plain GD
2τ represents the learning rate
3τ τ0
t = 1+t
4 µ ∈ (0, 1)

7
2 Lecture 17/04

– mini-batch SGD: choose NB instances at random, put them into mini-


batch B
λ X
!
(t+1) (t ) (t ) (t )

β = β +τ 1 − σ (Yi Xi β ) Yi Xi − β
NB i∈B

– averaged SGD: similar to momentum, but smooth β instead of gradient

β (t+1) = β (t ) + τt λ(1 − σ (Yi Xi β (t ) ))Yi XiT − β (t )


 

(t+1) (t )
β = (1 − µ)β + µβ (t+1)

– stochastic averaged gradient SAG:

дi(t ) , i , i0
дi(t+1)

=

(t ) (t )

λ(1 − σ (Yi Xi β ))Yi Xi − β , i = i
0 0 0 0
0

1 X (t+1) дi(t+1) − дi(t0 )


д (t+1) (t ) 0
= д =д +
N i i N
β (t+1) = β (t ) + τt д (t+1)

8
3 Lecture 22/04

3.1 Algorithms for Logistic Regression


• objective:
β Tβ λ X
min + (1 + exp(−Yi Xi β ))
β 2 N i

• gradient descent algorithm, stochastic GD


• Newton-type algorithms in primal and dual space
• reminder: Newton-Raphson-algorithm: optimize nonlinear function f (a), Taylor
series expansion around current guess:

f 00 (a (t ) ) 2
f (a (t ) + ∆a) ≈ f (a (t ) ) + f 0 (a (t ) )∆a + ∆a → min
2
∂d
f (a (t ) + ∆a) ≈ f 0 (a (t ) ) + f 00 (a (t ) )∆a = 0
!
∂∆a
f 0 (a (t ) )
⇒ ∆a = − f 00 (a (t ) )
if a is a vector: ∆a = −H −1 |a (t ) ∇f |a (t ) ⇒ update: a (t+1) = a (t ) + ∆a, need gradient

β Tβ λ X
min + (1 − σ (Yi Xi β )) (−Yi XiT )
β 2 N i | {z }
σ (−Yi X i β )

∂ 2 Loss λ
Hessian: ∂β 2
=I− N (− σ 0 (Yi Xi β ) ) Yi2 XiTXi
| {z } |{z}
σ (t )(1−σ (t )) =1

∂ 2Loss λ X
= I + σ (Xi β )(1 − σ (Xi β ))XiTXi = I + X TW X
∂β 2 N i
λ
where W = N diag (σ (X i β )(1 − σ (Xi β ))) is a N × N matrix
• simplify gradient using W 1 :
λ λ X Yi
(1 − σ (Yi Xi β ))Yi XiT = σ (Xi β )(1 − σ (Xi β )) XT
N N i σ (Yi Xi β ) i
= X TW Ỹ
1Ỹ Yi
= σ (Yi X i β ) is N × 1

9
3 Lecture 22/04

• insert into Newton-Raphson update

β (t+1) = β (t ) + (I + X TW (t ) X ) −1 (X TW (t )Ỹ (t ) − β (t ) )
= (I + X TW (t ) X ) −1 (I + X TW (t ) X )β (t ) + X TW (t )Ỹ (t ) − β (t )
 

= (I + X TW (t ) X ) −1 X TW (t ) (X β (t ) + Ỹ (t ) )
 

= (I + X TW (t ) X ) −1X TW (t ) Z (t )

where Z (t ) = X β (t ) + Ỹ (t )

• this is the formal solution of the weighted ridge regression problem

β (t+1) = arg min (Z (t ) − X β ) TW (t ) (Z (t ) − X β ) +


||β || 2
2
β

⇒ Iterated Reweighted Least-Squares Algorithm (IRLS)


repeat until convergence, t = 1, ...Tmax
– compute W (t ) and Z (t )
– V (t ) = (W (t ) ) 1/2 , X̃ (t ) = XV (t ) , Z̃ (t ) = Z (t )V (t )
– use a standard solver to minβ (Z̃ (t ) − X̃ (t ) β ) 2 +
||β || 2
2

• faster than GD or SGD on small datasets

• even faster: use fast approximation of the Hessian ⇒ “quasi-Newton”, e.g. BFGS
(Broyden–Fletcher–Goldfarb–Shanno) algorithm

• Newton in dual space

• in the primal space, we approach the optimum from above: all β (t ) are upper
bounds of Loss (β ∗ ) ≤ Loss (β (t ) )

• the dual problem approaches the optimum from below: ∀α (t ) : DualLoss (α ∗ ) ≥


DualLoss (α (t ) )

• In difficult optimization problems, one often brackets the (unknown) global optimum
between a primal upper bound and a dual lower bound. The difference between the
bounds is the “duality gap”.

• If the dual bound is tight, the duality gap is zero, and primal and dual solutions
agree, e.g. LR.

• requirements on dual for LD:


– tight lower bound
– simple in β, s.t. it can be solved in β in closed form

10
3.1 Algorithms for Logistic Regression

Figure 3.1: Plot showing the logistic loss function in comparison to the above introduced
hinge loss.

• obvious choice: tangents of loss, parameterized by their slope

log(1 + exp(−t )) ≥ −αt − α log α − (1 − α ) log(1 − α ) α ∈ [0, 1]

βTβ
⇒ Laдranдian(β,α ) = 2 + Nλ i (−αi Yi Xi β − αi log αi − (1 − αi ) log(1 − αi )), re-
P
place loss with its lower bound s.t. αi ∈ [0, 1]2

∂Laдranдian λ X
=β+ −αi Yi XiT = 0
!
∂β N i
λ X
⇒β= αi Yi XiT
N i

• insert into Lagrangian:

λ2 X λ X
DualLoss (α ) = − α i α i 0 Yi Yi 0 X i X 0 −
T
i (αi log αi + (1 − αi ) log(1 − αi ))
2N 2 i,i 0 N i

• dual optimization problem: α ∗ = maxα DualLoss (α ) s.t. αi ∈ [0, 1]

2 with 0 log 0 := 0

11
3 Lecture 22/04

• solution via coordinate-wise Newton (one αi at a time)

∂Dual λ2 λ αi (1 − αi )
X !
= − 2 Yi Xi (αi 0Yi 0 Xi 0 ) − log αi + − log(1 − αi ) −
∂αi N i0
N αi (1 − αi )
λ2 X λ αi
= − 2 Yi Xi (αi 0Yi 0 Xi 0 ) − log
N i0
N 1 − αi
∂ 2 Dual λ2 2 λ 1 λ 1
= − Yi Xi XiT − −
∂αi2 N 2 N αi N 1 − αi
λ2 λ 1
= − 2 Xi XiT −
N N αi (1 − αi )

• Dual coordinate-wise Newton algorithm:


– choose α (0) randomly
– repeat until convergence, t = 1, ...,Tmax
∂Dual
∗ αi(t+1) = αi(t ) − ∂αi
∂ 2 Dual
∂αi2

∗ clip at [0, 1]
– LIBLINEAR implements a slightly improved version (numerically more stable)
– seems to be the fastest algorithm for large N

3.2 Why is SGD fast for large N ?


• 3 sources of error:
1. modeling error ε mod : How far away is the (unknown) best model in our model
family from the truth 3 ?
2. estimation/generalization error ε est : How far away is our empirical opti-
mum (from finite training set) from the theoretical (from infinite data)4 ?
3. optimization error ε opt : How far away is our solution (after finitely many
iterations) from the true optimum (after infinitely many iterations)5 ?

• our choice of algorithm influences 1. and 2., ε = ε est + ε opt

• Both errors should decrease at about the same rate, otherwise our efforts on mini-
mizing one of them are useless.

3 can be reduced by a larger model family


4 can be reduced by a smaller model family or more training data
5 can be reduced by more iterations

12
3.2 Why is SGD fast for large N ?

 log N  q !
log N
• numerical analysis: ε est ∈ O N best case, O N worst case

r
log N log N +
ε ∼ ε est ∼ ε opt ∼ *or
N , N -

1
N ∼ log N
ε
1
log N ∼ log + log log N
ε | {z }
≈0
1 1
N ∼ log (best case)
ε ε
1 1
N ∼ 2 log (worst case)
ε ε

Algorithm Time per step Steps to accu- Time to accu- Time to total
racy racy ε opt accuracy
SGD + Dual O(D) 1
O( εopt ) O( εDopt ) O( Dε )
GD O(N D) O(log εopt
1
) O(DN log εopt
1
) O(D ε12 (log ε1 ) 2 )
Newton O(D 2 N ) O(log log εopt
1
) O(D N log log εopt
2 1
) O( Dε 22 log ε1 log log ε1 )

• Fazit: on the long run SGD wins

13
4 Lecture 24/04

4.1 Neural Networks


• linear classifiers: only work when the data are approximately linearly separable,
otherwise we need a nonlinear method

• two approaches to construct nonlinear methods from linear ones:


1. augment the feature space
– measure more properties
– compute new features as nonlinear functions of the existing ones (e.g.
kernel SVM) (⇒ later)
2. non-linearly combine several linear classifiers
– boosting: ŷ = sign( l αl fl (X )), fl (X ): linear classifiers sign(X βl + bl )
P

training: greedily add one classifier at a time, minimize exponential loss


(ML1)
– decision tree: hierarchy of linear classifiers
training: greedily maximize purity (minimize Gini impurity) (ML1)

• neural networks (NN) combine the ideas in 2. : connect linear classifiers in parallel
(“layers”) and layers in series (“multi-layer” or “deep” if ≥ 4)

• history:
– 1940/50s: first neuron models (Hebb, McCulloch/Pitts) and idea of multi-layered
architectures inspired by brain research and meant to explain the brain
– 1958: Perceptron and multylayer perceptron (Rosenblatt): first working training
algorithm (gradient descent on centered hinge loss), but no good algorithm for
multi-layer training
– 1969: book by Papert & Minsky: proved limitations of single layered perceptron
(cannot solve the XOR-problem) and they conjectured (falsly) that multi-layered
architectures are not much better.
⇒ first death
– 1986: Rumelhardt & Hinton: popularized backpropagation training for multi-
layer NN; first practical training algorithm for multi-layer NN ⇒ first rebirth
– ... 1995:

15
4 Lecture 24/04

∗ proof of universial approximation capability


∗ solved several difficult toy problems
but:
∗ proof that exact training is difficult (NP hard in worst case)
⇒ need training heuristics, but they are very difficult to apply effectively
(“black art”)1
∗ success on real problems was limited
∗ discovery of SVM, boosting and random forests (much better on practical
problems)
⇒ second death
– 2006:
∗ much larger training sets (less overfitting)
∗ GPU-based parallelization (100× speed-up)
∗ Hinton: discovery of unsupervised pre-training
⇒ second rebirth
∗ NN won several prestigious benchmark competitions
∗ training was still difficult
– 2010 ...:
∗ interesting ideas to simplify training (dropout, dropconnect, ReLU activa-
tion, max out, ...)
∗ simpler architectures (fewer layers)
⇒ NN start to get interesting

4.2 NN architecture
• each neuron has arbitrary many inputs and a single output

• originally: neuron computes a weighted sum of the inputs and “fires” if a threshold
is exceeded (threshold activation function), inspired by the brain

• today: generalize for arbitrary activation functions:

Zi = φ (Xi β + β 0 )

Zi is the response, φ the function, Xi the features, β the weights and β 0 the bias
2

1 “Training is easy as long as you let Hinton do it.”


2 Usually, the bias is absorbed into β.

16
4.2 NN architecture

Input Hidden Output


layer layer layer

Input # 1

Input # 2
Output
Input # 3

Input # 4

– activation functions motivated by brain research: step function, sign function


⇒ threshold “on-off” behavior
– activation functions motivated by training algorithms: logistic function3 , hy-
perbolic tangent4
⇒ smooth versions of step & sign functions
– modern choices: piecewise linear functions: hinge function 5 (usually called
ReLU-“rectified linear unit”); maxout activation (⇒ later)
almost everywhere differentiable ⇒ sparse activation patterns, better general-
ization

• a neuron with sigmoid activation is simply logistic regression ⇒ several neurons


needed

• “network architecture” = how many neurons and how to combine them (must be
fixed by network designer)

• in a “feed forward network” all connections are directed from input → output ⇒ NN
is a DAG (directed acyclic graph)

• opposite: recurrent network: information flows forward and backward (popular in


time series analysis and not treated in this lecture)

• authors cannot agree on how to count layers:


1. count input and output as layers ⇒ L = #hidden + 2
popular notation: D − H 1 − H 2 − ... − M to specify number of neurons in each
layer
2. most do not count the inputs
3. to avoid confusion, some use different terms:

t = σ (t )
4φ (t ) = tanh(t )= 2σ (2t ) − 1
5ϕ(t ) = max(0,t )

17
4 Lecture 24/04

– stages = # transitions between layers


– hidden layers
⇒ we take approach 2., but start counting at 0, input l = 0, output l = L

• notation:
number of layers L;
layer index l = 0, 1, ...,L;
number of inputs/features: D, j = 0, ...,D 6 ;
number of outputs: M,m = 1, ...,M;
number of hidden neurons in layer l: Hl , if there is only 1 hidden layer:
H , h = 0, ...,H 7 ;
B8 : 3-dimensional array of weights;
Bl matrix of weights between layer (l − 1) and l;
Blh : column vector 9 input weights of neuron h in layer l;
Blhj : single weight from neuron j in layer (l − 1) to neuron h in layer l;
output (row) vector of all neurons in layer l: Zl ;
output of neuron h in layer l: Zlh ;
φ l : activation functions in layer l (all identical)

• example: 2-layer NN with 1 output neuron:

XH1 X D
Ŷi = Z 21 = φ 2 . B 21h · φ 1 *. B 1hj Xij +/+/
*
, h=0 , j=0 --

60 being the bias neuron


70 being the bias neuron
8 actually capital β
9 (= β in a single linear classifier)

18
5 Lecture 29/04

5.1 Theoretical Capabilities of NN


• A 1-layer NN is just a set of independent LR1 instances (or linear regression).
⇒ to be better, we need hidden layers
• The VC-dimension of a linear classifier with D features is D + 1. NVC = D + 1 is the
largest training set where zero training error is always achievable, regardless of the
labels if C = 2 (ML1).
⇒ we can always reduce training error by adding more hidden neurons but beware
of overfitting
• consider the first stage: each hidden neuron in the first layer splits the feature space
into two half-spaces.
– their union partitions the feature space into convex cells (“polytopes”)
– encode the cells by a binary number according to the side of each hyperplane
– projects the feature space onto the corners of a H 1 -dimensional hyper cube
• these 2 properties can be used to construct a 2-layer (difficult) or 3-layer (easy)
network with zero training error ⇒ exercise
• universal approximation theorems (various versions): NN can learn arbitrary func-
tions
– e.g. Hornik, 1991 is one of the most general
∗ regression setting (includes the classification setting via regression of the
posterior probabilities)
∗ one hidden layer, and one neuron with linear activation
∗ assume that the activation function of hidden neurons is continuous,
bounded and non-constant on every compact subset of RD
⇒ output2 : fˆ (x ) = H1 B 21h ϕ 1 (B 1h Z 0 )
P
h=0
∗ consider function space Lp , i.e. the set of all functions s.t.
Z ! 1/p 
p

Lp =  f : || f ||p = | f (x )| dX

 

R D 
 
1 logistic regression
2 Notation update: Z 0 = [1 X T ]T , i.e. a column vector

19
5 Lecture 29/04

∗ Theorem: for every function f ∈ Lp , there exist parameters H 1 ,B 1 ,B 2


such that Z ! 1/p
p
| f − fˆ | dx <ε
ρ

where ρ ∈ RD arbitrary compact substract, ε > 0 arbitrary small


⇒ in principle, one hidden layer is sufficient3

5.2 The Practice


• intuition: Transform the data via several layers until they cluster cleanly into few
easily separable clusters.

• prediction:
input layer Z 0 ∈ [(D + 1) × 1]
hidden layer Z̃l = Bl Zl−1 ∈ [Hl × 1] = [Hl × Hl−1 ][Hl−1 × 1] layer (l − 1) to l
Zl = φl (Z̃l ) pointwise [Hl × 1]
output layer L depends on the application
regression linear activation Ŷi = Z̃ LT (= Z LT ) ∈ [1 × M]
decision rule ŷ = arg maxk Z̃ Lk , k ∈ 1, ...,C number of classes
2-class posterior p(Ŷi = 1|Xi ) = Z L = σ (Z̃ L ), k ∈ {0, 1}, scalar output
multi-class posterior (C ≥ 2) ⇒ k ∈ {1, ...,C},∀k : p(Ŷi = k |Xi ) = Z Lk =
e Z̃l k 4
Z̃ 0
,
k 0 e Lk
P

5.3 Backpropagation
fancy name for gradient descent training of the weights

• Define Loss (application specific ⇒ later) and its derivatives

∂Loss ∂Loss ∂Zl+1 ∂Zl+1


δl := = = δl+1
∂Zl ∂Zl+1 ∂Zl ∂Zl

• Derivatives w.r.t weights:

∂Loss ∂Loss ∂Zl ∂Z̃l


= = δl = δl φl0 (Z̃l ) Zl−1
T
∈ [Hl × Hl−1 ]
∂Bl ∂Zl ∂ Z̃ l ∂Bl | {z } |{z}
[H ×1] [1×H ]
|{z}
l l −1
δl

3 The problem with this theorem is, that it only gives a statement of existence with no indication on how to
construct fˆ.
4 known as “soft-max function” a generalization of the sigmoid function

20
5.3 Backpropagation

• common activation functions:


σ 0 (Z̃l ) = Zl (1 − Zl ),
tanh0 (Z̃l ) = 1 − Zl2 ,
ReLU φ (t ) = max(0,t ),
 1 Z̃l > 0

φ 0 (Z̃l ) = step(Z̃l ) = 
 0 else

Figure 5.1: Some depictions of widely used activation functions.

• derivatives w.r.t. to previous layers:

∂Zl+1 Z̃l+1
= φl+1
0
(Z̃l+1 ) = Bl+1
T
φl+1
0
(Z̃l+1 )
∂Zl ∂Zl

• backpropagation algorithm:
– init δL = ∂Loss , δ˜L = δLφ 0 (Z̃ L )
∂Zl L
– for l = L, ..., 1 :

∆Bl = δ˜l Zl−1


T

δl−1 = B Tδ˜l
l
δ˜l−1 = δl φl0 (Z̃l )

– Bl(t+1) = Bl(t ) − τ ∆Bl(t+1)

21
5 Lecture 29/04

5.4 Loss functions depend on application:


regression: Loss = 12 (Y − Ŷ ) 2 ⇒ δ L = Ŷ − Y = δ˜L (because φ L (t ) = t)

2-class posterior: Loss = −Y log p̂ − (1 − y) log(1 − p̂)

∂Loss  −1/Zl ,
 ify = 1 ∂Z L
δL = ∂Zl =
 1/(1 − Z L ), i f y = 0 , ∂Z̃ L
= Z L (1 − Z L )

Z L − 1, i f Y = 1
δ˜L = δL ∂∂ZZ̃ L = 

L Z L , ifY = 0

• regularization: RLoss = Loss + Reдularizer . Popular Regularizers: L 2 , L 1

22
6 Lecture 06/05

6.1 NN training algorithm


• Initialization:
– choose network architecture (# layers, # neurons) and learning rate (schedule)
– init the weights B (0)
• repeat until convergence, t = 1, ...,Tmax
– ∆B (t ) = 0


 a single random i (SGD)
(t )

– for i in instances = 

 a random mini-batch


 full training set (GD)

∗ forward step: prediction

Z 0 = [1 XiT ]T

for l = 1, ...,L

Z̃l = Bl(t−1) − Zl−1


Zl = φl (Z̃l )

∗ compute loss gradient acc. to application


∂Loss (Z L ,Yi )
δ˜L =
∂Z̃ L
(t )
∆BL + = δLZ L−1
˜ T

∗ backward sweep (“error (gradient) backpropagation”)


for l = L − 1, ..., 1
 (t−1)  T
δl = Bl+1 δ˜l+1
δ˜l = δl ∗ φ 0 (Z̃l )
l pointwise
∆Bl(t ) + = δ˜l Zl−1
T

– weight update:

B (t ) = B (t−1) − τt ∆B (t ) + µ (B (t−1) − B (t−2) )

23
6 Lecture 06/05

6.2 Classical Tricks to make this work


λ
• regularization, e.g. 2N ||B||F2 , Lλ ||B||1 , max-norm regularization Blh the weights for
lh ||2 )
neuron h at layer l: Blh = Blh min(C,||B||Blh ||2 (||Blh ||2 ≤ C, always C = 3, ..., 4 to keep
the expected input in the non-constant range of the sigmoids)
σ 0 (t ) = σ (t )(1 − σ (t )) ≈ 0 if σ (t ) ≈ 0, 1
tanh0 (t ) = 1 − tanh2 (t ) ≈ 0 if t ≈ ±1 ⇒ no gradient is propagated when the
nonlinearity is saturated

• weight initialization: init such that neurons are not saturated at the beginning
– standardize features (zero mean, unit variance)
– assume that the neuron activations are zero mean and unit variance1
– initialize weights with zero mean and variance s 2
– input properties of next layer neuron

E(Blh Zl−1 ) = E(Blh )E(Zl−1 ) = 0


H
X l −1

var(Blh Zl−1 ) = var(Blh )var(Zl−1 ) = s 2


varZl−1,h 0 = (Hl−1 + 1)s 2
h 0 =0
| {z }
=1

– for gradient training to work, E(Blh Zl−1 ) ± std(Blh Zl−1 ) should not be in the
saturated region.

⇒ Hl−1 + 1s ≤ 1 (≤ 2), solve for s = √H 1 +1
l −1
 q q 
⇒ init Blh ∼ N (0, (Hl−1 + 1) ) or ∼ U − Hl −1 +1 , Hl −1 +1
−1 3 3

always set the Blh0 = 0 (weight of bias neuron)


variant s = √ 1
Hl −1 +Hl +1
2

• optimization algorithms
– plain gradient descent (“batch training”)
– stochastic gradient descent (“online training”)
– mini-batch SGD
⇒ need to adjust learning rate and momentum (→ later)
– methods that automatically adjust the step size
– usual suspects:
∗ Newton, quasi Newton (BFGS), conjugate gradient with line search
∗ RPROP:
1 only assume this for the weight initialization... in general it is wrong

24
6.2 Classical Tricks to make this work

· idea: adjust the log of the training rate by gradient descent log(θt ) =
log(τt−1 ) + ∆t
⇒ multiplicative update on τ , after some math and simplifications
(t ) (t−1)
(t )
η + (= 1.25), if ∇Blhj ∇Blhj >0
ηlhj = −


η (= 0.7) else
(t )
  
τt,lhj = min τmax , max τmin ,τt−1,lhj ηlhj
(t ) (t−1) (t )
∆Blhj = ∆Blhj − τt,lhj sign(∇Blhj )

· converges very quickly or diverges


· use with large minibatches

• termination criterion: training easily overfits ⇒ keep a separate validation set and
monitor the validation error
⇒ stop when validation error starts to go up

• learning rate and momentum for GD and SGD [Wilson & Martinez 2003]
– we want ∆B (t ) = −τt E ∂Loss
f g , finite data estimate:
∂B t−1

∂Loss 1 X ∂Lossi
" #
E =
∂B NB i ∈ Batch ∂B t−1

– for the full GD: Batch = training set, NB = N ⇒ get accurate estimate of
E[дrad] at cost O(N )
– SGD: Batch = single instance, NB = 1 ⇒ inaccurate estimate at cost O(1)
– minibatch is between these extremes
– rule
√ of thumb: the more accurate E[дrad] the bigger
√ we can choose τ . τGD ≈
NτSGD ⇒ the time for equal progress in GD is N longer than SGD because
of O(N ).
– learning rate schedules:
∗ keep the learning rate constant
∗ divide τ → τ /10 when learning stalls (2x)
τ0
∗ τt = 1+t/t 0

25
7 Lecture 08/05

7.1 RPROP
(t ) (t−1) (t )
• normal update of a single weight: Blhj = Blhj − τ ∆Blhj
(t )
• give each weight an individual training rate τlhj and train it via GD: log τlhj =
(t−1) (t )
log τlhj + ∆(log τ )lhj ⇒ multiplicative update in τ itself
(t ) (t−1) (t ) (t )
⇒ update rule Blhj = Blhj − τlhj ∆Blhj .
 
(t ) ∂Loss
• after some math and approximations, we find: ∆Blhj = sign ∂Blhj t

⇒ the gradient w.r.t B only determines the step direction not the length
(t )
• step length is completely absorbed into τlhj

(t−1) (t )
τ (t ) = max(τmin , min(τmax ,τlhj ηlhj ))
(t ) (t−1)
(t )
η + (= 1.25), if ∇Blhj ∇Blhj >0
ηlhj = −


η (= 0.7) else

with τmin = 10−7 , τmax = 10−2

7.2 Dropout [Srivastava & Hinton 2012]

×
× ×
dropout
× ×
× ×

Figure 7.1: Depiction of a deep neural network demonstrating the influence of applying
dropout with p = 0.5.

27
7 Lecture 08/05

• new regularization technique (breakthrough), similar to the randomization in deci-


sion trees ⇒ random forest

• idea: randomly switch-off part of the neurons in each training step.


⇒ forward sweep:
for l = 1, ...,L:

rl−1 ∼ Bernoulli (p) Hl −1 +1 [(Hl−1 − 1) × 1]


Z̃l = Bl (rl−1 ∗ Zl−1 ) pointwise

keep neurons with probability p.

• backpropagation only on the subnetwork of active neurons (weights going in/out of


a inactive neuron are not changed)

• at end of training: downscale all weights B → p · B and use all neurons for prediction.
– this is an approximation for the statistical interpretation of dropout
– there are 2H possible subnetworks ⇒ dropout trains a (small) random fraction
of these
– all subnetworks share O(H 2 ) weights
– at prediction time: sample M of the 2H possible subnetworks and return the
average of their prediction.
– but: this is expensive ⇒ approximate the average of the predictions with
the prediction using the average network ⇒ single prediction with average
network instead of M predictions from subnetworks
– if all activations φl (t ) were linear, the average network is just B → p · B
– the inventors showed experimentally that this also works for nonlinear activa-
tions
– equivalent alternative (easier to implement): upscale all active weights during
training by B active → p1 B active

• practical recommendations:
– learning rate must be increased by a factor of 10...100, and we need more
iterations
τ0
– learning rate must decrease over time τt = 1+t/t 0
lh ||2 )
– use max-norm regularization Blh → Blh min(C,||B
||Blh ||2 with C ≈ 3..4

• theory:
– observation:
∗ since neurons cannot rely on the presence of any input neuron during
training, subtle co-adaptation effects (huge weights that cancel each other)
cannot occur ⇒ strong regularization

28
7.3 Piecewise linear activation functions

∗ weights tend to become sparse, e.g. if applied to images, weights Bl become


local filters.
∗ dropout reduces the Rademacher complexity of the network exponentially
[Gao & Zhou, 2014]
∗ reminder (ML1): Rademacher complexity 2R̂ measures the expected train-
ing success rate on nonsensical data, i.e. the features Xi are fixed and labels
Yi are random (if the success rate here is high, the algorithm will strongly
overfit)
∗ optimism: opt = 2R̂ + O( √1 ), test error ≤ train error + opt.
N
L
Q 
∗ without dropout: R̂ ∈ O l=1 ||B l || 1 ⇒
· classical regularization (=reducing ||B||) helps
· more layers (with total # weights fixed) increase overfitting
QL
∗ with dropout: R̂ ∈ O p L/2 l=1
 
||B||1
⇒ more layers reduce p L exponentially ⇒ we can use many layers (L > 20)

• variant: DropConnect ([Wan et al 2013]). Randomly drops weights (=graph edges)


instead of neurons (=graph nodes)
– small gains in performance, but a lot more complicated

7.3 Piecewise linear activation functions


• the other recent breakthrough

• ReLU (“rectified linear unit”) [Nair & Hinton, 2010 / Glorot et al. 2011]

ReLU(t ) = max(0,t )

– empirically works better than sigmoids


– convex, only saturated for negative t
 
– can approximate sigmoids by two ReLUs ex. ReLU(t + θ ) - ReLU(t − θ ) −θ
– effect: features select a subnetwork “specialized” for that input (by driving
some neurons into saturation)
⇒ for each input, we select a linear subclassifier that is an “expert” for that
particular region of the feature space.

29
7 Lecture 08/05

Figure 7.2: Rectified Linear Unit (ReLU) activation function, which is zero when t < 0 and
then linear with slope 1 when t > 0.

7.4 MaxOut [Goodfellow et al. 2013]


• any linear function is convex, the pointwise max of a set of convex functions is also
convex
⇒ we can produce an arbitrary piecewise linear convex function from the max of
linear functions

• any continuous function can be expressed as difference between two convex func-
tions1
⇒ we can arbitrarily well approximate any function by a piecewise linear function,
taking the difference of two of these max(linear functions)

• maxout: alternate linear layers with maxout layers

l = 1, 3, 5, ... : Zl = Z̃l
l = 2, 4, 6, ... : Zlh = max [Zl−1 ]
Slh ⊂[0,Hl −1 ]

where Slh is a subset of neurons in layer l − 1 (typically: each neuron Zlh uses k
neurons of Zl−1 without sharing, Hl−1 = kHl )

• k ∈ [2, ..., 20] is the number of linear segments after each maxout neuron.

• main
√ application: convolutional neural networks: “max pooling”: reduce a k×
k window to a single pixel by taking the maximum.

1 under mild assumptions

30
7.5 PReLU “parametric ReLU” [He et al 2015]

9 2 9 6 4 3

5 0 9 3 7 5 max pooling 9 9 7

0 7 0 0 9 0 9 5 9

7 9 3 5 9 4

Figure 7.3: Graphical depiction of the max pooling function used in convolutional neural
networks to reduce a 2 × 2 window to a single pixel by taking the maximum.

7.5 PReLU “parametric ReLU” [He et al 2015]


• compromise between ReLU and Maxout: two flexible linear segments.
t,
 t ≥0
φ (t; a) = 
a · t, t < 0

Figure 7.4: Leaky ReLUs or PReLUs are one attempt to fix the “dying ReLU” problem.
Instead of the function being zero when t < 0, a leaky ReLU or PReLU will
instead have a small negative slope a. For PReLUs the value of a is made into a
parameter which is adaptively learned during training.

• each neuron has its own a


• the a’s are trained via backpropagation:
∂φ  1, Z̃lh ≥ 0

= φ 0 (Z̃lh ; alh ) = 
∂Z a, Z̃lh < 0

∂φ (Z̃lh ,alh )   0, Z̃lh ≥ 0
=
∂alh
Z̃lh , Z̃lh < 0

31
7 Lecture 08/05

• weight initialization must be changed to:


!
2
Bl ∼ N 0,
(1 + a 20 )Hl−1

a 0 = 0.25 recommended initial a

32
8 Lecture 13/05

8.1 Multi-class Classification


• task: assign each instance to exactly one of C classes

• distinguish from multi-labeled problems: each instance can have several labels.
(Image: sunset, beach, surfing, ...; document: several topics)

• general approach: extend y to a “one hot”, “one-of-C” vector of size C


– hard decisions: Y ∈ {−1, 1}C , contains exactly one +1
– posterior probabilities: Y ∈ [0, 1]C , Ck=1 Yk = 1
P

– scores: Y ∈ RC

• goal: predict Ŷi = Yi , we can obtain hard decisions from posteriors and scores:

arg max Yk
k

• some classifiers have natural generalizations to the multi-class case


– nearest neighbor: predict class of the nearest neighbor (or the majority of
several near neighbors)
– Naive Bayes:
∗ learn 1D likelihoods for each feature and each class (DC likelihoods)
∗ computation of posteriors via Bayes’ rule is easy
– decision trees:
∗ split selection criteria are naturally multi-class
P  
· entropy: minimize N left k −pleft,k log pleft,k +N right k −pright,k log pright,k
P 
 P 2   
· Gini impurity: minimize N left 1 − k pleft,k + N right 1 − k pright,k
P 2

∗ prediction of each leaf: ŷ = pleaf,k


 

– random forest:
∗ ensembles of decision trees:
· train each tree on a random subset of the instances
· only consider a random subset of the features when selecting a split
∗ prediction: average over all tree predictions Ŷ = Ŷt
P

33
8 Lecture 13/05

– neural network:
∗ define the non-linearity of the output layer via the “soft-max” function
exp(Z̃ Lk )
Z Lk = P
k 0 exp Z̃ Lk 0

∗ train with the “cross-entropy loss”


X
Loss = −I[Yi = k] log Z Lk − I[Yi , k] log(1 − Z Lk )

k

∗ back propagation:

∂Loss  −1/Z Lk , k = Yi
∂Lk = =
∂Z Lk  1 , k , Yi
 1−Z Lk
∂Z Lk
= .... = Z Lk − Z Lk
2
∂Z̃ Lk
∂Z Lk
= ... = −Z Lk Z Lk 00
∂Z̃ Lk 00

Z Lk (1 − Z Lk ), k = k 0

⇒ Jacobian Jkk 0 = 
 −Z Lk Z Lk 0 ,
 k , k0

Z 00
∂Loss X Z Lk − 1 − k 00,k 1−ZLk 00 Z Lk , Yi = k
 P
δ˜Lk = = δLk Jk k =  Lk

00 00
Z Lk 00

∂Z̃ Lk 1−Z Lk 00 Z Lk , Yi = k 0 , k
P
k 00  2Z Lk − k 00 ,k,k 0

– logistic regression: is just the special NN with a single layer, L = 1


– both NN and LR are true multi-class algorithms because the prediction Z Lk are
not independent:
∗ trained jointly and coupled via the Jacobian
∗ prediction is coupled via the softmax normalization
∗ all Z Lk share the hidden weights
– Support vector machine 2-class objective:
λ X
min 12 ||β ||22 + hinge(1 − Yi (Xi β + b))
β,b N i

equivalently
λ X
min 12 ||β ||22 + ξi s.t. 1 Yi (Xi β + b) ≥ 1 − ξi , ξi ≥ 0
β,b N i

ξi are called slack variables

34
8.1 Multi-class Classification

– generalization of [Weston & Wathins, 1999]:


– we now have a (β,b) pair for each class β → B [D × C], b → vector [bk ]
– now the constraint must hold for every pair of classes
– predict Ŷi = arg maxk (Xi Bk + bk ) (optional: “don’t know” if too small)
⇒ objective:
λ XX
min 12 ||B||F2 + ξi,k s.t. Xi BYi + bYi ≥ Xi Bk + bk + 2 − ξik f.a. k , Yi
B,b N i k,Y
i

⇒ N (C − 1) slack variables
– compute the dual and train in the dual space ⇒ difficult

• generalization of [Crammer & Singer, 2001]:


– it is sufficient when the constraint holds for the Yi ’s closest competitor
– also absorb the threshold parameters b into B (by adding a feature Xi,0 = 1)
objective
λ X
min 1 2
2 ||B||F + ξi s.t. Xi BYi ≥ max (Xi Bk ) + 1 − ξi
B N i k,Yi

– again train in the dual ⇒ easier, more popular than WW

• traditional belief: CS is better than WW because there are only N constraints instead
of N (C − 1)

• but: [Drogan et al. 2011] claim that the crucial difference is actually the elimination
of the intercepts b because they lead to difficult equality constraints in the dual ⇒
eliminate b in the WW and got better than CS

• if your classifier is not generalizable to C > 2: reduce the multi-class problem to a


set of 2 class problems
– “one-vs-all” or “one-vs-rest”: train C classifiers where classifiers k 0 gets labels
 +1, k 0 = Yi
Yi,k  train hk 0 (X ) binary classifiers for k 0 vs rest

0 
 −1, k , Yi
0

∗ predict: Ŷi, 0 arg maxk 0 hk 0 (Xi )


∗ this only works if the outputs of a hk 0 (X ) are comparable in magnitude
· hk 0 (X ) return hard decisions: return Ŷ if exactly one hk 0 (X ) returns
+1 otherwise “don’t know”
· hk 0 (X ) return posteriors: just take the max
· hk 0 (X ) return scores: make the scores comparable, example: all hk 0 (X )
are linear classifiers hk 0 (X ) = X Bk 0 +bk 0 are comparable when ||Bk 0 ||2 =
1

35
8 Lecture 13/05

– “one-vs-all” or “all pairs”: train C (C−1)


2 classifiers for all possible pairs in k 0,k 00
 +1, Yi = k 0

⇒ train hk 0,k 00 (X ) with labels Yi = 
 −1, Yi = k 00 and don’t use the instances

of the other classes
∗ isn’t this too expensive?
Not always: h are kernel SVMs:
 training takes Ω(N 2 ) times, OVS takes
Ω(CN 2 ), but OVO takes Ω C (C−1)
 2
2
2N
C = Ω(N 2 ) ⇒ faster
– prediction:
∗ variant 1: apply all classifiers and return the class with most +1 votes
∗ variant 2 [Platt et al. 2000]: fill a vector with the class labels (in any order),
apply the classifier for the first and last entry in the list ⇒ pop the loosing
label from the list, repeat until only one label remains ⇒ Ŷi . This can be
written as a decision DAG (directed acyclic graph)

36
9 Lecture 20/05

9.1 Coding Matrices for multi-class problems


• we had one vs. rest (OVR) and one vs. one (OVO): We train L binary classifiers,
where classes get temporary labels from Ỹil ∈ {−1, 0, 1}1



 1, class k is pos in classifier l
⇒ write labels as a C × L matrix M, s.t. Mkl =  −1, class k is pos in classifier l




 0, class is not used

1 1... 0 0 ...
1 −1 ... −1 *.−1 0 ... 1 1 ...+/
−1 1 −1 ... //
*. +
e.g. M ovr = .. M ovo = .. 0 −1 ... −1 0 ...//
. /
,
. ... ... ... ... // .. 0 0 ... 0 −1 ...//
,−1 ... −1 1 - , ... ... ... ... ... ...-
• in principle, M can be arbitrary, the best choice of M is an open problem (restriction,
rows of M must differ)
– OVO, OVR are still good choices
– [Dietterich et al. 1995]: Error-correcting output codes (ECOC). Idea: make
the rows of M pairwise as different as possible. ⇒ classification becomes robust
against errors in a few of the L classifiers. If for classes k and k 0, at least L0
elements of M differ, we can recover from b L 2−1 c by majority vote
0

• [Sun et al. 2005] : choose M at random, simple and works well

• optimize M for your data and classifiers: current research


– [Bautista et al 2015]: compute between-class covariance, S ∈ [C × C], compute
PCA (eigenvector matrix EV). Initial guess M = sign(EV ), then iterate to
maximize error-correction while staying similar to EV

• stagewise optimization (various authors): add a new column to M until the over-
all performance is satisfying, choose the new column “optimally” with respect to
existing columns.

• prediction (“decoding”):
10 means, that the instance is not used to train classifier l

37
9 Lecture 20/05

– if all hl (X ) return crisp binary labels, assign X to the row of M, with minimal
Hamming distance.
– if all hl (X ) return posteriors or scores: compute the loss of all rows and choose
k which minimizes the loss.

• coding via M is especially critical for boosting


X
h(X ) = al hl (X )
l

h(X ) will be correct, when the majority of hl is correct

• This is easy for binary classification, hl (X ) just must be a bit better than guessing.

• but: for multi-class, better than guessing means pcorrect = 1


c + ε ⇒ majority vote will
not be correct

• We recover the “weak-learning condition” by reducing to a set of binary problems


via M.

9.2 Gaussian Processes (or The Statistical Theory of


Interpolation)
• so far, we always assume that training data are iid ( p(X ,Y ) is stationary, but un-
known)
QN
⇒ probability of the training set factorizes: p((X 1 ,Y1 ), ..., (X N ,YN )) = i=1 p(Xi ,Yi )
PN
⇒ the NLL is a sum − log p(..) = − i=1 − log p(Xi ,Yi )
⇒ the loss (training error) is additive over instances ⇒ convenient optimization of
loss

• but: many applications do not fulfill the iid assumption:


– time series
– images: neighboring pixels usually belong to the same object ⇒ not indepen-
dent

• three ways to deal with dependent data:


1. define features that capture a neighborhood of each instance, e.g. image filters
of a window of pixels.
⇒ relationship between neighboring instances (pixels) is recorded into features
describing the local changes
⇒ can treat the data as approximately iid, given (“conditional on”) these new
features (⇒ chapter “Features”)

38
9.2 Gaussian Processes (or The Statistical Theory of Interpolation)

2. factorize (p(X 1 ,Y1 ), ..., (X N ,YN )) as good as possible, (e.g. = Z1 p((X 1 ,Y1 ), (X 2 ,Y2 ))·
... · p((X N −1 ,YN −1 ), (X N ,YN )) “Markov assumption: only neighboring trace-
points are related”)
– graphical models find such factorizations systematically (⇒ chapter
“GM”)
3. learn the full joint probability p((X 1 ,Y1 ), ..., (X N ,YN )) ⇒ Gaussian processes
– this is only tractable when we use a simple model for p(..)
– obvious choice: multi-variate Gaussian distribution ⇒ everything an be
computed in closed form
– typical application is regression, classification is modeled via regression
of the posterior class probability.

• Consider a vector of values Y = [Y1 , ...,YN ]. These are eventually functions Yi =


f (Xi ), but we ignore the Xi for the moment. Indices i are now fixed and no longer
permutable.
– modeltheir distribution by a N -dimensional Gaussian, p(Y ) ∼ N (Y ,S ) =
Z exp − 2 (Y − Y ) S (Y − Y ) take N → ∞: Y becomes a function, we still
1 1 T −1

write formally Y ∞ ∼ N (Y ∞ ,S ∞ ) “infinite-dim Gaussian”


– in practice, we only work with finitely many points: can be interpreted as a
finite dimensional marginal of the infinite dimensional Gaussian (i.e. integrate
out all points we are not interested in)
– fortunately for a Gaussian all marginals are again Gaussian.

39
10 Lecture 22/05

10.1 Gaussian Processes


• Generalize from a finite vector [Y1 , ...,YN ] of dependent variables Yi to a function
Y = f (X ) by taking N → ∞.

• Model the probability of f (X ) as a Gaussian f (X ) → N ( f¯(X ),S ) 1 .


 D E 
f (X ) ∼ p( f ; f¯,S ) = 1
Z exp − 12 f − f¯, f − f¯
H (S )

D E
– Under our model p( f ; f¯,S ) all functions with finite norm f − f¯, f − f¯
H (S )
have non-zero probability.2
– A function f has high probability if it is similar to f¯ and conforms to the
covariance structure given by S ⇔ || f − f¯||H (S ) is small.
– S (X ,X + ∆X ) decreases slowly with ∆X ⇒ neighboring points are highly
correlated ⇒ f should be smooth.3
– S (X ,X + ∆X ) decreases quickly ⇒ noisy functions are also probable4
– S must be chosen by the designer to model the properties of the application.

• in practice, we are only interested in finitely many points of f (training & test points)
⇒ We create marginal distributions of N( f¯,S ) by integrating out all points we don’t
care about.
⇒ general property: any marginal of a Gaussian is again Gaussian
let [X 1 , ...,X N ] be the training locations with observed response [Y1 , ...,YN ]T = Y ,
[X N +1 , ...,X N +N 0 ] the test locations where we want to find [ŶN +1 , ..., ŶN +N 0 ]T = Y 0
Y
" # !
⇒ the marginal distribution of [Y1 , ...,YN +N 0 ] is N − Y ,S 1:N +N 0 5
Y0

1 f¯= E( f ) , S = E[( f (X ) − f¯(X ))( f (X 0 ) − f¯(X 0 ))], covariance or kernel function


R F (f ) F (д)
2
f,д H (S ) := F (S 0 ) dω, where S 0 (X ) = S (X , 0) centered kernel function at origin

3 the plot shows a rather smooth function
4 the plot shows a “squiggly” function
5 specialize the kernel to the points [X , ..,X
N +N 0 ], and Ȳ = [ f (X i )]
1
¯

41
10 Lecture 22/05

• we simplify by setting f¯ = 0, because we can always subtract f¯ in preprocessing,


and add it after analysis Ŷfinal = Ŷ + f¯(X )
" #T
Y Y
" #
p(Y ,Y ) ∝ exp *− 2 0 S 1:N +N 0 0 +
0 1 −1
Y Y
,  -
p(Y ) ∝ exp − 2 Y S 1:N Y
1 T −1

• we are interested in

p(Y 0 |Y ) = p(Y ,Y 0 )/p(Y )

" S # = S. To it, we partition S according to known and unknown


we need −1 ˜
" compute
A B ˜ Ã B̃
#
S= T ,S = T , by definition S −1S = SS
˜ = 1N +N 0 6
B C B̃ C̃
• we get  
p(Y 0 |Y ) ∼ N B TA−1Y ,C − B TA−1 B

• introduce kernel function k (X ,X 0 ), kernel matrix K with Kij = k (Xi ,X j ), kernel


vector k7 with k = k (X 0,Xi ), where X 0 is a test point, κ = K (X 0,X 0 )

• we can compute the test responses one point at a time, i.e. we can set N 0 = 1
!
p(Y |Y ) ∼ N k(X ) K Y ,κ − k(X ) K k(X )
0 0 T −1 0 T −1 0

fundamental interpolation equation:

Ȳ = E[Y 0 = f (X 0 )] = k(X 0 ) TK −1Y

uncertainty of the interpolated point8 :

var[Ŷ ] = κ − k(X 0 ) TK −1 k(X 0 )

• define interpolation coefficients: Ỹ = K −1Y can be precomputed9 because of indepen-


dence from X 0

• Ŷ = k(X 0 ) TỸ

• example: linear interpolation, assume that X is 1D and Xi are equidistant (a grid),


w.l.o.g. we set Xi = i for the training points
6 see Bishop p. 307 for the derivation of S˜
7 actually a cursive k
8 note: independent of Y
9 in practice: solve linear system K Ỹ = Y ... avoid computing K −1 . two typical algorithms:

1. if K is dense: Cholesky decomposition K = LLT


2. if K is sparse: conjugated gradients

42
10.1 Gaussian Processes

• k (X ,X 0 ) = (1 − |X − X 0 |)+ , Kij = k (Xi ,X j ) = k (i, j) = (1 − |i − j |)+ , i, j ∈ {1, ..., N } ⇒





 0, |X 0 − Xi | ≥ 1
K = 1N , κ = 1, k (X 0,Xi ) =  1 − t i = bX 0c
 10

t,


i = bX 0c + 1

Ŷ (X 0 ) = k(X 0 ) TY = (1 − t )Yi + tYi+1 , i = bX 0c, t = X 0 − bX 0c

var(Ŷ ) = κ − k(X 0 ) TK −1 k(X 0 ) = 1 − (k)(X 0 ) T k(X 0 )


= 1 − (1 − t ) 2 − t 2 = 2t (1 − t )

10t = X 0 − bX 0c

43
11 Lecture 27/05
Reminder: E(Y 0 ) = Ŷ = k(X 0 ) TK −1Y = k(X 0 ) TỸ
Kernel functions: A function K (X ,X 0 ) is a kernel iff it is positive definite (Mercer’s
condition).

• New kernel functions can be constructed from existing ones with easy operations.
– a positive linear combination of kernels is a kernel K new (X ,X 0 ) = α 1K 1 (X ,X 0 ) +
... + α M KM (X ,X 0 ), α 1 , ....,α M > 0
– a product of kernels is a kernel K new (X ,X 0 ) = K 1 (X ,X 0 )K 2 (X ,X 0 )
– We may map the X into an arbitrary feature space before applying the kernel
function K new (X ,X 0 ) = K 1 (ϕ(X ),ϕ(X 0 )).
– the exponential of a kernel is again a kernel K new (X ,X 0 ) = exp(K 1 (X ,X 0 ))
– [...]

• two big classes of popular kernels: “radial basis functions”, “tensor product
kernels”

• radial basis functions (RBF): K (X ,X 0 ) = K (r ), r = ||X − X 0 || distance between X ,


X 0 in some norm (usually Euclidean or weighted Euclidean)
  2
– squared exponential (aka Gaussian): K (r ) = exp − 12 ρr where ρ = “band-
width” of the kernel
  γ 
– γ -exponential: K (r ) = exp − γr , γ ∈ [0, 2]
– Matérn kernels (less smooth than squared exponential):

r p r 1
!
K (r ) ∼ kγ 2γ
γ γ

∗ γ = 1/2, K (r ) = exp(− ρr ) Ornstein-Uhlenbeck Kernel for Brownian


motion - very rough
√ √
∗ γ = 3/2, K (r ) = (1 + 3 ρr ) exp( 3 ρr )
 √  2 √
∗ γ = 5/2, K (r ) = 1 + 5 ρr + 53 ρr exp(− 5 ρr )

– inverse quadrics: smoother than squared exp: K (r ) = 


1
1
 2α
r
α > 0.
1+ 2α ρ

∗ α = 1/2 K (r ) = q 1 
1+ ρr
2

45
11 Lecture 27/05

∗ α = 1, K (r ) = 1
1+ 2 ρr
1  2

– for 2D feature spaces: thin plate spline K (r ) = r 2 log r


∗ only “conditionally positive definite” – is positive definite after performing
linear regression (i.e. TPS is applied to the residuals of linear regression)
∗ advantage: it’s usually not necessary to optimize the bandwidth
∗ TPS is the minimum energy surface of a (infinitely thin) elastic plate
attached to the training points:
· it minimizes the curvature integral R2 ( fxx
2 + 2f 2 + f 2 )dxdy
R
xy yy

• all kernels so far have infinite support K (r ) > 0 even for very large r ⇒ kernel
matrix K is dense, i.e. expensive to invert when N is big (solution of linear system
K Ỹ = Y takes O(N 3 ))

• A kernel with compact support (i.e. K (r ) = 0 if r > r max ) leads to a sparse kernel
matrix. ⇒ sparse solvers need O(N )
– truncate a non-compact kernel — approximation

– define compact kernels, e.g. Wendland splines K (r ) = 1 − ρr polyγ ( ρr )

+
∗ choose γ and poly according to feature space dimension and the required
# of derivatives

· smooth, but not differentiable: K (r ) = 1 − ρr , γ = b D2 c + 1

+
γ 
· 1-times differentiable: K (r ) = 1 − ρr γ γr + 1 , γ = b D2 c + 2
 
+
· [...]

• radial basis functions are best if the training points Xi are irregularly arranged —
“scattered data interpolation”

• if the Xi form a regular grid, tensor product kernels — allow to work in one dimension
at a time

• tensor product kernel: K (X ,X 0 ) = K 1 (X 1 ,X 10 ) · · · KD (X D ,X D0 ), each being 1D kernels

• squared exponential:
     
K (x,x 0 ) = exp − 21 ||X − X 0 || 2 = exp − 12 (X 1 − X 10 ) 2 · · · exp − 21 (X D − X D0 ) 2

– only kernel that is both a RBF and a tensor product

• B-splines (Xi+1,j − Xi,j = 1 unit grid, ρ = 1, i.e. preprocess data accordingly)2 :


Z x+1/2  1, −1/2 < x ≤ 1/2

k (x ) = Bγ (x ) = Bγ −1 (x )dx = B 0 ∗ Bγ −1 , B 0 (x ) = 
x−1/2
0

2 to simplify notation

46
 1 − |x |, |x | ≤ 1

B 1 (x ) = 
0

4 −x ,
3




2 |x | ≤ 12
B 2 (x ) =  ( 2 − |x |) 2 , 12 ≤ |x | ≤
 1 3 3

 2 2

0

3 − x + 2 |x | , |x | ≤ 1
 2 2 1 3



B 3 (x ) =  (2 − |x |) 3 , 1 < |x | ≤ 2
 1

 6

0

1 x 2
 
−2 ρ
B ∞ (x ) ∼ e

• kernel matrix is sparse, e.g. B 0 and B 1 : K = IN , B 2 and B 3 : K is tridiagonal. Tri-


diagonal systems are easy to solve:
– Thomas algorithm
– recursive filters



 1, x =0
• cardinal functions K (x ) =  x ∈ {±1, ±2, ±3, ...}

 0,


, 0

– B 0 and B 1 are cardinal
sin(πx )
– ideal interpolator: sinc(x ) = πx



 1 − 25 x 2 + 32 |x | 3 , |x | ≤ 1
– Catmull-Rom spline : C (x ) =  2 − 4|x | + 2 x − 2 |x | , 1 < |x | ≤ 2
3  5 2 1 3



0

– advantage: the kernel matrix K = IN , giving us Ŷ (X 0 ) = k(X 0 ) TY
– any 1D kernel defines a cardinal function: κ (X 0 ) T = k(X 0 ) TK −1

3 compact support version of the sinc function

47
12 Lecture 29/05
• on the grid, Gaussian process interpolation is just filtering (convolution)

• example: Catmull-Rom spline C (X ), y 0 = k(X 0 ) TK −1Y 1


To compute k(X 0 ), place a kernel function centered at X 0, i 0 = bX 0c, t = X 0 − i 0 ,
k(X 0 ) = [0, ..., 0,C (−t − 1) ,C (−t ) ,C (−t + 1) ,C (−t + 2) , 0, ..., 0]. The interpolation is
| {z } |{z} | {z } | {z }
index −2 −1 0 1
then the following filter

y 0 = k(X 0 )Y = C (−t − 1)Yi 0 −1 + C (−t )Yi 0 + C (−t + 1)Yi 0 +1 + C (−t + 2)Yi 0 +2 .

• example: B-spline B 3 : now K , I, but a tridiagonal matrix. Define interpolation


coefficients Ỹ = K −1Y
⇒ interpolation is analog to C-Rom:

y 0 = k(X 0 )Y = B 3 (−t − 1)Ỹi 0 −1 + B 3 (−t )Ỹi 0 + B 3 (−t + 1)Ỹi 0 +1 + B 3 (−t + 2)Ỹi 0 +2

Computing Ỹ is a preprocessing of Y . Since K is tridiagonal K −1Y can also be


implemented by filtering (specifically, a pair of recursive filters [Unser et al. 1991]).
Intuitive effect of K −1 : since B 3 is a smoothing filter, we would not get interpolation
when Y 0 = k(X 0 ) TY . The pre-filtering of Y with K −1 exactly counters the smoothing
effect at the grid points. Ỹ = K −1Y is sharpening

12.1 Uncertainty of GP interpolation


case 1 Yi are assumed to be noise-free ⇒ we have to keep the values intact ⇒ interpolation
Ŷi = k(Xi ) TK −1Y = Yi and the variance var[Y 0] = k (X 0,X 0 ) − k(X 0 ) TK −1 k(X 0 ) = 0 is
X 0 = Xi

case 2 Yi are noisy: Yi = f (Xi ) + εi , εi ∼ N (0,σ 2 ). The Gaussian process becomes:


|{z}
noise-free solution

Y ∼ N ( f (= 0),K + σ I)

where K is the uncertainty about the true function f , whereas σ 2 I is the uncertainty
in the measurements of Yi .
1Y values at grid points, K = I for Catmull-Rom

49
12 Lecture 29/05

• We need the conditional probability for unseen points, given the training
points: p(Y 0 |Y ) = p(Y ,Y 0 )/p(Y ) 2 . The computations are almost the same as
with σ 2 = 0, giving us

K +σ IN )Y + |{z}
Ŷ = E(Y 0 (X 0 )) = k(X 0 ) T (|{z} Eε 0
=A =0
var(Y ) = k (X ,X ) + σ − k(X ) (K + σ IN ) k(X ).
0 0 0 2 0 T 2 −1 0

This does not interpolate anymore: Ŷ = K (K + σ 2 I) −1Y , Y , giving us a


denoised version of Y .
• σ 2 acts as a regularization parameter, it shrinks Y 0 towards 0 (more generally,
towards f ).
• Ŷ = k(X 0 ) T (K + σ 2 I) −1Y is exactly kernel ridge regression (see ML1), but now
derived statistically.

12.2 Application [Snoek et al. 2012]: GP to optimize the hyper


parameters of a learning algorithm
• Learning Algorithm 1 (LA1) solves some problem of interest, Learning Algorithm 2
(LA2) optimizes LA1.

• standard approach to hyperparameter optimization of LA1: grid search with cross-


validation, but: CV is expensive, grid search is expensive, because exponentially
many candidate parameter sets (in the # of parameters)

• do CV for a few hyperparameter sets θi and compute Lossi

• use {(θ 1 ,Lossi )} as training data for LA2

• find θ 0 which minimizes Loss 0 in LA2


⇒ we can try many candidates θ 0 (≈ 106 ) by cheap interpolation in LA2 (= GP)

• perform CV on LA1 only with our best candidate θ best


0

• repeat

• precisely, the best candidate minimizes StdLoss


0
(Loss 0 ) “probability of improvement cri-
terium” or a more sophisticated criterion (better).

• suggest to use Matérn- 52 kernel in LA2

A + σ IN B
!
2 where S= , compare to an earlier lecture
BT C + σ 2 IN 0

50
12.3 Application: GP classification

12.3 Application: GP classification


• a standard GP learns a function Y 0 = f (X 0 ) : RD → R

• for classification, we need a posterior class probability p(Y |X 0 ) : RD → [0, 1]

• idea: inspired by logistic regression p(Y |X 0 ) = σ (X 0β ), f (X 0 ) = X 0β, p(Y |X 0 ) =


σ ( f (X 0 )). Now we replace f (X 0 ) by a GP estimate: instead of f (X 0 ) = X 0β we use
f (X 0 ) = k(X 0 ) T (K +σ 2 I) −1 [f (X 1 ), ..., f (X N )]T but it’s not so easy because f appears
on the LHS and RHS

• introduce a latent variable Z and define Zi = f (Xi ).


Z
p(Y |X ) = p(Y |X ,Z )p(Z |X )dZ

simplify by making independence assumptions: Yi ⊥ ⊥ X |Z (X ) ⇒ P (Y |X ,Z ) =


p(Y |Z ), and Y ⊥
⊥ Y |Z ,Z ⇒ p({Yi }|{Zi }) = i p(Yi |Zi ) giving us
0 0 Q
Z
p(Y |X ) = p(Y |Z ) p(Z |X ) dZ
| {z } | {z }
σ (Z ) GP reg.

• problem: how to determine Zi


p(Z |X ) actually depends on the training data D = {(Xi ,Yi )}, p(Z |X ,D).
[Zi ] = arg max p(Z |X ,D)
Z

• to model p(Z |X ,D), we make the Laplace approximation: in a neighborhood of the


optimum [Ẑi ] p(Z |X ,D) looks like a Gaussian. ⇔ we use the second order Taylor
expansion of log p(Z |X ,D)

log p(Z |X ,D) ≈ log p(Ẑ |X ,D) − 21 (Z − Ẑ ) TH (Z − Ẑ )

[the linear term is missing, because Ẑ is a maximum]


H is the negative Hessian of log p(Z |X ,D) at Ẑ

∂2
H =− log p(Z |X ,D)
∂Z 2

Z =Ẑ

1
Ẑ |X ,D) − (Z −Ẑ )H (Z −Ẑ )
p(Z |X ,D) ≈ e|log p(
{z } e 2 = N (Ẑ ,H −1 )
const

⇒ choose H −1 as an appropriate kernel and estimate Ẑ

51
13 Lecture 03/06

13.1 GP classification
• p(y 0 = ±1|X 0 ) = Z 0 p(y|Z 0,X )p(Z 0 |X )dZ 0 latent variable Z : simplifies matters
R

because of independence assumptions 1

• in order to make predictions p(Z 0 |X 0 ) = p(Z 0 |X 0, D) we need the Zi values for the
training points E(Z 0 ) = kT (X 0 )K −1Ẑ
⇒ training = determine Ẑ
N ) = p(Ẑ |{X ,Y } N )p({X ,Y } N ) = p({Y }|Ẑ )p(Ẑ |{X })p({X })
• p(Ẑ , {Xi ,Yi }i=1 i i i=1 i i i=1 i i i

N
Ẑ = arg max p(Z |{Xi ,Yi }i=1 ) = arg max p({Yi }|Z )p(Z |{Xi })
Z Z
= arg max = log p({Y }|Z ) + log p(Z |{Xi })
Z | {zPi } | {z }
log p(Yi |Z i )= log σ (Yi Z i )
P
i i | {zGP }
1 1 N
=− 2 Z T K −1Z + 2 log(K )+ 2 log 2π

X
⇒Ẑ = arg max = − log (1 + exp(−Yi Zi )) − 21 Z TK −1Z + const = ψ (Z )
Z i
∂ψ (Z )
= v − K −1Z = 0
!
∂Z
 t 1 − π1 
where v =  ...  , ti = 2Yi − 1 ∈ {0, 1},πi = σ (Zi )
t N − π N 

∂ 2ψ (Z )
= W − K −1
∂Z2 2

−π1 (1 − π1 ) 0 
where W =   ... ... 
−π N (1 − π N ) 

 0
Update step: Z (t+1) = Z (t ) − (W (t ) − K −1 ) −1 (v (t ) − K −1Z (t ) ), Ẑ = Z (t→∞)

[numerically better formulation ⇒ Rasmussen & Williams 2006]


1 see last lecture

53
13 Lecture 03/06

• predictions:
– solve p(Y |X 0 ) = Z 0 σ (YZ 0 )GP (Z 0 |Ẑ ,X 0 )dZ 0, no closed form solution
R

⇒ solve numerically or use the normal CDF instead of σ (t ) (but then we must
∂ψ ∂ 2ψ
adjust ∂Z , ∂Z 2 during training)
– if we only need a decision function:

ŷ = arg max p(Y |X 0 ) = sign(E(Z 0 )) = sign(kT (X 0 )K −1Ẑ )


Y

13.2 The Bayesian Interpretation of GP regression (and their relation to


“reproducing kernel Hilbert spaces”(RKHS))

• vector space Y ∈ RN , scalar prod. hY ,Y 0i = Y TY 0, how to visualize Y if N > 3,


“parallel coordinates”

Figure 13.1: A depiction of the use of parallel coordinates as plotting technique for multi-
variate data. It allows one to see clusters in data and to estimate other statistics
visually. When we are using parallel coordinates points are represented as
connected line segments. Each vertical line represents one attribute of the
car data set. One set of connected line segments represents one data point.
Points that tend to cluster will appear closer together. The dataset is clustered
in dependence of the number of cylinders given in the legend in the upper
right (MPG-miles per gallon).

• Hilbert space: take N → ∞ ⇒ parallel coordinates turn into a function f (X ),


f ,д = f (x 0 )д(x 0 )dx 0, x 0 ∈ RD

R

• vector with generalized scalar product: arbitrary bilinear form hY ,Y 0i = Y TAY 0


example: PCA, QDA

54
13.2 The Bayesian Interpretation of GP regression (and their relation to “reproducing kernel Hilbert spaces”(RKHS))

• doing the same in a Hilbert space gives the RKHS2 :


– kernel function K (X ,X 0 ) = K (X 0,X ) pos.def.,
– centered kernel function at x 0 : Kx 0 (X ) = K (X ,X 0 = x 0 )
H(k) is of RKHS:
(i) ∀x 0 ,Kx 0 ∈ H (K )
(ii) reproducing property of the scalar product:

∀x 0 ,∀f (x ) ∈ H (k ) : f ,Kx 0 H (k ) = f (x 0 )

• to define the scalar product explicitly, we need the convolution operator


Z Z
( f ∗ д)(x ) = f (x − x )д(x )dx =
0 0 0
f (x 0 )д(x − x 0 )dx 0

• use convolution to define the inverse kernel function centered at x 0 = 03


Z
k 0 ⇔ (k 0 ∗ k 0 )(x ) = δ (x ) =
−1 −1
k 0−1k (x − x 0 )dx 0

• to actually compute k 0−1 , its best to use Fourier transform: convolution theorem:
F ( f ∗ д)(x ) = F ( f )F (д)
!
1
k 0 ⇔ F (k 0 )F (k 0 ) = 1,k 0 = F
−1 −1 −1 −1
F (k 0 )
but often, no closed form expression for k 0−1 (x ) exists, no problem in practice, because
we can always explicitly invert the kernel matrix for our finite training set
• scalar product:
Z Z
f ,д f (x 0
)(k 0−1 ∗ д)(x )dx =
0 0
(k 0−1/2 ∗ f )(x 0 ) · (k 0−1/2 ∗ д)(x 0 )dx 0


H (k ) :=

• this fulfills the reproducing property: f ,kx 0 H (x ) = f (x 0 )




Z
f ,kx 0 H (x ) = f (x 0 ) (k 0−1 ∗ kx 0 )(x 0 ) dx 0


| {z }
(∗)
Z
= f (x 0 )δ (x 0 − x 0 )dx 0 = f (x 0 )

Z Z
(∗) k 0−1 (x 00 )kx 0 (x 0 − x )dx =
00 00
k 0−1 (x 00 )k 0 ((x 0 − x 0 ) − x 00 )dx 00 = δ (x 0 − x 0 )

if f (x ) = kx 1 (x ): kx 1 ,kx 0 H (k ) = Kx 1 (x 0 ) = Kx 0 (x 1 ) = K (x 0 ,x 1 ) = K 01 (kernel matrix




element)
2 Note on the notation: k,K might sometimes need to be exchanged
3 analog to inverse matrix M −1 ⇔ M −1 M = 1

55
13 Lecture 03/06

• application to GP regression:
– given training data D, find the function f (x ) that has maximum a posteriori
probability p( f |D)
– expand according to Bayes p( f |D) ∝ p(D| f ) p( f )
| {z } |{z}
training error prior for f
 
– data probability: squared loss p(D| f ) = exp i (Yi − f (Xi )) 2
1 P
2σ 2


– prior: choose a Gaussian process p( f ) = exp − 21 f , f

H (k )
– prior experience encodes the expected smoothness of f in the kernel K and
prefers f that conforms to this smoothness requirement. ⇔ f , f small


⇔ p( f ) high

1 X
fˆ = arg max p( f |D) = 2 (Yi − f (xi )) 2 + f , f H (k ) (∗)


f σ i | {z }
regularization

– to solve this, we need the “representer theorem” [Kimeldorf & Wahba 1971]
Thm: In any problem (∗), the optimal solution can be expressed as a linear com-
bination of kernel functions centered at the training points fˆ (x ) = i αi Kxi (X ),
P
just determine the αi
• insert the representer theorem into (∗)
D E DX X E X D E
fˆ, fˆ = αi Kx i , αi Kx i = αi α j Kxi ,Kx j
H (k )
i,j
X
= αi α j K (Xi ,X j ) = α TKα
i,j

Expansion of the first term of (∗):


1 X 2 2 X 1 X
= 2 Yi − 2 Yi f (Xi ) + 2 f (Xi ) 2
σ i σ i σ i
extending the second and third summand
X XX X X
( f (x )) =
2
( α j Kx j (xi )) =
2
α j αk kx j (Xi )κxk (Xi ) = α TK 2α
i i j jk i
X X X
Yi f (Xi ) = Yi α j kx j (Xi ) = α TKY
i i j

inserting into (∗) again gives:


1 X 2 2 T 1
(∗) = 2 Yi − 2 α KY + 2 α TK 2α + α TKα
σ i σ σ
∂(∗) 2 2
= − 2 KY + 2 K 2α + 2Kα = Y Kα + σ 2α = 0
!
∂α σ σ
⇒ α = (k + σ 2 1) −1Y Ŷ = k(X 0 ) Tα
aka fundamental interpolation equation

56
14 Lecture 10/06

14.1 Graphical Models


• task: model joint probability p(X 1 , ...,X D ), but:
– direct modeling is intractable
– no obvious factorization exists (e.g. for iid p(X 1 , ...,X D ) = j p(X j ))
Q

• idea: use conditional independence between variables to factorize as good as possible,


which is much weaker than unconditional independence, but our only chance

• “graphical”: represent conditional independence by means of a graph1

• example 1: correct handling of independence/association is not at all obvious


– problem: we know that Alice has two children that are not twins. What’s the
probability that both are boys?2
1. you have no additional information: p1
2. we meet Alice with one of her children, who is a boy: p2
3. we meet Alice with one of her children, who is a boy and she says “This is
my first-born”: p3
4. we meet Alice with one of her children, who is a boy and she says “He
was born on a Sunday”: p4
5. we meet Alice with one of her children, who is a boy and she says “Today
is his birthday”: p5
p1 , p2 , p3 , p4 , p5 . The probabilities are (p1 ,p2 ,p3 ,p4 ,p5 ) = ( 41 , 12 , 31 , 13
27 , 1459 )
729

Let A be first-born, B second-born child:


1. p(A = boy,B = boy) = P (A = boy)P (B = boy) = 1
4
P (A=boy,B=boy)
2. p(A = boy,B = boy|A = boy) = P (A=boy) = 1/4
1/2 = 1
2
p(A=boy,B=boy)
3. p(A = boy,B = boy|A = boy ∨ B = boy) = P (A=boy∨B=boy) = 1/4
3/4 = 1
3
4. wrong model:

p(A=boy∧B=boy∧(A=Sun∨B=Sun)|(A=boy∨B=boy)∧(A=Sun∨B=Sun) )
1 There will probably be a lot of plots in this chapter, which won’t be reproduced here, see e.g. Barber,
Koller&Friedman for those.
2 We could just ask Bob.

57
14 Lecture 10/06

p(A = boy,B = boy)


=
p(A = boy ∨ B = boy)

⇒ missing, that it’s the same person who is a boy and born on a sunday

⇒ correct model:

p(A=boy,B=boy∧(A=Sun∨B=Sun)|(A=boy∧A=Sun)∨(B=boy∧B=Sun) )

13 2 · 7 − 1
p4 = =
27 4 · 7 − 1

5. see exercise

• example 2: Simpson’s paradox: if dependencies are treated incorrectly, you can


turn a statement into its opposite, using the same data
– ≈ 1970 U Berkley was sued for preferring men over women
male:app male:adm male:% fem:app fem:adm fem:%
total 2590 1192 46 1835 557 30.4
A 825 512 62 108 89 82
B 560 353 63 25 17 68
C 325 120 37 593 202 34
D 417 138 33 375 131 35
E 191 53 28 393 94 24
F 272 16 6 341 24 7
– 4 out of 6 departments prefer women
– 5 out of 6 departments prefer the minority
– in total: men are highly preferred
⇒ explanation: women tend to apply for highly competitive fields
– statistical mistakes:
1. an association does not in general imply causality
∗ to determine causality, better methods are needed
· preferred: randomized controlled experiment (group applicants at
random and force each group into a particular field ⇒ dependency
between sex & field is broken by “active intervention” (⇒ inter-
ventioned dataset))
· often this is illegal or unethical or impossible ⇒ have only “observational
dataset” ⇒ causality is a very difficult problem ⇒ later

58
14.1 Graphical Models

Gene Gene

Smoking Cancer Smoking Cancer Smoking Cancer

(a) (b) (c)

Figure 14.1: Three possible models for smoking and cancer. (a) Direct causal influence; (b)
indirect causal influence via a latent common cause (Gene); (c) incorporated
model with both influences.

2. omitted variable bias: apparent association could be causal, but can also
have a common cause (smoking → lung cancer, gene → lung cancer and
gene → smoking) or a mediating property (sex → admission, but sex→field
→ admission)
if the additional variable is ignored (marginalized out), very misleading
conclusions will be drawn

• graphical models are a tool to treat conditional independence systematically


– two kinds:
∗ directed (graphs): based on chain rule of probability
∗ undirected (graphs): based on the Gibbs probability distribution p(X ) =
Z exp(−E (X ))
1

• chain rule: p(X 1 , ...,X D ) = p(X D |X D−1 , ...,X 1 ) · · · p(X 2 |X 1 )p(X 1 )


draw the decomposition as a directed graph3 which is called a “Bayesian network”

• trick: can drop arcs when variables are conditionally independent (remember condi-
tional independence does not in general imply general independence)

• goal: drop as many arcs as possible ⇒ simplest problem representation


How many parameters are needed to specify the probability? Let X j ∈ {1, ..,C j }.
– p(X 1 , ...,X D ) needs j C j − 1 parameters
Q

– full factorization needs as many parameters


– if we drop arcs, the number of parameters reduces

3 see earlier comment

59
15 Lecture 12/06

15.1 Bayesian Networks (directed graphical models)


• idea:
1. factorize the joint probability p(X 1 , ...,X D ) according to the chain rule
2. represent the factorization as a directed graph
3. use conditional independence to remove as many edges as possible ⇒ simpler
problem (reduced)

• catch:
– every permutation of the variables results in a different, but equivalent factor-
ization: D! possibilities
– but in some factorizations we can remove many more edges
⇒ goal: use the permutation that results in the fewest edges after step 3. the
best permutation tends to be the one that results in a causal graph (i.e. arc
direction = cause → effect)

• how to identify causal relationships:


1. use domain knowledge (past → present→ future, property → measurement1 )
2. perform randomized controlled experiments: experimenter intervenes to
break potential dependencies, so that other dependencies can be analyzed in
isolation (exclude the possibility of a common cause aka. confounder)
3. when controlled experiments are impossible/illegal/unethical, estimate causal-
ity from purely observational data
– this is very difficult and a hot research topic ⇒ later

• main task in BN:


– prediction: in contrast to traditional methods, where prediction is relatively
easy, here sophisticated inference algorithms are needed
∗ compute probabilities not explicitly represented in the model:
· marginals p(X j ) = p(X 1 , ...,X D )
P

· marginals given evidence on some variables p(X j |X j 0 = e j 0 )


· likewise for uncertain evidence
1 can be violated in quantum mechanics

61
15 Lecture 12/06

∗ compute the most probable variable assignment (maximum a posteriori


(MAP) solution ) or several highly probable solutions (k-best)
∗ support decision making (“will surgery help?”)
– training:
∗ parameter learning: given the graph, learn the conditional probabilities of
the decomposition
∗ structure learning: identify the optimal (ideally: causal) graph

• two popular kinds of BN:


– temporal models: causality is implied by time, e.g. speech recognition
– causal models: give an explanation of the observed behavior that can be
understood by domain experts

• Pearl’s basic network construction algorithm


1. identify all variables relevant to the problem (missing variables may lead to
Simpson’s paradox)
2. arrange the variables in a useful order (ideally: causal)
3. for j = 1, ...,D (D = # variables)
– add a node for X j to the network
– find a minimal subset PA(X j ) ⊆ {X 1 , ...,X j−1 } = S j−1 such that
Xj ⊥⊥ (S j−1 \PA(X j )|PA(X j )) PA(X j ) are called the “parents” (e.g. use a
statistical test like χ 2 test for conditional independence)
– add arcs ∀X j 0 ∈ PA(X j ) : X j 0 → X j
⇒ the graph represents the factorization p(X 1 , ...,X D ) = j p(X j |PA(X j ))
Q

4. learn the parameters of the distributions p(X j |PA(X j )) for all j (p(X j |PA(X j ))),
can be represented by conditional probability tables (CPT) or parametric models
⇒ Bayesian or Belief Network (BN)

• there are three fundamental configurations in a BN


– chain (“causal chain”) A → B → C
– diverging connection (“common cause”) A ← B → C
– converging connection (“common effect”) A → B ← C
⇒ behave interestingly when B is marginalized out or there is evidence on B

• chain:
– if B is marginalized, we just loose information: p(C |A) = B p(C |B)p(B|A)
P
(uncertainty increases)
– if B is known (B = b), then C is independent of A : A ⊥
⊥ C |B (dictated by the
graph structure, otherwise A must be in PA(C))

62
15.1 Bayesian Networks (directed graphical models)

• common cause:
– if B is marginalized, an association between A and C results (the arrow direction
does not follow from the graph, but often from the application) example:
Simpson’s paradox, Berkley admission
– if B is known: A ⊥
⊥ C |B

• common effect:
– if B is not marginalized but unknown A ⊥
⊥C
– if B is marginalized: unconditional independence still holds
– if B is known, A and C become conditionally dependent A 6⊥
⊥ C |B (“Bergson’s
paradox”)

• example: Burglary alarm2


p(B = 1) = 0.01, marginal p(A = 1) = 0.016

B p(A = 1|B) p(A = 1,B) p(B|A = 1)


0 0.007 0.0069 0.43
1 0.9 0.009 0.57

Now suppose you live in California: the alarm can be triggered by an earthquake

B E p(A = 1|B,E) p(A = 1,B,E) p(B,E|A = 1) p(B|A = 1)p(E|A = 1)


0 0 0.001 0.00097 0.06 , 0.27
0 1 0.3 0.0059 0.37 , 0.16
1 0 0.9 0.0088 0.55 , 0.35
1 1 0.95 0.00019 0.01 , 0.22

marginalize out B:

E p(A = 1,E) p(E|A = 1)


0 0.0098 0.62
1 0.0061 0.38

Bergson’s paradox: given A = 1, we learn (e.g. from the news) that there was an
p(A=1,B,E=1)
earthquake E = 1. Compute p(B|A = 1,E = 1) = p(A=1|E=1)

B p(B|A = 1,E = 1)
0 0.97
1 0.03

This is known as the “explaining away effect”

• The effect also occurs when we get evidence on any descendent of A.


2 The tables are not in the right order. Figuring out the correct order is left as an exercise to the reader.

63
15 Lecture 12/06

• The three fundamental configurations can be combined into a systematic criterion


to identify all independence assumptions that are implicitly represented in a given
graph. “d-separation”
– directed path from X Y sequence of nodes A0 = X ,A1 , ...,Ak−1 ,Ak = Y such
that Aj−1 → Aj is an arc
– transitive closure (descendants) of X : DE(X ) = {Y : X Y}
– ascendants of X : transitive closure of the transposed graph, nodes that can
reach X : AS (x ) = {Y |Y X}
– undirected path (X ! Y ): A0 = X ,A1 , ...,Ak−1 ,Ak = Y , such that Aj−1 → Aj
or Aj−1 ← Aj is an arc in the graph

• Consider a set S ⊂ {X 1 , ...,X D } such that evidence is available for all nodes in S. Let
X ,Y < S. Then, an undirected path X ! Y is blocked by S if any of the following
is true:
1. Ai−1 — Ai — Ai+1 is a chain and Ai ∈ S
2. Ai−1 ← Ai → Ai+1 and Ai ∈ S
3. Ai−1 → Ai ← Ai+1 and neither Ai < S nor for Z ∈ DE(Ai ) Z < S

• Def: X and Y are d-separated3 by S, if S blocks every path X ! Y .

3 Note: according to wiki d stands for directional

64
16 Lecture 17/06
• What independence assumptions does a BN encode?
– when X and Y are associated, information must flow between them:
– in a BN information can only flow along the arcs (in both directions!)
⇒ we must consider an undirected path between X and Y (X ! Y )
– if information can flow along X ! Y the path is “active”, otherwise “blocked”
– if all paths between X and Y are blocked ⇒ X ⊥
⊥ Y (unconditionally)
– given a set S of nodes where we have evidence (know the variable value), path
activation can change1
• algorithm to check for d-separateness:
Given: directed graph G, nodes S; X ,Y < S
1. Define the ancestral subgraph G 0 of G: remove all nodes not in {X ,Y ,S, ancestors(X ,Y ,S )}
and their arcs.
2. Define moral graph G 00 of G 0: for each node in G 0 connect all unconnected
parents (“unmarried”) by an undirected arc and remove all arrows.
3. Construct G 000 of G 00 by removing all nodes from S in G 00.
4. X and Y are d-separated given S if they are unconnected given G 000.
• Def: A joint probability p(X 1 , ...,X D ) satisfies (directed global) Markov property
w.r.t a graph, if X j and X j 0 , are d-separated by S implies X j ⊥
⊥ X j 0 |S in p(X 1 , ...,X D ).
• Theorem: If p(X 1 , ...,X D ) is Markov w.r.t a DAG G, then it can be factorized as
p(X 1 , ...,X D ) = j p(X j |PA(X j )).
Q

• The converse is not generally true: conditional independence does not always imply
d-separation.
Example: X and Y are not d-separated but independent: X → Z → Y ← X .
linear model:

X = εX ∼ N (0,σX2 )
Z = aX + εZ
Y = bZ + cX + εY
= abX + bεZ + cX + εY = (ab + c)X + εZ0

if (ax + c) = 0 ⇒ X ⊥
⊥Y
1 see end of last lecture

65
16 Lecture 17/06

• this is undesirable: define the problem away

• Def: A distribution p(X 1 , ...,X D ) is faithful to a DAG G if X j ⊥


⊥ X j 0 |S implies d-
separation of X j ,X j 0 , given S.

• Claim: many important models for p(X j |PA(X j )) are faithful with probability 1.
Advantage: d-separation, i.e. the structure of the graph, fully specifies all indepen-
dence assumptions ⇒ we can separate the two problems
1. define/learn the structure of the graph
2. learn the probabilities, given the graph

16.1 Inference in BN
• Inference: compute interesting properties that are implicitly represented by the
BN (graph structure and p(X j |PA(X j )) are known)

• basic algorithm: variable elimination: split the variables into 3 (disjoint & complete)
sets
– T : variables we are interested in (“targets”)
– V : variables where we have evidence (“visible”)
– U : variables we are not interested in

• when V is empty: variable elimination = marginalization over U : p(T ) = p(T ,U )


P
U

• when V is not empty:


P (X j |PA(X j )), if var. assignment is comp. w/ V

1. define new functions q(X j |PA(X j )) = 
 0,
 otherwise
2. marginalize over U q(T |V ) = U j q(X j |PA(X j ),V )
P Q

3. turn into probability by normalization p(T |V ) = q(T |V )/ T 0 q(T 0 |V )


P

• example 1: last week’s computations in the burglary alarm network

• example 2: Naive Bayes classifier. Assumptions:


– class membership causes feature observations
– features are independent, given the class X j ⊥
⊥ X j 0 |C.
Prediction:
∗ V = {X 1 = o 1 , ...,X D = oD } what is p(C |V )?
p(X j |C), X j = o j = p(X j |C)δ (X j = o j )

∗ q(X j |C,V ) = 
 0,
 otherwise

66
16.1 Inference in BN

∗ p(C |X 1 , ...,X D ) ∝ p(X 1 , ...,Xd |C)p(C)


∗ p(C |X 1 = o 1 , ...,X D = oD ) = q(C |X 1 , ...,X D )/ q(C = k |X 1 , ...,X D )
P

problem: variable elimination has exponential complexity in the size of U (# of


variables to eliminate)

• solution, idea: use the distributive law to minimize the complexity in the sum over
products
– ab + ac = a(b + c) (three operations vs 2). In a complex network, proper
grouping of terms can give dramatic gains:
X
q(X 1 ,X 3 ) = q 1 (X 1 ,X 2 ,X 3 )q(X 1 ,X 4 )
X 2 ,X 4

assume each X j takes b different values. Then we have in total b 2 · 2b 2 = 2b 4 ,


but grouping
X X
q 1 (X 1 ,X 2 ,X 3 ) q 2 (X 1 ,X 4 )
X2 X4

gives us b 3 + 2b 2

• but: finding the optimal grouping is in general NP-hard.

• but: it is easy for a very important special case: if the BN is a tree.


⇒ use “belief propagation algorithm” to find the optimal computation mechani-
cally

• why is this relevant:


– many practical BNs are trees
– some can be transformed into trees by duplicating and grouping variables
using “junction tree algorithm”
– belief propagation also works when the graph is not a tree (“loopy belief
propagation”) (relevant cycles of the undirected graph corresponding to BN)
but gives approximate solution (quality is application dependent)

• belief propagation is also known as message passing.


– it passes around (between neighboring nodes) reduced (marginalized) proba-
bility tables

67
17 Lecture 19/06
• marginalization in Bayesian networks is generally done by “variable elimination”
• but: VE has exponential complexity in the # of eliminated variables when applied
naively
• We can take advantage of the distributive law to group sums and products such that
the complexity is minimized.
• “belief propagation” finds an optimal evaluation order automatically for tree-shaped
graphs.
• original algorithm [Pearl, 1988] for BN, here: use the generalization to factor graphs
by [Kschischang et al., 2001]
– factor graph:
∗ two types of nodes: variables (X j ), factors (functions) fl (small squares)

X3 X3 X3

fc fb f (X 1 ,X 2 ,X 3 )

X1 X2 X1 fa X2 X1 X2

Figure 17.1: The left figure shows the undirected graph for the middle and right picture
with single clique potential Φ (X 1 ,X 2 ,X 3 ). The picture in the middle is the
factor graph of Φ (X 1 ,X 2 ,X 3 ) = fa (X 1 ,X 2 ) fb (X 2 ,X 3 ) fc (X 3 ,X 1 ) and the right
figure is the factor graph for Φ (X 1 ,X 2 ,X 3 ) = f (X 1 ,X 2 ,X 3 ).

∗ bipartite, i.e. edges are only between nodes of different types (undirected
edges)
∗ edge X j —fl exist ⇔ X j is an argument of fl
∗ example: Burglary alarm
– Variable elimination is implemented by “message passing”. Each node sends
and receives messages to/from its neighbors. messages = reduced probability
tables = partial variable elimination

69
17 Lecture 19/06

fE fB

E B

fA

Figure 17.2: Factor graph for the Burglary alarm example

– message passing has two simple rules:

1. variable to factor:
Y
µX j →fl (X j ) = µ f 0→X j (X j )
f 0 ∈N e (X
j )\fl

2. factor to variable:
X Y
µ fl →X j (X j ) = fl (N e ( fl )) µX 0→fl (X 0 )
{X 0 }∈N e ( fl )\X i {X 0 }∈N e ( fl )\X j

Note: N e (·) represents the neighborhood of ·

– shorthand notation {X 0 } ∈ N ( fl )\X j =: Nl\j

– Summary of the principle: message sent by a node to a receiver depends on


the messages coming in from all neighbors of the sender except the receiver.

– message scheduling: a message can be sent as soon as all required incoming messages
| {z } | {z }
LHS RHS
at the sender have been received
⇒ leaf nodes in a tree can send messages to its only neighbor without waiting
or prerequisites
⇒ message passing proceeds in rounds:
. round 0: send messages from leaf nodes
. round t: send messages where last prerequisites were received in round
(t − 1)
termination time T = diameter of the tree (longest path)

70
17.1 Temporal Models/Belief Networks

– finalization rule: compute the marginals from all incoming messages of the
variable nodes
Y
q(X j ) = µ f →X j (X j )
f ∈N e (X j )
X
p(X j ) = q(X j )/ q(X j0 )
X j0

– For small graphs, this is no improvement over naive variable elimination, but it
is easy to implement as an algorithm for arbitrary large graphs “sum-product
algorithm”.
– If the graph has cycles (no tree) ⇒ belief propagation is generalized to “loopy
belief propagation”.
∗ due to cycles, a node can receive messages through the same edge repeat-
edly (either through alternative paths or by repeated winding around a
cycle)
⇒ Whenever this happens, the node sends updated messages through the
other edges.
∗ no hard termination condition, but converges to a fixed point (local optimal
solution) = reasonable approximation of full variable elimination, but
depends on the initial state
∗ two possibilities to incorporate evidence (observed states)
· set fl (X ) = 0 if the evidence is incompatible with the variable states
f (A,B,C),A = 1 ∀B,E : f (A = 0,B,E) = 0
· attach unary factors to the variable nodes where we have evidence
example: Alice’s children, version (2) (We know that at least one of
the children is a boy.) 1

17.1 Temporal Models/Belief Networks


• causality goes past → present → future, we know the arrow directions
⇒ simplest possible model: Markov chain (MC) PA(X j ) := {X j−1 , ...,X j−M }, M-th
order Markov chain
M = 1 : (X 1 ) → (X 2 ) → (X 3 ) → ...
M represents how much memory the system has, for example M = 1 there is no
memory and the future only depends on the present, not the past
If p(X j |PA(X j )) = p(X j |X j−1 ) = R (j) , Ri,i 0 = p(X j = ai |X j−1=i 0 )
p(X D |X 1 ) = R (D) · · · R (2)p(X 1 )

• The system is stationary if all X j have the same set of states and R (j) = R (j ) .
0

1 there is again a graph that won’t be reproduced here

71
17 Lecture 19/06

• a stationary system can be represented as a probabilistic state machine, example: the


weather homework from exercise06.pdf

0.2

Rain

0.4 0.7

0.1 0.2

Sunny 0.3 Cloudy

0.3 0.5 0.3

Figure 17.3: Probabilistic State Machine for weather homework.

• for D → ∞ p(X D |X 1 ) becomes the stationary distribution p∞ which can be shown


to be independent of X 1 .
’stationary’ means that it doesn’t change anymore, p∞ = Rp∞ , i.e. p∞ is the eigen-
vector of R corresponding to eigenvalue 1.

72
18 Lecture 24/06
This lecture is actually two lectures.

18.1 Markov Chains


(X 1 ) → (X 2 ) → ... → (X D ) if stationary: probabilistic state machine, state transition
matrix R, stationary distribution p∞ = Rp∞ eigenvector of R with eigenvalue 1.

p(X1 | X1 ) p(X... | X... ) p(XD | XD )

X1 p(X... | X1 ) X... p(XD | X... ) XD

Figure 18.1: Schematic of Markov Chain

18.1.0.1 Google PageRank algorithm

• a search engine works in two steps:


1. find pages related to the query
2. rank these pages according to importance/relevance

• how to measure importance?


– today: probably use actual click statistics
– ≈ 1995 : no statistics available ⇒ simulate user clicks by a random walk =
monkey user clicking at random
⇒ consider pages as important if they are frequently reached in the random
walk ⇔ high prob. in p∞ .

• define transition matrix:


– state k = user looks at webpage k, k = 1, ...,C, C = # of pages

1. the monkey clicks on each link on page k uniformly at random


2. if page k contains no links the monkey goes to any page uniformly at random.

73
18 Lecture 24/06

3. on any page there is a constant prob λ that the monkey goes to any other page
uniformly at random instead of clicking a link.

 1, page k links to page k 0



– adjacency matrix A: Ak 0k = 
 0, else

– out-degree of page k: k 0 Ak 0k
P

– from 1. and 2. we define the transition matrix R 0:

A 0/ k 0 Ak 0k , if page k has Links


P
kk

P (X j = k |X j−1 = k ) =
0
Rk0 0k =

 1/C, otherwise

X
∀k : Rk0 0k = 1.
k0

– incorporate rule 3. to define “Google matrix” R:

1
Rk 0k = λ + (1 − λ)Rk0 0k
C

– importance of page k: (p∞ )k (numerically difficult if C >> 1)

18.2 Hidden Markov Models (HMM)


• make Markov chain a bit more complicated: the interesting variables X j are not
observable anymore

• instead we can observe features Yj that depend on the “hidden” or “latent” vari-
ables X j (dependency is causal, but probabilistic)
⇒ BN: (X 1 ) → (X 2 ) → ... → (X D ) and (Xi ) → (Yi ),∀i
probability factorizes: p(X 1 , ...,X D ,Y1 , ...,YD ) = j p(X j |X j−1 )p(Yj |X j )
Q

• example:
– speech recognition: X 1 , ...,X D is what the speaker said (phonemes), Y1 , ...,YD
is what you heard
– wireless communication (e.g. cell phones): Xi symbol sent, Yj symbol received

• major task:
– compute marginals for the hidden states, given observations Y = O 1 : p(X j |Y =
O ), j = 1, ...,D
1 observed state vector

74
18.2 Hidden Markov Models (HMM)

– compute the most likely sequence of hidden states, given observations

â = arg max p(X = a|Y = O )


a

(note: the global ML sequence â generally differs from the sequence of pointwise
maxima α̃ j = arg max p(X j |Y = O ))
a
– learn the probabilities p(X j |X j−1 ) and p(Yj |X j ) from training data

• compute pointwise marginals

X \X j p(X 1 , ...,X D ,Y1 = O 1 , ...,YD = O D )


P
p(X j |Y = O ) = P
X p(X 1 , ...,X D ,Y1 = O 1 , ...,YD = O D )
X
∝ p(X 1 , ...,X D ,Y1 = O 1 , ...,YD = O D ) = q(X j |Y = O )
X \X j

• factor graph

f 1 (X 1 ) = p(X 1 ) prior of X 1
f j (X j ,X j−1 ) = p(X j |X j−1 ) transition probability
дj (Yj ,X j ) = p(Yj |X j ) observation probability

f 1 (X 1 ) X1 f 2 (X 2 | X 1 ) X2 f 3 (X 3 | X 2 ) X3

д1 (Y1 | X 1 ) д2 (Y2 | X 2 ) д3 (Y3 | X 3 )

Y1 Y2 Y3

Figure 18.2: Illustration of the factor graph for a HMM.

• BP message passing rules:


Y
µX →f (X ) = µ f 0→X (X )
f 0 ∈N E (X )\f
X Y
µ f →X (X ) = f (X ) µX 0→f (X 0 )
X 0 ∈N E ( f )\X X 0 ∈N E ( f )\X
| {z }
(∗)

75
18 Lecture 24/06

• all factor nodes have degree 2 (“pairwise factors”) the products (∗) contain only a
single term, we can simplify message passing by concatenating two consecutive
messages2

γ j (X j ) = (µYj →дj ◦ µдj →X j )(X j )


α j (X j ) = (µX j−1 →f j ◦ µ f j →X j )(X j )
β j (X j ) = (µX j+1 →f j+1 ◦ µ f j+1 →X j )(X j )

• message schedule:
– round 0: send all γ messages (in parallel) and α 1 (X 1 ) (these messages have no
prerequisites, because f 1 and Yj are leaves)
– round j: send α j+1 (X j+1 ) and βD−j (X D−j )
⇒ forward-backward algorithm

• expand the message definitions:


X
γ j (X j ) = дj (Yj ,X j ) µYj →дj (Yj ) = дj (Yj = O j ,X j )
X 0 ∈N E (дj )\X j
| {z }
| {z } δ (Yj =O j )
Yj
X
α j (X j ) = f j (X j ,X j−1 ) µX →f j (X j−1 )
X 0 ∈N E ( f j )\X j Y| j−1 {z }
| {z } µ f 0→X j−1 (X j−1 )
X j−1
f 0 ∈N E (X j−1 )\f j−1
| {z }
α j−1 (X j−1 )γ j−1 (X j−1 )
X
= f j (X j ,X j−1 )α j−1 (X j−1 )γ j−1 (X j−1 )
X j−1

α 1 (X 1 ) = p(X 1 )
X
β j (X j ) = f j+1 (X j+1 ,X j )µX j+1 →f j+1 (X j+1 )
X 0 ∈N E ( f j+1 )\X j
| {z }
X j+1
X
= f j+1 (X j+1 ,X j )γ j+1 (X j+1 )β j+1 (X j+1 )
X j+1

βD (X D ) = 1

• algorithm:
– round 0: propagate γ j and α 1
– round j: propagate α j+1 and βD−j
2◦ represents the concatenation

76
18.2 Hidden Markov Models (HMM)

– finalization: compute marginals:


Y
q(X j |Y = O ) = µ f 0→X j (X j ) = α j (X j )β j (X j )γ j (X j )
f 0 ∈N E (X j )

• Remark: most textbooks derive the F/B algorithm directly, without factor graphs.
Then, the messages α j and γ j are usually merged to α̃ j (X j ) = α j (X j )γ j (X j ).
• computing p(X j |Y = O ) is also called “smoothing”, intuition: X and Y have the
same statespace (symbols of an alphabet), Y is a noisy version of the true message
X ⇒ p(X j |Y = O ) is a denoised (smoothed) version of Y = O
• in smoothing, we condition on all observations Y1 = O 1 , ...,YD = O D
• in an online system, we can only condition on the observations received so far Y1 =
O 1 , ...,Yj = O j (we do not yet know the values of future observations Yj+1 , ....,YD )
⇒ (online) filtering
• Derive filtering from scratch and show that it gives the same results as belief propa-
gation on factor graphs:
p(X j |Y1 = O 1 , ...,Yj = O j ) = p(X j , .Y1 , ...,Yj )/p(Y1 , ...,Yj )
q(X 1 |Y1 = O 1 , ...,Yj = O j ) = p(X j ,Y1 = O 1 , ...,Yj = O j )
X
p(X j ,Y1 , ...,Yj ) = p(X 1 , ..,X D ,Y1 , ...,YD )
X \X j ,Y \{Y1 ,...,Yj }
X
= p(X j−1 ,X j ,Y1 , ...,Yj ).
X j−1

• BN factorization and Bayes rule:


p(X j−1 ,X j ,Y1 , ...,Yj )
= p(Yj |X j−1 ,X j , ...,Yj−1 )p(X j |X j−1 ,Y1 , ...,Yj−1 )p(X j−1 ,Y1 , ...,Yj−1 )
= p(Yj |X j )p(X j |X j−1 )p(X j−1 ,Y1 , ....,Yj−1 )

q(X j |Y1 = O 1 , ...,Yj = O j )


| {z }
=:α̃ j (X j )
X
= p(Yj = O j |X j )p(X j |X j−1 )q(X j−1 |Y1 = O 1 , ....,Yj−1 = O j−1 )
X j−1
X
= p(Yj = O j |X j ) p(X j |X j−1 ) q(X j−1 |Y1 = O 1 , ...,Yj−1 = O j−1 )
X j−1
| {z }
α̃ j−1 (X j−1 )

X
α̃ j (X j ) = p(Yj = O j |X j ) p(X j |X j−1 )α̃ j−1 (X j−1 )
X j−1
| {z }
γ j (X j ) | {z }
α j (X j )

77
18 Lecture 24/06

translate this to our notation:

q(X j |Y1 = O 1 , ...,Yj = O j ) = γ j (X j ) α j (X j )


|{z} |{z}
corrector predictor
X
α j (X j ) = p(X j |X j−1 ) γ j−1 (X j−1 )α j−1 (X j−1 )
X j−1
| {z }
f j (X j ,X j−1 )

predictor: updated prior for X j representing our expectations of the next X j


corrector: noisy observation of X j
q(X j |Y1 , ...,Yj ): compromise between our expectations and observations

• Kalman filter3 :
– HMM with continuous state space for X and Y . X j ∈ RN , Yj ∈ RN
0

– transitions are defined by linear (matrix) equations + additive Gaussian noise


⇒ nice analytical matrix expressions for all probabilities

• task: determine the most likely sequence of X , given Y = O, “maximum likelihood


detection”
â = arg max p(X = a|Y = O )
a
– “decoding the noise received message Y ”
– Viterbi-algorithm, widely used in all digital communications

• as usual, instead of maximizing the likelihood, we minimize the negative log-


likelihood
– redefine the factors:

f j (X j ,X j−1 ) = − log p(X j |X j−1 )


дj (Yj ,X j ) = − log p(Yj |X j )

– objective

α̂ = arg max p(X = a|Y = O )


a
X D XD
⇔ â = arg min *. дj (Yj = O j ,X j = a j ) + f j (X j = a j ,X j−1 = a j−1 ) +/
a
, j=1 j=1 -

– surprisingly, this can also be solved by a variant of belief propagation


– crucial insight: sum-product algorithm (= standard belief propagation) auto-
matically groups terms to minimize computations, but this grouping only relies
on the algebraic properties of addition and multiplication

3 might be treated later if there is enough time

78
18.2 Hidden Markov Models (HMM)

⇒ it works for other tasks that have the required algebraic properties

• specifically addition and multiplication form a semi-ring


Def of a semi-ring4 : (R, ⊕, ⊗) is a semi-ring over domain R, if

(i) ⊕ is a commutative and associative operator with neutral element “0”

(ii) ⊗ is a commutative and associative operator with neutral element “1”

(iii) ⊕ and ⊗ are distributive: (a ⊗ b) ⊕ (a ⊗ c) = a ⊗ (b ⊕ c)

(iv) “0” annihilates ⊗: a ⊗ “0”=“0”

• This obviously applies to ordinary addition and multiplication with “0” = 0, “1” = 1.

• for maximum a posteriori estimation = minimal negative log-likelihood we need


the “min-sum algebra”

(i) a ⊕ b = min(a,b), “0” = ∞ (min(a, ∞) = a)

(ii) a ⊗ b = a + b, “1” = 0, a + 0 = a

(iii) min(a + b,a + c) = a + min(b,c)

(iv) a ⊗ “0” = a + ∞ = ∞

• using this algebra, belief propagation becomes the “min-sum algorithm”, intuition:
replace all products with sums and all sums with “min” in the sum product algorithm
⇒ reuse the messages α, β,γ and update scheduling

– round 0: initialization

γ j (X j ) = дj (Yj = O j ,X j ) = − log p(Yj = O j |X j )


βD (X D ) = 0 = “1” of min-sum

4 add inverses for a “complete” ring

79
18 Lecture 24/06

– backward sweep, rounds 1, ...,D − 1 (compute the likelihood right to left):


D−1
X
min . дj (Yj = O j ,X j ) + f j (X j ,X j−1 ) +/ =
*
{X 1 ,...,X D }
, j=1 -
X D
= min *. дj (X j ) + f j (X j ,X j−1 ) +/
{X 1 ,...,X D−1 }
, j=1 -

+ min ..дD (X D ) +fD (X D ,X D−1 ) + βD (X D ) //


* +
X D | {z } | {z }
, γ D (X D ) =0
| {z }-
=:β D−1 (X D−1 )
D−2
X
= min . дj (X j ) + f j (X j ,X j−1 ) +/
*
{X 1 ,...,X D−2 }
, j=1 -
+ min (дD−1 (X D−1 ) + fD−1 (X D−1 ,X D−2 ) + βD−1 (X D−1 ))
X D−1
| {z }
β D−2 (X D−2 )
[...]
 
β j (X j ) = min дj+1 (X j+1 ) + f j+1 (X j+1 ,X j ) + β j+1 (X j+1 )
X j+1

finally:

â 1 = arg min .. α 1 (X 1 ) +β 1 (X 1 ) + γ 1 (X 1 )
* +/
/
X 1 =a 1 | {z } | {z }
,− log p(X 1 =a1 ) − log p(Y1 =O 1 |X 1 =a 1 ) -

– forward sweep: propagate the solution from left to right


α j (X j ) = f j (X j ,X j−1 = â j−1 )
 
â j = arg min α j (X j ) + β j (X j ) + γ j (X j )
X j =a j

“Viterbi algorithm”
• remark 1: one can also do a forward sweep first, followed by a backward sweep ⇒
same result
• remark 2: in principle, one can also use this to maximize the likelihood directly
(instead of negative log-likelihood)
– use the “max-product algebra” ⊕ = max, “0” = −∞, ⊗ = ·, “1” = 1 to get the
“max-product algorithm”
– numerically not advisable, because it involves products of small numbers ⇒
loss of precision
better: work with logarithms and min-sum algorithm

80
19 Lecture 26/06
Note: This lecture contains a lot of equations given in a very short amount of time. The
frequency of typing errors is therefore probably higher. Proceed with caution!
• Difference between point-wise marginals p(X j |Y = O ) and the global MAP
solution â = arg min p(X = a|Y = O )
a

• Example 1:
– consider a problem where X j ∈ 1, ...,C and labels are ordinal (e.g. discretized
values of a continuous phenomenon) and Yj ∈ 1, ...,C are noisy observations
of the X j .
– Point-wise marginals describe the local uncertainty about X j . For example
X̄ j = E[X j ], std[X j ] can be computed from p(X j |Y = O ) and give local error
bars1 .
– The MAP solution is the most likely global solution, within the local error bars.
• Example 2: consider a random walk in a maze: the room entered most frequently2 is
not necessarily part of the most likely way out.

19.1 Learning the parameters (= transition probabilities) of a


HMM
• case 1: supervised training: the states X j are known in the training data → estimate
the transition probabilities by counting transition frequencies
• case 2: unsupervised training: X j are unknown
– example: wildlife photographer wants to get footage of a interesting chim-
 1, chimp is north of the river

panzee. X j = 
 2, chimp is south of the river at time j.

– photographer needs to predict X j+1 to set up equipment at the right location
– observations Yj : any evidence where chimp is/was (sightings, excrements, ...)



 0: no evidence or contradictory evidence on day j
Yj = 

 1: was seen north (but may be wrong!)


2 :
 was seen south (but may be wrong!)
1 example plot here
2 max of point-wise marginals

81
19 Lecture 26/06

– task: create a HMM from N observation sequences

• stationary HMM: transition probabilities are independent of j and constant ρk =


p(X 1 = k ) prior k ∈ {N ,S }; πk 0k = p(X j = k 0 |X j−1 = k ); µmk = p(Yj = m|X j = k ),
m ∈ {0, N ,S }
full parameter set: θ = {ρ,π , µ}

• training data:
– N sequences, n = 1, ..., N , length Dn j = 1, ..,Dn
– Observations Y = O, Yj(n) = O j(n) ∈ {0, 1, 2}
– hidden states X j(n) are unknown

• maximum likelihood parameter estimation3 :


X
θˆ = arg max pθ (Y = O ) = arg max p(X ,Y = O )
θ θ X

• no closed form solution, need an iterative algorithm: EM algorithm (known from


Gaussian mixture models/clustering in ML1)
given a current guess θ , try to get a better guess θ 0, such that pθ 0 (Y = O ) ≥ pθ (Y = O )

• Kullback-Leibler (KL) divergence between two distributions p1 (ω) and p2 (ω) over
the same domain ω ∈ Ω
X p1 (ω)
KL(p2 |p1 ) = p1 (ω) log ≥0
ω
p2 (ω)

pθ (X ,Y =O ) pθ 0 (X ,Y =O )
We choose: p1 (ω) = pθ (X |Y = O ) = pθ (Y =O ) , p 2 (ω) = pθ 0 (X |Y = O ) = pθ 0 (Y =O )

X pθ (X ,Y = O ) pθ (X ,Y = O )pθ 0 (Y = O )
KL(p2 |p1 ) = log
X
pθ (Y = O ) pθ 0 (X ,Y = O )pθ (Y = O )
pθ 0 (Y = O ) 1 X pθ (X ,Y = O )
= log + pθ (X ,Y = O ) log
pθ (Y = O ) pθ (Y = O ) X pθ 0 (X ,Y = O )
≥0

abbreviation:
X
Q (θ 1 ,θ 2 ) = pθ 1 (X ,Y = O ) log pθ 2 (X ,Y = O )
X

pθ 0 (Y = O ) Q (θ ,θ ) − Q (θ ,θ 0 )
KL(p2 |p1 ) = log + ≥0
pθ (Y = O ) pθ (Y = O )
3 marginalize over all possible assignments X

82
19.1 Learning the parameters (= transition probabilities) of a HMM

Q (θ ,θ 0 ) − Q (θ ,θ ) pθ 0 (Y = O )
⇒ ≤ log
p (Y = O ) p (Y = O )
| θ {z } | θ {z }
lower bound for RHS
| ≥{z 1 desired
}
≥ 0 desired
p 0 (Y =O )
⇒ Improve the objective pθθ (Y =O ) as much as possible by maximizing the lower
bound.
⇒ define θˆ0 = arg max Q (θ ,θ 0 )
θ0

• EM algorithm4 outline:
1. define initial guess θ (0)
2. for t = 1, ...,T (or until convergence)
– E-step: define
Q (θ (t−1) ,θ 0 ) = Eθ (t −1) [log pθ 0 ] =
X
pθ (t −1) (X ,Y = O ) log pθ 0 (X ,Y = O )
X

– M-step: find θˆ0 = arg max Q (θ (t − 1),θ 0 )


θ0

– set θ (t ) = θˆ0
• thanks to the BN factorization of our HMM; the calculations simplify tremendously
X
Q (θ ,θ 0 ) = pθ (X ,Y = O ) log pθ 0 (X ,Y = O )
X
| {z }
j pθ 0 (X j |X j−1 )pθ 0 (Yj =O j |X j )
Q Q
n

N Dn
* log pθ 0 (X (n) ) + log pθ 0 X j(n) |X j−1
(n)
X X X  
= pθ (X ,Y = O ) 1
X n=1 , j=2
Dn
log pθ 0 Yj(n) = O j(n) |X j(n) +
X  
+
j=1 -
• when minimizing w.r.t θ 0 = {ρ 0,π 0, µ 0 } we also need to preserve the normalization
of these probabilities ⇒ Lagrangian
N Dn
* log pθ 0 (X (n) ) + log pθ 0 X j(n) |X j−1
(n)
X X X  
L(θ ) =
0
pθ (X ,Y = O )
X n=1 ,
| {z1 } j=2 | {z }
ρ0 π0
Dn
log pθ 0 Yj(n) = O j(n) |X j(n) + + λ ρ *1 −
X   X
+ ρk0 +
j=1 | {z }- k
µ0
, -
X  X X 
+ λk 1 −
 * πk 0k + ηk 1 −
0 + * µmk 
0 +

k  , k 0 - , m -
4E for expectation, M for maximization

83
19 Lecture 26/06

• optimize by setting the derivative to 0

D n δ X (n) = k 0,X (n) = k 


 
∂L(θ 0 ) X X X j j−1
= pθ (X ,Y = O )   − λ =! 0
k
∂πk0 0k X n  j=2
 π 0
k 0k


Dn
δ (X j(n) = k 0,X j−1
(n)
X XX
πk0 0k λk = pθ (X ,Y = O ) = k)
| {z } n j=2
X
pθ (X |Y =O )pθ (Y =O )
Dn X
pθ (X |Y = O )δ (X j(n) = k 0,X j−1
(n)
XX
= pθ (Y = O ) = k)
n j=2 X
Dn
pθ (X j(n) = k 0,X j−1
(n)
XX
= pθ (Y = O ) = k |Y = O )
n j=2
Dn
pθ (X j(n) = k 0,X j−1
(n)
XX
πk0 0k ∝ = k |Y = O )
n j=2

This can be computed by a variant of the forward/backward algorithm ⇒ homework.


πk0 0k
And then normalize: → πk0 0k
k 0 πk 0k
P

pθ (X 1(n) |Y = O )
X
ρk0 ∝
n
Dn
pθ (X j(n) = k,Yj(n) = m|Y = O )
XX
µmk
0

n j=1
Dn
pθ (X j(n) = k |Y = O ) 1(Yj(n) = m)
XX

n j=1 | {z }
standard F/B algorithm

• Baum-Welch algorithm: repeat until convergence:


– compute marginals pθ (X j(n) = k |Y = O ) and pθ (X j(n) = k 0,X j−1
(n)
= k |Y = O )
under the current guess θ , using the forward-backward algorithm.
– Update ρ 0,π 0, µ 0 by pretending that the marginals are the ground truth, using
counting followed by normalization.

• BW converges only to a local optimum of pθˆ (Y = O ) ⇒ quality depends on the


quality of the initial guess.
– don’t initialize with 0, unless this is a constraint of the model, because 0 probs
stay 0 probs.
– method 1:
∗ random initialization: θ (0) is a uninformative prior plus noise, e.g. π (0)k 0k =
(1 − λ) C1 + λU (0, 1)

84
19.1 Learning the parameters (= transition probabilities) of a HMM

∗ repeat with several initializations and keep best solution


– method 2: Viterbi counting
∗ define counting matrices ρC , π C , µC (count how often each transition
occurred)
∗ initialize the counting matrices π C = LC×C , ρC = LC×1 , µC = LM×C (L =
regularization parameter, minimum count for each transition)
∗ for t = 1, ...,T
· choose a training sequence n at random
· define the transition matrices for the current iteration5 :

π (t ) = (1 − λ)norm(π C ) + λnorm(U (0, 1)C×C )


µ (t ) = (1 − λ)norm(π C ) + λnorm(U (0, 1)M×C )
ρ (t ) = (1 − λ)norm(ρC ) + λnorm(U (0, 1)C×1 )

· compute the MAP solution using Viterbi

X = â,Y = Ô, Ô j = arg max p(Yj = O j |X j = â j )


oj

· update the counting matrices π C , ρC , µC as if the MAP solution was


the ground truth
∗ init Baum-Welch with norm(π C ), norm(ρC ), norm(µC )

5norm = normalization

85
20 Lecture 01/07

20.1 Causality
• causality is second major application of BN: (first: temporal models)

• three approaches to causality:


– understanding the underlying mechanism (but: may be beyond our techno-
logical capabilities, too expensive, overkill, not yet possible in early stages of
investigation)
– statistical experiment: actively intervene into the system and analyze the effect
statistically (but may be impossible, illegal, unethical, too expensive, too early)
– observational analysis: measure properties

• example: cholera epidemics in London ≈ 1850


– root cause (bacterium) was discovered in 1854 by Pacini (but not widely known),
settled by Robert Koch in 1884
– many hypothesis about cause (air quality, elevation of homes, social status,...)
– Farr (head statistician): derived from data: Yi /Yi 0 = (Ei0 − E 0 )/(Ei − E 0 ) 1
– Snow (physician - inventor of anesthesia) believed (contrary to every one) that
cholera was transmitted by contaminated water.
⇒ identified highly significant association between illness and water pump
– Farr: this hypothesis is very plausible, but not forced by the data in 1854
– By 1866 Farr had collected enough data to confirm Snow’s claim.

• Why is it difficult to derive causality here?


– In physical systems (or standard ML) we can reset the experiment and repeat
under different conditions.
⇒ can identify2 E[Yi (Xi = a 1 ) − Yi (Xi = a 0 )] > 0 ⇒ a 1 is better than a 0 (E
goes over individuals i)
– in living systems it is impossible to replay the data ⇒ we can at best compute
EX i =a1 [Yi ] − EX i =a0 [Yi ] (E is over groups that received either treatment)
How can we assure that this is ≈ E[Yi (Xi = a 1 ) − Yi (Xi = a 0 )]?
1Y = prop i gets ill, Ei = elevation of i’s home
i
2i is a data instance and X i = a j are different interventions

87
20 Lecture 01/07

– we may not be able to actively intervene ⇒ the groups {Xi = a 1 } and {Xi = a 0 }
are outside of our control

• goals of causality analysis:


– prediction: What will happen if we apply treatment a or implement policy a?
(e.g. raise cigarette taxes?)
– counterfactual queries: What would have happened if Xi had been ak instead
of the actual ak 0 ?
– decision making: Is it “better” to apply treatment ak or ak 0 or no treatment at
all?

• BNs are a very useful tool here if


– they include all relevant variables (no hidden causes) → difficult to achieve
– the arrows in the BN represent causal directions → “identifiability”
– the probs are known → “parameter searching”

• always remember: association (“correlation”) if not causality (but: analysis of many


associations can reveal causality)

• Reichenbach’s “common cause principle”: if X and Y are associated then either X


causes Y or Y causes X or there exists Z such that Z causes X and Y 3
– almost complete: correct when data X and Y are not conditioned on a common
effect of X and Y (“explaining away effect”)
– example: are lectures useful? Top researchers find self-learning more efficient
“selection bias”

• remember the definition of independence: X ⊥⊥ Y ⇔ p(X ,Y ) = p(X )p(Y ), X ⊥ ⊥


Y |Z ⇔ p(X ,Y |Z ) = p(X |Z )p(Y |Z )
⇒ no causation is possible between independent variables (no direct causation if
conditional independent)

• modeling of interventions (active state changes by the investigator)


– suppose we have a “correct” BN, i.e. p(X 1 , ...,X D ) = j p(X j |PA(X j ))
Q

– Pearl’s “do” operator: do(X j = a j ) = X j was actively set into state a j


more general do(X j ∼ p̃(X j )) = X j was actively drawn from p̃(X j ) instead of
p(X j |PA(X j )) 4
– “do” changes the factorization by replacing the distribution of X j
  Y
p X 1 , ...,X D |do(X j ∼ p̃(X j )) = p̃(X j ) p(X j 0 |PA(X j 0 ))

graphical: incoming arcs of X j are deleted

88
20.1 Causality

X X X

Y Z Y Z Y Z
Figure 20.1: Common- Figure 20.2: Causal-Chain Figure 20.3: Common-
Cause Model Model Effect Model

X X X

Y Z Y Z Y Z

do Y do Y do Z
Figure 20.4: Influence of interventions on the three basic causal models

⇒ “structural interventions”5
– marginalization: (under hard intervention δ (X j = a j ))
X
p(X 1 , ...,X j−1 ,X j+1 , ...,X D |do(X j = a j )) = p(X 1 , ...,X D |do(X j = a j ))
Xj
Y
= p(X j 0 | PA(X j 0 ,X j = a j ) )
j 0 ,j
| {z }
X j = a j , whenever X j ∈ PA(X j 0 )

– examples of the effect of “do”

p(...|X = a) = p(...|do(X = a)) ⇔ X is a root in the BN (no incoming arcs)


| {z } | {z }
cond prob intervent prop

– otherwise, the two probs are different:


First example p(X ,Y ) = p(Y )p(X |Y )

p(Y )p(X = a|Y )


p(Y |X = a) = P
Y p(Y )p(X = a|Y )
,
X
p(Y |do(X = a)) = p(Y )δ (X = a) = p(Y )
X
3X → Y, Y → X, X ← Z → Y
4 hardintervention X j = a j is a special case: p̃(X j ) = δ (X j = a j )
5 opposite: “parametric intervention” p(X |PA(X )) = p̃(X |PA(X ))
j j j j

89
20 Lecture 01/07

Second Example67 p(X ,Y ,Z ) = p(Z )p(X |Z )p(Y |X ,Z )


X X p(X = a,Y ,Z )
p(Y |X = a) = p(Y ,Z |X = a) =
Z Z
p(X = a)
= (∗)
X
p(Z |X = a)p(Y |X = a,Z )
Z
,
X
p(Y |do(X = a)) = p(Z )δ (X = a)p(Y |X ,Z )
X ,Z
X
= p(Z )p(Y |X = a,Z )
Z

⇒ be careful when making interventional predictions from observational


probabilities (⇒ later)
• What’s the “correct” BN?
– Let G be a DAG with nodes X 1 , ...,X D and p(X 1 , ...,X D ) a joint distribution, p
is “Markov and faithful” with respect to G if:

[X d-separate Y |Z ]G ⇔ [X ⊥
⊥ Y |Z ]p .

– in general, G is not uniquely determined:


“Markov equivalence class”: M (G) = {G 0 |p is Markov and faithful w.r.t G’}
– “skeleton” of G: undirected graph obtained by removing all directions in G
– “moral graph” of G. connect all parents by an undirected edge and remove all
directions afterwards
– Theorem: Two DAGs G, G 0 are Markov equivalent iff their skeletons and
moral graphs are equal.
– “Markov minimality”: a DAG G is Markov minimal w.r.t p, if p is Markov and
faithful w.r.t G but not w.r.t any subgraph of G 8
– “causal effect”9 : there is a (total) causal effect from X j to X j 0 in p(X 1 , ...,X D ) iff
X j 6⊥
⊥ X j 0 in p(X 1 , ...,X D |do(X j ∼ p̃(X j )))
– “true causal graph”:
∗ Let G be a DAG s.t. p(X 1 , ...,X D ) is Markov and faithful.
∗ For all subsets of S ⊆ {X 1 , ...,X D } let
Y
pG (X 1 , ...,X D |do(S ∼ p̃(S ))) = p̃(S ) p(X j |PA(X j ))
j<S

6X ⊥ Y partly due to direct effect X → Y partly due to common cause Z


6⊥
7 (∗) there are some basic calculations needed here, which are left as an exercise for the reader
8 i.e. we cannot remove any edges from G without changing the probability
9 intuitively: there is a path from X X j 0 , in the graph where all incoming arcs of X j were removed
j

90
20.1 Causality

be the interventional distribution obtained from G and p(X 1 , ...,X D |do(S ∼


p̃(S ))) the true interventional distribution.
G is a true causal graph of p iff ∀S,∀p̃(S ):

pG (X 1 , ...,X D |do(S ∼ p̃(S )) = p(X 1 , ..,X D |do(S ∼ p̃(S )))

• Theorem: The minimal true causal graph for p is unique.

91
21 Lecture 03/07

21.1 Create BNs from data


• A true causal model reproduces the joint probability p(X 1 , ...,X D ) (what you get
from passive observation) and all interventional distributions p(X 1 , ...,X D |do(S ∼
p̃(S ),S ⊂ X )) (what you get from experiments).

• Theorem: The minimal true causal model is unique [Peters et al. 2014].

• problem: given data, infer the true model, or (weaker) a member of the Markov
equivalence class

• IC algorithm1 : idealized exact algorithm to identify the Markov equivalence class.


– assumptions:
∗ the nodes X 1 , ...,X D of the graph are known
∗ the possess to a test oracle that answers (conditional) independence queries
– steps:
1. start with the complete graph and remove edges whose end points are
(conditionally) independent
⇒ skeleton of G
2. detect “common effect” situations and orient the arrows accordingly (“v-
structures”)
3. use BN constraints (e.g. cycle-free) to orient as many additional edges as
possible
4. perform experiments to obtain the orientation of the remaining edges (or
orient arbitrarily)
1. compute the skeleton (conceptual way, exponential complexity)
– for all pairs (X j ,X j 0 )
∗ for every subset S ⊆ X \{X j ,X j 0 } (including ∅)
· ask the oracle if X j ⊥
⊥ X j 0 |S, if yes
+ call S ⇒ S jj 0 and remember it2
+ remove the edge (X j ,X j 0 ) from graph
1 IC = “inductive causation”
2 remember the smallest one if there are multiple

93
21 Lecture 03/07

– Theorem: This produces the true skeleton if the oracle is always correct.
In practice, the oracle is some statistical test (⇒ later) that may be erro-
neous. ⇒ we get only an approximate skeleton
– optimization 1: PC-algorithm3
∗ start with small conditioning sets S: CI-test (“conditional independence
test”) is faster and more accurate
∗ showed that S only needs to include neighbors of either X j , or X j 0 ⇒
after a few edge removals, a lot fewer S can be constructed
∗ early termination: After some removals, it may become impossible to
create new S.
∗ algorithm:
· start with complete graph
· set S = ∅: for all pairs (X j ,X j 0 ) remove edge if X j ⊥
⊥ X j 0 |∅
· work on S with |S | = 1: for all nodes X j that have at least 2 neighbors
+ for all X j 0 ∈ N E(X j ) and all X j 00 ∈ N E(X j )\X j 0 remove edge
(X j ,X j 0 ) if X j ⊥
⊥ X j 0 |X j 00
· work on S with |S | = 2: for all X j with at least 3 neighbors
+ for all X j 0 ∈ N E (X j ) and all X j1 ,X j2 ∈ N E (X j )\X j 0 remove edge
(X j ,X j 0 ) if X j ⊥
⊥ X j 0 |(X j1 ,X j2 )
· and so on, until no X j has the required number of neighbors
∗ in the worst case this is not faster than IC algorithm, but can be proven
[Classen et al. 2013] that a variant of PC has worst case complexity
O(D 2(deg+2) ) 4
⇒ if the skeleton is sparse (deg is small) ⇒ polynomial runtime; in
practice this is usually the case
∗ If the oracle is always correct: PC creates the correct skeleton, oth-
erwise the result is order-dependent because errors lead to different
subsequent tests and errors.
– optimization 2: stable parallel PC-algorithm: eliminate order dependence
by performing all CI tests for given |S | in parallel, only remove edges until
each round |S | is finished
∗ don’t remove edges in parallel but sort by confidence of the CI test
(increasing p-value) and remove in that order, but skip removals whose
preconditions no longer hold (i.e. the edge a particular test relies upon
have already been removed) = no, this requires more thought
∗ faster than PC if CI tests are performed concurrently
3 PC = “Peter & Clark”
4 deg: maximum degree of any node in the skeleton

94
21.1 Create BNs from data

B C
Figure 21.1: Example where applying stable-PC gives a different outcome.

2. detect “v-strutures” (common effects)


– for all pairs X j ,X j 0 , that are not connected but have a common neighbor
X j 00
if X j 00 < S jj 0 ⇒ X j 00 cannot be a common cause of or mediator between X j
and X j 0 ⇒ j → j 00 ← j 0
– assume that step 2. finds all v-structures5
3. orient as many edges as possible using following constraints
– BN must be acyclic (⇒ some orientations are infeasible ⇒ use the other
direction)
– when X j and X j 0 are not connected but have a common neighbor X j 00 this
cannot be a v-structure6

A B C D A B C D

E E
Figure 21.2: True Causal Graph of Figure 21.3: True Causal Pattern of
Example 2 Example 2

example7
example 2[Spirtes et al 2010]8
4. get interventional data to orient remaining edges: we know that after do(X j =
a j ), there can be no incoming arcs to X j ; all edges at X j must go out in the
interventional graph
– [Eberhardt et al.2006] showed: in the worst case, two situations are
needed for every edge (X j ,X j 0 )
a) exp 1: intervene on X j , but not on X j 0 , exp 2: intervene on X j 0 , but not
on X j
5 excludes turning j − j 00 − j 0 ⇒ j → j 00 ← j 0
6 ifone edge is already connected, this implies the orientation of the other
7 the lecture contains a couple of graphical examples here
8 see Automated Search for Causal Relations_ Theory and Practice.pdf figure 1 for the graph

95
21 Lecture 03/07

b) exp 1: intervene on neither of X j ,X j 0 , exp 2: intervene on exactly one


of X j ,X j 0

A IA A

B C IB B C
Figure 21.4: On the left we see the true unknown complete graph among the variables A,B,C.
In one experiment, the researcher performs simultaneously and independently
a parametric intervention on A and B (IA and IB , respectively, shown on the
right). Since the interventions do not break any edges, the graph on the right
represents the post-manipulation graph.

5. Theorem:
– If one intervenes on exactly one variable per experiment, at most D − 1
experiments are needed to get full BN.9
– If one can intervene on up to D/2 variables simultaneously, log2 D + 1
experiments are sufficient.10
practical problems are usually not worst case

9 assuming they do not err


10 again assuming correctness

96
22 Lecture 08/07

22.1 Detecting conditional independence by statistical tests


• standard method: G-test
– X can have states 1, ..., Cx
– Y can have states 1,..., Cy
Cx
X
H (x ) = − p(X = k ) log(p(X = k ))
k=1
CX
x ,Cy

H (x,y) = − p(X = k,Y = l ) log(p(X = k,Y = l ))


k=1,l=1

– Mutual Information

MI (X ,Y ) = H (X ) + H (Y ) − H (X ,Y )

– usual hypothesis H 0 : X ⊥
⊥Y
Under H 0 the actual observed counts N follow a multinomial distribution

Ĝ = 2N M̂I (X ,Y )

has a χ 2 distribution with (Cx − 1) · (Cy − 1) dof1


– conditional independence:
∗ repeat for every state of the conditioning set
∗ take result with max p-value
∗ better: Bonferroni correction2

• Problems:
– Normally this test is conservative: we reject H 0 only when we are confident
(high p-value). This means we assume independence in case of doubt. We
would like to have a test for X 6⊥
⊥ Y but this is difficult.
– continuous variables must be discretized and MI is very sensitive to particular
discretization ⇒ active area of research
1 dof = degrees of freedom
2 see https://xkcd.com/882/

97
22 Lecture 08/07

– if the conditioning set is large, only few samples fulfill the condition ⇒ high
variance of MI, in practice |S | = 4 is max
– errors propagate in PC also

22.1.0.2 Kernel-based independence test

• map data into augmented feature space X̃ = ϕ(X ), Ỹ = ψ (Y ), where ϕ,ψ are nonlin-
ear

• compute cross covariance matrix

CV = E[(X̃ − EX̃ ) T (Ỹ − EỸ )]

• compute biggest eigenvalue of CV

• perform statistical test if this ev is 0


if yes ⇒ X ⊥⊥ Y because zero correlation implies independence after nonlinear
mapping [Fukumizu et al 2008]

• eliminate explicit mapping by Kernel trick

22.1.0.3 Approximation algorithm for BN construction

• more making algorithms


– given our current guess of the BN, define “moves” that transform the BN into
a similar one (typical ones: add arc, remove arc, reverse arc,...)
– compute a score for all candidates produced by moves, eg:

BIC = − log p(D|θ ) + log N
2

– implementations vary in
∗ allowed moves
∗ score functions
∗ amount of randomness
– initialization: usually empty graph

22.1.0.4 Using structured equation models (SEMs)

• basic claim: Ambiguity in BN construction is caused by definitions that are too


general. It may not be the best idea to allow for any possible function p(X j |PA(X j ))
⇒ restrict function class using SEMs.

X j = f j (PA(X j ), N j )

where N j is noise

98
22.1 Detecting conditional independence by statistical tests

• Theorem: If p(X 1 , ...,X D ) is strictly positive and Markov with respect to a DAG G
then there exists a SEM on G that generates p(X 1 , ...,Xd ).

• advantage: we have control over the function class, e.g. X j = f j (PA(X j )) + N j

• identifiability:
if f j is linear and N j is Gaussian ⇒ cannot distinguish between X → Y , Y → X
same if f j is asymptotically constant and strictly monotone and N j is exp. distributed3

• when identifiability holds:


1. f j is linear and N j is nonGaussian ⇒ LINGAM algorithm [Kons & Shimzu 2003]
(similar to ICA)
2. f j is nonlinear N j is Gaussian ⇒ RESIT4 algorithm

3 N ∼ e − |x |
j
4 [Peters et al. 2014]

99
23 Lecture 15/07

23.1 RESIT algorithm (regression with subsequent


independence test)
• approximation algorithm to construct a BN

• example for hot research: find new ways to detect causality


– traditional (PC algo): analyze dependence between variables X ⊥
⊥ Y ?, X ⊥

Y |Z ?
– new idea: Define a SEM1 and analyze dependency between predictors and
residuals of the SEM regression.
SEM: X j = f (PA(X j )) + N j (additive noise model)
regression2 : fˆ = arg min i (Xij − f (PA(X j )i )) 2
P
f
residuals: Rij = Xij − fˆ (PA(X j )i )
by definition of an additive-noise SEM, we require N j ⊥ ⊥ PA(X j ). If the model
is correct, the same must be true for the residuals: R j ⊥
⊥ PA(X j ), meaning, that
there is no information in PA(X j ) that could be used to reduce the residual R j .
– in particular R j 6⊥
⊥ PA(X j ) if the modal is not causal3

• algorithm:
– phase 1 determine the optimal ordering of the variables for Bayesian factoriza-
tion.
∗ S = {X 1 , ...,X D }
∗ for t = D, ..., 1 (construct order backwards)
· for each X j ∈ S :
+ regress X j on S\X j using a suitable regression method and compute
residual R j
+ conduct independence test for R j ⊥
⊥ S\X j and store p-value p j
1 structured equation model
2 (non-linear least squares, kernel regression, ...)
3 The lecture contains an example graph here: X → X 0 via д(X ) and the anti-causal model X 0 → X via
j j j j j
f (X j 0 ). Our wrong SEM predicts x̂ j = fˆ (д(X j )) if д is not invertible, information is lost ⇒ R j = X j − X̂ j
contains information that could be used to predict X j 0 from X̂ j and R j .

101
23 Lecture 15/07

+ set π (t ) = arg max p j (place the variable with biggest p-value, i.e.
j
highest certainty of independence at position t )
+ PA(X π (t ) ) = S\X π (t )
+ update S = S\X π (t )
– phase 2: determine SEM (remove as many edges from the graph as possible
and determine the regression functions)
∗ initialize G as the complete DAG for the order of phase 1 (add arcs X j 0 → X j
for all X j 0 ∈ PA(X j ))
∗ for t = 2, ...,D:
· for each X j 0 ,PA(X π (t ) ) (try to get rid of as many parents as possible)
+ regress X π (t ) on PA(X π (t ) )\X j 0 and compute residuals R j 0
if R j 0 ⊥
⊥ PA(X π (t ) ): remove X j 0 from parents PA(X π (t ) ) = PA(X π (t ) )\X j 0 4
– output the resulting graph and regression functions

• properties:
– only marginal dependence tests are needed (no conditional ones) → easier
– can prove: The true causal model is identifiable when the regression functions
are non-linear, provided that the regression is sufficiently powerful and the
additive noise assumption is fulfilled.5
– efficient: O(D 2Q ) operations, where Q is complexity of the regressions and
independence tests6

23.2 Parameter estimation in BNs


Estimate the probabilities p(X j |PA(X j )) from training data, given the structure of the BN.

• if the variables are discrete: estimate conditional probabilities by counting

• if the variables are continuous (or too many discrete states): use an SEM and regres-
sion

• if data is partially missing


– if missing at random (the fact that a value is missing is statistically independent
of the missing value): EM algorithm (replace missing data by our current guess
on their expected value)
– if systematically missing:
4 typically:
kernel-based independence tests
5 typical regression methods:
kernel SVR, Gaussian processes, generalized additive models, linear regression
6 compare PC: O(2D )

102
23.3 Drawing Conclusions from a BN

∗ if the missing value is irrelevant for the instance at hand (e.g. physicians
wouldn’t do a useless diagnostic)
⇒ introduce class “irrelevant” as an additional state of the variable
∗ otherwise: non-trivial problem ⇒ later

23.3 Drawing Conclusions from a BN


Two typical cases (in both cases it is critical to avoid omitted variable bias (Simpson’s
paradox)):
1. has X a direct effect on Y ? (i.e. is there an arc X → Y , and if yes, how strong is the
association?)
2. what is the total causal effect of X on Y along all paths X Y combined?

• omitted variable bias in 1.: Berkley admission example


– proposed BN of the plaintiffs: sex → admission. G-test for this model is highly
⊥ admission and discriminating women p-value < 10− 5
significant for sex 6⊥
– proposed BNs of the defense: sex → field → admission and maybe even a link
sex → admission?
Q: is there a direct effect of sex on admission?
G-test: sex ⊥⊥ admission | field:
∗ conditional independence for 5 out of 6 fields p-value > 0.3
∗ in one field: conditional dependence with p-value = 10−5 , but here women
are significantly preferred (82% vs. 62%)
⇒ sex has a large total causal effect, but only a small direct effect.
⇒ rule: omitted “mediating” factors7 (here field) cause bias when the direct
effect is of interest.
• in case 2., omitted variable bias arises from missing common causes (“confounders”)
example: kidney stone data: recovery rates for two treatments A (open surgery), B
(minimal invasive surgery)
A:treated A:recovered B:treated B:recovered
350 273 (78%) 350 289 (83%)
B seems to be better (but G-test says that the difference is not yet significant)
BN : treatment → recovery. But an important confounder is missing: stone size,
giving us: recovery ← size→ treatment → recovery
A:treated A:recovered B:treated B:recovered
350 273 (78%) 350 289 (83%)
small 87 81(93%) 270 234(87%)
big 263 192 (73%) 80 55 (69%)
7 variables on directed path X Y

103
23 Lecture 15/07

conditional on stone size, treatment A is superior. But: physicians prefer treatment


B for the less severe cases8
problem arises due to the difference between the conditional probability p(rec |X )
and interventional probability p(rec |do(treatm.)). We derived earlier:
X
p(Y |X ) = p(Z |X )p(Y |X ,Z )
Z
X
p(Y |do(X )) = p(Z )p(Y |X ,Z )
Z

the latter being the definition of the total causal effect of X on Y .

p(rec |do(treat = A)) = 83%


p(rec |do(treat = B)) = 78%

i.e. exactly the reverse of the conditional probability.


Computing p(Y |do(X )) in the presence of confounders is called adjustment.

8 i.e.the BN is still not complete, because it doesn’t explain the treatment choice ⇒ more hidden factors,
e.g. risk factors, speed of recovery, cost,...

104
24 Lecture 17/07

24.1 Confounder Adjustment


• naive estimate of treatment effect E[Y |X = A] − E[Y |X = B] ≈ N1A i:X i =A Yi −
P

i:X i =B Yi is biased, when treatment decision X i depends on features Zi of the


1 P
NB
individual instance i, because the groups who received either treatment are not
comparable

• illustration using potential outcomes: Imagine that for each individual, the reaction
to treatment A and B is pre-determined but unknown to us.
– By applying a treatment, we will observe one of the potential outcomes, but
since we cannot rewind time, the other outcomes (“counter-factual” outcomes)
are missing = “fundamental missing data problem of causal inference”.
– If we have binary variables (2 treatments A and B, 2 outcomes true and false),
there are 4 different types of people according to potential outcomes:

type|pot outcome A B
a 1 0 A responsive
b 0 1 B responsive
c 1 1 complete responsive
d 0 0 doomed1
two observation groups: NA individuals who received treatment A: TA 2

– the (unknown) proportion of the 4 types in both groups: pa ,pb ,pc ,pd for TA ;
qa ,qb ,qc ,qd for TB
– we cannot distinguish types: a from c, b from d in TA and a from d, b from c in
TB because the counter-factual outcome is unknown.
– compute the number of recovered people: RA = NA (pa + pc ); R B = NB (qc + qd )
RA RB
– naive estimate: compare the proportions: NA − NB = pa + pc − (qb + qc )
– we are actually interested in the treatment effect: pa − pb
– we must make sure that the two groups TA and TB are comparable (“exchange-
able” treatment assignment)
⇒ pa ≈ qa ,pb ≈ qb ⇒ NRAA − NR BB ≈ pa − pb as desired.
1 insert dark laughter here
2N T
B B analogously
,

105
24 Lecture 17/07

– preferred strategy to achieve these randomized experiments = decide about the


treatment uniformly at random, without considering the features Zi .
⇒ asymptotically, the feature distributions p(Z |X = A) = p(Z |X = B) = p(Z )
because X ⊥ ⊥ Z by design
⇒ we have pa = qa , etc. asymptotically
(for finite samples, this may not be achieved ⇒ avoid this by rejection sampling,
i.e. check if p(Z |X = A) = p(Z |X = B) and draw a new random assignment if
this is not the case)
– often, randomized assignment is impossible
example: placebo surgery ⇒ We must explicitly adjust for the differences in
TA and TB .
• possibility 1. confounder adjustment: p(Y |do(X = A)) = Z p(Y |X = A,Z )P (Z )
P

– The confounder Z can actually be a set of features: Z = {Z 1 , ...,Z L }.


– question: What is a valid adjustment set Z , given the BN?
– graphical criterion: “backdoor criterion” (sufficient, but necessary):
a) Z must not contain a descendant of X to avoid Berkson’s paradox3 and
mediating variables
b) remove the outgoing arcs of X (= remove the causal effects of X) ⇒ All
remaining associations between X and Y are spurious common cause
effects of the confounders.
If X ⊥
⊥ Y |Z in the modified path, such effects cannot occur if we adjust
for Z =ˆ valid adjustment set.4
– problem: The number of joint states in Z grows exponentially with the number
of variables in Z #states = Ω(2L ).
⇒ #states is big we will be unable to estimate P (Y |X ,Z ) from a realistic amount
of training data
• possibility 2. stratification: define subgroups of instances with similar Z (“clus-
ters”, “strata”) and estimate probabilities in each stratum separately and sum over
strata
– typically achieved by coarse quantization of the Zl into at most 5 levels.5
– still doesn’t work when Z has too many variables
• possibility 3. propensity score:[Rosenbaum& Rubin 1983] [Austin 2011]6
– introduce a new variable F by splitting X (F becomes the only parent of X
receiving all the arrows from Z )
⇒ If Z was a valid adjustment set, so is F because it blocks exactly the same
backdoor paths.
3 https://en.wikipedia.org/wiki/Berkson’s_paradox
4 special case: Z = PA(X )
5 age groups: 5-15, 15-25,...
6 see austin2011.pdf

106
24.1 Confounder Adjustment

– define structured equation model: F is a deterministic function7 of Z FA (Z ) =


p(X = A|Z ), called the propensity score for A
simply train a classifier that gives a posterior distribution (logistic regression,
random forest, neural networks)
p(X = A) = Bernoulli (FA (X ))
– surprising result: when FA (Zi ) ≈ FA (Zi 0 ) then the two individuals are compa-
rable, even if Zi , Zi 0 ⇒ it only matters that propensity scores match
– use this in three ways:
i) stratify on propensity score intervals
ii) weight adjustment: consider

1(X = A)Y X 1(X = A)Y


" #
E = p(X ,Y ,Z )
FA (Z ) X ,Y ,Z
F A (Z )
X 1(X = A)Y
= p(Y |X ,Z )p(X |Z )p(Z )
X ,Y ,Z
FA (Z )
X Y
= p(Y |X = A,Z ) p(X = A|Z ) p(Z )
Y ,Z A
F (Z ) | {z }
=F A (Z )
X X
= Y p(Y |X = A,Z )p(Z )
Y Z
| {z }
p(Y |do(X =A))
= E[Y |do(X = A)]
f =B)Y g
likewise, we have E 1(X
F B (Z ) = E [Y |do(X = B)]

⇒E[Y |do(X = A)] − E[Y |do(X = B)]


1 X 1(Xi = A)Yi 1 X 1(Xi = B)Yi
≈ −
N i FA (Zi ) N i FA (Zi )
1 X Yi 1 X Yi
= −
N i:X =A FA (Zi ) N i:X =B FB (Zi )
i i
1 X 1 X
, Yi − Yi
NA i:X =A NB i:X =B
i i

iii) propensity score matching: arrange the two groups into a complete bipartite
graph
∗ weight edges by absolute difference |FA (Zi ) − FA (Zi 0 )| = wii 0
∗ remove edges whose weight is above some threshold
∗ find the minimum cost by bipartite matching by Hungarian algorithm
(or a greedy approximation8 )
7 analogously for B
8 results do not really change according to literature

107
24 Lecture 17/07

Treated
Control

1
6
2
7
3
8
4
9
5
10

∗ form subgroups TA0 and TB0 from all matched pairs: in TA0 and TB0 the
naive formula is equivalent to the exact formula

E[Y |do(X = A)] = Epairs [Y |X = A]

∗ the matched partner is our best guess at the counterfactual outcome


∗ ultimate pairing: twin-study

108
25 Lecture 22/07

25.1 Hidden Confounders


• variables we cannot measure (accurately) or don’t even know that they exist/are
relevant
• gold standard: randomized controlled experiment (RCE): do(X ) cuts all incoming
arcs of X , regardless of known or hidden
• if there is no opportunity to do RCE, not all is lost:
– sometimes, p(Y |do(X )) is still identifiable, e.g. front-door adjustment formula
– It may be possible to intervene on an instrumental variable W that influences
X . ⇒ p(Y |do(X )) may be identifiable
– But: the success of these methods depends on very narrow conditions.

25.2 Transfer Learning = Domain Adaptation


• suppose we cannot get data of the desired quality in the target
• instead of asking: “How can we get away with bad data?”, we ask “Can we combine
the bad data with good data from a similar domain to get better results?”
• we want to be better than the naive base lines:
– learn model from target data alone (high variance, possibly high bias if unad-
justed confounding)
– use model learned in the source domain (high bias, because domains differ)
• typical scenarios: we got high quality annotations from experts, but experts won’t
do this again for another dataset
• There was a carefully designed experiment in one country: are the results transfer-
able to another country, where only observational data exists?
• Can we transfer results on a limited cohort (students) to the population at large?1
• Frustratingly Easy Domain Adaptation: EasyAdapt[Daumié III 2007]2 , EasyAdapt++
[Daumé III et al 2010]
1 e.g. WEIRD students in psychological studies, or transfer from lab animals to humans
2 see daumedomainAdapt.pdf

109
25 Lecture 22/07

– standard 2-class classification


– we have lots of annotated training data in domain D = 0
– we have few annotated training data in domain D = 1, optionally lots of
unlabeled data (semi-supervised)
– idea:
∗ centralize the features X
∗ create an augmented feature space X̃ by replicating the features:

 [X ,X ,O], if i ∈ source

X̃ = [X , (1 − D)X ,DX ] = 
 [X ,O,X ], if i ∈ target

⇒ treat transfer learning as a missing data problem, one of the copies is


missing for each instance and replaced by the expected values O.
∗ the first copy should capture the common properties of both domains, the
others the differences
– consider a linear classifier: training gives a weight vector: β˜ = [βc , βs , βt ]T
– prediction of a source instance Ŷ = X̃ β = X (βc + βs )
– prediction of a target instance: Ŷ = X̃ β = X (βc + βt )
– also works for any blackbox classifier = EasyAdapt, works well in experiments
– semi-supervised version EasyAdapt++: we augment the training set also with
unlabeled data
– claim: predictions of source and target classifiers should be similar on unlabeled
data

X u (βc + βs ) ≈ X u (βc + βt ) ⇔ X u (βs − βt ) ≈ 0


⇔ X̃u β˜ ≈ 0 where X̃u = [O,X , −X ]

– since we don’t have label 0, we add two augmented instances for each unlabeled
instance, one with either label X̃u , Ỹu = [O,X , −X , +1], X̃u 0mỸu 0 = [O,X , −X , −1]
– train normally and predict Ŷ = X (βc + βt ) for target points
– for linear SVM, we get the rolling loss functions
– outperforms many complex methods

25.3 Data augmentation


idea: to make a ML algorithm robust against certain systematic transformations of the
data X , create additional training data using these transformations without changing the
outcome
⇒ algorithm must learn that the transformations are irrelevant

110
25.4 Importance sampling by reweighting

• robustness against noise: add noise,

• illumination changes: linear intensity transformations

• rotational invariance: rotate the data

• shape invariance: randomly morph the shape

• ...

• especially popular for neural networks since they need a lot of training data anyway,
and can create augmented data on the fly

• preliminary: performs as well as explicitly designed invariance, but much simpler

25.4 Importance sampling by reweighting


• given a BN with fixed structure and parameterized probabilities/structural equations:

• we have a high-quality data of the BN behavior for some parameterization θ , want to


know: “How does the BN behave for θ 0”, without redoing the (expensive) experiment.

• replace question by counterfactual question: “How would the BN have behaved had
the parameters been θ 0 provided the hidden mechanisms didn’t change.”
⇒ virtually replay the data to simulate BN(θ 0)

• example: advertisement placement on the Bing Search result [Bottou et al. 2013]
– three conflicting goals:
∗ don’t annoy the user (few and relevant ads)
∗ attract advertisers (high click rates at reasonable price)
∗ maximize Bing’s revenue
– BN:
Y
p(X |θ ) = p(X j |PA(X j ),θ )
j

– if we change the parameters for a single white arrow, we get

P (X j |PA(X j ),θ 0 )
p(X |θ ) = P (X ,θ )
0
P (X |PA(X ),θ )
| j {z j }
w j = reweighting factor for j
1 X
E[Y |θ 0] = E[Yw j |θ ] ≈ Yi wij
N i

This is known as ‘importance sampling’

111
25 Lecture 22/07

user intent u query x ad inventory v

ads a bids b

scores q

slate s prices c

clicks y revenue z

Figure 25.1: Causal graph of the network.

– critical: w j must not diverge because p(X j |PA(X j ),θ ) ≈ 0


⇒ researchers artificially increased the variance in the θ -experiment, so that
p(X j |PA(X j ),θ ) spans a larger part of the feature space
– performs well in experiments

25.5 Causal Theory of transferability [Barenboim, Pearl, Tian,


2012-2015]
• address3 if the causal effect p ∗ (Y |do(x )) in domain ∗ is identifiable by combining
observational data from ∗ (lower quality) with high-quality (experimental) data from
a related domain.

• graphical notation extensions: dotted bidirectional arrows for hidden confounding,


Selection variable S ∈ {0, 1} for the domain

• examples: the lecture contains some examples here, but since they aren’t reproduced
here, the corresponding calculations are left out as well

• intuitive goal: transform the expressions such that no conditional probability con-
tains S = 1, and do(X ) simultaneously

• inventors presented a complete theory (graphical criteria and algorithms) to


3 see e.g. AAAI-14-r425.pdf

112
25.5 Causal Theory of transferability [Barenboim, Pearl, Tian, 2012-2015]

– decide if p(Y |do(X ),S = 1) is identifiable in a given graph (including dotted


arrows)
– find the appropriate adjustment expression by automatic symbolic calculations

113
26 Lecture 24/07

26.1 The omitted chapters (aka. “Machine Learning III”)


26.1.0.5 Markov Random Fields

 Pgraphical models: decompose according to Gibb’s distribution p(X 1 , ...,X D ) =


• undirected
Z exp − c Ec (X c ) , where EC are the energies/potentials and c the cliques of vari-
1

ables in a given undirected graph ( = maximal fully connected subgraphs), Z the


partition function: Z = X exp(− c Ec (X c ))
P P
The partition function is usually intractable
⇒ most popular task: finding the MAP-solution 1 = minimal energy state = “best”
solution
X
X̂ = arg max p(X ) = arg min Ec (X c ) + log Z
X X c
|{z}
=const

• typical merges
– unary potentials Ec (X c ) = E (X j ) encode the local evidence for probable state
of X j
– pairwise potentials E(X j ,X j 0 ) encode the desire of X j ,X j 0 to take similar values
(= attractive potential) or different values (= repulsive potentials)
– higher-order potentials |C | ≥ 3: encode preferences of the structure/pattern of
the variables involved (often neglected by assumption)
• typical inference algorithms for discrete X
– exact:
∗ reformulate the problem as an integer linear program arg min w · X s.t.
X
linear inequality constraints are met
(usually NP-hard, but often tractable by heuristics in practice)
∗ special cases:
· tree-shaped models ⇒ belief propagation gives the exact solution in
one forward/backward sweep
· sub-modular models X j ,X j 0 1[X j = X j 0 ]E (X j ,X j 0 ) ≤ X j ,X j 0 1(X j ,
P P
X j 0 )E (X j ,X j 0 ) ⇒ graph-cut algorithm is exact [maximum flow in a
graph = standard problem]
1 maximum a posteriori

115
26 Lecture 24/07

– approximations:
∗ relaxation, i.e. allow real values for X in the linear program (rounded later)
∗ move making: given a guess X (t ) , define elementary moves (changes of
few variables) and accept the best one (iterated conditional models ICM,
Lazy Flipper2 , tree submodels)
∗ move making: reduce the problem to a tractable subproblem (one-label-
against-the-rest = α expansion, one-label-against-one = α-β swap3 )
∗ loopy belief propagation: iterate message passing until convergence
∗ sampling methods: randomly simulate the model and choose the best
solution we have seen (Markov Chain Monte Carlo (MCMC), Svendsen-
Wang, Gibbs sampling)
– learning the potentials:
∗ learn them in isolation, independently of the others
∗ better but much more difficult: learn potentials jointly, s.t. they reinforce
each other towards a global loss function (on large patterns)
∗ details: watch Fred Hamprecht’s new video lecture4

26.1.0.6 Weak Annotations

getting annotated training data is expensive


• one-class learning: only provide annotations for the target class (assuming this is
easy) ⇒ non-target = outliers in target distribution
⇒ generative model, one-class SVM (one contour of the PDF, e.g. 95% confidence
contour, or several coupled one-class SVMs for nested contours)

• multiple instance learning: a single label for a group (“bag”) of instances (e.g. one
label per image for all pixels jointly) with the understanding that only some instances
in each bag conform to the label ⇒ find these instances and the corresponding
classifier

• similarity (metric) learning: just annotate “is A more similar to B or to C” ⇒ learn


the similarity function and the clustering

• sparse annotation:5 for each instance, only a few true labels are known (e.g. movies
that someone likes) ⇒ infer the missing labels and learn a model (e.g. recommender
systems)

• active learning: minimize the required training set size by actively selecting the
most informative (don’t waste annotator effort on the easy decisions)
2 see lazyflipper.pdf
3 both can be solved by graph-cut
4 https://www.youtube.com/playlist?list=PLuRaSnb3n4kSgSV35vTPDRBH81YgnF3Dd
5 related to multiple instance learning

116
26.1 The omitted chapters (aka. “Machine Learning III”)

• semi-supervised learning: combine a small labeled training set with a big unlabeled
(combine supervised & unsupervised learning)

• transfer learning: combine a small labeled training set with a big labeled training
set from a similar domain

• reinforcement learning: for state machines: instead of learning transition probabili-


ties that maximize the data likelihood (Baum-Welch), learn transition probabilities
that optimize an “expected reward”, problem: reward can only be computed af-
ter many transitions not for each transition individually (delayed annotation): e.g.
games: win or loss, robots: goal achieved or crashed

26.1.0.7 Matrix factorization

many phenomena can be explained by a linear superposition of elementary phenomena


⇒ (observed matrix) = (elementary weights) ∗ weights
| {z } | {z }
what could happen what was selected
general idea: use application-specific constraints to make decomposition well-posed, most
popular: sparsity, each elementary entry only explains a local part of the domain, only
few elementary things are active in each instance

26.1.0.8 Features

• designed features: we have to select from the infinite possibilities

• feature learning:
– initial layers of a neural network
– kernel approximation: k (x,x 0 ) = hϕ(x ),ϕ(x 0 )i ⇒ use feature selection to find
the important coordinates in ϕ(x ) and compute ϕ(x ˜ ) explicitly (without kernel)

• random features:
– random projections have a lot of interesting structural properties (they are not
chaos)
⇒ use these properties (e.g. Johnson-Lindenstrauss lemma6 )
⇒ multi-dimensional (randomized) hashing for similarity
⇒ extreme learning machine 2 layer NN: visible→ hidden random, hidden →
output analytically optimized

6 https://en.wikipedia.org/wiki/Johnson-Lindenstrauss_lemma

117

You might also like