Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Statistical Machine Learning (BE4M33SSU)

Lecture 1.
Czech Technical University in Prague
Course format
2/10
Teachers: Jan Drchal, Boris Flach, Vojtech Franc and Daniel Bonilla
Format: 1 lecture & 1 tutorial per week (6 credits), tutorials of two types
seminars: discussing solutions of theoretical assignments (published a week before the


class). You are expected to work on them in advance.


practical labs: explaining and discussing practical homeworks, i.e. implementation of


selected methods in Python (or Matlab). You have to submit


1. a report in PDF format (typeset preferably in LaTeX). Exception: if necessary, you
may include lengthy formula derivations as handwritten scans.
2. your code either as source file or as python notebook. The code must be
executable.

Code either as source file or as python notebook. The code must be executable.
Grading: 40% homeworks + 60% written exam = 100% (+ bonus points)
Prerequisites:
probability theory and statistics (A0B01PSI)


pattern recognition and machine learning (AE4B33RPZ)




optimisation (AE4B33OPT)


More details: https://cw.fel.cvut.cz/wiki/courses/be4m33ssu/start


Goals
3/10
The aim of statistical machine learning is to develop systems (models and algorithms) for
solving prediction tasks given a set of examples and some prior knowledge about the task.

Machine learning has been successfully applied e.g. in areas


text and document classification,


speech recognition and natural language processing,




computational biology (genes, proteins) and biological imaging & medical diagnosis


computer vision,


fraud detection, network intrusion,




and many others




You will gain skills to construct learning systems for typical applications by successfully
combining appropriate models and learning methods.
Characters of the play
4/10
object features x ∈ X are observable; x can be:


a categorical variable, a scalar, a real valued vector, a tensor, a sequence of values, an


image, a labelled graph, . . .
state of the object y ∈ Y is usually hidden; y can be: see above


prediction strategy (a.k.a. inference rule) h : X → Y; depending on the type of Y:




• y is a categorical variable ⇒ classification


• y is a real valued variable ⇒ regression
training examples T = {(x, y) | x ∈ X , y ∈ Y}


loss function ` : Y × Y → R+ penalises wrong predictions,




i.e. `(y, h(x)) is the loss for predicting y 0 = h(x) when y is the true state

Goal: optimal prediction strategy h : X → Y that minimises the loss

Q: give meaningful application examples for combinations of different X , Y and


related loss functions
Statistical machine learning
5/10
Main assumption:
X, Y are random variables,


X, Y are related by an unknown joint p.d.f. p(x, y),




we can collect examples (x, y) drawn from p(x, y).




Typical concepts:
regression: Y = f (X) + , where f is unknown and  is a random error,


classification: p(x, y) = p(y)p(x | y), where p(y) is the prior class probability and


p(x | y) the conditional feature distribution.

Consequences and problems


the inference rule h(X) and the loss `(Y, h(X)) become random variables.


risk of an inference rule h(X) ⇒ expected loss




XX
R(h) = E[`(Y, h(X))] = p(x, y)`(y, h(x))
x∈X y∈Y

how to estimate R(h) if p(x, y) is unknown?




how to choose an optimal predictor h(x) if p(x, y) is unknown?



Statistical machine learning
6/10
Estimating R(h):
m
 i i

collect an i.i.d. test sample S = (x , y ) ∈ X × Y | i = 1, . . . , m drawn from the
distribution p(x, y),

estimate the risk R(h) of the strategy h by the empirical risk

m
1 X i
R(h) ≈ RS m (h) = `(y , h(xi))
m i=1

Q: how strong can they deviate from each other? (see next lectures)
 
P |RS m (h) − R(h)| >  ≤??
Statistical machine learning
7/10
Choosing an optimal inference rule h(x)

If p(x, y) is known:

The smallest possible risk is


XX X X

R = inf R(h) = inf p(x, y)`(y, h(x)) = p(x) inf
0
p(y | x)`(y, y 0)
h∈Y X h∈Y X y ∈Y
x∈X y∈Y x∈X y∈Y

The corresponding best possible inference rule is the Bayes inference rule
X

h (x) = arg min p(y | x)`(y, y 0)
y 0 ∈Y y∈Y

But p(x, y) is not known and we can only collect examples drawn from it. We need:

Learning algorithms that use training data and prior assumptions/knowledge about the task
Learning types
8/10
Training data:
m
 i i
if T = (x , y ) ∈ X × Y | i = 1, . . . , m ⇒ supervised learning


m
 i
if T = x ∈ X | i = 1, . . . , m ⇒ unsupervised learning


m1 S m2
m
if T = Tl Tu , with labelled training data Tlm1 and unlabelled training data Tum2


⇒ semi-supervised learning

Prior knowledge about the task:


Discriminative learning: assume that the optimal inference rule h∗ is in some class of


rules H ⇒ replace the true risk by empirical risk

1 X
RT (h) = `(y, h(x))
|T |
(x,y)∈T

and minimise it w.r.t. h ∈ H, i.e. h∗T = arg min RT (h).


h∈H
Q: How strong can R(h∗T ) deviate from R(h )? How does this deviation depend on H?

 
P |R(h∗T ) − R(h∗)| >  ≤??
Learning types
9/10
Generative learning: assume that the true p.d. p(x, y) is in some parametrised family


of distributions, i.e. p = pθ∗ ∈ PΘ ⇒ use the training set T to estimate θ ∈ Θ:


1. θT∗ = arg max log pθ (T ), i.e. maximum likelihood estimator,
θ∈Θ

2. set h∗T = hθT∗ , where hθ denotes the Bayes inference rule for the p.d. pθ .
Q: How strong can θT∗ deviate from θ∗? How does this deviation depend on PΘ?

Possible combinations (training data vs. learning type)

discr. gener.
superv. yes yes
semi-sup. (yes) yes
unsuperv. no yes

In this course:
discriminative: Support Vector Machines, Deep Neural Networks


generative: mixture models, Hidden Markov Models




other: Bayesian learning, Ensembling



Example: Classification of handwritten digits
10/10

x ∈ X - grey valued images, 28x28, y ∈ Y - categorical variable with 10 values


discriminative: Specify a class of strategies H and a loss function `(y, y 0). How would


you estimate the optimal inference rule h∗ ∈ H?


generative: Specify a parametrised family pθ (x, y), θ ∈ Θ and a loss function `(y, y 0).


How would you estimate the optimal θ∗ by using the MLE? What is the Bayes inference
rule for pθ∗ ?

You might also like