Statistical Learning Intro

Statistical Machine Learning (BE4M33SSU)
Lecture 1.
Czech Technical University in Prague
Course format
2/10
Teachers: Jan Drchal, Boris Flach, Vojtech Franc and Daniel Bonilla
Format: 1 lecture & 1 tutorial per week (6 credits), tutorials of two types
seminars: discussing solutions of theoretical assignments (published a week before the

class). You are expected to work on them in advance.

practical labs: explaining and discussing practical homeworks, i.e. implementation of

selected methods in Python (or Matlab). You have to submit

1. a report in PDF format (typeset preferably in LaTeX). Exception: if necessary, you
may include lengthy formula derivations as handwritten scans.
2. your code either as source file or as python notebook. The code must be
executable.
Code either as source file or as python notebook. The code must be executable.
Grading: 40% homeworks + 60% written exam = 100% (+ bonus points)
Prerequisites:
probability theory and statistics (A0B01PSI)

pattern recognition and machine learning (AE4B33RPZ)

optimisation (AE4B33OPT)

More details: https://cw.fel.cvut.cz/wiki/courses/be4m33ssu/start

Goals
3/10
The aim of statistical machine learning is to develop systems (models and algorithms) for
solving prediction tasks given a set of examples and some prior knowledge about the task.
Machine learning has been successfully applied e.g. in areas

text and document classification,

speech recognition and natural language processing,

computational biology (genes, proteins) and biological imaging & medical diagnosis

computer vision,

fraud detection, network intrusion,

and many others

You will gain skills to construct learning systems for typical applications by successfully
combining appropriate models and learning methods.
Characters of the play
4/10
object features x ∈ X are observable; x can be:

a categorical variable, a scalar, a real valued vector, a tensor, a sequence of values, an

image, a labelled graph, . . .
state of the object y ∈ Y is usually hidden; y can be: see above

prediction strategy (a.k.a. inference rule) h : X → Y; depending on the type of Y:

• y is a categorical variable ⇒ classification

• y is a real valued variable ⇒ regression
training examples T = {(x, y) | x ∈ X , y ∈ Y}

loss function ` : Y × Y → R+ penalises wrong predictions,

i.e. `(y, h(x)) is the loss for predicting y 0 = h(x) when y is the true state
Goal: optimal prediction strategy h : X → Y that minimises the loss
Q: give meaningful application examples for combinations of different X , Y and

related loss functions
Statistical machine learning
5/10
Main assumption:
X, Y are random variables,

X, Y are related by an unknown joint p.d.f. p(x, y),

we can collect examples (x, y) drawn from p(x, y).

Typical concepts:
regression: Y = f (X) + , where f is unknown and is a random error,

classification: p(x, y) = p(y)p(x | y), where p(y) is the prior class probability and

p(x | y) the conditional feature distribution.
Consequences and problems

the inference rule h(X) and the loss `(Y, h(X)) become random variables.

risk of an inference rule h(X) ⇒ expected loss

XX
R(h) = E[`(Y, h(X))] = p(x, y)`(y, h(x))
x∈X y∈Y
how to estimate R(h) if p(x, y) is unknown?

how to choose an optimal predictor h(x) if p(x, y) is unknown?

6/10
Estimating R(h):
m
i i

collect an i.i.d. test sample S = (x , y ) ∈ X × Y | i = 1, . . . , m drawn from the
distribution p(x, y),
estimate the risk R(h) of the strategy h by the empirical risk
m
1 X i
R(h) ≈ RS m (h) = `(y , h(xi))
m i=1
Q: how strong can they deviate from each other? (see next lectures)

P |RS m (h) − R(h)| > ≤??
7/10
Choosing an optimal inference rule h(x)
If p(x, y) is known:
The smallest possible risk is

XX X X
∗
R = inf R(h) = inf p(x, y)`(y, h(x)) = p(x) inf
0
p(y | x)`(y, y 0)
h∈Y X h∈Y X y ∈Y
x∈X y∈Y x∈X y∈Y
The corresponding best possible inference rule is the Bayes inference rule
X
∗
h (x) = arg min p(y | x)`(y, y 0)
y 0 ∈Y y∈Y
But p(x, y) is not known and we can only collect examples drawn from it. We need:
Learning algorithms that use training data and prior assumptions/knowledge about the task
Learning types
8/10
Training data:
m
i i
if T = (x , y ) ∈ X × Y | i = 1, . . . , m ⇒ supervised learning

m
i
if T = x ∈ X | i = 1, . . . , m ⇒ unsupervised learning

m1 S m2
m
if T = Tl Tu , with labelled training data Tlm1 and unlabelled training data Tum2

⇒ semi-supervised learning
Prior knowledge about the task:

Discriminative learning: assume that the optimal inference rule h∗ is in some class of

rules H ⇒ replace the true risk by empirical risk
1 X
RT (h) = `(y, h(x))
|T |
(x,y)∈T
and minimise it w.r.t. h ∈ H, i.e. h∗T = arg min RT (h).

h∈H
Q: How strong can R(h∗T ) deviate from R(h )? How does this deviation depend on H?
∗

P |R(h∗T ) − R(h∗)| > ≤??
Learning types
9/10
Generative learning: assume that the true p.d. p(x, y) is in some parametrised family

of distributions, i.e. p = pθ∗ ∈ PΘ ⇒ use the training set T to estimate θ ∈ Θ:

1. θT∗ = arg max log pθ (T ), i.e. maximum likelihood estimator,
θ∈Θ
2. set h∗T = hθT∗ , where hθ denotes the Bayes inference rule for the p.d. pθ .
Q: How strong can θT∗ deviate from θ∗? How does this deviation depend on PΘ?
Possible combinations (training data vs. learning type)
discr. gener.
superv. yes yes
semi-sup. (yes) yes
unsuperv. no yes
In this course:
discriminative: Support Vector Machines, Deep Neural Networks

generative: mixture models, Hidden Markov Models

other: Bayesian learning, Ensembling

Example: Classification of handwritten digits
10/10
x ∈ X - grey valued images, 28x28, y ∈ Y - categorical variable with 10 values

discriminative: Specify a class of strategies H and a loss function `(y, y 0). How would

you estimate the optimal inference rule h∗ ∈ H?

generative: Specify a parametrised family pθ (x, y), θ ∈ Θ and a loss function `(y, y 0).

How would you estimate the optimal θ∗ by using the MLE? What is the Bayes inference
rule for pθ∗ ?

Statistical Learning Intro

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Learning Intro

Uploaded by

Copyright:

Available Formats

Statistical Machine Learning (BE4M33SSU)

class). You are expected to work on them in advance.

selected methods in Python (or Matlab). You have to submit

pattern recognition and machine learning (AE4B33RPZ)

More details: https://cw.fel.cvut.cz/wiki/courses/be4m33ssu/start

Machine learning has been successfully applied e.g. in areas

speech recognition and natural language processing,

fraud detection, network intrusion,

and many others

a categorical variable, a scalar, a real valued vector, a tensor, a sequence of values, an

prediction strategy (a.k.a. inference rule) h : X → Y; depending on the type of Y:

• y is a categorical variable ⇒ classification

loss function ` : Y × Y → R+ penalises wrong predictions,

Goal: optimal prediction strategy h : X → Y that minimises the loss

Q: give meaningful application examples for combinations of different X , Y and

X, Y are related by an unknown joint p.d.f. p(x, y),

we can collect examples (x, y) drawn from p(x, y).

p(x | y) the conditional feature distribution.

Consequences and problems

risk of an inference rule h(X) ⇒ expected loss

how to estimate R(h) if p(x, y) is unknown?

how to choose an optimal predictor h(x) if p(x, y) is unknown?

estimate the risk R(h) of the strategy h by the empirical risk

The smallest possible risk is

Prior knowledge about the task:

rules H ⇒ replace the true risk by empirical risk

and minimise it w.r.t. h ∈ H, i.e. h∗T = arg min RT (h).

of distributions, i.e. p = pθ∗ ∈ PΘ ⇒ use the training set T to estimate θ ∈ Θ:

Possible combinations (training data vs. learning type)

generative: mixture models, Hidden Markov Models

other: Bayesian learning, Ensembling

x ∈ X - grey valued images, 28x28, y ∈ Y - categorical variable with 10 values

you estimate the optimal inference rule h∗ ∈ H?

You might also like