0% found this document useful (0 votes)

2 views

SML_Lecture2

The document discusses statistical learning theory, focusing on the concept of generalization in supervised machine learning. It introduces the Probably Approximately Correct (PAC) Learning framework, which formalizes generalization and outlines conditions for PAC learnability, including sample size requirements. Additionally, it explores examples of learning algorithms, generalization error bounds, and the implications of hypothesis class size on learning performance.

Uploaded by

mohamnaf.b

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

SML_Lecture2

Uploaded by

mohamnaf.b

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

CS-E4715 Supervised Machine Learning

Lecture 2: Statistical learning theory

Generalization

Our aim is to predict as well as possible the outputs of future

examples, not only for training sample
We would like to minimize the generalization error, or the (true)
risk
R(h) = E(x,y )∼D [ L(h(x), y ) ] ,
where L(y , y 0 ) is a suitable loss function (e.g. zero-one loss)
Assuming future examples are independently drawn from the same
distribution D that generated the training examples (i.i.d
assumption)
But we do not know D!
What can we say about R(h) based on training examples and the
hypothesis class H alone? Two possibilities:
Empirical evaluation through testing
Statistical learning theory (Lectures 2 and 3)

1
Additional reading

Lectures 2-4 are mostly based

on Mohri et al book:
chapters 2-4
Available online in Aalto
eBookAalto Central (link on
the course home page)
The book goes much deeper
in the theory (e.g. proofs of
theorems) than what we do
in the course

2
Probably approximately correct
learning
Probably Approximate Correct Learning framework

Probably Approximate Correct (PAC) Learning framework formalizes

the notion of generalization in machine learning
Ingredients:
input space X containing all possible inputs x
set of possible labels Y (in binary classification Y = {0, 1} or
Y = {−1,+1})
Concept class C containing concepts C : X 7→ Y (to be learned),
concept C gives a label C (x) for each input x
unknown probability distribution D
training sample S = (x1 , C (x1 )), . . . , (xm , C (xm )) drawn
independently from D
hypothesis class H, in the basic case H = C but this assumption can
be relaxed
The goal in PAC learning is to learn a hypothesis with a low
generalization error

R(h) = Ex∼D L0/1 (h(x), C (x)) = Pr (h(x) 6= C (x))
x∼D
3
PAC learnability

A class C is PAC-learnable, if there exist an algorithm A that given

a training sample S outputs a hypothesis hS ∈ H that has
generalization error satisfying

Pr (R(hS ) ≤ ) ≥ 1 − δ

for any distribution D, for arbitrary , δ > 0 and sample size m = |S|
that grows polynomially in 1/,1/δ
for any concept C ∈ C
In addition, if A runs in time polynomial in m,1/,and 1/δ the class
is called efficiently PAC learnable
Learned hypothesis is Probably (1 − δ) Approximately Correct ().

4
Interpretation

Let us interpret the bound

Pr (R(hS ) ≤ ) ≥ 1 − δ

sets the level of generalization error that is of interest to us, say we

are content with predicting incorrectly 10% of the new data points:
= 0.1
1 − δ sets a level of confidence, if we are content of the training
algorithm to fail 5% of the time to provide a good hypothesis:
δ = 0.05
Samples size and running time should not explode when we make
and δ stricter: requirement of polynomial growth
The event ”low generalization error”, {R(hS ) ≤ } is considered as a
random variable because we cannot know beforehand which
hypothesis hS ∈ H will be selected by the algorithm
5
Generalization error bound vs. test error

Generalization error bounds concern

the tail of the error distribution
We wish a high generalization error
to be a rare event
Expected generalization error which
might be considerably lower
Analyzing average behaviour where
most data distributions and
concepts are ”not bad”
We expect there be a gap between the
expected error and the tail
The smaller the failure probability δ,
the larger the gap

6
Example: learning axis-aligned
rectangles
Learning setup

The goal is to learn a

rectangle R (representing the
true concept) that includes
all blue points and excludes
all red points
The hypotheses also will be
rectangles (here R0 ), which
will in general have both false
positive predictions
(predicting blue when true
label is red) and false
negative predictions
(predicting red when true
label is blue).

7
Example: learning axis-aligned rectangles

We will use a simple

algorithm: choose the
tightest rectangle that
contains all blue points
Note that this will be a
consistent hypothesis: no
errors on the training data
The hypothesis R0 will only
make false negative errors
compared to the true concept
R, no false positive errors (Q:
Why is that?)

8
Example: learning axis-aligned rectangles

Questions:
Is the class of axis-aligned
rectangles PAC-learnable?
How much training data will
be needed to learn?
Need to bound the risk of
outputting a bad hypothesis
R(R0 ) > with high
probability 1 − δ
We can assume PrD (R) >
(otherwise R(R0 ) ≤
trivially)

9
Example: learning axis-aligned rectangles

Let r1 , r2 , r3 , r4 be rectangles along the sides of R such that

PrD (ri ) = /4

Their union satisfies

PrD (r1 ∪ r2 ∪ r3 ∪ r4 ) ≤
Errors can only occur within
R − R0 (false negative
predictions)
If R0 intersects all of the four
regions r1 , . . . , r4 then we
know that

R(R0 ) ≤
Thus, if R(R0 ) > then R0 must miss at least one of the four regions

10
Example: learning axis-aligned rectangles

Events A =
{R0 intersects all four rectangles r1 , . . . , r4 },
B = {R(R0 ) ≤ }, satisfy A ⊆ B
Complement events AC =
{R0 misses at least one rectangle },
BC = {R(R0 ) > } satisfy BC ⊆ AC
BC is the bad event (high
generalization error), we want it to
have low probability
In probability space, we have
Pr (BC ) ≤ Pr (AC )
Our task is to upper bound Pr (AC )

11
Example: learning axis-aligned rectangles

Each ri has probability mass

/4 by our design
Probability of one example
missing one rectangle:
1 − /4
Probability of m examples
missing one rectangle:
(1 − /4)m
(m times repeated trial with
replacement)
Probability of all examples
missing at least one of the
rectangles:

Pr (AC ) ≤ 4(1 − /4)m

12
Example: learning axis-aligned rectangles

We can use a general inequality

∀x : (1 − x) ≤ exp(−x) to obtain:

Pr (R(h) > ) ≤ 4(1−/4)m ≤ 4 exp(−m/4)

We want this probability to be small

(< δ):

4 exp(−m/4) < δ
⇔ m ≥ 4/ log 4/δ

The last inequality is our first

generalization error bound, a sample
complexity bound to be exact
Note: corresponding to Mohri et al (2018) log denotes the natural
logarithm: log(exp(x)) = x

13
Plotting the behaviour of bound

Left, the sample complexity, the number of examples needed to

reach a given generalization error level is shown m(, δ) = 4/ log 4/δ
Right, the generalization bound is plotted as a function of training
sample size (m, δ) = 4/m log 4/δ
Three different confidence levels (δ) are plotted

14
Plotting the behaviour of the bound

Typical behaviour of ML learning algorithms is revealed:

increase of sample size decreases generalization error
extra data gives less and less additional benefit as the sample size
grows (law of diminishing returns)
requiring high level of confidence (small δ) for obtaining low error
requires more data for the same level of error

15
Generalization error bound vs. expected test error

The error bounds hold for any

concepts from the class
including concepts that are harder
to learn than ”average concept”
They hold for any distribution D
generating the data
Including adversially generated
distributions (aiming to make
learning harder)
We bound the probability of being in
the high error tail of the distribution
(not the convergence to the mean or
median generalization error)
For these reasons empirically estimated test errors might be considerably
lower than the bounds suggest

16
Half-time poll: Personalized email spam filtering system
Company is developing a personalized email spam filtering system. The
system is tuned personally for each customer using the customers data on
top of the existing training data. The company has a choice of three
machine learning algorithms, with different performance characteristics.
So far, the company has tested three different algorithms on a small set
of test users.
Which algorithm should the company choose?

1. Algorithm 1, which guarantees error rate of less than 10% for 99%
of the future customer base
2. Algorithm 2, which guarantees error rate of less than 5% for 90% of
the future customer base
3. Algorithm 3, which empirically has error rate of 1% on the current
user base

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

down to ”Lecture 2 poll”:
Answers are anonymous and do not affect grading of the course.
Guarantees for finite hypothesis
sets
Finite hypothesis classes

Finite concept classes arise when:

Input variables have finite domains or they are converted to such in
preprocessing (e.g. discretizing real values), and
The representations of the hypotheses have finite size (e.g. the
number of times a single variable can appear)
Dealing with subclasses of Boolean formulae, expressions binary
input variables (literals) combined with logical operators (AND, OR,
NOT,...)
Finite concept classes have been thoroughly analyzed hypothesis
classes in statistical learning theory

17
Example: Boolean conjunctions

Aldo likes to do sport only when the weather is suitable

Also has given examples of suitable and not suitable weather
Let us build a classifier for Aldo to decide whether to do sports today
As the classifier we use rules in the form of boolean conjunctions
(boolean formulae containing AND, and NOT, but not OR
operators): e.g. if (Sky=Sunny) AND NOT(Wind=Strong) then
(EnjoySport=1)

18
Finite hypothesis class - consistent case

Sample complexity bound relying on the size of the hypothesis class

(Mohri et al, 2018): Pr (R(hs ) ≤ ) ≥ 1 − δ if

1 1
m≥ (log(|H|) + log( ))
δ
An equivalent generalization error bound:
1 1
R(h) ≤ (log(|H|) + log( ))
m δ
Holds for any finite hypothesis class assuming there is a consistent
hypothesis, one with zero empirical risk
Extra term compared to the rectangle learning example is the term
1
(log(|H|))
The more hypotheses there are in H, the more training examples are
needed

19
Example: Boolean conjunctions

How many different conjunctions can be built (=|H|)

Each variable can appear with or without ”NOT” or can be excluded
from the rule = 3 possibilities
The total number of hypotheses is thus 3d , where d is the number
of variables
We have six variables in total, giving us |H| = 36 = 729 different
hypotheses

20
Plotting the bound for Aldo’s problem using boolean conjunc-
tions

On the left, the generalization bound is shown for different values of

δ, using d = 6 variables
On the right, the bound is shown for increasing number of input
variables d, using δ = 0.05

21
Arbitrary boolean formulae

What about using arbitrary boolean formulae?

How many boolean formulae of d variables there are?
There are 2d possible input vectors, size of the input space is
|X | = 2d
We can define a boolean formula that outputs 1 for an arbitrary
subset of S ⊂ X and zero outside that subset:
fS (x) = (x = x1 )OR(x = x2 )OR · · · OR(x = x|S| )
We can pick the subset in 2|X | ways (Why?)
d
Thus we have |H| = 22 different boolean formula
Our generalization bound gives
1 d 1
m≥ (2 log 2 + log( ))
δ
Thus we need exponential number of examples with respect to the
number of variables; the hypothesis class is considered not
PAC-learnable!
22
Plotting the bound for Arbitrary boolean formulae

With d = 6 variables we need ca. 500 examples to get bound below

0.07 (left picture)
Increase of number of variables quickly raises the sample complexity
to 106 and beyond (right picture)

23
Proof outline of the PAC bound
for finite hypothesis classes
Proof outline (Mohri et al., 2018)

Consider any hypothesis h ∈ H with R(h) >

For h to be consistent R̂(h) = 0, all training examples need to miss
the region where h is making an error.
The probability of this event is

Pr (R̂(h) = 0|R(h) > ) ≤ (1 − )m

m times repeated trial with success probability

This is the probability that one consistent hypothesis has high error

24
Proof outline

But we do not know which consistent hypothesis h is selected by our

learning algorithm
Hence our result will need to hold for all consistent hypotheses
This is an example of uniform convergence bound
We wish to upper bound the probability that some h ∈ H is
consistent R̂(h) = 0 and has a high generalization error R(h) > for
a fixed > 0:

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > )

Above ∧ is the logical ”and”

25
Proof outline

We can replace ∃ by enumerating all hypotheses in H using

logical-or (∨)

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) =

Pr {R̂(h1 ) = 0 ∧ R(h1 ) > } ∨ {R̂(h2 ) = 0 ∧ R(h2 ) > } ∨ · · ·

Using the the fact that Pr (A ∪ B) ≤ Pr (A) + Pr (B) and

Pr (A ∩ C ) ≤ Pr (A|C ) for any events A,B and C the above is upper
bounded by
X X
≤ Pr (R̂(h) = 0 ∧ R(h) > ) ≤ Pr (R̂(h) = 0|R(h) > )
h∈H h∈H

≤ |H|(1 − )m

Last inequality follows from using

Pr (R̂(h) = 0|R(h1 ) > ) ≤ (1 − )m for the |H| summands
26
Proof outline

We have established

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) ≤ |H|(1 − )m ≤ |H| exp(−m)

Set the right-hand side equal to δ and solve for m to obtain the
bound:

δ = |H| exp (−m)

log δ = log |H| − m
1
m= (log(|H|) + log(1/δ))

27
Finite hypothesis class - inconsistent case

So far we have assumed that there is a consistent hypothesis h ∈ H,

one that achieves zero empirical risk on training sample
In practise this is often not the case
However as long as the empirical risk R̂(h) is small, a low
generalization error can still be achieved
Generalization error bound (Mohri, et al. 2018): Let H be a finite
hypothesis set. Then for any δ > 0 with probability at least 1 − δ we
have for all h ∈ H:
r
log(|H|) + log(2/δ)
R(h) ≤ R̂(h) +
2m
We see the dependency from log |H|/m as in the consistent case but
now under square root
Slower convergence w.r.t number of examples

28
Summary

Probably approximately correct learning is a theoretical framework

for analysing the generalization performance of machine learning
algorithms
PAC theory is concerned about upper bounding the probability δ of
”bad” events, those of high generalization error ()
In finite hypothesis classes, (the logarithm of) number of hypothesis
log |H| in the hypothesis class affects the number of examples
needed to obtain a given level of risk with high probability

IIT Kanpur Machine Learning End Sem Paper
No ratings yet
IIT Kanpur Machine Learning End Sem Paper
10 pages
Repositioning Royco
100% (1)
Repositioning Royco
6 pages
Pathfinders' Guide To Eberron
100% (2)
Pathfinders' Guide To Eberron
316 pages
Practice Midterm 2 Sol
No ratings yet
Practice Midterm 2 Sol
26 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
No ratings yet
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
6 pages
Lect 26 PDF
No ratings yet
Lect 26 PDF
14 pages
Computational Learning Theory
No ratings yet
Computational Learning Theory
15 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
06 Lower
No ratings yet
06 Lower
3 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
An Information-Theoretic Approach To Generalization Theory - Part2
No ratings yet
An Information-Theoretic Approach To Generalization Theory - Part2
22 pages
ML Notes
No ratings yet
ML Notes
161 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
Computational Learning
No ratings yet
Computational Learning
12 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
Foundations of Machine Learning: Assignment 1: Problem 1
No ratings yet
Foundations of Machine Learning: Assignment 1: Problem 1
2 pages
Colt Tutorial
No ratings yet
Colt Tutorial
43 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
Lect1116-18 Active Learning
No ratings yet
Lect1116-18 Active Learning
6 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
Learning Theory: Machine Learning 10 - 601B Seyoung Kim
No ratings yet
Learning Theory: Machine Learning 10 - 601B Seyoung Kim
44 pages
Lecture 17
No ratings yet
Lecture 17
2 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
n27 PDF
No ratings yet
n27 PDF
3 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
10 pages
SML_Lecture4
No ratings yet
SML_Lecture4
38 pages
Cao2013 Miccai 0
No ratings yet
Cao2013 Miccai 0
8 pages
Lec11 Handout
No ratings yet
Lec11 Handout
86 pages
Survey Wss 2010
No ratings yet
Survey Wss 2010
134 pages
9.1 The PAC Learning Model: 9.1.1 Some Background & Motivation
No ratings yet
9.1 The PAC Learning Model: 9.1.1 Some Background & Motivation
7 pages
Multivariate Analysis From A Statistical Point of View: K.S. Cranmer
No ratings yet
Multivariate Analysis From A Statistical Point of View: K.S. Cranmer
4 pages
5 Distributions
No ratings yet
5 Distributions
11 pages
Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
No ratings yet
Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
124 pages
07 Agnostic Pac
No ratings yet
07 Agnostic Pac
5 pages
AI&ML-Q With Answer
No ratings yet
AI&ML-Q With Answer
18 pages
Lecture 1: Brief Overview - PAC Learning
No ratings yet
Lecture 1: Brief Overview - PAC Learning
3 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
MODULE - 4 QB SOLVED-1
No ratings yet
MODULE - 4 QB SOLVED-1
31 pages
Slide 1
No ratings yet
Slide 1
37 pages
Neural Networks Economics
No ratings yet
Neural Networks Economics
27 pages
ps2
No ratings yet
ps2
9 pages
Homework 2
No ratings yet
Homework 2
4 pages
Tamas Hardness Da
No ratings yet
Tamas Hardness Da
19 pages
Male
No ratings yet
Male
9 pages
ML 05
No ratings yet
ML 05
20 pages
Lecture 2 ML_Maths
No ratings yet
Lecture 2 ML_Maths
80 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
Machine Leaning 3
No ratings yet
Machine Leaning 3
44 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
123 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
31 pages
EDAN96_2024_Last_lecture-1
No ratings yet
EDAN96_2024_Last_lecture-1
78 pages
module_5_notes BAYESIAN learning notes
No ratings yet
module_5_notes BAYESIAN learning notes
24 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
Convex Optimization Overview (CNT'D) : 1 Recap
No ratings yet
Convex Optimization Overview (CNT'D) : 1 Recap
15 pages
Lect 0329
No ratings yet
Lect 0329
3 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Essay Writing Skills
No ratings yet
Essay Writing Skills
20 pages
Grievous Hurt
No ratings yet
Grievous Hurt
2 pages
UCSP Q1 M2 The Concept of Society
No ratings yet
UCSP Q1 M2 The Concept of Society
74 pages
Psychology Test PDF
No ratings yet
Psychology Test PDF
3 pages
Thesis Grabiner SC
No ratings yet
Thesis Grabiner SC
362 pages
Instant Access to Critical Thinking in Clinical Practice Improving the Quality of Judgments and Decisions Second Edition Eileen Gambrill ebook Full Chapters
100% (1)
Instant Access to Critical Thinking in Clinical Practice Improving the Quality of Judgments and Decisions Second Edition Eileen Gambrill ebook Full Chapters
81 pages
Written by - Jezreel Madsa: Ove Has Its Own Language Understood Only by Those Who Experience It
No ratings yet
Written by - Jezreel Madsa: Ove Has Its Own Language Understood Only by Those Who Experience It
1 page
Download Full Introduction to Learning and Behavior 5th Edition – Ebook PDF Version PDF All Chapters
100% (2)
Download Full Introduction to Learning and Behavior 5th Edition – Ebook PDF Version PDF All Chapters
51 pages
Southern Luzon State University College of Allied Medicine Lucban, Quezon
No ratings yet
Southern Luzon State University College of Allied Medicine Lucban, Quezon
4 pages
Sound R
No ratings yet
Sound R
2 pages
Historical Knowledge Is Limited by Incompleteness of The Record
No ratings yet
Historical Knowledge Is Limited by Incompleteness of The Record
26 pages
Breast Cancer Thesis Statement
100% (3)
Breast Cancer Thesis Statement
5 pages
B. Warmer Exercise 1: Write Verb 2 (Past Participle) of The Following Verbs
No ratings yet
B. Warmer Exercise 1: Write Verb 2 (Past Participle) of The Following Verbs
3 pages
Strength Training For Women
100% (1)
Strength Training For Women
232 pages
Unit 5
No ratings yet
Unit 5
50 pages
Lessons Learned About Lesson Learned
100% (1)
Lessons Learned About Lesson Learned
18 pages
WR 2
No ratings yet
WR 2
2 pages
Ucsp Chapter 4 Lesson10!11!1st Final
No ratings yet
Ucsp Chapter 4 Lesson10!11!1st Final
2 pages
Funiture Company List Sep 19
No ratings yet
Funiture Company List Sep 19
55 pages
C24 BTTS-9 Physics
No ratings yet
C24 BTTS-9 Physics
19 pages
3.B.2.1. The Role of Strategic Alliance
No ratings yet
3.B.2.1. The Role of Strategic Alliance
7 pages
STRUCTURINGAIRESOURCES
No ratings yet
STRUCTURINGAIRESOURCES
11 pages
bài tập động từ yêu thích
No ratings yet
bài tập động từ yêu thích
4 pages
Model Answer For Question 2 of The IGCSE English Language Paper
100% (1)
Model Answer For Question 2 of The IGCSE English Language Paper
1 page
Ramona Badau: University of Lethbridge
No ratings yet
Ramona Badau: University of Lethbridge
3 pages
A Za Va A K+:M, A: Siv As.t.akam
No ratings yet
A Za Va A K+:M, A: Siv As.t.akam
2 pages
Assessment 3 Provide Customer Service
No ratings yet
Assessment 3 Provide Customer Service
23 pages
How To Work Smart
No ratings yet
How To Work Smart
49 pages