Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Lecture 1

Foundations of Machine Learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 1

Foundations of Machine Learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Foundations of Machine Learning

Introduction to ML

Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu

Foundations of Machine Learning page 1


Logistics
Prerequisites: basics in linear algebra, probability, and
analysis of algorithms.
Workload: about 3-4 homework assignments + project.
Mailing list: join as soon as possible.

Foundations of Machine Learning page 2


Course Material
Textbook Slides: course web page.
http://www.cs.nyu.edu/~mohri/ml23

Foundations of Machine Learning page 3


This Lecture
Basic de nitions and concepts.

Introduction to the problem of learning.

Probability tools.

Foundations of Machine Learning page 4


fi
Machine Learning
De nition: computational methods using experience to
improve performance.

Experience: data-driven task, thus statistics,


probability, and optimization.

Computer science: learning algorithms, analysis of


complexity, theoretical guarantees.

Example: use document word counts to predict its topic.

Foundations of Machine Learning page 5


fi
Examples of Learning Tasks
Text: document classi cation, spam detection.

Language: NLP tasks (e.g., morphological analysis, POS


tagging, context-free parsing, dependency parsing).

Speech: recognition, synthesis, veri cation.

Image: annotation, face recognition, OCR, handwriting


recognition.

Games (e.g., chess, backgammon, go).

Unassisted control of vehicles (robots, car).

Medical diagnosis, fraud detection, network intrusion.

Foundations of Machine Learning page 6


fi
fi
Some Broad ML Tasks
Classi cation: assign a category to each item (e.g.,
document classi cation).

Regression: predict a real value for each item (prediction of


stock values, economic variables).

Ranking: order items according to some criterion (relevant


web pages returned by a search engine).

Clustering: partition data into ‘homogenous’ regions


(analysis of very large data sets).

Dimensionality reduction: nd lower-dimensional manifold


preserving some properties of the data.

Foundations of Machine Learning page 7


fi
fi
fi
General Objectives of ML
Theoretical questions:

• what can be learned, under what conditions?

• are there learning guarantees?

• analysis of learning algorithms.

Algorithms:

• more e cient and more accurate algorithms.

• deal with large-scale problems.

• handle a variety of di erent learning problems.

Foundations of Machine Learning page 8


ffi
ff
This Course
Theoretical foundations:

• learning guarantees.

• analysis of algorithms.

Algorithms:

• main mathematically well-studied algorithms.

• discussion of their extensions.

Applications:

• illustration of their use.

Foundations of Machine Learning page 9


Topics
Probability tools, concentration inequalities.

PAC learning model, Rademacher complexity, VC-dimension,


generalization bounds.

Support vector machines (SVMs), margin bounds, kernel methods.

Ensemble methods, boosting.

Logistic regression and conditional maximum entropy models.

On-line learning, weighted majority algorithm, Perceptron algorithm,


mistake bounds.

Regression, generalization, algorithms.

Ranking, generalization, algorithms.

Reinforcement learning, MDPs, bandit problems and algorithm.


Foundations of Machine Learning page 10
De nitions and Terminology
Example: item, instance of the data used.

Features: attributes associated to an item, often


represented as a vector (e.g., word counts).

Labels: category (classi cation) or real value (regression)


associated to an item.

Data:

• training data (typically labeled).

• test data (labeled but labels not seen).

• validation data (labeled, for tuning parameters).

Foundations of Machine Learning page 11


fi
fi
General Learning Scenarios
Settings:

• batch: learner receives full (training) sample, which he


uses to make predictions for unseen points.

• on-line: learner receives one sample at a time and


makes a prediction for that sample.

Queries:

• active: the learner can request the label of a point.

• passive: the learner receives labeled points.

Foundations of Machine Learning page 12


Standard Batch Scenarios
Unsupervised learning: no labeled data.

Supervised learning: uses labeled data for prediction on


unseen points.

Semi-supervised learning: uses labeled and unlabeled data


for prediction on unseen points.

Transduction: uses labeled and unlabeled data for


prediction on seen points.

Foundations of Machine Learning page 13


Example - SPAM Detection
Problem: classify each e-mail message as SPAM or non-
SPAM (binary classi cation problem).

Potential data: large collection of SPAM and non-SPAM


messages (labeled examples).

Foundations of Machine Learning page 14


fi
Learning Stages
labeled data algorithm prior knowledge

training sample A(⇥) features

validation data A(⇥0 )


parameter
selection
test sample
evaluation

Foundations of Machine Learning page 15


This Lecture
Basic de nitions and concepts.

Introduction to the problem of learning.

Probability tools.

Foundations of Machine Learning page 16


fi
De nitions
Spaces: input space X , output space Y .

Loss function: L : Y ⇥Y ! R .

• y , y) : cost of predicting yb instead of y .


L(b

• binary classi cation: 0-1 loss, L(y, y 0 ) = 1y6=y0 .

• regression:Y ✓ R , l(y, y 0 ) = (y 0 y)2 .

Hypothesis set: H ✓ Y X, subset of functions out of which


the learner selects his hypothesis.

• depends on features.

• represents prior knowledge about task.

Foundations of Machine Learning page 17


fi
fi
Supervised Learning Set-Up
Training data: sample S of size m drawn i.i.d. fromX ⇥Y
according to distribution D :

S = ((x1 , y1 ), . . . , (xm , ym )).

Problem: nd hypothesis h 2 H with small generalization


error.

• deterministic case: output label deterministic function of


input, y = f (x) .

• stochastic case: output probabilistic function of input.

Foundations of Machine Learning page 18


fi
Errors
Generalization error: for h 2 H , it is de ned by

R(h) = E [L(h(x), y)].


(x,y)⇠D

Empirical error: for h 2 H and sample S , it is


X m
b 1
R(h) = L(h(xi ), yi ).
m i=1
Bayes error:
R? = inf R(h).
h
h measurable

• in deterministic case, R? = 0.

Foundations of Machine Learning page 19


fi
Noise
Noise:

• in binary classi cation, for any x 2 X ,

noise(x) = min{Pr[1|x], Pr[0|x]}.

• observe that E[noise(x)] = R⇤ .

Foundations of Machine Learning page 20


fi
Learning ≠ Fitting

Notion of simplicity/complexity.
How do we de ne complexity?

Foundations of Machine Learning page 21


fi
Generalization
Observations:

• the best hypothesis on the sample may not be the best


overall.

• generalization is not memorization.

• complex rules (very complex separation surfaces) can be


poor predictors.

• trade-o : complexity of hypothesis set vs sample size


(under tting/over tting).

Foundations of Machine Learning page 22


fi
ff
fi
Model Selection
General equality: for any h 2 H , best in class

R(h) R⇤ = [R(h) R(h⇤ )] + [R(h⇤ ) R⇤ ] .


| {z } | {z }
estimation approximation

Approximation: not a random variable, only depends on H .

Estimation: only term we can hope to bound.

How should we choose H ?

Foundations of Machine Learning page 23


Empirical Risk Minimization
Select hypothesis set H.

Find hypothesis h 2 H minimizing empirical error:

b
h = argmin R(h).
h2H

• but H may be too complex.

• the sample size may not be large enough.

Foundations of Machine Learning page 24


Generalization Bounds

De nition: upper bound on Pr sup |R(h) b
R(h)| >✏ .
h2H

Bound on estimation error for hypothesis h0 given by ERM:

R(h0 ) R(h⇤ ) = R(h0 ) b 0 ) + R(h


R(h b 0) R(h⇤ )
 R(h0 ) b 0 ) + R(h
R(h b ⇤) R(h⇤ )
 2 sup |R(h) b
R(h)|.
h2H

How should we choose H ? (model selection problem)

Foundations of Machine Learning page 25


fi
Model Selection

error

estimation
approximation
upper bound

∞§

[
H= H .
2

Foundations of Machine Learning page 26


Structural Risk Minimization
(Vapnik, 1995)

Principle: consider an in nite sequence of hypothesis sets


ordered for inclusion,

H 1 ⇢ H2 ⇢ · · · ⇢ H n ⇢ · · ·

b
h = argmin R(h) + penalty(Hn , m).
h2Hn ,n2N

• strong theoretical guarantees.

• typically computationally hard.

Foundations of Machine Learning page 27


fi
General Algorithm Families
Empirical risk minimization (ERM):

b
h = argmin R(h).
h2H

Structural risk minimization (SRM): Hn ✓ Hn+1 ,

b
h = argmin R(h) + penalty(Hn , m).
h2Hn ,n2N

Regularization-based algorithms: 0,

b
h = argmin R(h) + khk2 .
h2H

Foundations of Machine Learning page 28


This Lecture
Basic de nitions and concepts.

Introduction to the problem of learning.

Probability tools.

Foundations of Machine Learning page 29


fi
Basic Properties
Union bound: Pr[A _ B]  Pr[A] + Pr[B].

Inversion: if Pr[X ✏]  f (✏) , then, for any > 0 , with


probability at least 1 , X  f 1( ) .

Jensen’s inequality: if f is convex,f (E[X])  E[f (X)] .


Z +1
Expectation: if X 0 , E[X] = Pr[X > t] dt .
0

Foundations of Machine Learning page 30


Basic Inequalities
Markov’s inequality: if X 0 and ✏ > 0 , then

Pr[X ✏]  E[X]
✏ .

Chebyshev’s inequality: for any ✏ > 0 ,


2
Pr[|X E[X]| ✏]  X
✏2 .

Foundations of Machine Learning page 31


Hoe ding’s Inequality
Theorem: Let X1 , . . . , Xm be indep. rand. variables with the
same expectation µ and Xi 2 [a, b] , (a < b ). Then, for any ✏ > 0 ,
the following inequalities hold:
 Xm ✓ 2

1 2m✏
Pr µ Xi > ✏  exp
m i=1 (b a)2
 m ✓ ◆
1 X 2m✏ 2
Pr Xi µ > ✏  exp .
m i=1 (b a)2

Foundations of Machine Learning page 32


ff
McDiarmid’s Inequality
(McDiarmid, 1989)
Theorem: let X1 , . . . , Xm be independent random variables
taking values in U and f : U m ! R a function verifying for
all i 2 [1, m] ,

sup |f (x1 , . . . , xi , . . . , xm ) f (x1 , . . . , x0i , . . . , xm )|  ci .


x1 ,...,xm ,x0i

Then, for all ✏ > 0 ,


h i ✓ 2

2✏
Pr f (X1 , . . . , Xm ) E[f (X1 , . . . , Xm )] > ✏  2 exp Pm 2 .
c
i=1 i

Foundations of Machine Learning page 33


Appendix

Foundations of Machine Learning page 34


Markov’s Inequality
Theorem: let X be a non-negative random variable
with E[X] < 1 , then, for all t > 0 ,
1
Pr[X tE[X]]  .
t
Proof: X
Pr[X t E[X]] = Pr[X = x]
x tE[X]
X x
 Pr[X = x]
t E[X]
x t E[X]
X x
 Pr[X = x]
x
t E[X]

X 1
=E = .
t E[X] t
Foundations of Machine Learning page 35
Chebyshev’s Inequality
Theorem: let X be a random variable with Var[X] < 1 , then,
for all t > 0,
1
Pr[|X E[X]| t X ]  2 .
t

Proof: Observe that

Pr[|X E[X]| t X ] = Pr[(X E[X])2 t2 2


X ].

The result follows Markov’s inequality.

Foundations of Machine Learning page 36


Weak Law of Large Numbers
Theorem: let (Xn )n2N be a sequence of independent
random variables with the same mean µ and variance 2
<1
Pn
and let X n = n i=1 Xi , then, for any ✏ > 0 ,
1

lim Pr[|X n µ| ✏] = 0.
n!1

Proof: Since the variables are independent,


Xn 
Xi n 2 2
Var[X n ] = Var = 2 = .
i=1
n n n

Thus, by Chebyshev’s inequality,


2
Pr[|X n µ| ✏]  .
n✏2

Foundations of Machine Learning page 37


Concentration Inequalities
Some general tools for error analysis and bounds:

• Hoe ding’s inequality (additive).

• Cherno bounds (multiplicative).

• McDiarmid’s inequality (more general).

Foundations of Machine Learning page 38


ff
ff
Hoe ding’s Lemma
Lemma: Let X 2 [a, b] be a random variable with E[X] = 0
and b 6= a . Then for any t > 0 ,
t2 (b a)2
E[etX ]  e 8 .

Proof: by convexity of x 7! etx , for all a  x  b ,

tx b x ta x a tb
e  e + e .
b a b a
Thus,
E[etX ] E[ bb X ta
ae + b a e ]
X a tb
= b
b
a e ta
+ a tb
b ae =e (t)
,
with,
(t) = log( b b a eta + a tb
b ae ) = ta + log( b b a + a t(b a)
b ae ).

Foundations of Machine Learning page 39


ff
Taking the derivative gives:
0 aet(b a)
a
(t) = a b a t(b a) =a b t(b a) a .
b a b ae b ae b a

Note that: (0) = 0 and 0


(0) = 0. Furthermore,
t(b a)
00 abe
(t) =
[b bae t(b a) a 2
b a]
t(b a)
↵(1 ↵)e (b a)2
= t(b a) + ↵]2
[(1 ↵)e
↵ (1 ↵)e t(b a)
= t(b a)
(b a)2
[(1 ↵)e + ↵] [(1 ↵)e t(b a) + ↵]
2 (b a)2
= u(1 u)(b a)  ,
4
a
with ↵ = . There exists 0  ✓  t such that:
b a
0 t2 00 2 (b a)2
(t) = (0) + t (0) + (✓)  t .
2 8

Foundations of Machine Learning page 40


Hoe ding’s Theorem
Theorem: Let X1 , . . . , Xm be independent random variables.
Then for Xi 2 [ai , bi ], the following inequalities hold
Pm
for Sm = i=1 Xi , for any ✏ > 0 ,
Pm
2✏2 / (b a ) 2
Pr[Sm E[Sm ] ✏]  e i=1 i i

Pm
2✏2 / i=1 (bi a i )2
Pr[Sm E[Sm ]  ✏]  e .

Proof: The proof is based on Cherno ’s bounding


technique: for any random variable X and t > 0 , apply
Markov’s inequality and select t to minimize
tX t✏ E[etX ]
Pr[X ✏] = Pr[e e ] t✏
.
e

Foundations of Machine Learning page 41


ff
ff
Using this scheme and the independence of the random
variables gives Pr[Sm E[Sm ] ✏]
t✏
e E[et(Sm E[Sm ])
]
t✏
=e ⇧m
i=1 E[e t(Xi E[Xi ])
]
t✏ m t2 (bi ai )2 /8
(lemma applied to Xi E[Xi ])  e ⇧i=1 e
P
t✏ t2 m 2
i=1 (bi ai ) /8
=e e
Pm
2✏2 / i=1 (bi ai )2
e ,
Pm
choosing t = 4✏/ i=1 (bi ai ) 2 .

The second inequality is proved in a similar way.

Foundations of Machine Learning page 42


Hoe ding’s Inequality
Corollary: for any ✏ > 0 , any distribution D and any
hypothesis h : X ! {0, 1}, the following inequalities hold:
b 2m✏2
Pr[R(h) R(h) ✏]  e
b 2m✏2
Pr[R(h) R(h)  ✏]  e .

Proof: follows directly Hoe ding’s theorem.

Combining these one-sided inequalities yields


h i
b 2m✏2
Pr R(h) R(h) ✏  2e .

Foundations of Machine Learning page 43


ff
ff
Cherno ’s Inequality
Theorem: for any ✏ > 0 , any distribution D and any
hypothesis h : X ! {0, 1} , the following inequalities hold:

Proof: proof based on Cherno ’s bounding technique.


b m R(h) ✏2 /3
Pr[R(h) (1 + ✏)R(h)]  e
b m R(h) ✏2 /2
Pr[R(h)  (1 ✏)R(h)]  e .

Foundations of Machine Learning page 44


ff
ff
McDiarmid’s Inequality
(McDiarmid, 1989)
Theorem: let X1 , . . . , Xm be independent random variables
taking values in U and f : U m ! R a function verifying for
all i 2 [1, m] ,

sup |f (x1 , . . . , xi , . . . , xm ) f (x1 , . . . , x0i , . . . , xm )|  ci .


x1 ,...,xm ,x0i

Then, for all ✏ > 0 ,


h i ✓ 2

2✏
Pr f (X1 , . . . , Xm ) E[f (X1 , . . . , Xm )] > ✏  2 exp Pm 2 .
c
i=1 i

Foundations of Machine Learning page 45


Comments:

• Proof: uses Hoe ding’s lemma.

• Hoe ding’s inequality is a special case of McDiarmid’s


with
m
1 X |bi ai |

f (x1 , . . . , xm ) = xi and ci = .
m i=1 m

Foundations of Machine Learning page 46


ff
ff
Jensen’s Inequality
Theorem: let X be a random variable and f a measurable
convex function. Then,
f (E[X])  E[f (X)].

Proof: de nition of convexity, continuity of convex


functions, and density of nite distributions.

Foundations of Machine Learning page 47


fi
fi

You might also like