Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Appunti ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Notes on Machine Learning

Vincenzo Gargano

January 13, 2023


2
Contents

1 Neural Networks 5
1.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 VC-Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Statistical Learning Theory . . . . . . . . . . . . . . . 7
1.4 Intro to SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Quadratic Optimization Problem . . . . . . . . . . . . 9

3
4 CONTENTS
Chapter 1

Neural Networks

1.1 Regularization
Loss is Error + Regularization term and we have many approaches: like co-
efficient shrinkage aka Tikhnov regularization

Pl
Loss(W ) = p=1 (yp − yp )2 + λ ∗ ||W ||2

We can control it with the hyperparameter λ and we can use it to avoid


overfitting. We can remeber when we talked about VC-dimension, this is a
way to manage this dimensionality

Small lamba means high norm of the weights and a too complex model go-
ing to overfit data. Large lambda take the second term to grow too much,
bringing to a growth of the error data that tends to underfit. We call the
penalty term because penalize high values of the weights. Some weights can
even go to zero. If you use only L2 is called Ridge, if you use only L1 is
called Lasso, if you use both is called Elastic Net. L2 tends to bring weights
to smaller values, L1 penalize the absolute values and bring some weights to
0 and allows some weights to be larger, but Lasso has non differentiable loss.

5
6 CHAPTER 1. NEURAL NETWORKS

Exercises. Connect the regularization term with the VC-dimension


- Why this can have a better lower bound R?
- How λ values can rule the underfit and overfit cases?
Derive the new δ rule with weights decay using Tikhnov loss: computing ∂
of Loss with respect to , separating (η) and (λ)

1.2 Stopping Criteria


The basic is the used error mean error ¡ E, it is the best if you know the
tolerance of the data (expert knowledge), but often we haven’t information
or tolerance of when to stop. Our project is like this. We need to use inter-
nal criteria for example weight changes are very smol near zero gradient, or
error isn’t decreasing for an epoch for example less than 0.1%NOTE: may be
premature (for small eta), it can be applied observing k epochs (patients)

Just escape after an excessive number of epochs to escape from slow con-
vergence), avoiding to stop in a fixed number of epochs

1.3 VC-Dimension
Theorem 1. The VC-Dimensions of a class of function H is the maximum
cardinality of a set of points in X that can be shattered by H.

For linear functions:


V C(H) ≥ 3 but we need at least one configuration that can be shattered,
for example the 3 dots example. For V C(H) < 4 instead we need non linear
functions
In general the VC-Dimension of a class of linear separator hyperplanes
(LTU) in an n-dimensional space is n+1.

VC-dimension is not the number of free parameters, but they are related,
there are model with one parameter and infinite VC-dim, see Haykin book.
For Nearest-neighbour the VC-dim is infinite, for example for 1-Nearest-
Neighbour we can have infinite points and 0 error! 1-Nearest-Neighbour is
not a model indeed.
1.4. INTRO TO SVM 7

1.3.1 Statistical Learning Theory


Assume N the number of data (l)

The guaranteed risk: R[h] = Remp [h] + ϵ(V C, N, δ)


q
V C(ln V2N +1)−lnδ/4
The second term the VC-confidence ϵ(V C, N, δ) = C
N
, with
probability at least (1 − δ) for every V C < N

Remember the U-shaped plot of VC-Dimensions with ϵ going to zero, in-


creasing N, ϵ grows with VC, and we get the U shape by R increasing the
VC-dimension

This gives us a way to estimate the error on future data based only on
the training error and VC-Dimension of H. This provides good information
about model selection and assessment because don’t take into account the
process of cross-validation that is costful.

1.4 Intro to SVM


1.4.1 Linear SVM
We are now making a connection between VC-Dimension, SLT and Linear
Threshol Unit (LTU)
We follow Haykin chapter 6.

• N is the number of examples, before used l

• m is dimension of input vector, before used n

• b instead of w0 as intercept or bias

Assumptions: Hard margin SVM, we assume linear separable problem


and no error in the data

Separating hyperplane: wT x + b = 0
where wT xi + b ≥ 0 for di = +1 and the other way around for di = −1
8 CHAPTER 1. NEURAL NETWORKS

g(x) = wT x + b is the discriminant function and h(x) = sign(g(x)) is the


hypotesis

The separation margin ρ is evaluated as the double of the distance between


the hyperplane and the closes data point. We can call it ”safe zone”

Also not all the hyperplanes have equals safe zones, maybe with larges or
smaller margin. We want to make a preference for larger margins.

We define the optimal hyperplane as the one with that maximizethe margin
ρ.

woT x + bo = 0 (1.1)

2
We wat to find ρ = ∥w∥ , so we need to maximize ρ and minimize ∥w∥ This
is an optimization problem in terms of minimize the norm, in order to get a
new LTU that maximize the margin. We can see later why is this relevant.

We can rescale, (like normalize the distance to 1 or -1 for the points) w and
b and so that the closest points to hyperplane satisfy g(xi ) = |wT xi + b| = 1
and then write in a compact form:

di (wT xi + b) = 1, ∀i = 1, ..., N (1.2)

A Support Vector x(s) satisfies the previous equation exactly:

d(s) (wT x(s) + b) = 1 (1.3)

We have many support vectors, they are the closes points/datapoints to the
boundary or hyperplane.
Let’s call g(x), discriminant as g(x) = wT x + b, and recall that wo is a
vector orthogonal to the hyperplane, let’s denote the distance between x and
optimal hyperplane with r
wo
x = xp + r (1.4)
∥wo ∥

We now derive the margin evaluating g(x) in the point xp , evaluating g(x) =
1.4. INTRO TO SVM 9

woT x + bo we obtain:
wo
g(x) = g(xp + r )=
∥wo ∥
wo
= woT xp + bo + woT r =
∥wo ∥
wo (1.5)
= g(xp ) + rwoT =
∥wo ∥
∥wo ∥2
=r =5 r∥wo ∥
∥wo ∥
We use the definition of g(x) to get to the second row, we multiply wo for
the two terms basically.
But the first part is equal to g(xp ), so we can write like that in the third
equation. Now for definition of decision boundary the value of g(xp ) is zero
and we get only the second part, multipling we obtain the final result.
Thus
g(x)
r= (1.6)
∥wo ∥
Now we consider the distance between the hyperplane and a positive support
vector x(s)
g(x(s) ) 1 ρ
rf orx(s) = = = (1.7)
∥wo ∥ ∥wo ∥ 2
ρ in this case is divided by two because we’re taking half of the maximum
margin.

1.4.2 Quadratic Optimization Problem


Find the optimum values of w and b that maximize the margin yields to a
quadratic optimization problem
Given the training samples T = {(x(i) , d(i) )}, find the optimum values of w
and b that minimize the following equation:
1
Ψ(w) = wT w (min.∥w∥) (1.8)
2
Satisfiying the constraints (zero classification errors): d(i) (wT x(i) + b) ≥
1, ∀i = 1, ..., N
10 CHAPTER 1. NEURAL NETWORKS

We are searching the hyperplane that correctly classifies all the data and
at the same time have the largest margin. The Perceptron is able to solve
this problem, but this is not a random possible solution but the one with the
max margin, this function is quadratic and convex in w, costraints are linear
in w, and the problem scales with the size of input space m.

Why this is related to the Structural Risk Minimization? The next theorem
is fundamental to understand why we presented the SVM. Because otherwise
we have no other reason to introduce it instead of perceptron, Least Mean
Squares...
We fixed the training error on linearly separable problems. Than minimiz-
ing the norm of w is equivalent to minimize the VC dimension and thus the
capacity term (VC Confidence) ϵ(V C, N, δ).

Theorem 2 (Vapnik). Let D be the diameter of the smallest ball around the
data points x1 , ...xN .
For the class of sepaimrating hyperplane described by the equation wT x + b =
0 the upper bound to the VC-dimension is

D2
V C ≤ min( 2 , m0 ) + 1 (1.9)
ρ
This means that if in the class of our model, the linear models, the stan-
dard VC-Dim was equal to dimension of the problem plus one (remember
example of the 3 points linearly separable). The VC-Dimension can be re-
duced by increasing the margin, because the upper bound is the minimum
between the two, plus one accounting the fact that the VC dimension of a
class of linear separator hyperplane in n-dimensional space is n+1. As in 1.3.
So we’re optimizing the VC-bound, the empirical risk is fixed to 0 as we said
about fixing training error before. This corrisponds to regularization thus re-
ducing VC dimension as we did with Tikhnov but coming from a completely
different reasoning.
What is the best solution among all the possible hyperplanes? The one in
the middle. This theorem relate the maximum margin with idea of minimum
VC dimension.
Why support vector in the name? We’ll discuss in the next lesson.

You might also like