Appunti ML
Appunti ML
Appunti ML
Vincenzo Gargano
1 Neural Networks 5
1.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 VC-Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Statistical Learning Theory . . . . . . . . . . . . . . . 7
1.4 Intro to SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Quadratic Optimization Problem . . . . . . . . . . . . 9
3
4 CONTENTS
Chapter 1
Neural Networks
1.1 Regularization
Loss is Error + Regularization term and we have many approaches: like co-
efficient shrinkage aka Tikhnov regularization
Pl
Loss(W ) = p=1 (yp − yp )2 + λ ∗ ||W ||2
Small lamba means high norm of the weights and a too complex model go-
ing to overfit data. Large lambda take the second term to grow too much,
bringing to a growth of the error data that tends to underfit. We call the
penalty term because penalize high values of the weights. Some weights can
even go to zero. If you use only L2 is called Ridge, if you use only L1 is
called Lasso, if you use both is called Elastic Net. L2 tends to bring weights
to smaller values, L1 penalize the absolute values and bring some weights to
0 and allows some weights to be larger, but Lasso has non differentiable loss.
5
6 CHAPTER 1. NEURAL NETWORKS
Just escape after an excessive number of epochs to escape from slow con-
vergence), avoiding to stop in a fixed number of epochs
1.3 VC-Dimension
Theorem 1. The VC-Dimensions of a class of function H is the maximum
cardinality of a set of points in X that can be shattered by H.
VC-dimension is not the number of free parameters, but they are related,
there are model with one parameter and infinite VC-dim, see Haykin book.
For Nearest-neighbour the VC-dim is infinite, for example for 1-Nearest-
Neighbour we can have infinite points and 0 error! 1-Nearest-Neighbour is
not a model indeed.
1.4. INTRO TO SVM 7
This gives us a way to estimate the error on future data based only on
the training error and VC-Dimension of H. This provides good information
about model selection and assessment because don’t take into account the
process of cross-validation that is costful.
Separating hyperplane: wT x + b = 0
where wT xi + b ≥ 0 for di = +1 and the other way around for di = −1
8 CHAPTER 1. NEURAL NETWORKS
Also not all the hyperplanes have equals safe zones, maybe with larges or
smaller margin. We want to make a preference for larger margins.
We define the optimal hyperplane as the one with that maximizethe margin
ρ.
woT x + bo = 0 (1.1)
2
We wat to find ρ = ∥w∥ , so we need to maximize ρ and minimize ∥w∥ This
is an optimization problem in terms of minimize the norm, in order to get a
new LTU that maximize the margin. We can see later why is this relevant.
We can rescale, (like normalize the distance to 1 or -1 for the points) w and
b and so that the closest points to hyperplane satisfy g(xi ) = |wT xi + b| = 1
and then write in a compact form:
We have many support vectors, they are the closes points/datapoints to the
boundary or hyperplane.
Let’s call g(x), discriminant as g(x) = wT x + b, and recall that wo is a
vector orthogonal to the hyperplane, let’s denote the distance between x and
optimal hyperplane with r
wo
x = xp + r (1.4)
∥wo ∥
We now derive the margin evaluating g(x) in the point xp , evaluating g(x) =
1.4. INTRO TO SVM 9
woT x + bo we obtain:
wo
g(x) = g(xp + r )=
∥wo ∥
wo
= woT xp + bo + woT r =
∥wo ∥
wo (1.5)
= g(xp ) + rwoT =
∥wo ∥
∥wo ∥2
=r =5 r∥wo ∥
∥wo ∥
We use the definition of g(x) to get to the second row, we multiply wo for
the two terms basically.
But the first part is equal to g(xp ), so we can write like that in the third
equation. Now for definition of decision boundary the value of g(xp ) is zero
and we get only the second part, multipling we obtain the final result.
Thus
g(x)
r= (1.6)
∥wo ∥
Now we consider the distance between the hyperplane and a positive support
vector x(s)
g(x(s) ) 1 ρ
rf orx(s) = = = (1.7)
∥wo ∥ ∥wo ∥ 2
ρ in this case is divided by two because we’re taking half of the maximum
margin.
We are searching the hyperplane that correctly classifies all the data and
at the same time have the largest margin. The Perceptron is able to solve
this problem, but this is not a random possible solution but the one with the
max margin, this function is quadratic and convex in w, costraints are linear
in w, and the problem scales with the size of input space m.
Why this is related to the Structural Risk Minimization? The next theorem
is fundamental to understand why we presented the SVM. Because otherwise
we have no other reason to introduce it instead of perceptron, Least Mean
Squares...
We fixed the training error on linearly separable problems. Than minimiz-
ing the norm of w is equivalent to minimize the VC dimension and thus the
capacity term (VC Confidence) ϵ(V C, N, δ).
Theorem 2 (Vapnik). Let D be the diameter of the smallest ball around the
data points x1 , ...xN .
For the class of sepaimrating hyperplane described by the equation wT x + b =
0 the upper bound to the VC-dimension is
D2
V C ≤ min( 2 , m0 ) + 1 (1.9)
ρ
This means that if in the class of our model, the linear models, the stan-
dard VC-Dim was equal to dimension of the problem plus one (remember
example of the 3 points linearly separable). The VC-Dimension can be re-
duced by increasing the margin, because the upper bound is the minimum
between the two, plus one accounting the fact that the VC dimension of a
class of linear separator hyperplane in n-dimensional space is n+1. As in 1.3.
So we’re optimizing the VC-bound, the empirical risk is fixed to 0 as we said
about fixing training error before. This corrisponds to regularization thus re-
ducing VC dimension as we did with Tikhnov but coming from a completely
different reasoning.
What is the best solution among all the possible hyperplanes? The one in
the middle. This theorem relate the maximum margin with idea of minimum
VC dimension.
Why support vector in the name? We’ll discuss in the next lesson.