Support Vector Machines: Jeff Wu
Support Vector Machines: Jeff Wu
Support Vector Machines: Jeff Wu
Machines
Jeff Wu
1
1 Support Vector Machines: history
SVMs close to their current form were first introduced in COLT-92 by Boser,
Guyon & Vapnik. Became rather popular since then. But the history of SVMs
goes way back.
In 1936, R. A. Fisher suggested the first algorithm for pattern recognition
(Fisher 1936), called “discriminant analysis”.
Aronszajn (1950) introduced the ‘Theory of Reproducing Kernels’.
Vapnik and Lerner (1963) introduced the Generalized Portrait algorithm (the
algorithm implemented by support vector machines is a nonlinear
generalization of the Generalized Portrait algorithm).
Vapnik and Chervonenkis (1964) further develop the Generalized Portrait
algorithm.
Cover (1965) discussed large margin hyperplanes in the input space and also
sparseness.
2
1 Support Vector Machines: history
The field of ‘statistical learning theory’ began with Vapnik and Chervonenkis
(1974) (in Russian).
SVMs can be said to have started when statistical learning theory was
developed further with Vapnik (1979) (in Russian).
Vapnik (1982) wrote an English translation of his 1979 book.
In 1995 the soft margin classifier was introduced by Cortes and Vapnik (1995);
in the same year the algorithm was extended to the case of regression by
Vapnik (1995) in The Nature of Statistical Learning Theory.
The papers by Bartlett (1998) and Shawe-Taylor, et al. (1998) gave the first
rigorous statistical bounds on the generalisation of hard margin SVMs.
Shawe-Taylor and Cristianini (2000) gave statistical bounds on the
generalisation of soft margin algorithms and for the regression case.
3
2 Why SVMs
Empirically good performance: successful applications in many fields.
(bioinformatics, text, image recognition, . . . )
For the pattern recognition case, SVMs have been used for isolated
handwritten digit recognition (Cortes and Vapnik, 1995; Scholkopf, Burges and
Vapnik, 1995; Scholkopf, Burges and Vapnik, 1996; Burges and Scholkopf,
1997), object recognition (Blanz et al., 1996), speaker identification (Schmidt,
1996), charmed quark detection1, face detection in images (Osuna, Freund
and Girosi, 1997a), and text categorization .(Joachims, 1997)
For the regression estimation case, SVMs have been compared on benchmark
time series prediction tests (Muller et al., 1997; Mukherjee, Osuna and Girosi,
1997), the Boston housing problem (Drucker et al., 2 1997), and (on artificial
data) on the PET operator inversion problem .(Vapnik, Golowich and Smola,
1996)
In most of these cases, SVM generalization performance (i.e. error rates on
test sets) either matches or is significantly better than that of competing
methods.
4
Today’s Agenda
VC dimension
Linear SVMs (with both separable and non-separable cases)
Non-linear SVMs (Kernel trick)
5
Preliminaries
6
Empirical Risk and the true Risk
We can try to learn about f(x,α) by choosing a function that preforms well on
training data:
7
Empirical Risk and the true Risk
Vapnik & Chervonenkis showed that an upper bound on the true risk can be
given by the empirical risk + an additional term:
The capacity of the machine is the ability of the machine to learn any training
set without error.
The best generalization performance will be achieved if the right balance of
capacity of achieved.
A machine with too much capacity is like a botanist with a photographic
memory who, when presented with a new tree, concludes that it is not a tree
because it has a different number of leaves from anything he/she has seen
before.
a machine with too little capacity is like the botanist’s lazy brother, who
declares that if it’s green, it’s a tree.
9
VC dimension:
The VC dimension of a set of functions is the maximum number of points that
can be separated in all possible ways by that set of functions. For hyperplanes
in , the VC dimension can be shown to be n + 1.
10
Vapnik & Chervonenkis also showed :
11
Linear SVMs
12
Classification Margin
wT xi b
Distance from example xi to the separator is r w
Examples closest to the hyperplane are support vectors.
Margin ρ of the separator is the distance between support vectors
ρ
13
Maximum Margin Classification
only support vectors matter; other training examples are ignorable.(will have
more details later)
14
Linear SVM Mathematically
Let training set {(xi, yi)}i=1..n, xiRd, yi {-1, 1} be separated by a
hyperplane with margin ρ. Then for each training example (xi,
yi):
wTxi + b ≤ - ρ/2 if yi = -1 yi(wTxi + b) ≥ ρ/2
wTxi + b ≥ ρ/2 if yi = 1
16
Solving the Optimization Problem
17
Lagrangian Duality
18
KKT conditions
If there exists some saddle point of L, then the saddle point satisfies the
following “Karush-Kuhn-Tucker” (KKT) conditions:
19
Dual Problem of SVMs
Standard form:
20
The optimization problem solution
Problem becomes:
21
Interpretation of SVMs
22
Non-separable linear SVMs
ξi
ξi
23
Soft Margin Classification
Mathematically
The old formulation:
26
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
0 x
0 x
How about… mapping data to a higher-dimensional space:
x2
27
Non-linear SVMs: Feature spaces
Φ: x → φ(x)
28
The “Kernel Trick”
29
Mercer’s theorem
For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
30
Polynomial-SVMs
31
RBF(radial basis function)-SVMs
32
Non-linear SVMs Mathematically
The solution is
33
SVMs: more results
There is much more in the field of SVMs/ kernel machines than we could
cover here, including:
Regression, clustering, semi-supervised learning and other domains.
Lots of other kernels, e.g. string kernels to handle text.
Lots of research in modifications, e.g. to improve generalization ability, or
tailoring to a particular task.
Lots of research in speeding up training.
34
Readings: