A Short SVM (Support Vector Machine) Tutorial
A Short SVM (Support Vector Machine) Tutorial
j.p.lewis
CGIT Lab / IMSC
U. Southern California
version 0.zz dec 2004
This tutorial assumes you are familiar with linear algebra and equality-constrained optimization/Lagrange multipliers. It ex-
plains the more general KKT (Karush Kuhn Tucker) conditions for an optimum with inequality constraints, dual optimization,
and the “kernel trick”.
I wrote this to solidify my knowledge after reading several presentations of SVMs: the Burges tutorial (comprehensive and
difficult, probably not for beginners), the presentation in the Forsyth and Ponse computer vision book (easy and short, less
background explanation than here), the Cristianini and Shawe-Taylor SVM book, and the excellent Scholkopf/Smola Learning
with Kernels book.
’ means transpose.
1
This is also (somewhat) intuitive – if this were not the case (and the lagrangian was to be minimized wrt λ), subject to the
previously stated constraint λ ≥ 0, then the ideal solution λ = 0 would remove the constraints entirely.
Optimization problems can be converted to their dual form by differentiating the Lagrangian wrt the original variables, solving
the results so obtained for those variables if possible, and substituting the resulting expression(s) back into the Lagrangian,
thereby eliminating the variables. The result is an equation in the lagrange multipliers, which must be maximized. Inequality
constraints in the original variables also change to equality constraints in the multipliers.
The dual form may or may not be simpler than the original (primal) optimization. In the case of SVMs, the dual form has
simpler constraints, but the real reason for using the dual form is that it puts the problem in a form that allows the kernel trick
to be used, as described below.
The fact that the dual problem requires maximization wrt the multipliers carries over from the KKT condition. Conversion from
the primal to the dual converts the problem from a saddle point to a simple maximum.
A worked example, general quadratic with general linear constraint.
1
α is the lagrange multiplier vector Lp = x0 Kx + c0 x + α0 (Ax + d)
2
dLp
= Kx + c + A0 α = 0
dx
0
Kx = −A α − c
−1 0 −1
substitute this x into Lp : x = −K A α−K c
(assume K is symmetric)
1 −1 0 −1 0 −1 0 −1 0 −1 0 −1 0 −1 0 −1
(−K A α−K c) K(−K A α−K c) − c (K A α+K c) + α (A(−K A α−K c) + d)
−1 0 −1
2 1 0 −1 1 0 −1 0 −1 0 0 −1 0 −1 0 0 −1 0
X ≡ K(−K A α−K c) −( α AK X) − ( c K X) −c K A α−c K c − α AK A α − α AK c+α d
2 2
1 0 −1 −1 0 0 −1 −1 0 −1 −1 0 0 −1 −1
= (α AK KK A α + α AK KK c) + (c K KK A α+c K KK c) + ···
2
1
0 −1 0 0 −1 0 1 0 −1 0 −1 0 −1 0 0 −1 0 0 −1 0
= α AK A α+c K A α+ c K c−c K c−c K A α − α AK A α − α AK c+α d
2 2
1 0 −1 0 1 0 −1 0 −1 0
= Ld = − α AK A α− c K c − α AK c+α d
2 2
The − 1 c0 K −1 c can be ignored because it is a constant term wrt the independent variable α so the result is of the form
2
1 0 0 0
− α Qα − α Rc + α d
2
2
SVMs start from the goal of separating the data with a hyperplane, and extend this to non-linear decision boundaries using
the kernel trick described below. The equation of a general hyperplane is w0 x + b = 0 with x being the point (a vector), w
the weights (also a vector). The hyperplane should separate the data, so that w0 xk + b > 0 for all the xk of one class, and
w0 xj + b < 0 for all the xj of the other class. If the data are in fact separable in this way, there is probably more than one way
to do it.
Among the possible hyperplanes, SVMs select the one where the distance of the hyperplane from the closest data points (the
“margin”) is as large as possible (Fig. 1). This sounds reasonable, and the resulting line in the 2D case is similar to the line I
would probably pick to separate the classes. The Scholkopf/Smola book describes an intuitive justification for this criterion:
suppose the training data are good, in the sense that every possible test vector is within some radius r of a training vector.
Then, if the chosen hyperplane is at least r from any training vector it will correctly separate all the test data. By making the
hyperplane as far as possible from any data, r is allowed to be correspondingly large. The desired hyperplane (that maximizes
the margin) is also the bisector of the line between the closest points on the convex hulls of the two data sets.
Now, find this hyperplane. By labeling the training points by yk ∈ −1, 1, with 1 being a positive example, −1 a negative
training example,
yk (w0 xk + b) ≥ 0 for all points
Both w, b can be scaled without changing the hyperplane. To remove this freedom, scale w, b so that
yk (w0 xk + b) ≥ 1 ∀k
Next we want an expression for the distance between the hyperplane and the closest points; w, b will be chosen to maximize
this expression. Imagine additional “supporting hyperplanes” (the dashed lines in Fig. 1) parallel to the separating hyperplane
and passing through the closest points (the support points). These are yj (w0 xj + b) = 1, yk (w0 xk + b) = 1 for some points
j, k (there may be more than one such point on each side).
Figure 2: a - b = 2m
The distance between the separating hyperplane and the nearest points (the margin) is half of the distance between these support
hyperplanes, which is the same as the difference between the distances to the origin of the closest point on each of the support
hyperplanes (Fig. 2).
The distance of the closest point on a hyperplane to the origin can be found by minimizing x0 x subject to x being on the
hyperplane,
min x0 x + λ(w0 x + b − 1)
x
d
= 0 = 2x + λw = 0
dx
λ
→x=− w
2
3
λ
now substitute x into w0 x + b − 1 = 0 − w0 w + b = 1
2
2(b − 1)
→λ=
w0 w
1−b
substitute this λ back into x x= 0 w
ww
0 (1 − b)2 0 (1 − b)2
xx= 0
ww=
(w w) 2 w0 w
√ 1−b 1−b
kxk = x0 x = √ =
0
ww kwk
−1 − b
similarly working out for w0 x + b = −1 gives kxk =
kwk
Lastly, subtract these two distances, which gives the summed distance from the separating hyperplane to the nearest points:
2
kwk .
To maximize this distance, we need to minimize w ... subject to all the constraints yk (w0 xk + b) ≥ 1. Following the standard
KKT setup, use positive multipliers and subtract the constraints.
1 0 X
min L = ww− λk (yk (w0 xk + b) − 1)
w,b 2
This does not yet give b. By the KKT complementarity condition, either the lagrange multiplier is zero (the constraint is
inactive), or the L.M. is positive and the constraint is zero (active). b can be obtained by finding one of the active constraints
yk (w0 xk + b) ≥ 1 where the λk is non zero and solving w0 xk + b − 1 = 0 for b. With w, b known the separating hyperplane
is defined.
In a real problem it is unlikely that a line will exactly separate the data, and even if a curved decision boundary is possible (as
it will be after adding the nonlinear data mapping in the next section), exactly separating the data is probably not desirable: if
the data has noise and outliers, a smooth decision boundary that ignores a few data points is better than one that loops around
the outliers.
This issue is handled in different ways by different flavors of SVMs. In the simplest(?) approach, instead of requiring
yk (w0 x + b) ≥ 1
4
Reducing α allows more of the data to lie on the wrong side of the hyperplane and be treated as outliers, which gives a smoother
decision boundary.
Kernel trick
With w, b obtained the problem is solved for the simple linear case in which the data are separated by a hyperplane. The “kernel
trick” allows SVMs to form nonlinear boundaries. There are several parts to the kernel trick.
1. The algorithm has to be expressed using only the inner products of data items. For a hyperplane test w0 x this P
can be done
by recognizing that w itself is always some linear combination of the data xk (“representer theorem”), w = λk xk , so
w0 x = λk xk x.
P
2. The original data are passed through a nonlinear map to form new data with additional dimensions, e.g. by adding the
pairwise product of some of the original data dimensions to each data vector.
3. Instead of doing the inner product on these new, larger vectors, think of storing the inner product of two elements x0j xk
in a table k(xj , xk ) = x0j xk , so now the inner product of these large vectors is just a table lookup. But instead of doing
this, just “invent” a function K(xj , xk ) that could represent dot product of the data after doing some nonlinear map on
them. This function is the kernel.
To solve the problem the dual LD should be maximized wrt λk as described earlier.
The dual form sometimes simplifies the optimization, as it does in this problem - the constraints for this version are simpler
than the original constraints. One thing to notice is that this result depends on the 1/2 added for convenience in the original La-
grangian. Without this, the big double sum terms would cancel out entirely! For SVMs the major point of the dual formulation,
however, is that the data (see LD ) appear in the form of their dot product x0k xl . This will be used in part 3 below.
Kernel trick part 2: nonlinear map. In the second part of the kernel trick, the data are passed through a nonlinear mapping.
For example in the two-dimensional case suppose that data of one class is near the origin, surrounded on all sides by data of the
second class. A ring of some radius will separate the data, but it cannot be separated by a line (hyperplane).
The x, y data can be mapped to three dimensions u, v, w:
u←x
v←y
w ← x2 + y 2
5
The new “invented” dimension w (squared distance from origin) allows the data to now be linearly separated by a u − v plane
situated along the w axis. The problem is solved by running the same hyperplane-finding algorithm on the new data points
(u, v, w)k rather than on the original two dimensional (x, y)k data. This example is misleading in that SVMs do not require
finding new dimensions that are just right for separating the data. Rather, a whole set of new dimensions is added and the
hyperplane uses any dimensions that are useful.
Kernel trick part 3: the “kernel” summarizes the inner product. The third part of the “kernel trick” is to make use of
the fact that only the dot product of the data vectors are used. The dot product of the nonlinearly feature-enhanced data
from step two can be expensive, especially in the case where the original data have many dimensions (e.g. image data) – the
nonlinearly mapped data will have even more dimensions. One could think of precomputing and storing the dot products in a
table K(xj , xk ) = x0j xk , or of finding a function K(xj , xk ) that reproduces or approximates the dot product.
The kernel trick goes in the opposite direction: it just picks a suitable function K(xj , xk ) that corresponds to the dot product
of some nonlinear mapping of the data. The commonly chosen kernels are
(x0j , xk )d , d = 2 or 3
exp(−kxj − xk k)2 /σ
tanh(x0j xk + c)
Each of these can be thought of as expressing the result of adding a number of new nonlinear dimensions to the data and then
returning the inner product of two such extended data vectors.
A SVM is the maximum margin linear classifier (as described above) operating on the nonlinearly extended data. The particular
nonlinear “feature” dimensions added to the data are not critical as long as there is a rich set of them. The linear classifier will
figure out which ones are useful for separating the data. The particular kernel is to be chosen by trial and error on the test set,
but on at least some benchmarks these kernels are nearly equivalent in performance, suggesting the choice of kernel is not too
important.