Support Vector Machines: 1 Outline
Support Vector Machines: 1 Outline
Support Vector Machines: 1 Outline
Arthur Gretton
January 26, 2016
1 Outline
• Review of convex optimization
• Support vector classification. C-SV and ν-SV machines
Figure 2.1: Examples of convex and non-convex sets (taken from [1, Fig. 2.2]).
The first set is convex, the last two are not.
1
Figure 2.2: Convex function (taken from [1, Fig. 3.1])
The function is strictly convex if the inequality is strict for x 6= y. See Figure
2.2.
minimize f0 (x)
subject to fi (x) ≤ 0 i = 1, . . . , m (2.1)
hi (x) = 0 i = 1, . . . p.
Tm Tp
We define by p∗ the optimal value of (2.1), and by D := i=0 domfi ∩ i=1 domhi ,
where we require the domain1 D to be nonempty.
The Lagrangian L : Rn × Rm × Rp → R associated with problem (2.1) is
written
m
X p
X
L(x, λ, ν) := f0 (x) + λi fi (x) + νi hi (x),
i=1 i=1
m p
and has domain domL := D×R ×R . The vectors λ and ν are called lagrange
multipliers or dual variables. The Lagrange dual function (or just “dual
function”) is written
g(λ, ν) = inf L(x, λ, ν).
x∈D
If this is unbounded below, then the dual is −∞. The domain of g, dom(g),
is the set of values (λ, µ) such that g > −∞. The dual function is a pointwise
infimum of affine2 functions of (λ, ν), hence it is concave in (λ, ν) [1, p. 83].
When3 λ 0, then for all ν we have
g(λ, ν) ≤ p∗ . (2.2)
1 The domain is the set on which a function is well defined. Eg the domain of log x is R++ ,
the strictly positive real numbers [1, p. 639].
2 A function f : Rn → Rm is affine if it takes the form f (x) = Ax + b.
3 The notation a b for vectors a, b means that a ≥ b for all i.
i i
2
Figure 2.3: Example: Lagrangian with one inequality constraint, L(x, λ) =
f0 (x) + λf1 (x), where x here can take one of four values for ease of illustration.
The infimum of the resulting set of four affine functions is concave in λ.
3
See Figure (2.4) for an illustration on a toy problem with a single inequality
constraint. A dual feasible pair (λ, ν) is a pair for which λ 0 and (λ, ν) ∈
dom(g).
Proof. (of eq. (2.2)) Assume x̃ is feasible for the optimization, i.e. fi (x̃) ≤ 0,
hi (x̃) = 0, x̃ ∈ D, λ 0. Then
m
X p
X
λi fi (x̃) + νi hi (x̃) ≤ 0
i=1 i=1
and so
p
m
!
X X
g(λ, ν) := inf f0 (x) + λi fi (x) + νi hi (x)
x∈D
i=1 i=1
m
X p
X
≤ f0 (x̃) + λi fi (x̃) + νi hi (x̃)
i=1 i=1
≤ f0 (x̃).
where (
0 u≤0
I− (u) =
∞ u>0
and I0 (u) is the indicator of 0. This would then give an infinite penalization
when a constraint is violated. Instead of these sharp indicator functions (which
are hard to optimize), we replace the constraints with a set of soft linear con-
straints, as shown in Figure 2.5. It is now clear why λ must be positive for the
inequality constraint: a negative λ would not yield a lower bound. Note also
that as well as being penalized for fi > 0, the linear lower bounds reward us for
achieving fi < 0.
maximize g(λ, ν)
subject to λ 0. (2.3)
4
Figure 2.4: Illustration of the dual function for a simple problem with one
inequality constraint (from [1, Figs. 5.1 and 5.2]). In the right hand plot, the
dashed line corresponds to the optimum p∗ of the original problem, and the
solid line corresponds to the dual as a function of λ. Note that the dual as a
function of λ is concave.
5
Figure 2.5: Linear lower bounds on indicator functions. Blue functions represent
linear lower bounds for different slopes λ and ν, for the inequality and equality
constraints, respectively.
6
We use dual feasible to describe (λ, ν) with λ 0 and g(λ, ν) > −∞. The
solutions to the dual problem are written (λ∗ , ν ∗ ), and are called dual optimal.
Note that (2.3) is a convex optimization problem, since the function being max-
imized is concave and the constraint set is convex. We denote by d∗ the optimal
value of the dual problem. The property of weak duality always holds:
d∗ ≤ p ∗ .
The difference p∗ − d∗ is called the optimal duality gap. If the duality gap is
zero, then strong duality holds:
d∗ = p ∗ .
Conditions under which strong duality holds are called constraint qualifica-
tions. As an important case: strong duality holds if the primal problem is
convex,4 i.e. of the form
minimize f0 (x)
subject to fi (x) ≤ 0 i = 1, . . . , n (2.4)
Ax = b
for convex f0 , . . . , fm , and if Slater’s condition holds: there exists some
strictly feasible point5 x̃ ∈ relint(D) such that
fi (x̃) < 0 i = 1, . . . , m Ax̃ = b.
A weaker version of Slater’s condition is sufficient for strong convexity when
some of the constraint functions f1 , . . . , fk are affine (note the inequality con-
straints are no longer strict):
fi (x̃) ≤ 0 i = 1, . . . , k fi (x̃) < 0 i = k + 1, . . . , m Ax̃ = b.
A proof of this result is given in [1, Section 5.3.2].
7
In other words, we recover the primal problem when the inequality constraint
holds, and get infinity otherwise. We can therefore write
p∗ = inf sup L(x, λ).
x λ0
We already know
d∗ = sup inf L(x, λ).
λ0 x
Weak duality therefore corresponds to the max-min inequality:
sup inf L(x, λ) ≤ inf sup L(x, λ). (2.5)
λ0 x x λ0
which holds for general functions, and not just L(x, λ). Strong duality occurs
at a saddle point, and the inequality becomes an equality.
There is also a game interpretation: L(x, λ) is a sum that must be paid
by the person ajusting x to the person adjusting λ. On the right hand side of
(2.5), player x plays first. Knowing that player 2 (λ) will maximize their return,
player 1 (x) chooses their setting to give player 2 the worst possible options
over all λ. The max-min inequality says that whoever plays second has the
advantage.
8
Consider now the case where the functions fi , hi are differentiable, and the
duality gap is zero. Since x∗ minimizes L(x, λ∗ , ν ∗ ), the derivative at x∗ should
be zero,
m
X p
X
∇f0 (x∗ ) + λ∗i ∇fi (x∗ ) + νi∗ ∇hi (x∗ ) = 0.
i=1 i=1
We now gather the various conditions for optimality we have discussed. The
KKT conditions for the primal and dual variables (x, λ, ν) are
fi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , p
λi ≥ 0, i = 1, . . . , m
λi fi (x) = 0, i = 1, . . . , m
m
X p
X
∇f0 (x) + λi ∇fi (x) + νi ∇hi (x) = 0
i=1 i=1
where
2
J(f ) = Ly (f (x1 ), . . . , f (xn )) + Ω kf kH ,
Ω is non-decreasing, and y is the vector of yi . Examples of loss functions might
be
Pn
• Classification: Ly (f (x1 ), . . . , f (xn )) = i=1 Iyi f (xi )≤0 (the number of
points for which the sign of y disagrees with that of the prediction f (x)),
Pn
• Regression: Ly (f (x1), . . . , f(xn )) = i=1 (yi −f (xi ))2 , the sum of squared
2 2
errors (eg. when Ω kf kH = kf kH , we are back to the standard ridge
regression setting).
The representer theorem states that a solution to 3.1 takes the form
n
X
f∗ = αi k(xi , ·).
i=1
9
If Ω is strictly increasing, all solutions have this form.
Proof: We write as fs the projection of f onto the subspace
span {k(xi , ·) : 1 ≤ i ≤ n} , (3.2)
such that
f = fs + f⊥ ,
Pn
where fs = i=1 αi k(xi , ·). Consider first the regularizer term. Since
2 2 2 2
kf kH = kfs kH + kf⊥ kH ≥ kfs kH ,
then
2 2
Ω kf kH ≥ Ω kfs kH ,
so this term is minimized for f = fs . Next, consider the individual terms f (xi )
in the loss. These satisfy
f (xi ) = hf, k(xi , ·)iH = hfs + f⊥ , k(xi , ·)iH = hfs , k(xi , ·)iH ,
so
Ly (f (x1 ), . . . , f (xn )) = Ly (fs (x1 ), . . . , fs (xn )).
Hence the loss L(. . .) only depends on the component of f in the subspace 3.2,
and the regularizer Ω(. . .) is minimized when f = fs . If Ω is strictly non-
decreasing, then kf⊥ kH = 0 is required at the minimum, otherwise this may be
one of several minima.
and negative classes): consider two points x+ and x− of opposite label, located on the margins.
The width of the margin, dm , is the difference x+ − x− projected onto the unit vector in the
direction w, or
w
dm = (x+ − x− )> (4.1)
kwk
Subtracting the two equations in the constraints (4.3) from each other, we get
w> (x+ − x− ) = 2.
Substituting this into (4.1) proves the result.
10
11
Figure 4.1: The linearly separable case. There are many linear separating hy-
perplanes, but only one max. margin separating hyperplane.
subject to (
min w> xi + b = 1
i : yi = +1,
(4.3)
max w> xi + b = −1
i : yi = −1.
The resulting classifier is
y = sign(w> x + b),
where sign takes value +1 for a positive argument, and −1 for a negative ar-
gument (its value at zero is not important, since for non-pathological cases we
will not need to evaluate it there). We can rewrite to obtain
1
max or min kwk2
w,b kwk w,b
subject to
yi (w> xi + b) ≥ 1. (4.4)
where C controls the tradeoff between maximum margin and loss, and I(A) = 1
if A holds true, and 0 otherwise (the factor of 1/2 is to simplify the algebra later,
and is not important: we can adjust C accordingly). This is a combinatorical
optimization problem, which would be very expensive to solve. Instead, we
replace the indicator function with a convex upper bound,
n
!
1 2
X
>
min kwk + C θ yi w xi + b .
w,b 2 i=1
although obviously other choices are possible (e.g. a quadratic upper bound).
See Figure 4.2.
Substituting in the hinge loss, we get
n
!
1 X
kwk2 + C θ yi w > xi + b
min .
w,b 2 i=1
12
Figure 4.2: The hinge loss is an upper bound on the step loss.
13
Figure 4.3: The nonseparable case. Note the red point which is a distance ξ/kwk
from the margin.
subject to7
yi w> xi + b ≥ 1 − ξi
ξi ≥ 0
(compare with (4.4)). See Figure 4.3.
Now let’s write the Lagrangian for this problem, and solve it.
n n n
1 X X X
kwk2 +C αi 1 − yi w> xi + b − ξi +
L(w, b, ξ, α, λ) = ξi + λi (−ξi )
2 i=1 i=1 i=1
(4.6)
7 To see this, we can write it as ξi ≥ 1 − yi w> xi + b . Thus either ξi = 0, and
w> x
yi i + b ≥ 1 as before, or ξi > 0, in which case to minimize (4.5), we’d use the smallest
possible ξi satisfying the inequality, and we’d have ξi = 1 − yi w> xi + b .
14
with dual variable constraints
αi ≥ 0, λi ≥ 0.
Derivative wrt b:
∂L X
= yi αi = 0. (4.8)
∂b i
Derivative wrt ξi :
∂L
= C − αi − λi = 0 αi = C − λi . (4.9)
∂ξi
We can replace the final constraint by noting λi ≥ 0, hence
αi ≤ C.
Before writing the dual, we look at what these conditions imply about the scalars
αi that define the solution (4.7).
Non-margin SVs: αi = C:
Remember complementary slackness:
1. We immediately have 1 − ξi = yi w> xi + b .
15
We now write the dual function, by substituting equations (4.7), (4.8), and
(4.9) into (4.6), to get
n n n
1 X X X
kwk2 + C αi 1 − yi w> xi + b − ξi +
g(α, λ) = ξi + λi (−ξi )
2 i=1 i=1 i=1
m m m m X m m
1 XX X X X
= αi αj yi yj x> x
i j + C ξ i − α α y y x >
i j i j i jx − b αi yi
2 i=1 j=1 i=1 i=1 j=1 i=1
| {z }
0
m
X m
X m
X
+ αi − αi ξi − (C − αi )ξi
i=1 i=1 i=1
m m m
X 1 XX
= αi − αi αj yi yj x>
i xj .
i=1
2 i=1 j=1
So far we have defined the solution for w, but not for the offsetb. This is simple
to compute: for the margin SVs, we have 1 = yi w> xi + b . Thus, we can
obtain b from any of these, or take an average for greater numerical stability.
for the RKHS H with kernel k(x, ·). When we kernelize, we use the result of the
representer theorem,
Xn
w= βi k(xi , ·). (4.10)
i=1
16
Substituting (4.10) and introducing the ξi variables, get
n
!
1 > X
min β Kβ + C ξi (4.11)
w,b 2 i=1
Thus, the primal variables w are replaced with β. The problem remains convex
since K is positive definite. With some calculation (exercise!), the dual becomes
m m m
X 1 XX
g(α) = αi − αi αj yi yj k(xi , xj ),
i=1
2 i=1 j=1
subject to
ρ ≥ 0
ξi ≥ 0
>
yi w xi ≥ ρ − ξi ,
where we see that we now optimize the margin width ρ. Thus, rather than
choosing C, we now choose ν; the meaning of the latter will become clear shortly.
The Lagrangian is
n n n
1 2 1X X
>
X
kwkH + ξi − νρ + α i ρ − yi w xi − ξi + βi (−ξi ) + γ(−ρ)
2 n i=1 i=1 i=1
17
for αi ≥ 0, βi ≥ 0, and γ ≥ 0. Differentiating wrt each of the primal variables
w, ξ, ρ, and setting to zero, we get
n
X
w = αi yi xi
i=1
1
αi + βi = (4.12)
n
Xn
ν = αi − γ (4.13)
i=1
0 ≤ αi ≤ n−1 .
so
|N (α)|
≤ ν,
n
and ν is an upper bound on the number of non-margin SVs.
2. Case of ξi = 0. Then αi < n−1 . Denote by M (α) the set of points
n−1 > αi > 0. Then from (4.14),
n X 1
X X X 1
ν= αi = + αi ≤ ,
i=1
n n
i∈N (α) i∈M (α) i∈M (α)∪N (α)
thus
|N (α)| + |M (α)|
ν≤ ,
n
and ν is a lower bound on the number of support vectors with non-zero
weight (both on the margin, and “margin errors”).
18
Substituting into the Lagrangian, we get
m m n m X m n n
1 XX > 1X X
>
X X
αi αj yi yj xi xj + ξi − ρν − αi αj yi yj xi xj + αi ρ − αi ξi
2 i=1 j=1 n i=1 i=1 j=1 i=1 i=1
n n
!
X 1 X
− − αi ξi − ρ αi − ν
i=1
n i=1
m m
1 XX
=− αi αj yi yj x>
i xj
2 i=1 j=1
subject to
n
X 1
αi ≥ ν 0 ≤ αi ≤ .
i=1
n
References
[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge
University Press, Cambridge, England, 2004.
19