Arthur Gretton
January 26, 2016
• Review of convex optimization
• Support vector classification. C-SV and ν-SV machines
Figure 2.1: Examples of convex and non-convex sets (taken from [1, Fig. 2.2]).
The first set is convex, the last two are not.
Figure 2.2: Convex function (taken from [1, Fig. 3.1])
The function is strictly convex if the inequality is strict for x 6= y. See Figure
minimize f0 (x)
subject to fi (x) ≤ 0 i = 1, . . . , m (2.1)
hi (x) = 0 i = 1, . . . p.
Tm Tp
We define by p∗ the optimal value of (2.1), and by D := i=0 domfi ∩ i=1 domhi ,
where we require the domain1 D to be nonempty.
The Lagrangian L : Rn × Rm × Rp → R associated with problem (2.1) is
X p
L(x, λ, ν) := f0 (x) + λi fi (x) + νi hi (x),
i=1 i=1
m p
and has domain domL := D×R ×R . The vectors λ and ν are called lagrange
multipliers or dual variables. The Lagrange dual function (or just “dual
function”) is written
g(λ, ν) = inf L(x, λ, ν).
If this is unbounded below, then the dual is −∞. The domain of g, dom(g),
is the set of values (λ, µ) such that g > −∞. The dual function is a pointwise
infimum of affine2 functions of (λ, ν), hence it is concave in (λ, ν) [1, p. 83].
When3 λ 0, then for all ν we have
g(λ, ν) ≤ p∗ . (2.2)
1 The domain is the set on which a function is well defined. Eg the domain of log x is R++ ,
the strictly positive real numbers [1, p. 639].
2 A function f : Rn → Rm is affine if it takes the form f (x) = Ax + b.
3 The notation a b for vectors a, b means that a ≥ b for all i.
i i
Figure 2.3: Example: Lagrangian with one inequality constraint, L(x, λ) =
f0 (x) + λf1 (x), where x here can take one of four values for ease of illustration.
The infimum of the resulting set of four affine functions is concave in λ.
See Figure (2.4) for an illustration on a toy problem with a single inequality
constraint. A dual feasible pair (λ, ν) is a pair for which λ 0 and (λ, ν) ∈
Proof. (of eq. (2.2)) Assume x̃ is feasible for the optimization, i.e. fi (x̃) ≤ 0,
hi (x̃) = 0, x̃ ∈ D, λ 0. Then
X p
λi fi (x̃) + νi hi (x̃) ≤ 0
i=1 i=1
and so
g(λ, ν) := inf f0 (x) + λi fi (x) + νi hi (x)
i=1 i=1
X p
≤ f0 (x̃) + λi fi (x̃) + νi hi (x̃)
i=1 i=1
≤ f0 (x̃).
where (
0 u≤0
I− (u) =
∞ u>0
and I0 (u) is the indicator of 0. This would then give an infinite penalization
when a constraint is violated. Instead of these sharp indicator functions (which
are hard to optimize), we replace the constraints with a set of soft linear con-
straints, as shown in Figure 2.5. It is now clear why λ must be positive for the
inequality constraint: a negative λ would not yield a lower bound. Note also
that as well as being penalized for fi > 0, the linear lower bounds reward us for
achieving fi < 0.
maximize g(λ, ν)
subject to λ 0. (2.3)
Figure 2.4: Illustration of the dual function for a simple problem with one
inequality constraint (from [1, Figs. 5.1 and 5.2]). In the right hand plot, the
dashed line corresponds to the optimum p∗ of the original problem, and the
solid line corresponds to the dual as a function of λ. Note that the dual as a
function of λ is concave.
Figure 2.5: Linear lower bounds on indicator functions. Blue functions represent
linear lower bounds for different slopes λ and ν, for the inequality and equality
constraints, respectively.
We use dual feasible to describe (λ, ν) with λ 0 and g(λ, ν) > −∞. The
solutions to the dual problem are written (λ∗ , ν ∗ ), and are called dual optimal.
Note that (2.3) is a convex optimization problem, since the function being max-
imized is concave and the constraint set is convex. We denote by d∗ the optimal
value of the dual problem. The property of weak duality always holds:
d∗ ≤ p ∗ .
The difference p∗ − d∗ is called the optimal duality gap. If the duality gap is
zero, then strong duality holds:
d∗ = p ∗ .
Conditions under which strong duality holds are called constraint qualifica-
tions. As an important case: strong duality holds if the primal problem is
convex,4 i.e. of the form
minimize f0 (x)
subject to fi (x) ≤ 0 i = 1, . . . , n (2.4)
Ax = b
for convex f0 , . . . , fm , and if Slater’s condition holds: there exists some
strictly feasible point5 x̃ ∈ relint(D) such that
fi (x̃) < 0 i = 1, . . . , m Ax̃ = b.
A weaker version of Slater’s condition is sufficient for strong convexity when
some of the constraint functions f1 , . . . , fk are affine (note the inequality con-
straints are no longer strict):
fi (x̃) ≤ 0 i = 1, . . . , k fi (x̃) < 0 i = k + 1, . . . , m Ax̃ = b.
A proof of this result is given in [1, Section 5.3.2].
In other words, we recover the primal problem when the inequality constraint
holds, and get infinity otherwise. We can therefore write
p∗ = inf sup L(x, λ).
x λ0
We already know
d∗ = sup inf L(x, λ).
λ0 x
Weak duality therefore corresponds to the max-min inequality:
sup inf L(x, λ) ≤ inf sup L(x, λ). (2.5)
λ0 x x λ0
which holds for general functions, and not just L(x, λ). Strong duality occurs
at a saddle point, and the inequality becomes an equality.
There is also a game interpretation: L(x, λ) is a sum that must be paid
by the person ajusting x to the person adjusting λ. On the right hand side of
(2.5), player x plays first. Knowing that player 2 (λ) will maximize their return,
player 1 (x) chooses their setting to give player 2 the worst possible options
over all λ. The max-min inequality says that whoever plays second has the
Consider now the case where the functions fi , hi are differentiable, and the
duality gap is zero. Since x∗ minimizes L(x, λ∗ , ν ∗ ), the derivative at x∗ should
be zero,
X p
∇f0 (x∗ ) + λ∗i ∇fi (x∗ ) + νi∗ ∇hi (x∗ ) = 0.
i=1 i=1
We now gather the various conditions for optimality we have discussed. The
KKT conditions for the primal and dual variables (x, λ, ν) are
fi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , p
λi ≥ 0, i = 1, . . . , m
λi fi (x) = 0, i = 1, . . . , m
X p
∇f0 (x) + λi ∇fi (x) + νi ∇hi (x) = 0
i=1 i=1
J(f ) = Ly (f (x1 ), . . . , f (xn )) + Ω kf kH ,
Ω is non-decreasing, and y is the vector of yi . Examples of loss functions might
• Classification: Ly (f (x1 ), . . . , f (xn )) = i=1 Iyi f (xi )≤0 (the number of
points for which the sign of y disagrees with that of the prediction f (x)),
• Regression: Ly (f (x1), . . . , f(xn )) = i=1 (yi −f (xi ))2 , the sum of squared
2 2
errors (eg. when Ω kf kH = kf kH , we are back to the standard ridge
regression setting).
The representer theorem states that a solution to 3.1 takes the form
f∗ = αi k(xi , ·).
If Ω is strictly increasing, all solutions have this form.
Proof: We write as fs the projection of f onto the subspace
span {k(xi , ·) : 1 ≤ i ≤ n} , (3.2)
such that
f = fs + f⊥ ,
where fs = i=1 αi k(xi , ·). Consider first the regularizer term. Since
2 2 2 2
kf kH = kfs kH + kf⊥ kH ≥ kfs kH ,
2 2
Ω kf kH ≥ Ω kfs kH ,
so this term is minimized for f = fs . Next, consider the individual terms f (xi )
in the loss. These satisfy
f (xi ) = hf, k(xi , ·)iH = hfs + f⊥ , k(xi , ·)iH = hfs , k(xi , ·)iH ,
Ly (f (x1 ), . . . , f (xn )) = Ly (fs (x1 ), . . . , fs (xn )).
Hence the loss L(. . .) only depends on the component of f in the subspace 3.2,
and the regularizer Ω(. . .) is minimized when f = fs . If Ω is strictly non-
decreasing, then kf⊥ kH = 0 is required at the minimum, otherwise this may be
one of several minima.
and negative classes): consider two points x+ and x− of opposite label, located on the margins.
The width of the margin, dm , is the difference x+ − x− projected onto the unit vector in the
direction w, or
dm = (x+ − x− )> (4.1)
Subtracting the two equations in the constraints (4.3) from each other, we get
w> (x+ − x− ) = 2.
Substituting this into (4.1) proves the result.
Figure 4.1: The linearly separable case. There are many linear separating hy-
perplanes, but only one max. margin separating hyperplane.
subject to (
min w> xi + b = 1
i : yi = +1,
max w> xi + b = −1
i : yi = −1.
The resulting classifier is
y = sign(w> x + b),
where sign takes value +1 for a positive argument, and −1 for a negative ar-
gument (its value at zero is not important, since for non-pathological cases we
will not need to evaluate it there). We can rewrite to obtain
max or min kwk2
w,b kwk w,b
subject to
yi (w> xi + b) ≥ 1. (4.4)
where C controls the tradeoff between maximum margin and loss, and I(A) = 1
if A holds true, and 0 otherwise (the factor of 1/2 is to simplify the algebra later,
and is not important: we can adjust C accordingly). This is a combinatorical
optimization problem, which would be very expensive to solve. Instead, we
replace the indicator function with a convex upper bound,
1 2
min kwk + C θ yi w xi + b .
w,b 2 i=1
although obviously other choices are possible (e.g. a quadratic upper bound).
See Figure 4.2.
Substituting in the hinge loss, we get
1 X
kwk2 + C θ yi w > xi + b
min .
w,b 2 i=1
Figure 4.2: The hinge loss is an upper bound on the step loss.
Figure 4.3: The nonseparable case. Note the red point which is a distance ξ/kwk
from the margin.
subject to7
yi w> xi + b ≥ 1 − ξi
ξi ≥ 0
(compare with (4.4)). See Figure 4.3.
Now let’s write the Lagrangian for this problem, and solve it.
n n n
1 X X X
kwk2 +C αi 1 − yi w> xi + b − ξi +
L(w, b, ξ, α, λ) = ξi + λi (−ξi )
2 i=1 i=1 i=1
7 To see this, we can write it as ξi ≥ 1 − yi w> xi + b . Thus either ξi = 0, and
w> x
yi i + b ≥ 1 as before, or ξi > 0, in which case to minimize (4.5), we’d use the smallest
possible ξi satisfying the inequality, and we’d have ξi = 1 − yi w> xi + b .
with dual variable constraints
αi ≥ 0, λi ≥ 0.
Derivative wrt b:
∂L X
= yi αi = 0. (4.8)
∂b i
Derivative wrt ξi :
= C − αi − λi = 0 αi = C − λi . (4.9)
We can replace the final constraint by noting λi ≥ 0, hence
αi ≤ C.
Before writing the dual, we look at what these conditions imply about the scalars
αi that define the solution (4.7).
Non-margin SVs: αi = C:
Remember complementary slackness:
1. We immediately have 1 − ξi = yi w> xi + b .
We now write the dual function, by substituting equations (4.7), (4.8), and
(4.9) into (4.6), to get
n n n
1 X X X
kwk2 + C αi 1 − yi w> xi + b − ξi +
g(α, λ) = ξi + λi (−ξi )
2 i=1 i=1 i=1
m m m m X m m
1 XX X X X
= αi αj yi yj x> x
i j + C ξ i − α α y y x >
i j i j i jx − b αi yi
2 i=1 j=1 i=1 i=1 j=1 i=1
| {z }
X m
X m
+ αi − αi ξi − (C − αi )ξi
i=1 i=1 i=1
m m m
X 1 XX
= αi − αi αj yi yj x>
i xj .
2 i=1 j=1
So far we have defined the solution for w, but not for the offsetb. This is simple
to compute: for the margin SVs, we have 1 = yi w> xi + b . Thus, we can
obtain b from any of these, or take an average for greater numerical stability.
for the RKHS H with kernel k(x, ·). When we kernelize, we use the result of the
representer theorem,
w= βi k(xi , ·). (4.10)
Substituting (4.10) and introducing the ξi variables, get
1 > X
min β Kβ + C ξi (4.11)
w,b 2 i=1
Thus, the primal variables w are replaced with β. The problem remains convex
since K is positive definite. With some calculation (exercise!), the dual becomes
m m m
X 1 XX
g(α) = αi − αi αj yi yj k(xi , xj ),
2 i=1 j=1
subject to
ρ ≥ 0
ξi ≥ 0
yi w xi ≥ ρ − ξi ,
where we see that we now optimize the margin width ρ. Thus, rather than
choosing C, we now choose ν; the meaning of the latter will become clear shortly.
The Lagrangian is
n n n
1 2 1X X
kwkH + ξi − νρ + α i ρ − yi w xi − ξi + βi (−ξi ) + γ(−ρ)
2 n i=1 i=1 i=1
for αi ≥ 0, βi ≥ 0, and γ ≥ 0. Differentiating wrt each of the primal variables
w, ξ, ρ, and setting to zero, we get
w = αi yi xi
αi + βi = (4.12)
ν = αi − γ (4.13)
0 ≤ αi ≤ n−1 .
|N (α)|
≤ ν,
and ν is an upper bound on the number of non-margin SVs.
2. Case of ξi = 0. Then αi < n−1 . Denote by M (α) the set of points
n−1 > αi > 0. Then from (4.14),
n X 1
X X X 1
ν= αi = + αi ≤ ,
n n
i∈N (α) i∈M (α) i∈M (α)∪N (α)
|N (α)| + |M (α)|
ν≤ ,
and ν is a lower bound on the number of support vectors with non-zero
weight (both on the margin, and “margin errors”).
Substituting into the Lagrangian, we get
m m n m X m n n
1 XX > 1X X
αi αj yi yj xi xj + ξi − ρν − αi αj yi yj xi xj + αi ρ − αi ξi
2 i=1 j=1 n i=1 i=1 j=1 i=1 i=1
n n
X 1 X
− − αi ξi − ρ αi − ν
n i=1
m m
1 XX
=− αi αj yi yj x>
i xj
2 i=1 j=1
subject to
X 1
αi ≥ ν 0 ≤ αi ≤ .
[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge
University Press, Cambridge, England, 2004.