Convex Tutorial
Convex Tutorial
Vision
Christoph Schnörr
University of Mannheim, Germany
Table of Contents
1 – Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 – Literatur . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 – Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 – Convex Functions . . . . . . . . . . . . . . . . . . . . . . 27
5 – Subgradients and Optimality . . . . . . . . . . . . . . . . 44
6 – Conjugate Duality . . . . . . . . . . . . . . . . . . . . . . 51
7 – Convex Optimization . . . . . . . . . . . . . . . . . . . . . 55
8 – Non-Convex Optimization . . . . . . . . . . . . . . . . . . 71
1 – Introduction
• prior knowledge
Optimization point-of-view
• variables, domain
• objective function
• constraints
Optimization problem
Computational aspects
Convex problems can be solved reliably and efficiently.
Non-linearity does not imply that a problem is difficult, but non-
convexity does in general.
Modelling aspects
Many competing models exist in computer vision ...
⇒ optimization theory helps to classify the field.
It is important to recognize problems that can be formulated as
convex optimization problems.
Optimization Tree
http://www-fp.mcs.anl.gov/otc/Guide/OptWeb/index.html
2 – Literatur
[1] R.T. Rockafellar. Convex analysis. Princeton Univ. Press, Princeton, NJ,
2nd. edition, 1972.
[2] R.T. Rockafellar and R.J-B. Wets. Variational Analysis, volume 317 of
Grundlehren der math. Wissenschaften. Springer, 1998.
3 – Convex Sets
Convex subset C ⊂ Rn
Invariant operations
Minkowski sum C1 + C2
Scaling λ C , λ ∈ R
Set product C1 × C2
T
Intersection i∈I Ci
Hyperplane
C = x ha, xi = α
Half-space
C = x ha, xi ≤ α
Box
C = x a i ≤ xi ≤ b i
Ball
B(x, r) = x kx − xk ≤ r
Cone K ∈ Rn
0∈K and ∀x ∈ K , ∀λ ≥ 0 , λx ∈ K
K is convex if
K +K ⊂K
Standard cone
K = Rn+
Semidefinite cone
n
n×n
>
K= S+ = S∈R S = S , S 0
Examples
n
Standard cone R+
Ax ≤ b ⇔ b − Ax ∈ Rn+
Second-order cone Ln
D d
kDx − dk ≤ hp, xi − q ⇔ x − ∈ Ln
p> q
| {z } | {z }
A b
n
Semidefinite cone S+
Learning/optimization of inner-product and kernel matrices
A = hxi , xj i i,i=1,...,n , B = k(xi , xj ) i,i=1,...,n
→ Mercer kernels
Polar cones
∗
K = v hv, wi ≤ 0 , ∀w ∈ K
K
K closed convex:
∗ ∗
K =K
0
Important special cases:
M∗ = M⊥ (subspaces) K*
∗
0 = Rn
TC(x)
Note:
Interior points x ∈ C:
NC (x) = {0} C
Exterior points x 6∈ C:
NC(x)
NC (x) = ∅
Constraint functions
Defining C by
inequalities (fi , i ∈ I, are convex) and
equalities (fj , j ∈ J , are affine):
C = x fi (x) ≤ 0 , i ∈ I ; fj (x) = 0 , j ∈ J
Convex combination of x0 , x1 , . . . , xp ∈ Rn :
p
X p
X
λ i xi , λi ≥ 0 , λi = 1
i=0 i=0
C = conv {x0 , x1 , . . . , xp }
4 – Convex Functions
Convex function f : C → R relative to (the convex set) C:
f (1 − t)x0 + tx1 ≤ (1 − t)f (x0 ) + tf (x1 ) , ∀t ∈ (0, 1)
(c) ∇2 f (x) 0 , ∀x
Jensen’s inequality
f convex, C convex
For any convex combination
p
! p
X X
f λ i xi ≤ λi f (xi )
i=0 i=0
Invariant operations
Pm
Addition and (positive) scalar multiplication i=1 λi fi , λi ≥ 0.
inf-projection
h(x) = inf f (x, y)
y
Representation by inf-projection:
n1 o
2
ρ(t) = inf s + γ|t − s|
s 2
Example (cont’d)
Note:
Additional dimensions often simplify problem representations.
Affine function
f (x) = ha, xi + β
Quadratic function
1
f (x) = hx, Axi + ha, xi + β , A0
2
`p -norm
n
!1/p
X
kxkp = |xi |p , 1≤p<∞
i=1
`p -balls, p ∈ {1, 2, ∞}
Entropy
The entropy function is concave, i.e. the negative entropy is convex:
n
X
h : C ⊂ Rn → R , h(p) = pi log pi
i=1
C= Rn+ ∩ p he, pi = 1
(2-simplex)
Global minimum:
uniform distribution
pi = 1/n , ∀i
Indicator functions
C closed convex; extended function δC : Rn → R = [−∞, +∞]
0 if x ∈ C
δC (x) :=
+∞ if x 6∈ C
Proper functions
Effective domain: dom f = x f (x) < +∞ (e.g. dom δC = C)
f is called proper if dom f 6= ∅
Attainment of minima
If f : Rn → R is proper, lsc and level-bounded, then inf f is finite and
argmin f is nonempty and compact.
Support functions
Support functions
C 6= ∅ closed convex
σC (v) := sup hx, vi
x∈C
sublinear (subadditive)
Example
p
C= x hx, Axi ≤ 1 , A pos. definite
p
σC (v) = hv, A−1 vi
Note:
λmin kxk22 ≤ hx, Axi ≤ λmax kxk22
Then
σC (v) = kvkA−1 ,2
Examples
Euclidean norm
Support function of the unit ball:
Examples:
k · k◦1 = k · k∞ , k · k◦∞ = k · k1
Conversely, by definition
BX = x σB◦X (x) ≤ 1
n
= x hx, vi ≤ σBX (v) , ∀v ∈ R
Seminorms
For seminorms | · |, analogous definitions are useful:
C = x σC ◦ (x) ≤ 1 , σC ◦ (x) = |x|
◦
C = v σC (v) ≤ 1 ,
σC (v) = |v|◦
Z
1
(u − g)2 + λTV(u) dx
Ω 2
200
200
150
60 60
100
100
0
40 40
20 20
40 20 40 20
60 60
80 80
◦
⇒ TV(u) = σC ◦ (u) , C = v kvkG ≤ 1
⇒ kvkG = σC (v) , C = u TV(u) ≤ 1
A subgradient v of a proper
convex function f (possibly
non-smooth) at x is the gradi- f
ent of an affine function sup-
porting f at x:
∂f (x)
Examples
Indicator functions
∂δC (x) = NC (x)
Support functions
∂σC (v) = argmaxhx, vi = x ∈ C v ∈ NC (x)
x∈C
Norm
x
kxk2 if x 6= 0
∂ kxk2 =
B(0, 1) if x = 0
0 ∈ ∂f (x)
Global optimality:
0 ∈ ∇f (x) + NC (x)
Distance function x
dC (x) = inf ky−xk+δC (y)
y
Projection ΠC( x)
ΠC (x) = argmin ky − xk C
y∈C
Note:
v ∈ NC (x) ⇔ x = ΠC (x + v)
Application (cont’d)
Total variation denoising (Aujol, Chambolle)
n1 o
inf ku − gk2 + λTV(u)
u 2
◦
Optimality: 0 ∈ u − g + ∂σλC ◦ (u) , λC = v kvkG ≤ λ
We already know
◦
∂σλC ◦ (u) = v ∈ λC u ∈ NλC ◦ (v)
u ∈ NλC ◦ (v) ⇔ v = ΠλC ◦ (v + u)
Hence
Example
6 – Conjugate Duality
Legendre-Fenchel Transform
One-to-one transform of the class of proper, lsc, convex functions:
∗
f (v) = sup hv, xi − f (x)
x
∗
f (x) = sup hv, xi − f (v)
v
Interpretation:
Subgradient rule
f proper, lsc, convex
∗
∂f (x) = argmax hv, xi − f (v)
v
∗
∂f (v) = argmax hv, xi − f (x)
x
Example:
∗
δC = σC ⇒ v ∈ NC (x) ⇔ x ∈ ∂σC (v)
Note:
inf x f (x)} = − supx − f (x) , thus
∗
inf f (x) = −f (0) , argmin f (x) = ∂f ∗ (0)
x x
n 1
θY,B : R → R , θY,B (x) = sup hy, zi − hy, B, yi
y∈Y 2
∞
∗
dom θY,B = Y ∩ N (B) (Y ∞ : horizon cone of all directions)
B = 0: θY,0 = σY
∗
B = 0 , Y = K is a cone: θY,0 = σY = δK = δK ∗
1
∗
Y = Rm : θY,B = 2 hy, Byi
max
m
ψ(y) , ψ(y) = hb, yi − h∗ (y) − k ∗ (A> y − c)
y∈R
7 – Convex Optimization
min hc, xi , Ax ≥ b
x∈Rn
corresponds to X = Rn , Y = R+
m
, and
b − Ax ≤ 0 ⇔ b − Ax ∈ Y ∗ ↔ δY ∗ (b − Ax) = σY (b − Ax)
n o n o
min hc, xi + θY,0 (b − Ax) , max hb, yi − θX,0 (A> y − c)
x∈X y∈Y
max
m
hb, yi , A> y = c , y≥0
y∈R
2
2
-1
> I Φ g
e r r
min , I
−Φ ≥
−g
r,α 0 α α
0 I 0
Example (cont’d)
4
4
2
2
1.4
1.2
4
1
0.8
2
0.6
0.4
0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vector-matrix notation: xL
h
X = x1 , . . . , xm ∈ Rn×m
D = diag y1 , . . . , ym ∈ Rm×m
0
Thus
1 2
min kwk , D(X > w + be) − e ≥ 0
w∈Rn ,b∈R 2
Example (cont’d)
Corresponds to X = Rn+1 , Y = Rm+ , δY ∗ = σY = θY,0
( > )
1 w I 0 w w
min + θY,0 e − DX > De
w,b 2 b 0 0 b b
| {z }
C
( )
XD
max he, yi − θX,C y
y∈Y e> D
>
XD x XD
θX,C y = sup y − 1 kxk2
>
e D n
x∈R ,xn+1 ∈R xn+1 e> D 2
∗
1
⇒ hDe, yi = 0 , θX,C . . . = kxk2 XDy
2
Example (cont’d)
Dual program
1 2
max he, yi − kXDyk , hDe, yi = 0 , y≥0
y∈R m 2
Conic Programming
LP ⊂ SOCP ⊂ SDP
m
Semidefinite program (SDP): K = S+ , A : Rn → S m
minn hc, xi , Ax − B ≥K 0 maxm tr(BY ) , A> Y = c , Y ≥K 0
x∈R Y ∈S
LP ⊂ SOCP:
k0k ≤ hAi,• , xi − bi , ∀i
100 1 5
50 0.5
10
0 0
0 20 40 0 10 20
50 1 15
0.5
20
0 0
0 20 40 0 10 20
200 1 25
100 0.5
30
0 0
0 20 40 0 10 20
50 1 35
0.5
40
0 0 5 10 15 20
0 20 40 0 10 20
Example (cont’d)
Example (cont’d)
100 1
50 0.5 5
0 0
0 20 40 0 10 20 10
50 1
0.5 15
0 0 20
0 20 40 0 10 20
200 1
25
100 0.5
0 0 30
0 20 40 0 10 20
50 1
35
0.5
40
0 0 5 10 15 20
0 20 40 0 10 20
40 2 50 2
20 1 1
0 0 0 0
0 10 20 30 40 0 5 10 15 20 0 10 20 30 40 0 5 10 15 20
40 5 40 4
20 0 20 2
0 −5 0 0
0 10 20 30 40 0 5 10 15 20 0 10 20 30 40 0 5 10 15 20
4 4 40 2
2 2 20 1
0 0 0 0
0 10 20 30 40 0 5 10 15 20 0 10 20 30 40 0 5 10 15 20
4 4 10 5
2 2 5 0
0 0 0 −5
0 10 20 30 40 0 5 10 15 20 0 10 20 30 40 0 5 10 15 20
Dual program
SOCP ⊂ SDP:
x tIn−1 x
∈ Ln ⇔ 0
t x> t
Rn 3 x i → y i ∈ R m
m < n , i = 1, . . . , p
such that
P
Additionally, impose i yi = 0 for centering yi .
Example (cont’d)
Put dij = kxi − xj k2 and express the constraints in terms of K:
maxp tr K , K 0
K∈S
Ki,j − 2Ki,j + Kj,j = dij , ∀ij ∈ E
p
X
Ki,j = 0
i,j=1
8 – Non-Convex Optimization
Solvable binary problems
Integral polyhedra
A ∈ Zm×n , b ∈ Zm , id. situation
n
x ∈ R+ Ax ≤ b
n
= conv x ∈ Z+ Ax ≤ b
...
...
maxn hc, xi , Ax ≤ 1
x∈R+
Approach (overview)
V
as set functions J : 2 → R+ , V = i ∈ V | xi = 0 ∪ i ∈ V | xi = 1
| {z } | {z }
S V \S
• Design J to be submodular :
Networks, cuts.
A network (D, c, s, t) is a digraph D = (V, E) with edge capacities
c : E → R+ and two specified vertices s ∈ V (source) and t ∈ V \ {s}
(sink, target).
Example. Vertices s, t are marked with black, and the capacities
c(e) , e ∈ E, are depicted as edge labels:
6 2
8
2 9
6 4 3 9
5 1
s 8 3 6 t
7 7
14 1 8 12
8 4 2
11
2 9
f (S ∪ T ) + f (S ∩ T ) ≤ f (S) + f (T ) , ∀S, T ⊆ V
6 2
8
2 9
6 4 3 9
5 1
8 3 6
7 7
14 1 8 12
8 4 2
11
2 9
Construction of network D from functional J: Kolmogorov-Zabih’04
Saliency measure
X X 2
E(p) = − w(i, j)pi pj + λ pi , p ∈ {0, 1}n
i,j i
x = 2p − e ∈ {−1, +1}n :
1
>
1
J(x) = x, (λee − W )x + e, (λnI − W )x
4 2
Lifting problem variables into matrix space:
> >
x Q b x
J(x) = hx, Qxi + 2hb, xi =
1 b> 0 1
> >
= x̃ Lx̃ = tr Lx̃x̃
-1
1
1 a b
X = a 1 c
0
b c 1
-1
1
-1
Auxiliary variables
Non-convex local data term and convex non-
local regularization: step-size, convergence, ...?
Example (coherent particle matching in fluids)
Approach
Represent the non-convex function f (x) through auxiliary variables y
1
f (x) = inf φ(x, y)
y λ
1
φ(x, y) = kxk2 − hx, yi + h∗∗ (y)
2
f (x) as inf-projection of the biconvex (6= convex!) function λ−1 φ(x, y):
Example
DC-Programming
Example: Robust linear classificaton with feature selection
n
X
inf
n
(1 − λ) 1 − yi hw, xi i + b + λkwk0
w∈R ,b∈R +
i=1
kwk0 := {i wi 6= 0 “counts” features. Concave approximation:
1
X
−αvi
1−e , |wi | ≤ vi
i
1
DC-program:
min (1 − λ)he, ξi + λ e, e − exp(−αv)
w,b,ξ,v
ξi ≥ 1 − yi hw, xi i + b , ∀i
ξ ≥ 0 , −v ≤ w ≤ v
µ
− hx, x − ei
2
1
DC-program:
1n o
inf kAx − bk2 + λhx, Lxi − µhx, x − ei
x∈[0,1]n 2
Convex program
n o
xk+1 ∈ argmin g(x) − h(xk ) + hy k , x − xk i = ∂g ∗ (y k )
Example
Tomogr. reconstruction of a binary 2563 volume from 5 projections: