15 Optimization Script
15 Optimization Script
Marc Toussaint
Contents
1 Introduction 3
Types of optimization problems (1:3)
3 Constrained Optimization 16
Constrained optimization (3:1) Log barrier method (3:6) Central path (3:9) Squared penalty
method (3:12) Augmented Lagrangian method (3:14) Lagrangian: definition (3:21) La-
grangian: relation to KKT (3:24) Karush-Kuhn-Tucker (KKT) conditions (3:25) Lagrangian:
saddle point view (3:27) Lagrange dual problem (3:29) Log barrier as approximate KKT
(3:33) Primal-dual interior-point Newton method (3:36) Phase I optimization (3:40) Trust re-
gion (3:41)
4 Convex Optimization 26
Function types: covex, quasi-convex, uni-modal (4:1) Linear program (LP) (4:6) Quadratic
program (QP) (4:6) LP in standard form (4:7) Simplex method (4:11) LP-relaxations of integer
programs (4:15) Sequential quadratic programming (4:23)
7 Exercises 49
7.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.5 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.6 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.7 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.8 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.9 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.10 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.11 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.12 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.13 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.14 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Index 62
1 Introduction
• Which science does not use optimality principles to describe nature & artifacts?
– Physics, Chemistry, Biology, Mechanics, ...
– Operations research, scheduling, ...
– Computer Vision, Speach Recognition, Machine Learning, Robotics, ...
• Endless applications
1:1
Teaching optimization
• Standard: Convex Optimization, Numerical Optimization
• Discrete Optimization (Stefan Funke)
• Exotics: Evolutionary Algorithms, Swarm optimization, etc
• In this lecture I try to cover the standard topics, but include as well work on
stochastic search & global optimization
1:2
min f (x)
x
• “Approximate upgrade”:
– Use samples of f (x) to approximate ∇f (x) locally
– Use samples of ∇f (x) to approximate ∇2f (x) locally
1:3
x
1:4
Optimization in Robotics
• Trajectories:
Let xt ∈ Rn be a joint configuration and x = x1:T = (x1 , . . . , xT ) a trajectory of
length T . Find
T
X
min ft (xt−k:t )>ft (xt−k:t )
x
t=0 (1)
s.t. ∀t : gt (xt ) ≤ 0 , ht (xt ) = 0
• Control:
s.t. u = M q̈ + h + J>
gλ (3)
Jφ q̈ = c (4)
λ = λ∗ (5)
Jg q̈ = b (6)
1:5
Planned Outline
• Unconstrained Optimization: Gradient- and 2nd order methods
– stepsize & direction, plain gradient descent, steepest descent, line search &
trust region methods, conjugate gradient
– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS
• Constrained Optimization
– log barrier, squared penalties, augmented Lagrangian
– Lagrangian, KKT conditions, Lagrange dual, log barrier ↔ approx. KKT
• Special convex cases
– Linear Programming, (sequential) Quadratic Programming
– Simplex algorithm
– Relaxation of integer linear programs
• Global Optimization
– infinite bandits, probabilistic modelling, exploration vs. exploitation, GP-UCB
• Stochastic search
– Blackbox optimization (0th order methods), MCMC, downhill simplex
1:7
Books
(this course will not go to the full depth in math of Boyd et al.)
1:8
Books
1:9
Organisation
• Webpage:
http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/15-Optimization/
– Slides, Exercises & Software (C++)
– Links to books and other resources
• Admin things, please first ask:
Carola Stahl, Carola.Stahl@ipvs.uni-stuttgart.de, Raum 2.217
Gradient descent
• Objective function: f : Rn → R
h i>
Gradient vector: ∇f (x) = ∂∂x f (x) ∈ Rn
• Problem:
min f (x)
x
2:1
A. Stepsize
B. Descent direction
2:2
Stepsize
• Making steps proportional to ∇f (x)?
small gradient
small step?
large gradient
large step?
min f (x + αd)
α≥0
%−
α describes the stepsize decrement in case of a rejected step
2:6
Wolfe Conditions
1
f (y) ≤ f (x) − |∇f (x)|2
2M
1
f (y) − fmin ≤ f (x) − fmin − |∇f (x)|2
2M
2m
≤ f (x) − fmin − (f (x) − fmin )
2M
h mi
≤ 1− (f (x) − fmin )
M
m
→ each step is contracting at least by 1 − M
<1
2:8
M α2
f (y) ≤ f (x) − α|∇f (x)|2 + |∇f (x)|2
2
α
≤ f (x) − |∇f (x)|2
2
≤ f (x) − %ls α|∇f (x)|2
1 %−
As backtracking terminates for any α ≤ M
, a step α ≥ M
α
is chosen, such that
%ls %−
α
f (y) ≤ f (x) − |∇f (x)|2
M
% %−α
f (y) − fmin ≤ f (x) − fmin − ls |∇f (x)|2
M
2m%ls %−α
≤ f (x) − fmin − (f (x) − fmin )
M
−i
h 2m%ls %α
≤ 1− (f (x) − fmin )
M
2m%ls %−
→ each step is contracting at least by 1 − M
α
<1
2:9
B. Descent Direction
2:10
Is it really?
The steepest descent direction is the one where, when I make a step of length 1,
I get the largest decrease of f in its linear approximation.
2:11
Steepest Descent Direction
• But the norm ||δ||2 = δ>Aδ depends on the metric A!
Newton Direction
• Assume we have access to the symmetric Hessian
∂2 ∂2 ∂2
∂x1 ∂x1
f (x) ∂x1 ∂x2
f (x) ··· ∂x1 ∂xn
f (x)
..
∂2
f (x) .
∇2f (x) = ∂x1 ∂x2 ∈ Rn×n
.. ..
. .
∂2 ∂2
f (x) ··· ··· f (x)
∂xn ∂x1 ∂xn ∂xn
Newton method
• For finding roots (zero points) of f (x)
f (x)
x←x−
f 0 (x)
f 0 (x)
x←x−
f 00 (x)
For x ∈ Rn :
x ← x − ∇2f (x)-1 ∇f (x)
2:15
Why 2nd order information is better
• Better direction:
2nd Order
Plain Gradient
Conjugate Gradient
• Better stepsize:
– a full step jumps directly to the minimum of the local squared approx.
– often this is already a good heuristic
– additional stepsize reduction and dampening are straight-forward
2:16
• Notes:
– Line 3 computes the Newton step d = −∇2f (x)-1 ∇f (x),
use special Lapack routine dposv to solve Ax = b (using Cholesky)
– λ is called damping, related to trust region methods, makes the parabola
more steep around current x
for λ → ∞: d becomes colinear with −∇f (x) but |d| = 0
2:17
Demo
2:18
Gauss-Newton method
• Consider a sum-of-squares problem:
X
min f (x) where f (x) = φ(x)>φ(x) = φi (x)2
x
i
n
and we can evaluate φ(x), ∇φ(x) for any x ∈ R
• φ(x) ∈ Rd is a vector; each entry contributes a squared cost term to f (x)
• ∇φ(x) is the Jacobian (d × n-matrix)
∂ ∂ ∂
φ1 (x) φ1 (x) ··· φ1 (x)
∂x1 ∂x2 ∂xn
..
∂
φ2 (x) .
∇φ(x) = ∂x1 ∈ Rd×n
.. ..
. .
∂ ∂
φd (x) ··· ··· φd (x)
∂x1 ∂xn
Gauss-Newton method
• The gradient and Hessian of f (x) become
f (x) = φ(x)>φ(x)
∇f (x) = 2∇φ(x)>φ(x)
∇2f (x) = 2∇φ(x)>∇φ(x) + 2φ(x)>∇2φ(x)
• The Gauss-Newton method is the Newton method for f (x) = φ(x)>φ(x) with ap-
proximating ∇2φ(x) ≈ 0
Quasi-Newton methods
2:22
Quasi-Newton methods
• Yes: We can approximate ∇2f (x) from the data {(xi , ∇f (xi ))}ki=1 of previous iter-
ations
2:23
Basic example
• We’ve seen already two data points (x1 , ∇f (x1 )) and (x2 , ∇f (x2 ))
How can we estimate ∇2f (x)?
• In 1D:
∇f (x2 ) − ∇f (x1 )
∇2f (x) ≈
x2 − x1
! !
∇2f (x) δ = y δ = ∇2f (x)−1 y
y y> δδ>
∇2f (x) = ∇2f (x)−1 =
y>δ δ>y
Convince yourself that the last line solves the desired relations
[Left: how to update ∇2f (x). Right: how to update directly ∇2f (x)-1 .]
2:24
BFGS
• Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:
• Notes:
– The blue term is the H -1 -update as on the previous slide
– The red term “deletes” previous H -1 -components
2:25
Quasi-Newton methods
• BFGS is the most popular of all Quasi-Newton methods
Others exist, which differ in the exact H -1 -update
• L-BFGS (limited memory BFGS) is a version which does not require to explicitly
store H -1 but instead stores the previous data {(xi , ∇f (xi ))}ki=1 and manages to
compute d = −H -1 ∇f (x) directly from this data
• Some thought:
In principle, there are alternative ways to estimate H -1 from the data {(xi , f (xi ), ∇f (xi )
e.g. using Gaussian Process regression with derivative observations
– Not only the derivatives but also the value f (xi ) should give information on
H(x) for non-quadratic functions
– Should one weight ‘local’ data stronger than ‘far away’?
(GP covariance function)
2:26
Conjugate Gradient
• The “Conjugate Gradient Method” is a method for solving (large, or sparse) linear
eqn. systems Ax + b = 0, without inverting or decomposing A. The steps will be
“A-orthogonal” (=conjugate).
We mention its extension for optimizing nonlinear functions f (x)
• A key insight:
– at xk we computed g 0 = ∇f (xk )
– assume we made a exact line-search step to xk+1
– at xk+1 we computed g = ∇f (xk+1 )
Conjugate Gradient
Input: initial x ∈ Rn , functions f (x), ∇f (x), tolerance θ
Output: x
1: initialize descent direction d = g = −∇f (x)
2: repeat
3: α ← argminα f (x + αd) // line search
4: x ← x + αd
5: g 0 ← g, g
= −∇f (x) // store and compute grad
g>(g−g 0 )
6: β ← max g 0>g 0
,0
7: d ← g + βd // conjugate descent direction
8: until |∆x| < θ
• Notes:
– β > 0: The new descent direction always adds a bit of the old direction!
– This essentially provides 2nd order information
– The equation for β is by Polak-Ribière: On a quadratic function f (x) = x>Ax + b>x this
leads to conjugate search directions, d0>Ad = 0.
– Line search can be replaced by 1st and 2nd Wolfe condition with %ls2 < 21
2:29
Conjugate Gradient
• For quadratic functions CG converges in n iterations. But each iteration does line
search
2:30
|xk+1 − x∗ |
lim =r
k |xk − x∗ |p
Rprop
2:34
Rprop
“Resilient Back Propagation” (outdated name from NN times...)
2:35
Rprop
• Rprop is a bit crazy:
– stepsize adaptation in each dimension separately
– it not only ignores |∇f | but also its exact direction
step directions may differ up to < 90◦ from ∇f
– Often works very robustly
– Guarantees? See work by Ch. Igel
Appendix
2:37
Stopping Criteria
• Standard references (Boyd) define stopping criteria based on the “change” in f (x),
e.g. |∆f (x)| < θ or |∇f (x)| < θ.
• Throughout I will define stopping criteria based on the change in x, e.g. |∆x| < θ!
In my experience with certain applications this is more meaningful, and invariant
of the scaling of f . But this is application dependent.
2:38
Evaluating optimization costs
• Standard references (Boyd) assume line search is cheap and measure optimiza-
tion costs as the number of iterations (counting 1 per line search).
• Throughout I will assume that every evaluation of f (x) or (f (x), ∇f (x)) or (f (x), ∇f (x)
is approx. equally expensive—as is the case in certain applications.
2:39
3 Constrained Optimization
General definition, log barriers, central path, squared penalties, augmented Lagrangian
(equalities & inequalities), the Lagrangian, force balance view & KKT conditions, saddle
point view, dual problem, min-max max-min duality, modified KKT & log barriers, Phase
I
Constrained Optimization
• General constrained optimization problem:
Let x ∈ Rn , f : Rn → R, g : Rn → Rm , h : Rn → Rl find
In this lecture I’ll mostly focus on inequality constraints g, equality constraints are
analogous/easier
• Applications
– Find an optimal, non-colliding trajectory in robotics
– Optimize the shape of a turbine blade, s.t. it must not break
– Optimize the train schedule, s.t. consistency/possibility
3:1
General approaches
• Try to somehow transform the constraint problem to
General approaches
• Penalty & Barriers
– Associate a (adaptive) penalty cost with violation of the constraint
– Associate an additional “force compensating the gradient into the constraint” (augmented
Lagrangian)
– Associate a log barrier with a constraint, becoming ∞ for violation (interior point method)
we address
X
min f (x) − µ log(−gi (x))
x
i
3:6
Log barrier
• Eventually we want to have a very small µ—but choosing small µ makes the barrier
very non-smooth, which might be bad for gradient and 2nd order methods
3:7
Note: See Boyd & Vandenberghe for alternative stopping criteria based on f precision (du-
ality gap) and better choice of initial µ (which is called t there).
3:8
Central Path
• Every µ defines a different optimal x∗ (µ)
X
x∗ (µ) = argmin f (x) − µ log(−gi (x))
x
i
• Each point on the path can be understood as the optimal compromise of mini-
mizing f (x) and a repelling force of the constraints. (Which corresponds to dual
variables λ∗ (µ).)
3:9
We will revisit the log barrier method later, once we introduced the Langrangian...
3:10
we address
m
X
min f (x) + µ [gi (x) > 0] gi (x)2
x
i=1
3:12
• A better idea would be to add an out-pushing gradient/force −∇gi (x) for every
constraint gi (x) > 0 that is violated
Ideally, the out-pushing gradient mixes with −∇f (x) exactly such that the result
becomes tangential to the constraint!
Augmented Lagrangian
(We can introduce this is a self-contained manner, without yet defining the “Lagrangian”)
3:14
Augmented Lagrangian (equality constraint)
• We first consider an equality constraint before addressing inequalities
• Instead of
min f (x) s.t. h(x) = 0
x
we address
m
X X
min f (x) + µ hi (x)2 + λi hi (x) (7)
x
i=1 i=1
• Note:
– The gradient ∇hi (x) is always orthogonal to the constraint
– By tuning λi we can induce a “virtual gradient” λi ∇hi (x)
– The term µ m 2
P
i=1 hi (x) penalizes as before
m
X m
X
x0 = argmin f (x) + µ hi (x)2 + λi hi (x)
x
i=1 i=1
m
X m
X
⇒ 0 = ∇f (x0 ) + µ 2hi (x0 )∇hi (x0 ) + λi ∇hi (x0 )
i=1 i=1
X m
X X
0
λnew
i ∇hi (x ) = µ 2hi (x0 )∇hi (x0 ) + λold 0
i ∇hi (x )
i=1 i=1 i=1
0
λnew
i = λold
i + 2µhi (x )
5: optionally, µ ← %+µµ
6: until |∆x| < θ and |hi (x)| <
3:16
we address
m
X m
X
min f (x) + µ [gi (x) ≥ 0 ∨ λi > 0] gi (x)2 + λi gi (x)
x
i=1 i=1
Input: initial x ∈ Rn , functions f (x), g(x), ∇f (x), ∇g(x), tol. θ, , parameters (de-
faults: %+µ = 1, µ0 = 1)
Output: x
1: initialize µ = µ0 , λi = 0
2: repeat
find x ← argminx f (x) + µ i [gi (x) ≥ 0 ∨ λi > 0] gi (x)2 + i λi gi (x)
P P
3:
4: ∀i : λi ← max(λi + 2µgi (x0 ), 0)
5: optionally, µ ← %+µµ
6: until |∆x| < θ and gi (x) <
3:18
• See also:
M. Toussaint: A Novel Augmented Lagrangian Approach for Inequalities and Convergent
Any-Time Non-Central Updates. e-Print arXiv:1412.4329, 2014.
3:19
The Lagrangian
3:20
The Lagrangian
• Given a constraint problem
m
X
L(x, λ) = f (x) + λi gi (x)
i=1
• The Lagrangian implies a dual problem, which is sometimes easier to solve than
the primal
3:22
Example: Some calculus using the Lagrangian
• For x ∈ R2 , what is
min x2 s.t. x1 + x2 = 1
x
• Solution:
L(x, λ) = x2 + λ(x1 + x2 − 1)
0 = ∇x L(x, λ) = 2x + λ 1 ⇒ x1 = x2 = −λ/2
1
3:23
• At the optimum there must be a balance between the cost gradient −∇f (x) and
the gradient of the active constraints −∇gi (x)
3:24
m
X
∇f (x) + λi ∇gi (x) = 0 (“stationarity”)
i=1
The last condition says that λi > 0 only for active constraints.
These are the Karush-Kuhn-Tucker conditions (KKT, neglecting equality con-
straints)
3:25
m
X
∇f (x) + λi ∇gi (x) = 0
i=1
∇x L(x, λ) = 0
• In that sense, the Lagrangian can be viewed as the “energy function” that gener-
ates (for good choice of λ) the right balance between cost and constraint gradients
• This is exactly as in the augmented Lagrangian approach, where however we have an addi-
tional (“augmented”) squared penalty that is used to tune the λi
3:26
• Note:
This implies either (λi = 0 ∧ gi (x) < 0) or gi (x) = 0, which is exactly equivalent
to the complementarity and primal feasibility conditions
• Again, optima (x∗ , λ∗ ) are saddle points where
minx L enforces stationarity and
maxλ≥0 L enforces complementarity and primal feasibility
Then we have
And consequently
∃x : ∀i : gi (x) < 0
we address X
min f (x) − µ log(−gi (x))
x
i
or equivalently
X
∇f (x) + λi ∇gi (x) = 0 , λi gi (x) = −µ
i
• We think of the KKT conditions as an equation system r(x, λ) = 0, and can use
the Newton method for solving it:
∇r ∆x = −r
∆λ
This leads to primal-dual algorithms that adapt x and λ concurrently. Roughly, this
uses the curvature ∇2 f to estimate the right λ to push out of the constraint.
3:36
Pm
∇f (x) + i=1 λi ∇gi (x) = 0 (“force balance”)
∀i : gi (x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λi gi (x) = −µ (complementary)
∇f (x) + ∇g(x)>λ
r(x, λ) = 0 , r(x, λ) :=
−diag(λ)g(x) − µ1m
∇r(x, λ) =
−diag(λ)∇g(x) −diag(g(x))
3:37
• The above formulation allows for a duality gap µ; choose µ = 0 or consult Boyd
how to update on the fly (sec 11.7.3)
or
m
X
min si s.t. ∀i : gi (x) ≤ si , si ≥ 0
(x,s)∈Rn+m
i=1
3:40
Trust Region
• Instead of adapting the stepsize along a fixed direction, an alternative is to adapt
the trust region
• Rougly, while f (x + δ) > f (x) + %ls ∇f (x)>δ:
– Reduce trust region radius β
– try δ = argminδ:|δ|<β f (x + δ) using a local quadratic model of f (x + δ)
General approaches
• Penalty & Barriers
– Associate a (adaptive) penalty cost with violation of the constraint
– Associate an additional “force compensating the gradient into the constraint” (augmented
Lagrangian)
– Associate a log barrier with a constraint, becoming ∞ for violation (interior point method)
Function types
• A function is defined convex iff
• [Subjective!] I call a function unimodal iff it has only 1 local minimum, which is the
global minimum
Note: in dimensions n > 1 quasiconvexity is stronger than unimodality
• A general non-linear function is unconstrained and can have multiple local minima
4:1
Local optimization
• So far I avoided making explicit assumptions about problem convexity: To empha-
size that all methods we considered – except for Newton – are applicable also on
non-convex problems.
• The methods we considered are local optimization methods, which can be defined
as
– a method that adapts the solution locally
– a method that is guaranteed to converge to a local minimum only
Convex problems
• Convexity is a strong assumption
• Roughly:
“global optimization = finding local optima + multiple convex problems”
4:4
Convex problems
• A constrained optimization problem
• Alternative definition:
f convex and feasible region is a convex set
4:5
LP in standard form
min c>x s.t. x ≥ 0, Ax = b
x
1 >
min x Qx + c>x s.t. Gx ≤ h, Ax = b
x 2
where Q is positive definite.
• Express x = x+ − x− with x+ , x− ≥ 0:
min c>(x+ − x− )
x+ ,x− ,ξ
• Now this is conform with the standard form (replacing (x+ , x− , ξ) ≡ x, etc)
4:7
Example LPs
Linear Programming
– Algorithms
– Application: LP relaxation of discret problems
4:9
Algorithms for Linear Programming
(The emphasis in the notion of interior point methods is to distinguish from con-
straint walking methods.)
Simplex Algorithm
Georg Dantzig (1947)
Note: Not to confuse with the Nelder–Mead method (downhill simplex method)
4:11
Simplex Algorithm
• The Simplex Algorithm walks along the edges of the polytope, at every corner
choosing the edge that decreases c>x most
• This either terminates at a corner, or leads to an unconstrained edge (−∞ opti-
mum)
Simplex Algorithm
• The simplex algorithm is often efficient, but in worst case exponential in n and m.
• Interior point methods (log barrier) and, more recently again, augmented La-
grangian methods have become somewhat more popular than the simplex algo-
rithm
4:13
• Examples:
P
– P
Travelling Salesman: minxij ij cij xP ij with xij ∈ {0, 1} and constraints ∀j :
we solve
min c>x s.t. Ax = b, x ∈ [0, 1]
x
• Clearly, the relaxed solution will be a lower bound on the integer solution (some-
times also called “outer bound” because [0, 1] ⊃ {0, 1})
4:17
Example: MAP inference in MRFs
• Finding maxx f (x) of a MRF is then equivalent to
X X XX
max bij (x, y) fij (x, y) + bi (x) fi (x)
bi (x),bij (x,y)
(ij)∈E x,y i x
such that
X X
bi (x), bij (x, y) ∈ {0, 1} , bi (x) = 1 , bij (x, y) = bi (x)
x y
This set of feasible b’s is called marginal polytope (because it describes the a
space of “probability distributions” that are marginally consistent (but not neces-
sarily globally normalized!))
4:18
• If the solution of the LP-relaxation turns out to be integer, we’ve solved the origi-
nally NP-hard problem!
If not, the relaxed problem can be discretized to be a good initialization for discrete
optimization
• For binary attractive MRFs (a common case) the solution will always be integer
4:19
Quadratic Programming
4:20
Quadratic Programming
1 >
min x Qx + c>x s.t. Gx ≤ h, Ax = b
x 2
• Efficient Algorithms:
– Interior point (log barrier)
– Augmented Lagrangian
– Penalty
• Dual:
x
4:22
where we can evaluate f (x), ∇f (x), ∇2f (x) and g(x), ∇g(x), ∇2g(x) for any x ∈
Rn
→ Newton method
• In the unconstrained case, the standard step direction δ is (∇2f (x) + λI) δ =
−∇f (x)
• In the constrained case, a natural step direction δ can be found by solving the local
QP-approximation to the problem
This is an optimization problem over δ and only requires the evaluation of f (x), ∇f (x), ∇
once.
4:23
5 Global & Bayesian Optimization
Multi-armed bandits, exploration vs. exploitation, navigation through belief space, up-
per confidence bound (UCB), global optimization = infinite bandits, Gaussian Pro-
cesses, probability of improvement, expected improvement, UCB
Global Optimization
• Is there an optimal way to optimize (in the Blackbox case)?
• Is there a way to find the global optimum instead of only local?
5:1
Outline
• Play a game
• Multi-armed bandits
– Belief state & belief planning
– Upper Confidence Bound (UCB)
• Standard heuristics:
– Upper Confidence Bound (GP-UCB)
– Maximal Probability of Improvement (MPI)
– Expected Improvement (EI)
5:2
Bandits
5:3
Bandits
Bandits
• Let at ∈ {1, .., n} be the choice of machine at time t
Let yt ∈ R be the outcome with mean hyat i
or
max hyT i
P∞
γ t yt
or other objectives like discounted infinite horizon max t=1
5:5
Exploration, Exploitation
• “Two effects” of choosing a machine:
– You collect more data about the machine → knowledge
– You collect reward
• For example
– Exploration: Choose the next action at to min hH(bt )i
– Exploitation: Choose the next action at to max hyt i
5:6
– as the belief
bt (θ) = P (θ|ht )
where θ are the unknown parameters θ = (θ1 , .., θn ) of all machines
5:7
θ θ θ θ
or as Belief MDP
a1 y1 a2 y2 a3 y3
b0 b1 b2 b3
(
0 1 if b0 = b0[b,a,y] R
P (b |y, a, b) = , P (y|a, b) = θa b(θa ) P (y|θa )
0 otherwise
• The Belief MDP describes a different process: the interaction between the information avail-
able to the agent (bt or ht ) and its actions, where the agent uses his current belief to antici-
pate outcomes, P (y|a, b).
• The belief (or history ht ) is all the information the agent has avaiable; P (y|a, b) the “best”
possible anticipation of observations. If it acts optimally in the Belief MDP, it acts optimally in
the original problem.
Optimality in the Belief MDP ⇒ optimality in the original problem
5:8
Optimal policies via Belief Planning
• The Belief MDP:
a1 y1 a2 y2 a3 y3
b0 b1 b2 b3
(
1 if b0 = b0[b,a,y]
P (b0 |y, a, b) =
R
, P (y|a, b) = θa b(θa ) P (y|θa )
0 otherwise
5:9
Optimal policies
• The value function assigns a value (maximal achievable return) to a state of knowl-
edge
• The optimal policy is greedy w.r.t. the value function (in the sense of the maxat
above)
• Computationally heavy: bt is a probability distribution, Vt a function over probability
distributions
R h i
• The term yt
P (yt |at , bt-1 ) yt + Vt (bt-1 [at , yt ]) is related to the Gittins Index: it can be computed
for each bandit separately.
5:10
Example exercise
• Consider 3 binary bandits for T = 10.
– The belief is 3 Beta distributions Beta(pi |α + ai , β + bi ) → 6 integers
– T = 10 → each integer ≤ 10
– Vt (bt ) is a function over {0, .., 10}6
• Given a prior α = β = 1,
a) compute the optimal value function and policy for the final reward and the aver-
age reward problems,
b) compare with the UCB policy.
5:11
See Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi & Fischer, Machine learn-
ing, 2002.
5:12
UCB algorithms
• UCB algorithms determine a confidence interval such that
ŷi − σi < hyi i < ŷi + σi
with high probability.
UCB chooses the upper bound of this confidence interval
Conclusions
• The bandit problem is an archetype for
– Sequential decision making
– Decisions that influence knowledge as well as rewards/states
– Exploration/exploitation
• The same aspects are inherent also in global optimization, active learning & RL
• Greedy Heuristics (UCB) are computationally much more efficient and guarantee
bounded regret
5:14
Further reading
• ICML 2011 Tutorial Introduction to Bandits: Algorithms and Theory, Jean-Yves
Audibert, Rémi Munos
• Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi & Fis-
cher, Machine learning, 2002.
• On the Gittins Index for Multiarmed Bandits, Richard Weber, Annals of Applied
Probability, 1992.
Optimal Value function is submodular.
5:15
Global Optimization
5:16
Global Optimization
• Let x ∈ Rn , f : Rn → R, find
min f (x)
x
or
min hf (xT )i
5:18
Gaussian Processes as belief
• The unknown “world property” is the function θ = f
• Given a Gaussian Process prior GP (f |µ, C) over f and a history
the belief is
bt (f ) = P (f | Dt ) = GP(f |Dt , µ, C)
Mean(f (x)) = fˆ(x) = κ(x)(K + σ 2 I)-1 y response surface
Var(f (x)) = σ̂(x) = k(x, x) − κ(x)(K + σ 2 In )-1 κ(x) confidence interval
• Side notes:
– Don’t forget that Var(y ∗ |x∗ , D) = σ 2 + Var(f (x∗ )|D)
– We can also handle discrete-valued functions f using GP classification
5:19
5:20
Conclusions
• Optimization as a problem of
– Computation of the belief
– Belief planning
Heuristics
5:23
1-step heuristics based on GPs
R y∗
xt = argmax −∞
N(y|fˆ(x), σ̂(x))
x
R y∗
xt = argmax −∞
N(y|fˆ(x), σ̂(x)) (y ∗ − y)
x
• Maximize UCB
xt = argmax fˆ(x) + βt σ̂(x)
x
(Often, βt = 1 is chosen. UCB theory allows for better choices. See Srinivas et al. citation below.)
5:24
• We put a lot of effort into carefully selecting just the next query point
5:25
From: Information-theoretic regret bounds for gaussian process optimization in the bandit setting Srinivas,
Krause, Kakade & Seeger, Information Theory, 2012.
5:26
5:27
Further reading
• Classically, such methods are known as Kriging
Entropy Search
slides by Philipp Hennig
P. Hennig & C. Schuler: Entropy Search for Information-Efficient Global Optimiza-
tion, JMLR 13 (2012).
5:30
“Blackbox Optimization”
• We use the term to denote the problem: Let x ∈ Rn , f : Rn → R, find
min f (x)
x
• Bayesian/Global Optimization
– Methods for arbitrary (smooth) blackbox functions that get not stuck in local
optima.
– Very interesting domain – close analogies to (active) Machine Learning, ban-
dits, POMDPs, optimal decision making/planning, optimal experimental de-
sign
6:2
Outline
• Basic downhill running
– Greedy local search, stochastic local search, simulated annealing
– Iterated local search, variable neighborhood search, Tabu search
– Coordinate & pattern search, Nelder-Mead downhill simplex
Simulated Annealing
• Simulated Annealing is a Markov chain Monte Carlo (MCMC) method.
– Must read!: An Introduction to MCMC for Machine Learning
– These are iterative methods to sample from a distribution, in our case
−f (x)
p(x) ∝ e T
• For a fixed temperature T , one can prove that the set of accepted points is dis-
tributed as p(x) (but non-i.i.d.!) The acceptance probability
compares the f (y) and f (x), but also the reversibility of q(y|x)
• When cooling the temperature, samples focus at the extrema. Guaranteed to
sample all extrema eventually
• Of high theoretical relevance, less of practical
6:8
Simulated Annealing
6:9
• Random restarts:
1: repeat
2: Sample x ∼ q(x)
3: x ← GreedySearch(x) or StochasticSearch(x)
4: If f (x) < f (x∗ ) then x∗ ← x
5: until run out of budget
Very briefly...
• Variable Neighborhood Search:
– Switch the neighborhood function in different phases
– Similar to Iterated Local Search
• Tabu Search:
– Maintain a tabu list points (or points features) which may not be visited again
– The list has a fixed finite size: FILO
– Intensification and diversification heuristics make it more global
6:13
Coordinate Search
Input: Initial x ∈ Rn
1: repeat
2: for i = 1, .., n do
3: α∗ = argminα f (x + αei ) // Line Search
4: x ← x + α∗ ei
5: end for
6: until x converges
6:14
Pattern Search
– In each iteration k, have a (new) set of search directions Dk = {dki } and test
steps of length αk in these directions
– In each iteration, adapt the search directions Dk and step length αk
Details: See Nocedal et al.
6:15
6:16
• Typical parameters: α = 1, γ = 2, % = − 12 , σ = 1
2
6:17
6:20
• Update heuristic:
– Given D = {(xi , f (xi ))}λi=1 , select µ best: D0 = bestOfµ (D)
– Compute the new mean x̂ from D0
• The downhill methods of the previous section did not store any information other
than the current x. (Exception: Tabu search, Nelder-Mead)
• Categories of EAs:
– Evolution Strategies: x ∈ Rn , often Gaussian pθ (x)
– Genetic Algorithms: x ∈ {0, 1}n , crossover & mutation define pθ (x)
– Genetic Programming: x are programs/trees, crossover & mutation
– Estimation of Distribution Algorithms: θ directly defines pθ (x)
6:23
6:24
CMA references
Hansen, N. (2006), ”The CMA evolution strategy: a comparing review”
Hansen et al.: Evaluating the CMA Evolution Strategy on Multimodal Test Func-
tions, PPSN 2004.
Model-based optimization
following Nodecal et al. “Derivative-free optimization”
6:30
Model-based optimization
• The previous stochastic serach methods are heuristics to update θ
Why not store the previous data directly?
Model-based optimization
1: Initialize D with at least 12 (n + 1)(n + 2) data points
2: repeat
3: Compute a regression fˆ(x) = φ2 (x)>β on D
4: Compute x+ = argminx fˆ(x) s.t. |x − x̂| < α
f (x̂)−f (x+ )
5: Compute the improvement ratio % =
fˆ(x̂)−fˆ(x+ )
6: if % > then
7: Increase the stepsize α
8: Accept x̂ ← x+
9: Add to data, D ← D ∪ {(x+ , f (x+ ))}
10: else
11: if det(D) is too small then // Data improvement
12: Compute x+ = argmaxx det(D ∪ {x}) s.t. |x − x̂| < α
13: Add to data, D ← D ∪ {(x+ , f (x+ ))}
14: else
15: Decrease the stepsize α
16: end if
17: end if
18: Prune the data, e.g., remove argmaxx∈∆ det(D \ {x})
19: until x converges
1
• Variant: Initialize with only n + 1 data points and fit a linear model as long as |D| < 2 (n + 1)(n + 2) =
dim(φ2 (x))
6:32
Model-based optimization
• Optimal parameters (with data matrix X ∈ Rn×dim(β) )
β̂ ls = (X>X)-1 X>y
Conclusions
• We covered
– “downhill running”
– Two flavors of methods that exploit the recent data:
– stochastic search (& EAs), maintaining θ that defines pθ (x)
– model-based opt., maintaining local data D that defines fˆ(x)
• These methods can be very efficient, but somehow the problem formalization is
unsatisfactory:
– What would be optimal optimization?
– What exactly is the information that we can gain from data about the opti-
mum?
– If the optimization algorithm would be an “AI agent”, selecting points his ac-
tions, seeing f (x) his observations, what would be his optimal decision mak-
ing strategy?
– And what about global blackbox optimization?
6:35
7 Exercises
7.1 Exercise 1
Read sections 1.1, 1.3 & 1.4 of Boyd & Vandenberghe “Convex Optimization”. This is
for you to get an impression of the book. Learn in particular about their categories of
convex and non-linear optimization problems.
For C = I (identity matrix) these would be fairly simple to optimize. The C matrix
changes the conditioning (“skewedness of the Hessian”) of these functions to make
them a bit more interesting. We assume that C is a diagonal matrix with entries
i−1
C(i, i) = c n−1 . We choose a conditioning1 c = 10.
c) Implement these functions and display them for c = 10 over x ∈ [−1, 1]2 . You can
use any language, but we recommend Python, Octave/Matlab, or C++ (iff you are ex-
perienced with numerics in C++). Plotting is oftem a quite laboring part of coding... For
plotting a function over the 2D input on evaluates the function on a grid of points, e.g.
in Python
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.plot_wireframe(X0, X1, Y)
plt.show()
Or in Octave:
[X0,X1] = meshgrid(linspace(-1,1,20),linspace(-1,1,20));
Y = X0.**2 + X1.**2;
mesh(X0,X1,Y);
save ’datafile’ Y -ascii
Or you can store the grid data in a file and use gnuplot, e.g.:
splot [-1:1][-1:1] ’datafile’ matrix us ($1/10-1):($2/10-1):3 with lines
d) Implement a simple fixed stepsize gradient descent, iterating xk+1 = xk − α∇f (xk ),
with start point x0 = (1, 1), c = 10 and heuristically chosen α.
e) If you use Python or Octave, use an off-the-shelve optimization routine (ideally IP-
opt). In Python, scipy.optimize is a standard go-to solution for general optimization
problems.
1
The word “conditioning” generally denotes the ration of the largest and smallest
Eigenvalue of the Hessian.
7.2 Exercise 2
7.2.1 Quadratics
Take the quadratic function fsq = x>Cx with diagonal matrix C and entries C(i, i) = λi .
a) Which 3 fundamental shapes does a 2-dimensional quadratic take? Plot the surface
of fsq for various values of λ1 , λ2 (big/small, positive/negative/zero). Could you predict
these shapes before plotting them?
b) For which values of λ1 , λ2 does minx fsq (x) not have a solution? For which does it
have infinite solutions? For which does it have exactly 1 solution? Find out empirically
first, if you have to, then analytically.
7.2.2 Backtracking
i−1
with diagonal matrix C and entries C(i, i) = c n−1 . We choose a conditioning2 c = 10.
• number of inner (line search) loops over the number of outer (gradient descent)
loops.
b) Test also the alternative in step 3. Further, how does the performance change with
%ls (the backtracking stop criterion)?
7.3 Exercise 3
7.3.1 Misc
a) How do you have to choose the “damping” λ depending on ∇2f (x) in line 3 of the
Newton method (slide 02-18) to ensure that the d is always well defined (i.e., finite)?
desired relation δ = H -1 y, where δ and y are defined as on slide 02-23. Are there other
choices of H -1 that fulfill the relation? Which?
7.3.2 Gauss-Newton
b) Optimize the function using your optimization library of choice (If you can, use a
BFGS implementation.)
7.4 Exercise 4
c
In a previous exercise we defined the “hole function” fhole (x), where we now assume a
conditioning c = 4.
b) Implement the Squared Penalty Method. (In the inner loop you may choose any
method, including simple gradient methods.) Choose as a start point x = ( 12 , 12 ). Plot
its optimization path and report on the number of total function/gradient evaluations
needed.
d) Implement the Log Barrier Method and test as in b) and c). Compare the func-
tion/gradient evaluations needed.
7.5 Exercise 6
Slide 03:38 describes the primal-dual Newton method. Implement it to solve the same
constrained problem we considered in the last exercise.
a) d = −∇r(x, λ)-1 r(x, λ) defines the search direction. Ideally one can make a step
with factor α = 1 in this direction. However, line search needs to ensure (i) dual
feasibility λ > 0, (ii) primal feasibility g(x) ≤ 0, and (iii) sufficient decrease (the Wolfe
condition). Line search decreases the step factor α to ensure these conditions (in this
order), where the Wolfe condition here reads
a) Use the method you implemented above to find a feasible initialization (Phase I). Do
this by solving the n + 1-dimensional problem
For some very small . Initialize this with the infeasible point (1, 1) ∈ R2 .
b) Once you’ve found a feasible point, use the standard log barrier method to find the
solution to the original problem (14). Start with µ = 1, and decrease it by µ ← µ/2 in
each iteration. In each iteration also report λi := giµ(x) for i = 1, 2.
7.6 Exercise 5
Assume that if we minimize (14) we end up at a solution x̄ for which each hi (x̄) is
reasonable small, but not exactly zero. Prove, in the context of the Augmented La-
grangian method, that setting λi = 2µhi (x̄) will, if we assume that the gradients ∇f (x)
and ∇h(x) are (locally) constant, ensure that the minimum of (15) fulfills the constraints
h(x) = 0.
Tip: Think intuitive. Think about how the gradient that arises from the penalty in (14) is
now generated via the λi .
7.6.2 Lagrangian and dual function
with variable x ∈ R.
a) Derive the optimal solution x∗ and the optimal value p∗ = f (x∗ ) by hand.
b) Write down the Lagrangian L(x, λ). Plot (using gnuplot or so) L(x, λ) over x for
various values of λ ≥ 0. Verify the lower bound property minx L(x, λ) ≤ p∗ , where p∗
is the optimum value of the primal problem.
c) Derive the dual function l(λ) = minx L(x, λ) and plot it (for λ ≥ 0). Derive the dual
optimal solution λ∗ = argmaxλ l(λ). Is maxλ l(λ) = p∗ (strong duality)?
Take last week’s programming exercise on Squared Penalty and “augment” it so that it
becomes the Augmented Lagrangian method. Compare the function/gradient evalua-
tions between the simple Squared Penalty method and the Augmented method.
7.7 Exercise 6
c
We have previously defined the “hole function” as fhole (x) = 1 − exp(−x>Cx), where
i−1
C is a n × n diagonal matrix with Cii = c n−1 . Assume conditioning c = 10 and
use the Lagrangian Method of Multipliers to solve on paper the following constrained
optimization problem in 2D.
c
min fhole (x) s.t. h(x) = 0 (16)
x
Near the very end, you won’t be able to proceed until you have special values for v. Go
as far as you can without the need for these values.
Slide 03:38 describes the primal-dual Newton method. Implement it to solve the same
constrained problem we considered in the last exercise.
a) d = −∇r(x, λ)-1 r(x, λ) defines the search direction. Ideally one can make a step
with factor α = 1 in this direction. However, line search needs to ensure (i) dual
feasibility λ > 0, (ii) primal feasibility g(x) ≤ 0, and (iii) sufficient decrease (the Wolfe
condition). Line search decreases the step factor α to ensure these conditions (in this
order), where the Wolfe condition here reads
7.8 Exercise 7
These exercises focus on the first type, which is just as important as the second, as it
enables the use of a wider range of solvers. Exercises from Boyd et al http://www.
stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf:
Solve Exercise 4.12 (pdf page 207) from Boyd & Vandenberghe, Convex Optimization.
Solve Exercise 4.16 (pdf page 208) from Boyd & Vandenberghe, Convex Optimization.
Derive an explicit equation for the primal-dual Newton update of (x, λ) (slide 03:38) in
the case of Quadratic Programming. Use the special method for solving block matrix
linear equations using the Schur complements (Wikipedia “Schur complement”).
7.9 Exercise 7
a) Test CMA with a standard parameter setting a log-variant of the Rosenbrock function
(see Wikipedia). My implementation of this function in C++ is:
double LogRosenbrock(const arr& x) {
double f=0.;
for(uint i=1; i<x.N; i++)
f += sqr(x(i)-sqr(x(i-1)))
+ .01*sqr(1-x(i-1));
f = log(1.+f);
return f;
}
where sqr computes the square of a double.
Test CMA for the n = 2 and n = 10 dimensional Rosenbrock function. Initialize around
the start point (1, 10) and (1, 10, .., 10) ∈ R10 with standard deviation 0.1. You might
require up to 1000 iterations.
CMA should have no problem in optimizing this function – but as it always samples a
whole population of size λ, the number of evaluations is rather large. Plot f (xbest ) for
the best point found so far versus the total number of function evaluations.
b) Implement Twiddle Search (slide 05:15) and test it on the same function under same
conditions. Also plot f (xbest ) versus the total number of function evaluations and com-
pare to the CMA results.
7.10 Exercise 8
A few more exercises on standard techniques to convert problems into linear programs:
Solve Exercise 4.11 (pdf page 207) from Boyd & Vandenberghe, Convex Optimization.
You’re at the market and you find n offers, each represented by a set of items Ai and
the respective price ci . Your goal is to buy at least one of each item for as little as
possible.
Formulate as an ILP and then define a relaxation. If possible, come up with an inter-
pretation for the relaxed problem.
There are n facilities with which to satisfy the needs of m clients. The cost for opening
facility j is fj , and the cost for servicing client i through facility j is cij . You have to find
an optimal way to open facilities and to associate clients to facilities.
Formulate as an ILP and then define a relaxation. If possible, come up with an inter-
pretation for the relaxed problem.
You’re a taxicab driver in hyper-space (Rd ) and have to service n clients. Each client i
has an known initial position ci ∈ Rd and a destination di ∈ Rd . You start out at position
p0 ∈ Rd and have to service all the clients while minimizing fuel use, which is propor-
tional to covered distance. Hyper-space is funny, so the geometry is not Euclidean and
distances are Manhattan distances.
Formulate as an ILP and then define a relaxation. If possible, come up with an inter-
pretation for the relaxed problem.
7.10.5 Programming
Use the primal-dual interior point Newton method you programmed in the previous
exercises to solve the relaxed facility location for n facilities and m clients (n and m
small enough that you can find the solution by hand, so about n ∈ {3, ..., 5} and m ∈
{5, 10}.
Sample the positions of facilities and clients uniformly in [−10, 10]2 . Also sample the
cost of opening a facility randomly in [1, 10]. Set the cost for servicing client i with
facility j as the euclidean distance between the two. (important: keep track of what
seed you are using for your RNG).
Compare the solution you found by hand to the relaxed solution, and to the relaxed
solution after rounding it to the nearest integral solution. Try to find a seed for which
the rounded solution is relatively good, and one for which the rounded solution is pretty
bad.
7.11 Exercise 8
Find an implementation of Gaussian Processes for your language of choice (e.g. python:
scikit-learn, or Sheffield/Gpy; octave/matlab: gpml). and implement UCB. Test your im-
plementation with different hyperparameters (Find the best combination of kernel and
its parameters in the GP) on the following 2D global optimization problems:
On slide 5:18 it is speculated that one could consider a constrained blackbox optimiza-
tion problem as well. How could one approach this in the UCB manner?
7.12 Exercise 10
Use the above methods on the Rosenbrock and the Rastrigin functions.
7.12.2 Neighborhoods
7.13 Exercise 10
Use the above methods on the Rosenbrock and the Rastrigin functions.
7.13.2 Neighborhoods
7.14 Exercise 11
Visualize the current estimated model at each step of the optimization procedure.
Hints:
• Try out both a quadratic model φ2 (x) = [1, x1 , x2 , x21 , x22 , x1 x2 ] and a linear one
φ1 (x) = [1, x1 , x2 ].
• Initialize the model sampling .5(n + 1)(n + 2) (in the case of quadratic model)
or n + 1 (in the case of linear model) points around the starting position.
• For any given set of datapoints D, compute β = (X>X)−1 X>y, where X con-
tains (row-wise) the data points (either φ1 or φ2 ) in D, and y are the respective
function evaluations.
Broadly speaking, the No Free Lunch Theorems state that an algorithm can be said
to outperform another one only if certain assumptions are made about the problem
which is being solved itself. In other words, algorithms perform in average exactly the
same, if no restriction or assumption is made on the type of problem itself. Algorithms
outperform each other only w.r.t specific classes of problems.
2a) Read the publication “No Free Lunch Theorems for Optimization” by Wolpert and
Macready and get a better feel for what the statements are about.
2b, 2c, 2d) You are given an optimization problem where the search space is a set X
with size 100, and the cost space Y is the set of integers {1, . . . , 100}. Come up with
three different algorithms, and three different assumptions about the problem-space
such that each algorithm outperforms the others in one of the assumptions.
Try to be creative, or you will all come up with the same “obvious” answers.
8 Bullet points to help learning
This is a summary list of core topics in the lecture and intended as a guide for prepara-
tion for the exam. Test yourself also on the bullet points in the table of contents. Going
through all exercises is equally important.
• Steepest descent
– Is the gradient the steepest direction?
– Covariance (= invariance under linear transformations) of the steepest descent
direction
• 2nd-order information
– 2nd order information can improve direction & stepsize
– Hessian needs to be pos-def (↔ f (x) is convex) or modified/approximated as
pos-def (Gauss-Newton, damping)
• Newton method
– Definition
– Adaptive stepsize & damping
• Gauss-Newton
– f (x) is a sum of squared cost terms
– The approx. Hessian 2∇φ(x)>∇φ(x) is always semi-pos-def!
• Quasi-Newton
– Accumulate gradient information to approximate a Hessian
δδ>
– BFGS, understand the term δ>y
• Conjugate gradient
– New direction d0 should be “orthogonal” to the previous d, but relative to the
local quadratic shape, d0>Ad = 0 (= d0 and d are conjugate)
– On quadratic functions CG converges in n iterations
• Rprop
– Seems awfully hacky
– Every coordinate is treated separately. No invariance under rotations/transformatio
– Change in gradient sign → reduce stepsize; else increase
– Works surprisingly well and robust in practice
• Convergence
– With perfect line search, the extrem (finite & positive!) eigenvalues of the
Hessian ensure convergence
– The Wolfe conditions (acceptance criterion for backtracking line search) en-
sure a “significant” decrease in f (x), which also leads to convergence
• Trust region
– Alternative to stepsize adaptation and backtracking
• Augmented Lagrangian
– Definition
– Role of the squared penalty: “measure” how strong f pushes into the con-
straint
– Role of the Lagrangian term: generate counter force
– Unstand that the λ update generates the “desired gradient”
• The Lagrangian
– Definition
– Using the Lagrangian to solve constrained problems on paper (set both, ∇x L(x, λ)
0 and ∇λ L(x, λ) = 0)
– “Balance of gradients” and the first KKT condition
– Understand in detail the full KKT conditions
– Optima are necessarily saddle points of the Lagrangian
– minx L ↔ first KKT ↔ balance of gradients
– maxλ L ↔ complementarity KKT ↔ constraints
• Lagrange dual problem
– primal problem: minx maxl≥0 L(x, λ)
– dual problem: maxλ≥0 minx L(x, λ)
– Definition of Lagrange dual
– Lower bound and strong duality
• Phase I optimization
– Nice trick to find feasible initialization
• Definitions
– Convex, quasi-convex, uni-modal functions
– Convex optimization problem
• Linear Programming
– General and standard form definition
– Converting into standard form
– LPs are efficiently solved using 2nd-order log barrier, augmented Lagrangian
or primal-dual methods
– Simplex Algorithm is classical alternative; walks on the constraint edges in-
stead of the interior
• Application of LP:
– Very important application of LPs: LP-relaxations of integer linear programs
• Quadratic Programming
– Definition
– QPs are efficiently solved using 2nd-order log barrier, augmented Lagrangian
or dual-primal methods
– Sequential QP solves general (non-quadratic) problems by defining a local QP
for the step direction followed by a line search in that direction
• Overview
– Basic downhill running: mostly ignore the collected data
– Use the data to shape search: stochastic search, EAs, model-based search
– Bayesian (global) optimization
• Model-based Optimization
– Precursor of Bayesian Optimization
– Core: smart ways to keep data D healthy
• Global optimization
– Global optimization = infinite bandits
– Locally correlated bandits → Gaussian Process beliefs
– Maximum Probability of Improvement
– Expected Improvement
– GP-UCB
• Potential pitfalls
– Choice of prior belief (e.g. kernel of the GP) is crucial
– Pure variance-based sampling for radially symmetric kernel ≈ grid sampling
Index
Augmented Lagrangian method (3:14), Phase I optimization (3:40),
Plain gradient descent (2:1),
Primal-dual interior-point Newton method
Backtracking (2:5), (3:36),
Bandits (5:4),
Belief planning (5:8), Quadratic program (QP) (4:6),
Blackbox optimization: definition (6:1), Quasi-Newton methods (2:23),