04 Nonlinear Systems and Optimization
04 Nonlinear Systems and Optimization
Ronny Bergmann
Univariate Optimization m = n = 1
Notation: f : R → R
Find a point x ∗ ∈ R such that f (x ∗ ) is minimal
Multivariate Optimisation n ∈ N, m = 1
Notation: F : Rn → R
Find a point x ∗ ∈ Rn such that F (x ∗ ) is minimal
3
Strict Local vs. Global Minimizer
3
Gradient
∂F F (x + αd ) − F (x)
(x) = lim = ∇F (x)T d
∂d α→0 α
4
Jacobian and Hessian Matrix
For a function F ∈ C 1 (D), D ⊆ Rn , i. e. F : Rn → Rn
or F = (F1 , . . . , Fn )T with Fi ∈ C 1 (D; R), we call the matrix
∂F1 ∂F1
(x) · · · ∂xn (x)
∂x1
. .. ..
JF (x) = .. ∈ Rn×n , x ∈ Rn
. .
∂Fn ∂Fn
∂x1 (x) · · · ∂xn (x)
the Jacobian of F .
⇒ ith row is the gradient ∇Fi of ith component function Fi of F
B(x; R) = {y ∈ Rn | ∥y − x∥ < R}
6
Taylor Expansion
Let F : Rn → R and F ∈ C 1 (Rn ; R) then we have the zeroth order Taylor
expansion, i. e. for x, p ∈ Rn that
1
F (x + p) = F (x) + ∇F (x)T p + pT ∇2 F (x + tp)p for some t ∈ (0, 1).
2
7
First Order Necessary (Optimality) Conditions
Theorem (First Order Necessary Conditions)
If x ∗ is a local minimizer and F ∈ C 1 (B(x ∗ ; R); R) for a suitable R > 0.
Then ∇F (x ∗ ) = 0.
Proof.
Blackboard/Note
8
Stationary Points
F (x) = 0.
9
Second Order Necessary and Sufficient Conditions
(without proofs here)
F (x ∗ ) = 0
11
General iterative Scheme
12
Contractive Mapping
13
Convergence (Fix Point Theorem, QSS Theorem 7.2 )
x (k+1) = G (x (k) )
14
Proof of (Fix Point Theorem)
15
Rate of Convergence
in QSS only implicitly defined for p = 2 in Theorem 7.1, 1-dim. case: Def. 6.1
16
Examples
17
Examples: An Outlook
Model idea:
Find a matrix (an image) x
▶ “close to” y
▶ with “similar” neighbor
pixel
Model idea:
Find a matrix (an image) x
▶ “close to” y
▶ with “similar” neighbor
pixel
1
∥x − y ∥22
2
▶ I set of pixel indices
▶ Ni neighbors of pixel i
▶ “similar” neighbors:
XX
|xi − xj |
i∈I j∈Ni
▶ tradeoff/weight λ > 0:
1 XX
F (x) = ∥x−y ∥22 +λ |xi −xj |
2
i∈I j∈Ni
▶ Minimize!
Measured with noise A (simple) reconstruction
19 (Obs! highdimensional & nonsmooth)
A good direction d in the iterative scheme
20
Find the best descent direction!
21
Norwegian University of Science and Technology
Newton’s Method
Alternative Idea: Use more/other “Information”
Instead of “just” using the gradient information, we could use higher
order information:
In Optimization: The Hesse Matrix ∇2 F (or 1D: second derivative f ′′ )
Algorithm 1: Bisection.
Given a, b ∈ R such that f (a)f (b) < 0 ⇒ ∃α ∈ (a, b) : f (α) = 0
Idea: “Divide and Conquer”:
1. set a0 = a, b0 = b, and k = 0
2. Repeat for k = 0, . . .
2.1 compute ck = ak +b 2
k
22
Use function evaluations, but more clever!
x (k) − x (k−1)
x (k+1) = x (k) − f (x (k) )
f (x (k) ) − f (x (k−1) )
23
Best Idea: Take the Tangent instead of the Secant.
At or current point x (k) : Tangent (or model) equation
Otherwise: Setting mk (x) = 0 and solving for x as the new iterate yields
Newton-Iteration
f (x (n) )
x (k+1) = x (n) −
f ′ (x (n) )
24
Multivariate View I: Newton in Optimisation
To minimize F : Rn → R
yields
−1
∇2 F (x (k) )d (k) = −∇F (k (k) ) ⇔ d (k) = − ∇2 F (x (k) ) ∇F (k (k) )
M k (d ) = F (x (k) ) + JF (x (k) )d
Again:
We set the model M k to zero, solve for the new direction d (k) = d , and
obtain
Solve JF (x (k) )d (k) = −F (x (k) )
Step x (k+1) = x (k) + d (k)
Then there exists r > 0 such that for any x (0) ∈ B(x ∗ , r ) the Newton
iteration is uniquely defined and converges to x ∗ with
28
Reasons for Modified Newton
Challenges
29
Variant I: Cyclic Updating
Evaluate Jacobian once J = JF (x (k) ) and use this matrix for the
iterations k, k + 1, . . . , k + p.
But this is less accurate since we do not solve the original linear system
(only for k we do).
30
Variant II: Inexact Solution of the Linear System
31
Variant III: Replace JF by a difference approximation
(k)
Let j be a column index and hj > 0 be a step size.
Compute the jth column by a finite difference approximation
(k)
(k) F (x (k) + hj e j ) − F (x (k) )
JF (x (k) ) j ≈ Jh j =
(k)
hj
▶ One can get linear convergence if 0 < |hj(k) | < h for some h.
▶ One can get/keep quadratic convergence if
(k) (k)
|hj | ≤ C ∥x (k) − x ∗ ∥ (or |hj | ≤ c∥F (x (k) )∥)
32
Changes due to the 3 Variants
This means
33
Norwegian University of Science and Technology
Bk d k = −F (x (k) )
where Bk is
▶ “cheap to compute”
▶ approximates the Jacobian “well enough”
▶ is updated every iteration to improve approximation property.
34
Broyden’s Method – Derivation
35
The Multivariate Secant Equation
36
Broydens Method – Matrix Update
Given s (k) = x (k+1) − x (k) and y (k) = F (x (k+1) ) − F (x (k) ) as well as Bk we
compute
(y (k) − Bk s (k) )(s (k) )T
Bk+1 = Bk +
(s (k) )T s (k)
Then
37
The Minimiser of ∥B − Bk ∥F
38
Broyden’s Method
39
Convergence of Broyden’s Method
Under the same assumptions as for the convergence of Newton’s
Method and if additionally there exist ε, γ > 0 such thats
▶ ∥x (0) − x ∗ ∥ ≤ ε
▶ ∥B0 − JF (x ∗ )∥ ≤ γ
∥x (k+1) − x ∗ ∥ ≤ ck ∥x (k) − x ∗ ∥
(Unconstrained) Optimization
(Back to) Optimization
41
Direct Search Methods
Advantage.
They do not require the gradient ∇F but just evaluations of F .
Examples are Nelder-Mead or Hooke and Jeeves.
Disadvantage.
They might be very (, very,...) slow.
While we do not consider them here, what are two good reasons to
have/use these methods?
42
(Multivariate) Mean Value Theorem
Let
1. D be a convex domain on Rn
2. F ∈ C 1 (D; R)
3. α ∈ R
4. x, x + d ∈ D
Then there exists a 0 < θ < 1 such that with y = x + θαd we have
F (x + αd ) − F (x) = α∇F (y )T d .
43
Descent Methods
Idea: Take a descent direction d ∈ Rn , i.e. with d T ∇F (x) < 0
and “walk into this direction”.
This is also called a second order method, since it uses first and second
order derivatives
45
Beyond Broyden
46
Outlook: Limited Memory BFGS
Application: Optimisation on an 4K image (n ≈ 9 million pixel)
7 7 7
▶ F : R9·10 → R, hence a gradient ∇F : R9·10 → R9·10
7 7 7
▶ Hessian: ∇2 F : R9·10 → R9·10 ×9·10 (8.1 · 1015 entries)
Features of BFGS.
−1
There is a recursive formula for Bk+1 using Bk−1 , s (k) , y (k)
−1 k
⇒ an recursive formula for solving Bk+1 x
Solution.
Start with B0 = I
Store only K ≪ N previous s (k−K +1) , . . . s (k) and y (k−K +1) , . . . y (k) .
⇒ Limited Memory BFGS by applying the recursive formula K times
47
Algorithm 2: Gradient (or Steepest) Descent
Given some x (0) ∈ Rn and some stepsizes αk (maybe just again 1).
48
What is the best step size αk ?
We we are at x (k) and we have chosen a descent direction d (k) .
49
Algorithm 3: Nonlinear Conjugate Gradient
The Gradient descent sometimes tends to “zig-zag” (depending on αk ).
Avoid by orthogonalizing directions d (k) (w.r.t. some scalar product).
For k = 0, 1, . . .
But even more: There exists a closed form for the “best” (most
decrease) step size.
51
Norwegian University of Science and Technology
▶ The step size might be too small and we do not make progress
▶ The “gain” (or descent rate) F (x (k+1) ) − F (x (k) ) might be too small
53
Armijo Linesearch
Let σ ∈ (0, 1) be given. The Armijo condition for a step size α reads
In Practice.
Use a backtrackig parameter c ∈ (0, 1) and a starting step size s try
α = s, α = cs, α = c 2 s,... until the Armijo condition is first fulfilled.
Improve by starting with s as the last found step size.
54
Welldefinedness of Armijo
Theorem
Let σ, c ∈ (0, 1) and s > 0 be given.
Then for every pair (x, d ) ∈ Rn × Rn with ∇F (x)T d < 0 exists a α0 > 0 such
that
F (x + αd ) ≤ F (x) + σα∇F (x)T d
holds for all α ∈ [0, α0 ]
55
Proof.
56
(Global) Convergence of Gradient Descent with Armijo
Linesearch
Theorem
When performing Gradient descent with Armijo Linesearch to determine the
step size αk , then for the sequence x (k) it holds:
Every cluster point x ∗ of {x (k) } is a stationary point, i. e. ∇F (x ∗ ) = 0.
57
Problem: The Step size might still get too small
58
The Curvature Condition
Then we look for a step size α that additionally fulfills the curvature
condition
∇F (x + αd )T d ≥ β∇F (x)T d
59
Wolfe Conditions
For 0 < σ < β < 1 we search for a step size α that fulfills
60
Strong Wolfe Conditions
For 0 < σ < β < 1 we search for a step size α that fulfills
61
Welldefinedness of (Strong) Wolfe conditions
Then if 0 < σ < β < 1, there exist intervals of step sizes satisfying the
Wolfe conditions and the strong Wolfe conditions.
62
Proof.
63
Other descent direction methods
Gradient descent methods gained popularity in Machine Learning,
N
X
The (cost) function F (x) = Fi (x) has a very large N
i=1
65
Other descent direction methods III
There is a whole zoo of further methods in Machine learning, that all try
to avoid a costly line-search
AdaGrad scale (Adapt) each single gradient ∇Fi (x (k) ) (by the
combined norm of the gradients history)
⇒ different step size for every component of x (k) .
Adadelta avoid the strictly decreasing step size of AdaGrad
Adam adaptive motion estimation – also keep track of gradients
themselves
AdaMax Adam with maxium norm
Nadam Adam with Nesterov combined
66
Stopping criteria
For the last two (or three) we can distinguish two cases
67
Outlook M4: Further topics in questions.
As a teaser a few topics that one might continue with (but we don’t)
1. What if the function is convex?
⇒ Convex Analysis, Duality
2. What if the function is not smooth, but has jumps?
3. What if there are constraints?
4. What if the function is defined on a Riemannian Manifold?
68