Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

04 Nonlinear Systems and Optimization

Nonlinear Systems and Optimization
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

04 Nonlinear Systems and Optimization

Nonlinear Systems and Optimization
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Numerical Mathematics

Norwegian University of Science and Technology

# 4: Nonlinear Systems and Optimization

Ronny Bergmann

Department of Mathematical Sciences, NTNU.

September 11, 2024


Nonlinear Systems of Equations & Optimisation
Let F : Rn ⊃ D → Rm then we can look for

Univariate Optimization m = n = 1
Notation: f : R → R
Find a point x ∗ ∈ R such that f (x ∗ ) is minimal

Multivariate Optimisation n ∈ N, m = 1
Notation: F : Rn → R
Find a point x ∗ ∈ Rn such that F (x ∗ ) is minimal

Solving Nonlinear Systems of Equations (usually) m = n:


Find a point x ∗ such that F (x ∗ ) = 0

Let C k (D; Rm ) denote the set of k-times differentiable functions


F : D → Rm on D ⊂ Rn .
2 We write in short C k (D) = C k (D; Rn ) for m = n.
Local vs. Global Minimizer

Let F : Rn → R be given. We call

A point x ∗ a global minimizer if

F (x ∗ ) ≤ F (x) for all x ∈ Rn

A point x ∗ a local minimizer if there exists an open neighborhood N


“around” x ∗ such that

F (x ∗ ) ≤ F (x) for all x ∈ N

3
Strict Local vs. Global Minimizer

Let F : Rn → R be given. We call

A point x ∗ a strict global minimizer if

F (x ∗ ) < F (x) for all x ∈ Rn

A point x ∗ a strict local minimizer if there exists an open neighborhood


N “around” x ∗ such that

F (x ∗ ) < F (x) for all x ∈ N

3
Gradient

For a function F ∈ C 1 (D; R), D ⊆ Rm , we denote the vector of partial


derivatives by ∇F : Rm → Rm . It is defined as
 ∂F  
∂x1 (x)
∇F (x) =  .. x ∈ Rm
.  ,
 
∂F
∂xn (x)

and called the gradient of F .

⇒ Compute directional dervatives

∂F F (x + αd ) − F (x)
(x) = lim = ∇F (x)T d
∂d α→0 α

4
Jacobian and Hessian Matrix
For a function F ∈ C 1 (D), D ⊆ Rn , i. e. F : Rn → Rn
or F = (F1 , . . . , Fn )T with Fi ∈ C 1 (D; R), we call the matrix
    
∂F1 ∂F1
(x) · · · ∂xn (x)
 ∂x1
. .. ..
JF (x) =  ..  ∈ Rn×n , x ∈ Rn
 
. .
    
∂Fn ∂Fn
∂x1 (x) · · · ∂xn (x)

the Jacobian of F .
⇒ ith row is the gradient ∇Fi of ith component function Fi of F

Special case F ∈ C 2 (Rn ; R) the gradient F = ∇F is in C 1 (Rn ). We obtain


the Hessian matrix denoted by ∇2 F (x) or HF (x) and given by
 2   2  
∂ F ∂ F
(x) · · · ∂x1 ∂xn (x)
 ∂x1 ∂x1
∇2 F (x) = J∇F (x) = 
 .. .. .. 
. . . 
 2   2  
∂ F ∂ F
5 ∂xn ∂x1 (x) · · · ∂xn ∂xn (x)
Open Balls & Convexity

Let x ∈ Rn and R > 0. The open ball of radius R centered at x is defined


as

B(x; R) = {y ∈ Rn | ∥y − x∥ < R}

A set C ⊂ Rn is called convex if for all x, y ∈ C the connecting line


segment is contained in C .
Formally

tx + (1 − t)y = x + (1 − t)(y − x) ∈ C for all t ∈ [0, 1]

6
Taylor Expansion
Let F : Rn → R and F ∈ C 1 (Rn ; R) then we have the zeroth order Taylor
expansion, i. e. for x, p ∈ Rn that

F (x + p) = F (x) + ∇F (x + tp)T p for some t ∈ (0, 1).

If F ∈ C 2 (Rn ; R) we even get the first order Taylor expansion

1
F (x + p) = F (x) + ∇F (x)T p + pT ∇2 F (x + tp)p for some t ∈ (0, 1).
2

We also write this as

F (x + p) = F (x) + ∇F (x)T p + O(∥p∥2 )

7
First Order Necessary (Optimality) Conditions
Theorem (First Order Necessary Conditions)
If x ∗ is a local minimizer and F ∈ C 1 (B(x ∗ ; R); R) for a suitable R > 0.
Then ∇F (x ∗ ) = 0.

Proof.
Blackboard/Note

8
Stationary Points

Any point x ∗ with ∇F (x ∗ ) = 0 is called stationary point. Due to First


order necessary conditions:

Any local minimizer must be a stationary point.

This relates Minimization


arg min F (x)
x∈Rn

to solving nonlinear systems of equations by setting F (x) = ∇F (x) we


look for candidates of minimizers by considering the

F (x) = 0.

9
Second Order Necessary and Sufficient Conditions
(without proofs here)

Theorem (Second Order Necessary Conditions)


If x ∗ is a local minimizer of F : Rn → R and ∇2 F exists and is continuous in
some B(x ∗ ; R) for a suitable R > 0.
Then ∇F (x ∗ ) = 0 and ∇2 F (x ∗ ) is positive semidefinite.

Again: Only these are candidates for minimizers

Theorem (Second Order Sufficient Conditions)


Let F : Rn → R be in C 2 (B(x ∗ ; R); R) around x ∗ ∈ Rn for a suitable R > 0.
Let ∇F (x ∗ ) = 0 and ∇2 F (x ∗ ) be positive definite.
Then x ∗ is a strict local minimizer of F .
10
Summary

We can take two viewpoints here


Minimization.
To minimize a function F : Rn → R: we are looking for
a stationary point x ∗ , i. e.
∇F (x ∗ ) = 0

Solving nonlinear Systems.


To solve for a function F : Rn → Rn : we are looking for a solution x ∗ , i. e.

F (x ∗ ) = 0

Special case: Minimization for F = ∇F .

11
General iterative Scheme

Idea. (Again): reformulate F (x) = 0 (or F (x) → min)


into a fix point equation x = G (x).
Here: In a way, that we get update directions:

Let x (0) ∈ Rn be a starting point. For each k = 1, 2, . . .

1. Determine a update direction d (k) ∈ Rn


2. Determine a step size αk > 0
3. Update x (k+1) = G (x (k) ) = x (k) + αk d (k)

And we have to specify a suitable stopping criterion.

12
Contractive Mapping

A mapping G : D ⊂ Rn → Rn is called contractive on D0 ⊂ D


if there exists α < 1 such that

∥G (x) − G (y )∥ ≤ α∥x − y ∥ for all x, y ∈ D0 .

Here ∥·∥ is a suitable vector norm.

13
Convergence (Fix Point Theorem, QSS Theorem 7.2 )

Let G : D ⊂ Rn → Rn be a contractive mapping on the closed set D0 ⊂ D


and let G (x) ∈ D0 for all x ∈ D0 .

Then G has a unique fix point in D0

Even more: For every x (0) ∈ D0 the iteration procedure

x (k+1) = G (x (k) )

converges to this fix point.

14
Proof of (Fix Point Theorem)

15
Rate of Convergence

An iterative method is called locally convergent of order p to x ∗ ∈ Rn


if there exists a C (r ), r > 0, such that for all ∥x (0) − x ∗ ∥ ≤ r it holds

∥x (k+1) − x ∗ ∥ ≤ C (r )∥x (k) − x ∗ ∥p

in QSS only implicitly defined for p = 2 in Theorem 7.1, 1-dim. case: Def. 6.1

Most important cases

▶ p = 2 – quadratic convergence. Then the number of digits in the


error half every iteration.
▶ p = 1 is called linear convergence (if C (r ) < 1)

16
Examples

17
Examples: An Outlook

An (grayscale) image is just a


matrix Y ∈ Rm×n , Yij ∈ [0, 1]
⇒ y = vec(Y ) ∈ Rmn .

But measurements might be


noisy

Model idea:
Find a matrix (an image) x
▶ “close to” y
▶ with “similar” neighbor
pixel

Some nice image Measured with noise


18
Examples: An Outlook

An (grayscale) image is just a


matrix Y ∈ Rm×n , Yij ∈ [0, 1]
⇒ y = vec(Y ) ∈ Rmn .

But measurements might be


noisy

Model idea:
Find a matrix (an image) x
▶ “close to” y
▶ with “similar” neighbor
pixel

Some nice image Measured with noise


18
Example: An Outlook ▶ x “close to” y → use

1
∥x − y ∥22
2
▶ I set of pixel indices
▶ Ni neighbors of pixel i
▶ “similar” neighbors:
XX
|xi − xj |
i∈I j∈Ni

▶ tradeoff/weight λ > 0:

1 XX
F (x) = ∥x−y ∥22 +λ |xi −xj |
2
i∈I j∈Ni
▶ Minimize!
Measured with noise A (simple) reconstruction
19 (Obs! highdimensional & nonsmooth)
A good direction d in the iterative scheme

To find a good (scaled) direction αd : first order Taylor:


1
F (x + αd ) = F (x) + α∇F (x)T d + α2 d T ∇F (x + td )d for some t ∈ (0, α)
2
The rate of change: the coefficient of α, namely ∇F (x)T d

We call d a descent direction if ∇F (x)T d < 0

For any descent direction d (and F ∈ C 1 ) there exists a (small enough) α


such that
F (x + αd ) < F (x)
if we are not in a minimum already.

20
Find the best descent direction!

If we are just interested in the direction, let’s take ∥d ∥ = 1

The best descent direction is

arg min d T ∇F (x) subject to ∥d ∥ = 1


d

If we denote by θ the angle between d and ∇F (x) we get

d T ∇F (x) = ∥d ∥∥∇F (x)∥ cos θ = ∥∇F (x)∥ cos θ

Minimum: cos θ = −1, i. e. θ = π i. e.


∇F (x)
steepest descent direction: d = − ∥∇F (x)∥

21
Norwegian University of Science and Technology

Newton’s Method
Alternative Idea: Use more/other “Information”
Instead of “just” using the gradient information, we could use higher
order information:
In Optimization: The Hesse Matrix ∇2 F (or 1D: second derivative f ′′ )

Recap: The one-dimensional case: Looking for f (x) = 0

Algorithm 1: Bisection.
Given a, b ∈ R such that f (a)f (b) < 0 ⇒ ∃α ∈ (a, b) : f (α) = 0
Idea: “Divide and Conquer”:
1. set a0 = a, b0 = b, and k = 0
2. Repeat for k = 0, . . .
2.1 compute ck = ak +b 2
k

2.2 if f (ak )f (ck ) < 0 set ak+1 = ak and bk+1 = ck


2.3 if f (ak )f (ck ) > 0 set ak+1 = ck and bk+1 = bk

22
Use function evaluations, but more clever!

Idea: Take the line


1 
l(t) = (b − a − t)f (a) + tf (b) , t ∈ [0, b − a]
b−a
a−b
⇒ solve l(t) = 0 for t yields t = − f (a)−f (b) f (a)
⇒ Choose the point c not as the mit point but as c = a + t

Algorithm 2: Secant information.


For some x (0) , x (1) compute for k = 1, . . .

x (k) − x (k−1)
x (k+1) = x (k) − f (x (k) )
f (x (k) ) − f (x (k−1) )

23
Best Idea: Take the Tangent instead of the Secant.
At or current point x (k) : Tangent (or model) equation

mk (x) = f (x (k) ) + f ′ (x (k) )(x − x (k) )

(again): First order Taylor (leave out the remainder term).

If f ′ (x (k) ) = 0 : Stationary point!

Otherwise: Setting mk (x) = 0 and solving for x as the new iterate yields

Newton-Iteration

f (x (n) )
x (k+1) = x (n) −
f ′ (x (n) )
24
Multivariate View I: Newton in Optimisation
To minimize F : Rn → R

Determine the direction d from second-order Taylor as


1
F (x (k) + d ) ≈ F (x (k) ) + d T ∇F (x (k) ) + d T ∇2 F (x (k) )d =: Mk (d )
2

Minimize Mk or solving ∇Mk (d ) = 0


(assuming ∇2 F (x (k) ) is positive definite)

yields
−1
∇2 F (x (k) )d (k) = −∇F (k (k) ) ⇔ d (k) = − ∇2 F (x (k) ) ∇F (k (k) )

25 Then set x (k+1) = x (k) + d (k)


Multivariate View II: Nonlinear Systems of Equations
To solve F (x) = 0 at iterate x (k) :
Model equation given by first-order Taylor
(or ∇M from last slide is M here)

M k (d ) = F (x (k) ) + JF (x (k) )d

Again:
We set the model M k to zero, solve for the new direction d (k) = d , and
obtain
Solve JF (x (k) )d (k) = −F (x (k) )
Step x (k+1) = x (k) + d (k)

Note. In every Newton step, we have to solve a linear system of


equations.
26
Convergence of Multivariate Newton
Theorem (QSS, Theorem 7.1)
Let F : Rn → Rn be a C 1 function in a convex open set D of Rn that contains
x ∗ with F (x ∗ ) = 0. Suppose that JF−1 (x ∗ ) exists and that there exist positive
constraints R, C and L such that
▶ C = ∥JF−1 (x ∗ )∥
▶ JF is locally Lipschitz, i.e.
∥JF (x) − JF (y )∥ ≤ L∥x − y ∥ for all x, y ∈ B(x ∗ , R).
where ∥·∥ denotes a vector and a consistent matrix norm.

Then there exists r > 0 such that for any x (0) ∈ B(x ∗ , r ) the Newton
iteration is uniquely defined and converges to x ∗ with

∥x (k+1) − x ∗ ∥ ≤ CL∥x (k) − x ∗ ∥2

Note: The Assumptions also make x (k) well defined.


27
Proof Convergence of Multivariate Newton

28
Reasons for Modified Newton

Challenges

▶ We have to evaluate JF (x (k) )


▶ We have to solve a linear system JF (x (k) )d = −F (x (k) )

This can be time consuming!

29
Variant I: Cyclic Updating

Evaluate Jacobian once J = JF (x (k) ) and use this matrix for the
iterations k, k + 1, . . . , k + p.

▶ Less evaluations of the Jacobian


▶ efficiency by using LU decomposition
i.e. solve Lc = −F (x (k+i) ), Ud = c, x (k+i+1) = x (k+i) + d
⇒ reduce from O(n3 ) to O(n2 )

But this is less accurate since we do not solve the original linear system
(only for k we do).

30
Variant II: Inexact Solution of the Linear System

Motivation: The matrix is inexact already, so why solve it exactly?

⇒ Use Jacobi, Gauss-Seidel, SOR


Or Gradient Descent (GD) or (preconditioned) Conjugate Gradient (pCG)

...to some precision.

31
Variant III: Replace JF by a difference approximation

(k)
Let j be a column index and hj > 0 be a step size.
Compute the jth column by a finite difference approximation
(k)
(k)  F (x (k) + hj e j ) − F (x (k) )
JF (x (k) ) j ≈ Jh j =

(k)
hj

where e j is the jth canonical unit vector.

▶ One can get linear convergence if 0 < |hj(k) | < h for some h.
▶ One can get/keep quadratic convergence if
(k) (k)
|hj | ≤ C ∥x (k) − x ∗ ∥ (or |hj | ≤ c∥F (x (k) )∥)

32
Changes due to the 3 Variants

▶ all 3 methods will affect the convergence rate


▶ but the limit point is unchanged

This means

▶ it converges slower (more iterations required), but


▶ each single iteration is (much) faster

33
Norwegian University of Science and Technology

Quasi Newton Methods


Broyden’s Method – Motivation

Since evaluating the Jacobian JF (x (0) ) is expensive.


⇒ replace it with some matrix Bk that is: Solve

Bk d k = −F (x (k) )

where Bk is

▶ “cheap to compute”
▶ approximates the Jacobian “well enough”
▶ is updated every iteration to improve approximation property.

34
Broyden’s Method – Derivation

35
The Multivariate Secant Equation

From JF (x (k) )(x (k+1) − x (k) ) ≈ F (x (k+1) ) − F (x (k) ) we

▶ introduce s (k) = x (k+1) − x (k)


▶ introduce y (k) = F (x (k+1) ) − F (x (k) )
▶ and assume, that we already used Bk to get x (k+1)

Then we would like our next approximation Bk+1 to fulfill


the secant equation
Bk+1 s (k) = y (k) .
Problem. That is only n equations, but Bk+1 has n2 unknowns.
But. We can look for “even nicer” Bk+1 .

36
Broydens Method – Matrix Update
Given s (k) = x (k+1) − x (k) and y (k) = F (x (k+1) ) − F (x (k) ) as well as Bk we
compute
(y (k) − Bk s (k) )(s (k) )T
Bk+1 = Bk +
(s (k) )T s (k)

Then

1. It fulfills Bk+1 s (k) = y (k)


2. It is “the best” when looking at the minimiser of ∥B − Bk ∥F . that
fulfils the equation from 1.

37
The Minimiser of ∥B − Bk ∥F

38
Broyden’s Method

Choose x (0) and some nonsingular B0 (e.g. B0 = JF (x (0) )).

Then we repeat for k = 0, 1, . . . ,

1. Solve Bk d (k) = −F (x (k) ) for d (k)


2. Set x (k+1) = x (k) + αk d (k) (we still do αk = 1 for now)
3. Compute s (k) = x (k+1) − x (k)
4. Compute y (k) = F (x (k+1) ) − F (x (k) )
(y (k) − Bk s (k) )(s (k) )T
5. Update Bk+1 = Bk +
(s (k) )T s (k)

39
Convergence of Broyden’s Method
Under the same assumptions as for the convergence of Newton’s
Method and if additionally there exist ε, γ > 0 such thats

▶ ∥x (0) − x ∗ ∥ ≤ ε
▶ ∥B0 − JF (x ∗ )∥ ≤ γ

Then the sequence of iterates x (k) of Broyden’s Method is well-defined


and converges super-linearly to x ∗ , i. e.

∥x (k+1) − x ∗ ∥ ≤ ck ∥x (k) − x ∗ ∥

where the constants ck are such that limk→∞ ck = 0

One can also prove (under further assumptions)


that Bk converges to JF (x ∗ ).
40
Norwegian University of Science and Technology

(Unconstrained) Optimization
(Back to) Optimization

We look for a (global) minimizer of F : Rn → R

x ∗ = arg min F (x)


x∈Rn

We already saw local/global (strict) minimisers.

This is also called unconstrained optimization,


since one can further impose constraints on x, for example x ≥ 0.
⇒ TMA4180 Optimization I (Spring 2024 )

41
Direct Search Methods

Advantage.
They do not require the gradient ∇F but just evaluations of F .
Examples are Nelder-Mead or Hooke and Jeeves.

Disadvantage.
They might be very (, very,...) slow.

While we do not consider them here, what are two good reasons to
have/use these methods?

42
(Multivariate) Mean Value Theorem

Let

1. D be a convex domain on Rn
2. F ∈ C 1 (D; R)
3. α ∈ R
4. x, x + d ∈ D

Then there exists a 0 < θ < 1 such that with y = x + θαd we have

F (x + αd ) − F (x) = α∇F (y )T d .

⇒ For a descent direction: F (x + αd ) < F (x).

43
Descent Methods
Idea: Take a descent direction d ∈ Rn , i.e. with d T ∇F (x) < 0
and “walk into this direction”.

Start with x (0) ∈ Rn .


For k = 0, . . . , (until convergence)

1. Choose a descent direction d (k) , i.e. s.t. (d (k) )T ∇F (x (k) ) < 0


2. Choose a small enough step size α for which F (x + αd ) < F (x), so
we choose such an αk
3. and set x (k+1) = x (k) + αk d (k)

Idea for a stopping criterion: ∥∇F (x k )∥ is smaller than some tolerance.

But how to actually choose d and α?


44
Algorithm 1: Newtons Method

Assume that F ∈ C 2 (D; R) and choose αk = 1


(set F = ∇F in the last section)

Then JF (x) = HF (x) and d (k) = −HF (x (k) )−1 ∇F (x (k) ).


From sufficient condition ⇒ locally around x ∗ : HF positive definite!

⇒ ∇F (x (k) )T d (k) = − ∇F (x (k) )HF−1 (x (k) )∇F (x (k) ) < 0.

This is also called a second order method, since it uses first and second
order derivatives

Same approaches as before with Broyden: Bk (sym.) to approximate HF .

45
Beyond Broyden

▶ Broydens method: Bk fulfills the secant equation Bk s (k) = y (k)


▶ but the Hessian ∇2 F (x (k) ) is even more: symmetric (& pos. definite)

The BFGS-Update: named after Broyden, Fletcher, Goldfarb, and Shanno

Bk s (k) (s (k) )T BkT y (k) (y (k) )T


Bk+1 = Bk − +
(s (k) )T Bk s (k) (y (k) )T s (k)

▶ the update is a symmetric matrix. ⇒ Bk+1 is symmetric (if Bk is)


▶ the update keeps positive definiteness.

46
Outlook: Limited Memory BFGS
Application: Optimisation on an 4K image (n ≈ 9 million pixel)

7 7 7
▶ F : R9·10 → R, hence a gradient ∇F : R9·10 → R9·10
7 7 7
▶ Hessian: ∇2 F : R9·10 → R9·10 ×9·10 (8.1 · 1015 entries)

This is not feasible in memory.

Features of BFGS.
−1
There is a recursive formula for Bk+1 using Bk−1 , s (k) , y (k)
−1 k
⇒ an recursive formula for solving Bk+1 x

Solution.
Start with B0 = I
Store only K ≪ N previous s (k−K +1) , . . . s (k) and y (k−K +1) , . . . y (k) .
⇒ Limited Memory BFGS by applying the recursive formula K times
47
Algorithm 2: Gradient (or Steepest) Descent

We already learned d = −∇F (x) is the steepest descent direction.

Given some x (0) ∈ Rn and some stepsizes αk (maybe just again 1).

Then again for k = 0, 1, . . .

1. Choose d k = −∇F (x (k) )


2. Set x (k+1) = x (k) + αk d (k)

48
What is the best step size αk ?
We we are at x (k) and we have chosen a descent direction d (k) .

What is the best step size we can choose?

arg min ϕ(α) = arg min F (x (k) + αd (k) )


α∈R α∈R

In practice. Not really feasible


But. In theory, this choice has a nice feature:
Theorem
For a descent method, let x (k) , d (k) be given and αk = arg minα∈R ϕ(α).
Then
∇F (x (k+1) )T d (k) = 0

49
Algorithm 3: Nonlinear Conjugate Gradient
The Gradient descent sometimes tends to “zig-zag” (depending on αk ).
Avoid by orthogonalizing directions d (k) (w.r.t. some scalar product).
For k = 0, 1, . . .

1. Set d (k) = −∇F (x (k) ) + βk d (k−1)


2. Set x (k+1) = x (k) + αk d (k)

where β might have different variants

∇F (x (k) )(∇F (x (k) )−∇F (x (k−1) ))


Hestens-Stiefel βkHS =
(∇F (x (k) )−∇F (x (k−1) ))T d (k−1)
∇F (x (k) )T ∇F (x (k) )
Fletcher-Reeves βkFR = ∇F (x (k−1) )T ∇F (x (k−1) )

and quite a few more.


50
Interlude: (Linear) CG

If F is quadratic, e. g. F (x) = ∥Ax − b∥22 for A spd


then the gradient ∇F (x) = AT Ax − AT b is linear

Then Fletcher-Reeves yields directions d (0) , d (1) , . . .


that are orthogonal with respect to (v , w )A = v T Aw .
This yields the update direction as in Conjugate Gradient (Assignment 3)

But even more: There exists a closed form for the “best” (most
decrease) step size.

Question. How many steps does this method need?

51
Norwegian University of Science and Technology

Line Search Methods


Introduction
Since finding the minimizer of ϕ(α) = F (x (k) + αd (k) ) is not feasible, we
would like to have other methods, that work well enough, i.e. such that

F (x (k+1) ) < F (x (k) )

Question. What are the challenges here?

▶ The step size might be too small and we do not make progress
▶ The “gain” (or descent rate) F (x (k+1) ) − F (x (k) ) might be too small

In the following: We are at point x, have found a descent direction d


and we are looking for a step size α.
52
Illustration for having Sufficient Decrease

53
Armijo Linesearch
Let σ ∈ (0, 1) be given. The Armijo condition for a step size α reads

F (x + αd ) ≤ F (x) + σα∇F (x)T d

In terms of the helping 1D function this reads

ϕ(α) ≤ ϕ(0) + σαϕ′ (0)x

In Practice.
Use a backtrackig parameter c ∈ (0, 1) and a starting step size s try
α = s, α = cs, α = c 2 s,... until the Armijo condition is first fulfilled.
Improve by starting with s as the last found step size.
54
Welldefinedness of Armijo

Theorem
Let σ, c ∈ (0, 1) and s > 0 be given.
Then for every pair (x, d ) ∈ Rn × Rn with ∇F (x)T d < 0 exists a α0 > 0 such
that
F (x + αd ) ≤ F (x) + σα∇F (x)T d
holds for all α ∈ [0, α0 ]

55
Proof.

56
(Global) Convergence of Gradient Descent with Armijo
Linesearch
Theorem
When performing Gradient descent with Armijo Linesearch to determine the
step size αk , then for the sequence x (k) it holds:
Every cluster point x ∗ of {x (k) } is a stationary point, i. e. ∇F (x ∗ ) = 0.

One can further show:

▶ x ∗ can not be a local maximum


▶ all cluster points have the same function value
▶ in an area with positive definite Hessian, the whole sequence
converges.

57
Problem: The Step size might still get too small

58
The Curvature Condition

or: Avoiding too small step sizes

Let β ∈ (σ, 1) be given, where σ is the constant from the Armijo


condition.

Then we look for a step size α that additionally fulfills the curvature
condition
∇F (x + αd )T d ≥ β∇F (x)T d

59
Wolfe Conditions

Combining both criteria, we obtain the Wolfe conditions: Let x, d be


given.

For 0 < σ < β < 1 we search for a step size α that fulfills

F (x + αd ) ≤ F (x) + σα∇F (x)T d


∇F (x + αd )T d ≥ β∇F (x)T d

But our choice α̂ might be far away from a minimizer of ϕ(α∗ ).

60
Strong Wolfe Conditions

Combining both criteria, we obtain the strong Wolfe conditions: Let x, d


be given.

For 0 < σ < β < 1 we search for a step size α that fulfills

F (x + αd ) ≤ F (x) + σα∇F (x)T d


|∇F (x + αd )T d | ≤ β|∇F (x)T d |

We disallow ϕ′ (α) to be positive. Since we start negative (for small α),


we hence stay closer to stationary points of ϕ.

61
Welldefinedness of (Strong) Wolfe conditions

Let F ∈ C 1 a point x and a descent directin d at x be given. Assume that


F is bounded from below on the ray {x + αd : α > 0}.

Then if 0 < σ < β < 1, there exist intervals of step sizes satisfying the
Wolfe conditions and the strong Wolfe conditions.

62
Proof.

63
Other descent direction methods
Gradient descent methods gained popularity in Machine Learning,
N
X
The (cost) function F (x) = Fi (x) has a very large N
i=1

⇒ step size α fixed (“learning rate”)

Algorithm 4. Stochastic GD. Choose one index i ∈ {1, . . . , N}

d (k) = ∇Fi (x (k) )

Algorithm 5. Minibatch (group-stochatsic) GD


Choose a (small) random subset I ⊂ {1, . . . , N} and compute
X
d (k) = ∇Fi (x (k) )
i∈I
64
Other descent direction methods II

Algorithm 6. Momentum – Fix a momentum term γ and compute

d (k) = −∇F (x (k) ) + γd (k−1)

this could be seen as a simplified CG.


Idea: Like a ball rolling down a hill “keep momentum”.

Algorithm 7. Nesterov’s Accelerated Gradient

d (k) = −∇F (x (k) + γd (k−1) ) + γd (k−1)

Both can be combined with Stochastic/Minibatch.

65
Other descent direction methods III

There is a whole zoo of further methods in Machine learning, that all try
to avoid a costly line-search

AdaGrad scale (Adapt) each single gradient ∇Fi (x (k) ) (by the
combined norm of the gradients history)
⇒ different step size for every component of x (k) .
Adadelta avoid the strictly decreasing step size of AdaGrad
Adam adaptive motion estimation – also keep track of gradients
themselves
AdaMax Adam with maxium norm
Nadam Adam with Nesterov combined

66
Stopping criteria

maximal number of iterates stop if k = Kmax


small gradient stop if ∥∇F (x (k) )∥ is small
small change stop if ∥x (k−1) − x (k) ∥ is small
small change II stop if |F (x (k−1) ) − F (x (k) )| is small

For the last two (or three) we can distinguish two cases

absolute tolerance atol – ∥∇F (x (k) )∥ < ε


relative tolerance rtol – ∥∇F (x (k) )∥ < δ∥∇F (x (k−1) )∥
combined sum both right hand sides above

67
Outlook M4: Further topics in questions.

As a teaser a few topics that one might continue with (but we don’t)
1. What if the function is convex?
⇒ Convex Analysis, Duality
2. What if the function is not smooth, but has jumps?
3. What if there are constraints?
4. What if the function is defined on a Riemannian Manifold?

68

You might also like