0% found this document useful (0 votes)

5 views

04 Nonlinear Systems and Optimization

Nonlinear Systems and Optimization

Uploaded by

Mubarek Abdurahman

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

04 Nonlinear Systems and Optimization

Nonlinear Systems and Optimization

Uploaded by

Mubarek Abdurahman

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Numerical Mathematics

Norwegian University of Science and Technology

# 4: Nonlinear Systems and Optimization

Ronny Bergmann

Department of Mathematical Sciences, NTNU.

September 11, 2024

Nonlinear Systems of Equations & Optimisation
Let F : Rn ⊃ D → Rm then we can look for

Univariate Optimization m = n = 1
Notation: f : R → R
Find a point x ∗ ∈ R such that f (x ∗ ) is minimal

Multivariate Optimisation n ∈ N, m = 1
Notation: F : Rn → R
Find a point x ∗ ∈ Rn such that F (x ∗ ) is minimal

Solving Nonlinear Systems of Equations (usually) m = n:

Find a point x ∗ such that F (x ∗ ) = 0

Let C k (D; Rm ) denote the set of k-times differentiable functions

F : D → Rm on D ⊂ Rn .
2 We write in short C k (D) = C k (D; Rn ) for m = n.
Local vs. Global Minimizer

Let F : Rn → R be given. We call

A point x ∗ a global minimizer if

F (x ∗ ) ≤ F (x) for all x ∈ Rn

A point x ∗ a local minimizer if there exists an open neighborhood N

“around” x ∗ such that

F (x ∗ ) ≤ F (x) for all x ∈ N

3
Strict Local vs. Global Minimizer

Let F : Rn → R be given. We call

A point x ∗ a strict global minimizer if

F (x ∗ ) < F (x) for all x ∈ Rn

A point x ∗ a strict local minimizer if there exists an open neighborhood

N “around” x ∗ such that

F (x ∗ ) < F (x) for all x ∈ N

3
Gradient

For a function F ∈ C 1 (D; R), D ⊆ Rm , we denote the vector of partial

derivatives by ∇F : Rm → Rm . It is deﬁned as
 ∂F 
∂x1 (x)
∇F (x) =  .. x ∈ Rm
.  ,
 
∂F
∂xn (x)

and called the gradient of F .

⇒ Compute directional dervatives

∂F F (x + αd ) − F (x)
(x) = lim = ∇F (x)T d
∂d α→0 α

4
Jacobian and Hessian Matrix
For a function F ∈ C 1 (D), D ⊆ Rn , i. e. F : Rn → Rn
or F = (F1 , . . . , Fn )T with Fi ∈ C 1 (D; R), we call the matrix
 
∂F1 ∂F1
(x) · · · ∂xn (x)
 ∂x1
. .. ..
JF (x) =  ..  ∈ Rn×n , x ∈ Rn
 
. .
 
∂Fn ∂Fn
∂x1 (x) · · · ∂xn (x)

the Jacobian of F .
⇒ ith row is the gradient ∇Fi of ith component function Fi of F

Special case F ∈ C 2 (Rn ; R) the gradient F = ∇F is in C 1 (Rn ). We obtain

the Hessian matrix denoted by ∇2 F (x) or HF (x) and given by
 2 2 
∂ F ∂ F
(x) · · · ∂x1 ∂xn (x)
 ∂x1 ∂x1
∇2 F (x) = J∇F (x) = 
 .. .. .. 
. . . 
 2 2 
∂ F ∂ F
5 ∂xn ∂x1 (x) · · · ∂xn ∂xn (x)
Open Balls & Convexity

Let x ∈ Rn and R > 0. The open ball of radius R centered at x is deﬁned

B(x; R) = {y ∈ Rn | ∥y − x∥ < R}

A set C ⊂ Rn is called convex if for all x, y ∈ C the connecting line

segment is contained in C .
Formally

tx + (1 − t)y = x + (1 − t)(y − x) ∈ C for all t ∈ [0, 1]

6
Taylor Expansion
Let F : Rn → R and F ∈ C 1 (Rn ; R) then we have the zeroth order Taylor
expansion, i. e. for x, p ∈ Rn that

F (x + p) = F (x) + ∇F (x + tp)T p for some t ∈ (0, 1).

If F ∈ C 2 (Rn ; R) we even get the ﬁrst order Taylor expansion

1
F (x + p) = F (x) + ∇F (x)T p + pT ∇2 F (x + tp)p for some t ∈ (0, 1).
2

We also write this as

F (x + p) = F (x) + ∇F (x)T p + O(∥p∥2 )

7
First Order Necessary (Optimality) Conditions
Theorem (First Order Necessary Conditions)
If x ∗ is a local minimizer and F ∈ C 1 (B(x ∗ ; R); R) for a suitable R > 0.
Then ∇F (x ∗ ) = 0.

Proof.
Blackboard/Note

8
Stationary Points

Any point x ∗ with ∇F (x ∗ ) = 0 is called stationary point. Due to First

order necessary conditions:

Any local minimizer must be a stationary point.

This relates Minimization

arg min F (x)
x∈Rn

to solving nonlinear systems of equations by setting F (x) = ∇F (x) we

look for candidates of minimizers by considering the

F (x) = 0.

9
Second Order Necessary and Suﬃcient Conditions
(without proofs here)

Theorem (Second Order Necessary Conditions)

If x ∗ is a local minimizer of F : Rn → R and ∇2 F exists and is continuous in
some B(x ∗ ; R) for a suitable R > 0.
Then ∇F (x ∗ ) = 0 and ∇2 F (x ∗ ) is positive semideﬁnite.

Again: Only these are candidates for minimizers

Theorem (Second Order Suﬃcient Conditions)

Let F : Rn → R be in C 2 (B(x ∗ ; R); R) around x ∗ ∈ Rn for a suitable R > 0.
Let ∇F (x ∗ ) = 0 and ∇2 F (x ∗ ) be positive deﬁnite.
Then x ∗ is a strict local minimizer of F .
10
Summary

We can take two viewpoints here

Minimization.
To minimize a function F : Rn → R: we are looking for
a stationary point x ∗ , i. e.
∇F (x ∗ ) = 0

Solving nonlinear Systems.

To solve for a function F : Rn → Rn : we are looking for a solution x ∗ , i. e.

F (x ∗ ) = 0

Special case: Minimization for F = ∇F .

11
General iterative Scheme

Idea. (Again): reformulate F (x) = 0 (or F (x) → min)

into a ﬁx point equation x = G (x).
Here: In a way, that we get update directions:

Let x (0) ∈ Rn be a starting point. For each k = 1, 2, . . .

1. Determine a update direction d (k) ∈ Rn

2. Determine a step size αk > 0
3. Update x (k+1) = G (x (k) ) = x (k) + αk d (k)

And we have to specify a suitable stopping criterion.

12
Contractive Mapping

A mapping G : D ⊂ Rn → Rn is called contractive on D0 ⊂ D

if there exists α < 1 such that

∥G (x) − G (y )∥ ≤ α∥x − y ∥ for all x, y ∈ D0 .

Here ∥·∥ is a suitable vector norm.

13
Convergence (Fix Point Theorem, QSS Theorem 7.2 )

Let G : D ⊂ Rn → Rn be a contractive mapping on the closed set D0 ⊂ D

and let G (x) ∈ D0 for all x ∈ D0 .

Then G has a unique ﬁx point in D0

Even more: For every x (0) ∈ D0 the iteration procedure

x (k+1) = G (x (k) )

converges to this ﬁx point.

14
Proof of (Fix Point Theorem)

15
Rate of Convergence

An iterative method is called locally convergent of order p to x ∗ ∈ Rn

if there exists a C (r ), r > 0, such that for all ∥x (0) − x ∗ ∥ ≤ r it holds

∥x (k+1) − x ∗ ∥ ≤ C (r )∥x (k) − x ∗ ∥p

in QSS only implicitly deﬁned for p = 2 in Theorem 7.1, 1-dim. case: Def. 6.1

Most important cases

▶ p = 2 – quadratic convergence. Then the number of digits in the

error half every iteration.
▶ p = 1 is called linear convergence (if C (r ) < 1)

16
Examples

17
Examples: An Outlook

An (grayscale) image is just a

matrix Y ∈ Rm×n , Yij ∈ [0, 1]
⇒ y = vec(Y ) ∈ Rmn .

But measurements might be

noisy

Model idea:
Find a matrix (an image) x
▶ “close to” y
▶ with “similar” neighbor
pixel

Some nice image Measured with noise

18
Examples: An Outlook

An (grayscale) image is just a

matrix Y ∈ Rm×n , Yij ∈ [0, 1]
⇒ y = vec(Y ) ∈ Rmn .

But measurements might be

noisy

Model idea:
Find a matrix (an image) x
▶ “close to” y
▶ with “similar” neighbor
pixel

Some nice image Measured with noise

18
Example: An Outlook ▶ x “close to” y → use

1
∥x − y ∥22
2
▶ I set of pixel indices
▶ Ni neighbors of pixel i
▶ “similar” neighbors:
XX
|xi − xj |
i∈I j∈Ni

▶ tradeoff/weight λ > 0:

1 XX
F (x) = ∥x−y ∥22 +λ |xi −xj |
2
i∈I j∈Ni
▶ Minimize!
Measured with noise A (simple) reconstruction
19 (Obs! highdimensional & nonsmooth)
A good direction d in the iterative scheme

To ﬁnd a good (scaled) direction αd : ﬁrst order Taylor:

1
F (x + αd ) = F (x) + α∇F (x)T d + α2 d T ∇F (x + td )d for some t ∈ (0, α)
2
The rate of change: the coeﬃcient of α, namely ∇F (x)T d

We call d a descent direction if ∇F (x)T d < 0

For any descent direction d (and F ∈ C 1 ) there exists a (small enough) α

such that
F (x + αd ) < F (x)
if we are not in a minimum already.

20
Find the best descent direction!

If we are just interested in the direction, let’s take ∥d ∥ = 1

The best descent direction is

arg min d T ∇F (x) subject to ∥d ∥ = 1

If we denote by θ the angle between d and ∇F (x) we get

d T ∇F (x) = ∥d ∥∥∇F (x)∥ cos θ = ∥∇F (x)∥ cos θ

Minimum: cos θ = −1, i. e. θ = π i. e.

∇F (x)
steepest descent direction: d = − ∥∇F (x)∥

21
Norwegian University of Science and Technology

Newton’s Method
Alternative Idea: Use more/other “Information”
Instead of “just” using the gradient information, we could use higher
order information:
In Optimization: The Hesse Matrix ∇2 F (or 1D: second derivative f ′′ )

Recap: The one-dimensional case: Looking for f (x) = 0

Algorithm 1: Bisection.
Given a, b ∈ R such that f (a)f (b) < 0 ⇒ ∃α ∈ (a, b) : f (α) = 0
Idea: “Divide and Conquer”:
1. set a0 = a, b0 = b, and k = 0
2. Repeat for k = 0, . . .
2.1 compute ck = ak +b 2
k

2.2 if f (ak )f (ck ) < 0 set ak+1 = ak and bk+1 = ck

2.3 if f (ak )f (ck ) > 0 set ak+1 = ck and bk+1 = bk

22
Use function evaluations, but more clever!

Idea: Take the line

1
l(t) = (b − a − t)f (a) + tf (b) , t ∈ [0, b − a]
b−a
a−b
⇒ solve l(t) = 0 for t yields t = − f (a)−f (b) f (a)
⇒ Choose the point c not as the mit point but as c = a + t

Algorithm 2: Secant information.

For some x (0) , x (1) compute for k = 1, . . .

x (k) − x (k−1)
x (k+1) = x (k) − f (x (k) )
f (x (k) ) − f (x (k−1) )

23
Best Idea: Take the Tangent instead of the Secant.
At or current point x (k) : Tangent (or model) equation

mk (x) = f (x (k) ) + f ′ (x (k) )(x − x (k) )

(again): First order Taylor (leave out the remainder term).

If f ′ (x (k) ) = 0 : Stationary point!

Otherwise: Setting mk (x) = 0 and solving for x as the new iterate yields

Newton-Iteration

f (x (n) )
x (k+1) = x (n) −
f ′ (x (n) )
24
Multivariate View I: Newton in Optimisation
To minimize F : Rn → R

Determine the direction d from second-order Taylor as

1
F (x (k) + d ) ≈ F (x (k) ) + d T ∇F (x (k) ) + d T ∇2 F (x (k) )d =: Mk (d )
2

Minimize Mk or solving ∇Mk (d ) = 0

(assuming ∇2 F (x (k) ) is positive deﬁnite)

yields
−1
∇2 F (x (k) )d (k) = −∇F (k (k) ) ⇔ d (k) = − ∇2 F (x (k) ) ∇F (k (k) )

25 Then set x (k+1) = x (k) + d (k)

Multivariate View II: Nonlinear Systems of Equations
To solve F (x) = 0 at iterate x (k) :
Model equation given by ﬁrst-order Taylor
(or ∇M from last slide is M here)

M k (d ) = F (x (k) ) + JF (x (k) )d

Again:
We set the model M k to zero, solve for the new direction d (k) = d , and
obtain
Solve JF (x (k) )d (k) = −F (x (k) )
Step x (k+1) = x (k) + d (k)

Note. In every Newton step, we have to solve a linear system of

equations.
26
Convergence of Multivariate Newton
Theorem (QSS, Theorem 7.1)
Let F : Rn → Rn be a C 1 function in a convex open set D of Rn that contains
x ∗ with F (x ∗ ) = 0. Suppose that JF−1 (x ∗ ) exists and that there exist positive
constraints R, C and L such that
▶ C = ∥JF−1 (x ∗ )∥
▶ JF is locally Lipschitz, i.e.
∥JF (x) − JF (y )∥ ≤ L∥x − y ∥ for all x, y ∈ B(x ∗ , R).
where ∥·∥ denotes a vector and a consistent matrix norm.

Then there exists r > 0 such that for any x (0) ∈ B(x ∗ , r ) the Newton
iteration is uniquely deﬁned and converges to x ∗ with

∥x (k+1) − x ∗ ∥ ≤ CL∥x (k) − x ∗ ∥2

Note: The Assumptions also make x (k) well deﬁned.

27
Proof Convergence of Multivariate Newton

28
Reasons for Modiﬁed Newton

Challenges

▶ We have to evaluate JF (x (k) )

▶ We have to solve a linear system JF (x (k) )d = −F (x (k) )

This can be time consuming!

29
Variant I: Cyclic Updating

Evaluate Jacobian once J = JF (x (k) ) and use this matrix for the
iterations k, k + 1, . . . , k + p.

▶ Less evaluations of the Jacobian

▶ eﬃciency by using LU decomposition
i.e. solve Lc = −F (x (k+i) ), Ud = c, x (k+i+1) = x (k+i) + d
⇒ reduce from O(n3 ) to O(n2 )

But this is less accurate since we do not solve the original linear system
(only for k we do).

30
Variant II: Inexact Solution of the Linear System

Motivation: The matrix is inexact already, so why solve it exactly?

⇒ Use Jacobi, Gauss-Seidel, SOR

Or Gradient Descent (GD) or (preconditioned) Conjugate Gradient (pCG)

...to some precision.

31
Variant III: Replace JF by a difference approximation

(k)
Let j be a column index and hj > 0 be a step size.
Compute the jth column by a ﬁnite difference approximation
(k)
(k) F (x (k) + hj e j ) − F (x (k) )
JF (x (k) ) j ≈ Jh j =

(k)
hj

where e j is the jth canonical unit vector.

▶ One can get linear convergence if 0 < |hj(k) | < h for some h.
▶ One can get/keep quadratic convergence if
(k) (k)
|hj | ≤ C ∥x (k) − x ∗ ∥ (or |hj | ≤ c∥F (x (k) )∥)

32
Changes due to the 3 Variants

▶ all 3 methods will affect the convergence rate

▶ but the limit point is unchanged

This means

▶ it converges slower (more iterations required), but

▶ each single iteration is (much) faster

33
Norwegian University of Science and Technology

Quasi Newton Methods

Broyden’s Method – Motivation

Since evaluating the Jacobian JF (x (0) ) is expensive.

⇒ replace it with some matrix Bk that is: Solve

Bk d k = −F (x (k) )

where Bk is

▶ “cheap to compute”
▶ approximates the Jacobian “well enough”
▶ is updated every iteration to improve approximation property.

34
Broyden’s Method – Derivation

35
The Multivariate Secant Equation

From JF (x (k) )(x (k+1) − x (k) ) ≈ F (x (k+1) ) − F (x (k) ) we

▶ introduce s (k) = x (k+1) − x (k)

▶ introduce y (k) = F (x (k+1) ) − F (x (k) )
▶ and assume, that we already used Bk to get x (k+1)

Then we would like our next approximation Bk+1 to fulﬁll

the secant equation
Bk+1 s (k) = y (k) .
Problem. That is only n equations, but Bk+1 has n2 unknowns.
But. We can look for “even nicer” Bk+1 .

36
Broydens Method – Matrix Update
Given s (k) = x (k+1) − x (k) and y (k) = F (x (k+1) ) − F (x (k) ) as well as Bk we
compute
(y (k) − Bk s (k) )(s (k) )T
Bk+1 = Bk +
(s (k) )T s (k)

Then

1. It fulﬁlls Bk+1 s (k) = y (k)

2. It is “the best” when looking at the minimiser of ∥B − Bk ∥F . that
fulﬁls the equation from 1.

37
The Minimiser of ∥B − Bk ∥F

38
Broyden’s Method

Choose x (0) and some nonsingular B0 (e.g. B0 = JF (x (0) )).

Then we repeat for k = 0, 1, . . . ,

1. Solve Bk d (k) = −F (x (k) ) for d (k)

2. Set x (k+1) = x (k) + αk d (k) (we still do αk = 1 for now)
3. Compute s (k) = x (k+1) − x (k)
4. Compute y (k) = F (x (k+1) ) − F (x (k) )
(y (k) − Bk s (k) )(s (k) )T
5. Update Bk+1 = Bk +
(s (k) )T s (k)

39
Convergence of Broyden’s Method
Under the same assumptions as for the convergence of Newton’s
Method and if additionally there exist ε, γ > 0 such thats

▶ ∥x (0) − x ∗ ∥ ≤ ε
▶ ∥B0 − JF (x ∗ )∥ ≤ γ

Then the sequence of iterates x (k) of Broyden’s Method is well-deﬁned

and converges super-linearly to x ∗ , i. e.

∥x (k+1) − x ∗ ∥ ≤ ck ∥x (k) − x ∗ ∥

where the constants ck are such that limk→∞ ck = 0

One can also prove (under further assumptions)

that Bk converges to JF (x ∗ ).
40
Norwegian University of Science and Technology

(Unconstrained) Optimization
(Back to) Optimization

We look for a (global) minimizer of F : Rn → R

x ∗ = arg min F (x)

x∈Rn

We already saw local/global (strict) minimisers.

This is also called unconstrained optimization,

since one can further impose constraints on x, for example x ≥ 0.
⇒ TMA4180 Optimization I (Spring 2024 )

41
Direct Search Methods

Advantage.
They do not require the gradient ∇F but just evaluations of F .
Examples are Nelder-Mead or Hooke and Jeeves.

Disadvantage.
They might be very (, very,...) slow.

While we do not consider them here, what are two good reasons to
have/use these methods?

42
(Multivariate) Mean Value Theorem

Let

1. D be a convex domain on Rn
2. F ∈ C 1 (D; R)
3. α ∈ R
4. x, x + d ∈ D

Then there exists a 0 < θ < 1 such that with y = x + θαd we have

F (x + αd ) − F (x) = α∇F (y )T d .

⇒ For a descent direction: F (x + αd ) < F (x).

43
Descent Methods
Idea: Take a descent direction d ∈ Rn , i.e. with d T ∇F (x) < 0
and “walk into this direction”.

Start with x (0) ∈ Rn .

For k = 0, . . . , (until convergence)

1. Choose a descent direction d (k) , i.e. s.t. (d (k) )T ∇F (x (k) ) < 0

2. Choose a small enough step size α for which F (x + αd ) < F (x), so
we choose such an αk
3. and set x (k+1) = x (k) + αk d (k)

Idea for a stopping criterion: ∥∇F (x k )∥ is smaller than some tolerance.

But how to actually choose d and α?

44
Algorithm 1: Newtons Method

Assume that F ∈ C 2 (D; R) and choose αk = 1

(set F = ∇F in the last section)

Then JF (x) = HF (x) and d (k) = −HF (x (k) )−1 ∇F (x (k) ).

From suﬃcient condition ⇒ locally around x ∗ : HF positive deﬁnite!

⇒ ∇F (x (k) )T d (k) = − ∇F (x (k) )HF−1 (x (k) )∇F (x (k) ) < 0.

This is also called a second order method, since it uses ﬁrst and second
order derivatives

Same approaches as before with Broyden: Bk (sym.) to approximate HF .

45
Beyond Broyden

▶ Broydens method: Bk fulﬁlls the secant equation Bk s (k) = y (k)

▶ but the Hessian ∇2 F (x (k) ) is even more: symmetric (& pos. deﬁnite)

The BFGS-Update: named after Broyden, Fletcher, Goldfarb, and Shanno

Bk s (k) (s (k) )T BkT y (k) (y (k) )T

Bk+1 = Bk − +
(s (k) )T Bk s (k) (y (k) )T s (k)

▶ the update is a symmetric matrix. ⇒ Bk+1 is symmetric (if Bk is)

▶ the update keeps positive deﬁniteness.

46
Outlook: Limited Memory BFGS
Application: Optimisation on an 4K image (n ≈ 9 million pixel)

7 7 7
▶ F : R9·10 → R, hence a gradient ∇F : R9·10 → R9·10
7 7 7
▶ Hessian: ∇2 F : R9·10 → R9·10 ×9·10 (8.1 · 1015 entries)

This is not feasible in memory.

Features of BFGS.
−1
There is a recursive formula for Bk+1 using Bk−1 , s (k) , y (k)
−1 k
⇒ an recursive formula for solving Bk+1 x

Solution.
Start with B0 = I
Store only K ≪ N previous s (k−K +1) , . . . s (k) and y (k−K +1) , . . . y (k) .
⇒ Limited Memory BFGS by applying the recursive formula K times
47
Algorithm 2: Gradient (or Steepest) Descent

We already learned d = −∇F (x) is the steepest descent direction.

Given some x (0) ∈ Rn and some stepsizes αk (maybe just again 1).

Then again for k = 0, 1, . . .

1. Choose d k = −∇F (x (k) )

2. Set x (k+1) = x (k) + αk d (k)

48
What is the best step size αk ?
We we are at x (k) and we have chosen a descent direction d (k) .

What is the best step size we can choose?

arg min ϕ(α) = arg min F (x (k) + αd (k) )

α∈R α∈R

In practice. Not really feasible

But. In theory, this choice has a nice feature:
Theorem
For a descent method, let x (k) , d (k) be given and αk = arg minα∈R ϕ(α).
Then
∇F (x (k+1) )T d (k) = 0

49
Algorithm 3: Nonlinear Conjugate Gradient
The Gradient descent sometimes tends to “zig-zag” (depending on αk ).
Avoid by orthogonalizing directions d (k) (w.r.t. some scalar product).
For k = 0, 1, . . .

1. Set d (k) = −∇F (x (k) ) + βk d (k−1)

2. Set x (k+1) = x (k) + αk d (k)

where β might have different variants

∇F (x (k) )(∇F (x (k) )−∇F (x (k−1) ))

Hestens-Stiefel βkHS =
(∇F (x (k) )−∇F (x (k−1) ))T d (k−1)
∇F (x (k) )T ∇F (x (k) )
Fletcher-Reeves βkFR = ∇F (x (k−1) )T ∇F (x (k−1) )

and quite a few more.

50
Interlude: (Linear) CG

If F is quadratic, e. g. F (x) = ∥Ax − b∥22 for A spd

then the gradient ∇F (x) = AT Ax − AT b is linear

Then Fletcher-Reeves yields directions d (0) , d (1) , . . .

that are orthogonal with respect to (v , w )A = v T Aw .
This yields the update direction as in Conjugate Gradient (Assignment 3)

But even more: There exists a closed form for the “best” (most
decrease) step size.

Question. How many steps does this method need?

51
Norwegian University of Science and Technology

Line Search Methods

Introduction
Since ﬁnding the minimizer of ϕ(α) = F (x (k) + αd (k) ) is not feasible, we
would like to have other methods, that work well enough, i.e. such that

F (x (k+1) ) < F (x (k) )

Question. What are the challenges here?

▶ The step size might be too small and we do not make progress
▶ The “gain” (or descent rate) F (x (k+1) ) − F (x (k) ) might be too small

In the following: We are at point x, have found a descent direction d

and we are looking for a step size α.
52
Illustration for having Suﬃcient Decrease

53
Armijo Linesearch
Let σ ∈ (0, 1) be given. The Armijo condition for a step size α reads

F (x + αd ) ≤ F (x) + σα∇F (x)T d

In terms of the helping 1D function this reads

ϕ(α) ≤ ϕ(0) + σαϕ′ (0)x

In Practice.
Use a backtrackig parameter c ∈ (0, 1) and a starting step size s try
α = s, α = cs, α = c 2 s,... until the Armijo condition is first fulfilled.
Improve by starting with s as the last found step size.
54
Welldefinedness of Armijo

Theorem
Let σ, c ∈ (0, 1) and s > 0 be given.
Then for every pair (x, d ) ∈ Rn × Rn with ∇F (x)T d < 0 exists a α0 > 0 such
that
F (x + αd ) ≤ F (x) + σα∇F (x)T d
holds for all α ∈ [0, α0 ]

55
Proof.

56
(Global) Convergence of Gradient Descent with Armijo
Linesearch
Theorem
When performing Gradient descent with Armijo Linesearch to determine the
step size αk , then for the sequence x (k) it holds:
Every cluster point x ∗ of {x (k) } is a stationary point, i. e. ∇F (x ∗ ) = 0.

One can further show:

▶ x ∗ can not be a local maximum

▶ all cluster points have the same function value
▶ in an area with positive deﬁnite Hessian, the whole sequence
converges.

57
Problem: The Step size might still get too small

58
The Curvature Condition

or: Avoiding too small step sizes

Let β ∈ (σ, 1) be given, where σ is the constant from the Armijo

condition.

Then we look for a step size α that additionally fulﬁlls the curvature
condition
∇F (x + αd )T d ≥ β∇F (x)T d

59
Wolfe Conditions

Combining both criteria, we obtain the Wolfe conditions: Let x, d be

given.

For 0 < σ < β < 1 we search for a step size α that fulﬁlls

F (x + αd ) ≤ F (x) + σα∇F (x)T d

∇F (x + αd )T d ≥ β∇F (x)T d

But our choice α̂ might be far away from a minimizer of ϕ(α∗ ).

60
Strong Wolfe Conditions

Combining both criteria, we obtain the strong Wolfe conditions: Let x, d

be given.

For 0 < σ < β < 1 we search for a step size α that fulﬁlls

F (x + αd ) ≤ F (x) + σα∇F (x)T d

|∇F (x + αd )T d | ≤ β|∇F (x)T d |

We disallow ϕ′ (α) to be positive. Since we start negative (for small α),

we hence stay closer to stationary points of ϕ.

61
Welldeﬁnedness of (Strong) Wolfe conditions

Let F ∈ C 1 a point x and a descent directin d at x be given. Assume that

F is bounded from below on the ray {x + αd : α > 0}.

Then if 0 < σ < β < 1, there exist intervals of step sizes satisfying the
Wolfe conditions and the strong Wolfe conditions.

62
Proof.

63
Other descent direction methods
Gradient descent methods gained popularity in Machine Learning,
N
X
The (cost) function F (x) = Fi (x) has a very large N
i=1

⇒ step size α ﬁxed (“learning rate”)

Algorithm 4. Stochastic GD. Choose one index i ∈ {1, . . . , N}

d (k) = ∇Fi (x (k) )

Algorithm 5. Minibatch (group-stochatsic) GD

Choose a (small) random subset I ⊂ {1, . . . , N} and compute
X
d (k) = ∇Fi (x (k) )
i∈I
64
Other descent direction methods II

Algorithm 6. Momentum – Fix a momentum term γ and compute

d (k) = −∇F (x (k) ) + γd (k−1)

this could be seen as a simpliﬁed CG.

Idea: Like a ball rolling down a hill “keep momentum”.

Algorithm 7. Nesterov’s Accelerated Gradient

d (k) = −∇F (x (k) + γd (k−1) ) + γd (k−1)

Both can be combined with Stochastic/Minibatch.

65
Other descent direction methods III

There is a whole zoo of further methods in Machine learning, that all try
to avoid a costly line-search

AdaGrad scale (Adapt) each single gradient ∇Fi (x (k) ) (by the
combined norm of the gradients history)
⇒ different step size for every component of x (k) .
Adadelta avoid the strictly decreasing step size of AdaGrad
Adam adaptive motion estimation – also keep track of gradients
themselves
AdaMax Adam with maxium norm
Nadam Adam with Nesterov combined

66
Stopping criteria

maximal number of iterates stop if k = Kmax

small gradient stop if ∥∇F (x (k) )∥ is small
small change stop if ∥x (k−1) − x (k) ∥ is small
small change II stop if |F (x (k−1) ) − F (x (k) )| is small

For the last two (or three) we can distinguish two cases

absolute tolerance atol – ∥∇F (x (k) )∥ < ε

relative tolerance rtol – ∥∇F (x (k) )∥ < δ∥∇F (x (k−1) )∥
combined sum both right hand sides above

67
Outlook M4: Further topics in questions.

As a teaser a few topics that one might continue with (but we don’t)
1. What if the function is convex?
⇒ Convex Analysis, Duality
2. What if the function is not smooth, but has jumps?
3. What if there are constraints?
4. What if the function is deﬁned on a Riemannian Manifold?

Weatherwax Nocedal Solutions
No ratings yet
Weatherwax Nocedal Solutions
23 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
20-Region Elimination Method_ Golden search method-11-03-2025
No ratings yet
20-Region Elimination Method_ Golden search method-11-03-2025
20 pages
Note Set 7 - Nonlinear Equations: 7.1 - Overview
No ratings yet
Note Set 7 - Nonlinear Equations: 7.1 - Overview
10 pages
tylor series
No ratings yet
tylor series
21 pages
Stationary Points Minima and Maxima Gradient Method
No ratings yet
Stationary Points Minima and Maxima Gradient Method
8 pages
2.NCC-SFC-LMT-KKT 2
No ratings yet
2.NCC-SFC-LMT-KKT 2
56 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Bologna 07
No ratings yet
Bologna 07
315 pages
An Algorithm For Minimax Solution of Overdetennined Systems of Non-Linear Equations
No ratings yet
An Algorithm For Minimax Solution of Overdetennined Systems of Non-Linear Equations
8 pages
Chương 9
No ratings yet
Chương 9
12 pages
Opt7 20
No ratings yet
Opt7 20
8 pages
DP Noas
No ratings yet
DP Noas
10 pages
Princeton University Notation and Terminology in optimization
No ratings yet
Princeton University Notation and Terminology in optimization
13 pages
Lec_11
No ratings yet
Lec_11
13 pages
Convention: Throughout This Discussion A Feasible Direction D at A Point Is by Definition Taken
No ratings yet
Convention: Throughout This Discussion A Feasible Direction D at A Point Is by Definition Taken
12 pages
Chapter 6 Lecture Notes
No ratings yet
Chapter 6 Lecture Notes
4 pages
MAE_opti_worksheet_4_correction
No ratings yet
MAE_opti_worksheet_4_correction
3 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Solving_Nonlinear_Equations
No ratings yet
Solving_Nonlinear_Equations
18 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Non Linear Optmisation - Notes
No ratings yet
Non Linear Optmisation - Notes
24 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Institute of Computer Science: Academy of Sciences of The Czech Republic
No ratings yet
Institute of Computer Science: Academy of Sciences of The Czech Republic
49 pages
5165 Test 2 Cheating
No ratings yet
5165 Test 2 Cheating
7 pages
Lec6 Constr Opt
No ratings yet
Lec6 Constr Opt
30 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Nonlinear Analysis: Real World Applications: Changfeng Ma, Lihua Jiang, Desheng Wang
No ratings yet
Nonlinear Analysis: Real World Applications: Changfeng Ma, Lihua Jiang, Desheng Wang
16 pages
Nonlinear Program
No ratings yet
Nonlinear Program
12 pages
MAT 461/561: Numerical Analysis II: James V. Lambers May 5, 2014
No ratings yet
MAT 461/561: Numerical Analysis II: James V. Lambers May 5, 2014
124 pages
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
No ratings yet
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
3 pages
Textbook 656 663
No ratings yet
Textbook 656 663
8 pages
Optim
No ratings yet
Optim
70 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
Mathematics For Economics (ECON 104)
No ratings yet
Mathematics For Economics (ECON 104)
46 pages
Lecture 4
No ratings yet
Lecture 4
7 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
Course Notes For MATH 524: Non-Linear Optimization
No ratings yet
Course Notes For MATH 524: Non-Linear Optimization
112 pages
lecture 8
No ratings yet
lecture 8
9 pages
Chapter 003
No ratings yet
Chapter 003
54 pages
Nonlinear Optimization
No ratings yet
Nonlinear Optimization
6 pages
O4MD 02 Foundations
No ratings yet
O4MD 02 Foundations
8 pages
Bms Basic NLP 120609
No ratings yet
Bms Basic NLP 120609
103 pages
Exam1Review Annotated
No ratings yet
Exam1Review Annotated
13 pages
MAE Opti Worksheet 3 Correction
No ratings yet
MAE Opti Worksheet 3 Correction
6 pages
ECOM 6302: Engineering Optimization: Chapter Three
100% (1)
ECOM 6302: Engineering Optimization: Chapter Three
56 pages
Jiyue Zeng Honors Thesis
No ratings yet
Jiyue Zeng Honors Thesis
59 pages
Lec4 Gradient Method Revise
No ratings yet
Lec4 Gradient Method Revise
33 pages
Introduction to optimization - Jean-François Aujol
No ratings yet
Introduction to optimization - Jean-François Aujol
51 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Long-Memory Time Series: Theory and Methods
From Everand
Long-Memory Time Series: Theory and Methods
Wilfredo Palma
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Linear Optimization PDF
No ratings yet
Linear Optimization PDF
259 pages
Intro To Research Method Course Outline
100% (1)
Intro To Research Method Course Outline
2 pages
Mobius
No ratings yet
Mobius
1 page
Notes 4 6382 Complex Mapping
No ratings yet
Notes 4 6382 Complex Mapping
26 pages
LPP
No ratings yet
LPP
8 pages
655755d54c7f8 (1)
No ratings yet
655755d54c7f8 (1)
33 pages
10 Job Sequencing with Deadline
No ratings yet
10 Job Sequencing with Deadline
3 pages
ME6014 Computational Fluid Dynamics
No ratings yet
ME6014 Computational Fluid Dynamics
7 pages
Finite Element Analysis Theory and Application With ANSYS 4th Edition by Saeed Moaveni 0273774301 978-0273774303pdf download
100% (2)
Finite Element Analysis Theory and Application With ANSYS 4th Edition by Saeed Moaveni 0273774301 978-0273774303pdf download
78 pages
Engineering Mathematics
60% (5)
Engineering Mathematics
234 pages
Finite Element Analysis of Shell Structures
No ratings yet
Finite Element Analysis of Shell Structures
1 page
Multiple Regression and Correlation Analysis
No ratings yet
Multiple Regression and Correlation Analysis
1 page
Drilling
No ratings yet
Drilling
37 pages
Or Graphical
No ratings yet
Or Graphical
14 pages
X X X X XX Solution: Solve Using Modified Newton's Method The Following System of Non Linear Algebraic Equations
No ratings yet
X X X X XX Solution: Solve Using Modified Newton's Method The Following System of Non Linear Algebraic Equations
5 pages
GATE Aerospace Coaching by Team IGC Engineering Mathematics
No ratings yet
GATE Aerospace Coaching by Team IGC Engineering Mathematics
5 pages
Me103 Linear Algebra: Dr. Orhan Keklicioglu
No ratings yet
Me103 Linear Algebra: Dr. Orhan Keklicioglu
46 pages
Stochastic Gradient Descent (SGD) :: Import As
No ratings yet
Stochastic Gradient Descent (SGD) :: Import As
4 pages
(Ebook) Handbook of Numerical Analysis : Numerical Methods for Fluids (Part 3) (Handbook of Numerical Analysis) by Roland Glowinski, P. G. Ciarlet, ISBN 9780080507941, 9780444512246, 0080507948, 0444512241 - The latest updated ebook version is ready for download
100% (2)
(Ebook) Handbook of Numerical Analysis : Numerical Methods for Fluids (Part 3) (Handbook of Numerical Analysis) by Roland Glowinski, P. G. Ciarlet, ISBN 9780080507941, 9780444512246, 0080507948, 0444512241 - The latest updated ebook version is ready for download
57 pages
Combinatorial Optimization Algorithms and Complexi
100% (1)
Combinatorial Optimization Algorithms and Complexi
9 pages
Finite Element Analysis of Composite Materials Using Ansys 2nd Barbero Solution Manual
100% (46)
Finite Element Analysis of Composite Materials Using Ansys 2nd Barbero Solution Manual
10 pages
00 Computer-Aided Computation For Chemical Engineers PDF
No ratings yet
00 Computer-Aided Computation For Chemical Engineers PDF
30 pages
(BCA - 504) Numerical MethodsI
No ratings yet
(BCA - 504) Numerical MethodsI
1 page
Paper 81
No ratings yet
Paper 81
6 pages
Electric and Magnetic Field Calculations With Finite Element Methods - Stanley Humphries
No ratings yet
Electric and Magnetic Field Calculations With Finite Element Methods - Stanley Humphries
130 pages
Chapter2 Nonlinear Eqs Version2021
No ratings yet
Chapter2 Nonlinear Eqs Version2021
19 pages
Computational Linear Algebra - Problem Set 7
No ratings yet
Computational Linear Algebra - Problem Set 7
2 pages
RBF-DQ Method For Solving Non-Linear Differential Equations of Lane-Emden Type
No ratings yet
RBF-DQ Method For Solving Non-Linear Differential Equations of Lane-Emden Type
15 pages
Dynamic Response of Free-Span Submarine Pipelines: Integral Transform Solution
No ratings yet
Dynamic Response of Free-Span Submarine Pipelines: Integral Transform Solution
14 pages
Getdpgmsh
No ratings yet
Getdpgmsh
462 pages
Alpha Beta Example 2 PDF
No ratings yet
Alpha Beta Example 2 PDF
28 pages
SVD and KL Transform-1
100% (1)
SVD and KL Transform-1
2 pages
IEOR E4007 December 10, 2021 G. Iyengar
No ratings yet
IEOR E4007 December 10, 2021 G. Iyengar
4 pages
Ant Colony Optimization and Local Search For Bin P
No ratings yet
Ant Colony Optimization and Local Search For Bin P
13 pages