4.2 Gradient-Based Optimization

Deep Learning Srihari
Gradient-based Optimization
Sargur N. Srihari
srihari@cedar.buffalo.edu
This is part of lecture slides on Deep Learning:

http://www.cedar.buffalo.edu/~srihari/CSE676 1
Topics
• Numerical Computation
• Gradient-based Optimization
– Stationary points, Local minima
– Second Derivative
– Convex Optimization
– Lagrangian
2
Gradient-Based Optimization
• Most ML algorithms involve optimization
• Minimize/maximize a function f (x) by altering x
– Usually stated a minimization
– Maximization accomplished by minimizing –f(x)
• f (x) referred to as objective function or criterion
– In minimization also referred to as loss function
cost, or error
1 2
f (x) = || Ax −b ||
– Example is linear least squares 2
– Denote optimum value by x*=argmin f (x)
3
Calculus in Optimization
• Suppose function y=f (x), x, y real nos.
– Derivative of function denoted: f’(x) or as dy/dx
• Derivative f’(x) gives the slope of f (x) at point x
• It specifies how to scale a small change in input to obtain
a corresponding change in the output:
f (x + ε) ≈ f (x) + ε f’ (x)
– It tells how you make a small change in input to
make a small improvement in y
– We know that f (x - ε sign (f’(x))) is less than f (x)
for small ε. Thus we can reduce f (x) by moving x in
small steps with opposite sign of derivative
• This technique is called gradient descent (Cauchy 1847)
Gradient Descent Illustrated

• For x>0, f(x) increases with x and f’(x)>0
• For x<0, f(x) is decreases with x and f’(x)<0
• Use f’(x) to follow function downhill
• Reduce f(x) by going in direction opposite sign of
derivative f’(x)
5
Stationary points, Local Optima

• When f’(x)=0 derivative provides no
information about direction of move
• Points where f’(x)=0 are known as stationary
or critical points
– Local minimum/maximum: a point where f(x)
lower/higher than all its neighbors
– Saddle Points: neither maxima nor minima
6
Presence of multiple minima

• Optimization algorithms may fail to find
global minimum
• Generally accept such solutions
7
Minimizing with multiple inputs

• We often minimize functions with multiple
inputs: f: RnàR
• For minimization to make sense there
must still be only one (scalar) output
8
Functions with multiple inputs

• Need partial derivatives
∂
• ∂x f ( x) measures how f changes as only
i
variable xi increases at point x

• Gradient generalizes notion of derivative
where derivative is wrt a vector
• Gradient is vector containing all of the
partial derivatives denoted ∇ f ( x) x
– Element i of the gradient is the partial

derivative of f wrt xi
– Critical points are where every element of the
gradient is equal to zero 9
Directional Derivative
• Directional derivative in direction u (a unit
vector) is the slope of function f in direction u
– This evaluates to u ∇ f ( x) T
x
• To minimize f find direction in which f

decreases the fastest
– Do this using min u ∇ f ( x) = min u ∇ f ( x) cos θ
T
x 2 x
u,u Tu=1 u,u Tu=1 2
• where θ is angle between u and the gradient

• Substitute ||u||2=1 and ignore factors that not depend on
u this simplifies to minucosθ
• This is minimized when u points in direction opposite to
gradient
• In other words, the gradient points directly uphill, and
10
the negative gradient points directly downhill
Method of Gradient Descent

• The gradient points directly uphill, and the
negative gradient points directly downhill
• Thus we can decrease f by moving in the
direction of the negative gradient
– This is known as the method of steepest
descent or gradient descent
• Steepest descent proposes a new point
x' = x − ε∇ f ( x )
x
– where ε is the learning rate, a positive scalar.

11
Set to a small constant.
Choosing ε: Line Search

• We can choose ε in several different ways
• Popular approach: set ε to a small constant
• Another approach is called line search:
• Evaluate f (x − ε∇ f ( x) for several values of ε
x
and choose the one that results in smallest

objective function value
12
Ex: Gradient Descent on Least Squares

1
• Criterion to minimize f (x) =
2
|| Ax − b ||2
2
– Least squares regression 1 N

{
ED ( w ) = å t n - w T f ( xn )
2 n =1
}
• The gradient is
∇ x f ( x ) = AT (Ax −b ) = AT Ax − AT b
• Gradient Descent algorithm is

1.Set step size ε, tolerance δ to small, positive nos.
2.while || AT Ax − AT b ||2> δ do
(
x ← x − ε AT Ax − AT b )
1.end while
13
Convergence of Steepest Descent

• Steepest descent converges when every
element of the gradient is zero
– In practice, very close to zero
• We may be able to avoid iterative
algorithm and jump to the critical point by
solving the equation ∇ f ( x) = 0 for x
x
14
Generalization to discrete spaces

• Gradient descent is limited to continuous
spaces
• Concept of repeatedly making the best
small move can be generalized to discrete
spaces
• Ascending an objective function of discrete
parameters is called hill climbing
15
Beyond Gradient: Jacobian and

Hessian matrices
• Sometimes we need to find all derivatives
of a function whose input and output are
both vectors
• If we have function f: RmàRn
– Then the matrix of partial derivatives is known
as the Jacobian matrix J defined as
∂
J i,j = f (x )
∂x j i
16
Second derivative
• Derivative of a derivative
• For a function f: RnàR the derivative wrt xi of
the derivative of f wrt xj is denoted as ∂x∂∂x f
i
2
• In a single dimension we can denote ∂ f by 2
2
∂x
f’’(x)
• Tells us how the first derivative will change
as we vary the input
• This important as it tells us whether a
gradient step will cause as much of an
improvement as based on gradient alone 17
Second derivative measures curvature

• Derivative of a derivative
• Quadratic functions with different curvatures
Dashed line is
value of cost
function predicted
by gradient alone
Decrease is Gradient Predicts Decrease

faster than predicted decrease correctly is slower than expected
by Gradient Descent Actually increases 18
Hessian
• Second derivative with many dimensions
2
∂
• H ( f ) (x) is defined as H(f )(x) =
∂x ∂x
f (x)
i,j
i j
• Hessian is the Jacobian of the gradient

• Hessian matrix is symmetric, i.e., Hi,j =Hj,i
• anywhere that the second partial derivatives are
continuous
– So the Hessian matrix can be decomposed into a
set of real eigenvalues and an orthogonal basis of
eigenvectors
• Eigenvalues of H are useful to determine learning rate as
seen in next two slides
Role of eigenvalues of Hessian

• Second derivative in direction d is dTHd
– If d is an eigenvector, second derivative in that
direction is given by its eigenvalue
– For other directions, weighted average of
eigenvalues (weights of 0 to 1, with eigenvectors
with smallest angle with d receiving more value)
• Maximum eigenvalue determines maximum
second derivative and minimum eigenvalue
determines minimum second derivative
20
Learning rate from Hessian

• Taylor’s series of f(x) around current point x(0)
1
f (x) ≈ f (x(0) ) + (x - x(0) )T g + (x - x(0) )T H(x - x(0) )
2
• where g is the gradient and H is the Hessian at x(0)
– If we use learning rate ε the new point x is given by
x(0)-εg. Thus we get f (x − εg) ≈ f (x )− εg g + 12 ε g Hg
(0) (0) T 2 T
• There are three terms:

– original value of f,
– expected improvement due to slope, and
– correction to be applied due to curvature
• Solving for step size when correction is least gives
gT g
ε* ≈ T 21
g Hg
Second Derivative Test: Critical Points

• On a critical point f’(x)=0
• When f’’(x)>0 the first derivative f’(x)
increases as we move to the right and
decreases as we move left
• We conclude that x is a local minimum
• For local maximum, f’(x)=0 and f’’(x)<0
• When f’’(x)=0 test is inconclusive: x may
be a saddle point or part of a flat region
22
Multidimensional Second derivative test

• In multiple dimensions, we need to examine
second derivatives of all dimensions
• Eigendecomposition generalizes the test
• Test eigenvalues of Hessian to determine
whether critical point is a local maximum, local
minimum or saddle point
• When H is positive definite (all eigenvalues are
positive) the point is a local minimum
• Similarly negative definite implies a maximum 23
Saddle point
• Contains both positive and negative curvature
• Function is f(x)=x12-x22
– Along axis x1, function curves upwards: this axis is

an eigenvector of H and has a positive value
– Along x2, function corves downwards; its direction is
an eigenvector of H with negative eigenvalue
– At a saddle point eigen values are both positive and
negative 24
Inconclusive Second Derivative Test

• Multidimensional second derivative test can
be inconclusive just like univariate case
• Test is inconclusive when all non-zero eigen
values have same sign but at least one value
is zero
– since univariate second derivative test is
inconclusive in cross-section corresponding to
zero eigenvalue
25
Poor Condition Number

• There are different second derivatives in each
direction at a single point
• Condition number of H e.g., λmax/λmin measures
how much they differ
– Gradient descent performs poorly when H has a
poor condition no.
– Because in one direction derivative increases
rapidly while in another direction it increases slowly
– Step size must be small so as to avoid overshooting
the minimum, but it will be too small to make
progress in other directions with less curvature 26
Gradient Descent without H

• H with condition no, 5
– Direction of most curvature has
five times more curvature than
direction of least curvature
• Due to small step size
Gradient descent wastes time
• Algorithm based on Hessian
can predict that steepest
descent is not promising
27
Newton’s method uses Hessian

• Another second derivative method
– Using Taylor’s series of f(x) around current x(0)
1
f (x) ≈ f (x(0) ) + (x - x(0) )T ∇ x f (x(0) ) + (x - x(0) )T H(f )(x - x(0) )(x - x(0) )
2
• solve for the critical point of this function to give

x* = x(0) − H(f )(x(0) )−1 ∇x f (x(0) )
– When f is a quadratic (positive definite) function use

solution to jump to the minimum function directly
– When not quadratic apply solution iteratively
• Can reach critical point much faster than
gradient descent
28
– But useful only when nearby point is a minimum
Summary of Gradient Methods

• First order optimization algorithms: those that
use only the gradient
• Second order optimization algorithms: use
the Hessian matrix such as Newton’s method
• Family of functions used in ML is
complicated, so optimization is more
complex than in other fields
– No guarantees
• Some guarantees by using Lipschitz
continuous functions, f (x)− f ( y) ≤ L x - y
2
29
• with Lipschitz constant L
Convex Optimization
• Applicable only to convex functions–
functions which are well-behaved,
– e.g., lack saddle points and all local minima
are global minima
• For such functions, Hessian is positive
semi-definite everywhere
• Many ML optimization problems,
particularly deep learning, cannot be
expressed as convex optimization
30
Constrained Optimization
• We may wish to optimize f(x) when the solution
x is constrained to lie in set S
– Such values of x are feasible solutions
• Often we want a solution that is small, such as
||x||≤1
• Simple approach: modify gradient descent
taking constraint into account (using Lagrangian
formulation)
31
Ex: Least squares with Lagrangian

1
• We wish to minimize f (x) =
2
|| Ax − b ||2
• Subject to constraint xTx ≤ 1

• We introduce the Lagrangian ( )
L(x,λ) = f (x) + λ x T x −1
– And solve the problem min max L(x,λ)

x λ,λ≥0
• For the unconstrained problem (no Lagrangian)

the smallest norm solution is x=A+b
– If this solution is not feasible, differentiate
Lagrangian wrt x to obtain ATAx-ATb+2λx=0
– Solution takes the form x = (ATA+2λI)-1ATb
– Choosing λ: continue solving linear equation and
increasing λ until x has the correct norm
Generalized Lagrangian: KKT

• More sophisticated than Lagrangian
• Karush-Kuhn-Tucker is a very general
solution to constrained optimization
• While Lagrangian allows equality
constraints, KKT allows both equality and
inequality constraints
• To define a generalized Lagrangian we
need to describe S in terms of equalities
and inequalities
33
Generalized Lagrangian
• Set S is described in terms of m functions
g(i) and n functions h(j) so that
{
S = x | ∀i,g (i )(x) = 0 and ∀j,h (j )(x) ≤ 0 }
– Functions of g are equality constraints and
functions of h are inequality constraints
• Introduce new variables λi and αj for each
constraint (called KKT multipliers) giving
the generalized Lagrangian
L(x,λ, α) = f (x) + ∑ λi g (i )(x) + ∑ αj h (j )(x)
i j
• We can now solve the unconstrained

optimization problem 34
Minima in million-dimensions
"If you have a million dimensions, and you're coming down, and you come to a ridge,
even if half the dimensions are going up, the other half are going down!
So you always find a way to get out,"

You never get trapped" on a ridge, at least, not permanently.
https://www.zdnet.com/article/ai-pioneer-sejnowski-says-its-all-about-the-gradient/35

4.2 Gradient-Based Optimization

Uploaded by

Copyright:

Available Formats

4.2 Gradient-Based Optimization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.2 Gradient-Based Optimization

Uploaded by

Copyright:

Available Formats

Deep Learning Srihari

This is part of lecture slides on Deep Learning:

– Denote optimum value by x*=argmin f (x)

Gradient Descent Illustrated

Stationary points, Local Optima

Presence of multiple minima

Minimizing with multiple inputs

Functions with multiple inputs

variable xi increases at point x

– Element i of the gradient is the partial

• To minimize f find direction in which f

• where θ is angle between u and the gradient

Method of Gradient Descent

– where ε is the learning rate, a positive scalar.

Choosing ε: Line Search

and choose the one that results in smallest

Ex: Gradient Descent on Least Squares

– Least squares regression 1 N

• Gradient Descent algorithm is

Convergence of Steepest Descent

Generalization to discrete spaces

Beyond Gradient: Jacobian and

• In a single dimension we can denote ∂ f by 2

Second derivative measures curvature

Decrease is Gradient Predicts Decrease

• Hessian is the Jacobian of the gradient

Role of eigenvalues of Hessian

Learning rate from Hessian

• There are three terms:

Second Derivative Test: Critical Points

Multidimensional Second derivative test

– Along axis x1, function curves upwards: this axis is

Inconclusive Second Derivative Test

Poor Condition Number

Gradient Descent without H

Newton’s method uses Hessian

• solve for the critical point of this function to give

– When f is a quadratic (positive definite) function use

Summary of Gradient Methods

Ex: Least squares with Lagrangian

• Subject to constraint xTx ≤ 1

– And solve the problem min max L(x,λ)

• For the unconstrained problem (no Lagrangian)

Generalized Lagrangian: KKT

• We can now solve the unconstrained

So you always find a way to get out,"

You might also like