4.2 Gradient-Based Optimization
4.2 Gradient-Based Optimization
4.2 Gradient-Based Optimization
Gradient-based Optimization
Sargur N. Srihari
srihari@cedar.buffalo.edu
Topics
• Numerical Computation
• Gradient-based Optimization
– Stationary points, Local minima
– Second Derivative
– Convex Optimization
– Lagrangian
2
Deep Learning Srihari
Gradient-Based Optimization
• Most ML algorithms involve optimization
• Minimize/maximize a function f (x) by altering x
– Usually stated a minimization
– Maximization accomplished by minimizing –f(x)
• f (x) referred to as objective function or criterion
– In minimization also referred to as loss function
cost, or error
1 2
f (x) = || Ax −b ||
– Example is linear least squares 2
3
Deep Learning Srihari
Calculus in Optimization
• Suppose function y=f (x), x, y real nos.
– Derivative of function denoted: f’(x) or as dy/dx
• Derivative f’(x) gives the slope of f (x) at point x
• It specifies how to scale a small change in input to obtain
a corresponding change in the output:
f (x + ε) ≈ f (x) + ε f’ (x)
– It tells how you make a small change in input to
make a small improvement in y
– We know that f (x - ε sign (f’(x))) is less than f (x)
for small ε. Thus we can reduce f (x) by moving x in
small steps with opposite sign of derivative
• This technique is called gradient descent (Cauchy 1847)
Deep Learning Srihari
5
Deep Learning Srihari
6
Deep Learning Srihari
7
Deep Learning Srihari
8
Deep Learning Srihari
Directional Derivative
• Directional derivative in direction u (a unit
vector) is the slope of function f in direction u
– This evaluates to u ∇ f ( x) T
x
12
Deep Learning Srihari
14
Deep Learning Srihari
15
Deep Learning Srihari
16
Deep Learning Srihari
Second derivative
• Derivative of a derivative
• For a function f: RnàR the derivative wrt xi of
the derivative of f wrt xj is denoted as ∂x∂∂x f
i
2
2
∂x
f’’(x)
• Tells us how the first derivative will change
as we vary the input
• This important as it tells us whether a
gradient step will cause as much of an
improvement as based on gradient alone 17
Deep Learning Srihari
Hessian
• Second derivative with many dimensions
2
∂
• H ( f ) (x) is defined as H(f )(x) =
∂x ∂x
f (x)
i,j
i j
Saddle point
• Contains both positive and negative curvature
• Function is f(x)=x12-x22
25
Deep Learning Srihari
Convex Optimization
• Applicable only to convex functions–
functions which are well-behaved,
– e.g., lack saddle points and all local minima
are global minima
• For such functions, Hessian is positive
semi-definite everywhere
• Many ML optimization problems,
particularly deep learning, cannot be
expressed as convex optimization
30
Deep Learning Srihari
Constrained Optimization
• We may wish to optimize f(x) when the solution
x is constrained to lie in set S
– Such values of x are feasible solutions
• Often we want a solution that is small, such as
||x||≤1
• Simple approach: modify gradient descent
taking constraint into account (using Lagrangian
formulation)
31
Deep Learning Srihari
Generalized Lagrangian
• Set S is described in terms of m functions
g(i) and n functions h(j) so that
{
S = x | ∀i,g (i )(x) = 0 and ∀j,h (j )(x) ≤ 0 }
– Functions of g are equality constraints and
functions of h are inequality constraints
• Introduce new variables λi and αj for each
constraint (called KKT multipliers) giving
the generalized Lagrangian
L(x,λ, α) = f (x) + ∑ λi g (i )(x) + ∑ αj h (j )(x)
i j
Minima in million-dimensions
"If you have a million dimensions, and you're coming down, and you come to a ridge,
even if half the dimensions are going up, the other half are going down!