Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
55 views

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

3.

GRADIENT-BASED OPTIMIZATION

 Most deep learning algorithms involve optimization of some sort.


 Optimization refers to the task of either minimizing or maximizing some function f(x) by
altering x.
 We usually phrase most optimization problems in terms of minimizing f(x).
 Maximization may be accomplished via a minimization algorithm by minimizing - f(x).
 The function we want to minimize or maximize is called the objective function or
criterion.
 When we are minimizing it, we may also call it the cost function, loss function, or error
function.

For example: We might say x ∗ = arg min f (x)

 S
uppose
we have a
function
y = f(x),
where
both x
and y are
real
numbers.

 T
he derivative of this function is denoted as f’(x) or as dy/ dx. The derivative f‘(x) gives the
slope of f (x) at the point x.

 In other words, it specifies how to scale a small change in the input in order to obtain the
corresponding change in the output: f(x +∈ ) ≈ f(x) +∈f’(x).

 The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y.
For example:

 We know that f (x − ∈ sign (f ∈(x))) is less than f (x) for small enough∈.

 We can thus reduce f (x) by moving x in small steps with opposite sign of the derivative.

 When f ‘(x) = 0, the derivative provides no information about which direction to move.
Points

 Where f ‘(x) = 0 are known as critical points or stationary points.

 A local minimum is a point where f (x) is lower than at all neighboring points, so it is no
longer possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points.
 The directional derivative in direction u (a unit vector) is the slope of the function f in
direction u.

 In other words, the directional derivative is the derivative of the function f(x + αu) with
respect to α, evaluated at α = 0.

 Using the chain rule, we can see that ∂ /∂α f(x + αu) evaluates to u T∇xf(x) when α = 0.

 To minimize f, we would like to find the direction in which f decreases the fastest. We can
do this using the directional derivative:

Where θ is the angle between u and the gradient.

 Substituting in ||u||2 = 1 and ignoring factors that do not depend on u, this simplifies to
minu cos θ. This is minimized when u points in the opposite direction as the gradient. In
other words, the gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent. Steepest descent proposes a
new p

where ∈ is the learning rate, a positive scalar determining the size of the step. We can
choose ∈ in several different ways. A popular approach is to set ∈ to a small constant.
Beyond the Gradient: Jacobian and Hessian Matrices

 Jacobian and Hessian Matrices Sometimes we need to find all of the partial derivatives of
a function whose input and output are both vectors.

 The matrix containing all such partial derivatives is known as a Jacobian matrix.

 Specifically, if we have a function f : R m → Rn , then the Jacobian matrix J ∈ Rn×m of f


is defined such that Ji,j = ∂/ ∂xj f(x)i.

 We are also sometimes interested in a derivative of a derivative. This is known as a second


derivative.
For example:- For a function f : R n → R, the derivative with respect to xi of the derivative of f
with respect to xj is denoted as ∂2 / ∂xi ∂xj f.
 In a single dimension, we can denote d2 /dx2 f by f “(x).
 The second derivative tells us how the first derivative will change as we vary the input.

Suppose we have a quadratic function


o If such a function has a second derivative of zero, then there is no curvature.

o It is a perfectly flat line, and its value can be predicted using only the gradient.

o If the gradient is 1, then we can make a step of size ∈ along the negative gradient, and the
cost function will decrease by ∈ .

o If the second derivative is negative, the function curves downward, so the cost function
will actually decrease by more than ∈.

o Finally, if the second derivative is positive, the function curves upward, so the cost
function can decrease by less than ∈.
When our function has multiple input dimensions, there are many second derivatives. These
derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix
H(f)(x) is defined such that

 Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can
decompose it into a set of real eigenvalues and an orthogonal basis eigenvectors.

 The second derivative in a specific direction represented by a unit vector d is given by


dTHd.

 When d is an eigenvector of H , the second derivative in that direction is given by the


corresponding eigenvalue.

 For other directions of d the directional second derivative is a weighted average of all of
the eigenvalues, with weights between 0 and 1, and eigenvectors that have smaller angle
with d receiving more weight.

 The maximum eigenvalue determines the maximum second derivative and the minimum
eigenvalue determines the minimum second derivative.

 The (directional) second derivative tells us how well we can expect a gradient descent step
to perform.
 We can make a second-order Taylor series approximation to the function f(x) around the
current point x (0) :

 When gTHg is zero or negative, the Taylor series approximation predicts that increasing ∈
forever will decrease f forever.

 When gTHg is positive, solving for the optimal step size that decreases the Taylor series
approximation of the function the most yields
Figure: A saddle point containing both positive and negative curvature.

 The function in this example is f (x) = x2 1 − x2 2 . Along the axis corresponding to x1, the
function curves upward.

Figure : Gradient descent fails to exploit the curvature information

 The simplest method for doing so is known as Newton’s method. Newton’s method is
based on using a second-order Taylor series expansion to approximate f(x) near some
point x (0) :

Optimization algorithms that use only the gradient such as gradient descent are called first-
order optimization algorithms.

 Optimization algorithms that also use the Hessian matrix, such as Newton’s method are
called second-order optimization algorithms.
 In the context of deep learning, we sometimes gain some guarantees by restricting
ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous
derivatives.

 Convex optimization algorithms are able to provide many more guarantees by making
stronger restrictions.

 Convex optimization algorithms are applicable only to convex functions -- functions for
which the Hessian is positive semidefinite everywhere.

 Such functions are well-behaved because they lack saddle points and all of their local
minima are necessarily global minima.

 However, most problems in deep learning are difficult to express in terms of convex
optimization. Convex optimization is used only as a subroutine of some deep learning
algorithms.

 Ideas from the analysis of convex optimization algorithms can be useful for proving the
convergence of deep learning algorithms.

 However, in general, the importance of convex optimization is greatly diminished in the


context of deep learning.

You might also like