Gradient Based Optimization
Gradient Based Optimization
GRADIENT-BASED OPTIMIZATION
S
uppose
we have a
function
y = f(x),
where
both x
and y are
real
numbers.
T
he derivative of this function is denoted as f’(x) or as dy/ dx. The derivative f‘(x) gives the
slope of f (x) at the point x.
In other words, it specifies how to scale a small change in the input in order to obtain the
corresponding change in the output: f(x +∈ ) ≈ f(x) +∈f’(x).
The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y.
For example:
We know that f (x − ∈ sign (f ∈(x))) is less than f (x) for small enough∈.
We can thus reduce f (x) by moving x in small steps with opposite sign of the derivative.
When f ‘(x) = 0, the derivative provides no information about which direction to move.
Points
A local minimum is a point where f (x) is lower than at all neighboring points, so it is no
longer possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points.
The directional derivative in direction u (a unit vector) is the slope of the function f in
direction u.
In other words, the directional derivative is the derivative of the function f(x + αu) with
respect to α, evaluated at α = 0.
Using the chain rule, we can see that ∂ /∂α f(x + αu) evaluates to u T∇xf(x) when α = 0.
To minimize f, we would like to find the direction in which f decreases the fastest. We can
do this using the directional derivative:
Substituting in ||u||2 = 1 and ignoring factors that do not depend on u, this simplifies to
minu cos θ. This is minimized when u points in the opposite direction as the gradient. In
other words, the gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent. Steepest descent proposes a
new p
where ∈ is the learning rate, a positive scalar determining the size of the step. We can
choose ∈ in several different ways. A popular approach is to set ∈ to a small constant.
Beyond the Gradient: Jacobian and Hessian Matrices
Jacobian and Hessian Matrices Sometimes we need to find all of the partial derivatives of
a function whose input and output are both vectors.
The matrix containing all such partial derivatives is known as a Jacobian matrix.
o It is a perfectly flat line, and its value can be predicted using only the gradient.
o If the gradient is 1, then we can make a step of size ∈ along the negative gradient, and the
cost function will decrease by ∈ .
o If the second derivative is negative, the function curves downward, so the cost function
will actually decrease by more than ∈.
o Finally, if the second derivative is positive, the function curves upward, so the cost
function can decrease by less than ∈.
When our function has multiple input dimensions, there are many second derivatives. These
derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix
H(f)(x) is defined such that
Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can
decompose it into a set of real eigenvalues and an orthogonal basis eigenvectors.
For other directions of d the directional second derivative is a weighted average of all of
the eigenvalues, with weights between 0 and 1, and eigenvectors that have smaller angle
with d receiving more weight.
The maximum eigenvalue determines the maximum second derivative and the minimum
eigenvalue determines the minimum second derivative.
The (directional) second derivative tells us how well we can expect a gradient descent step
to perform.
We can make a second-order Taylor series approximation to the function f(x) around the
current point x (0) :
When gTHg is zero or negative, the Taylor series approximation predicts that increasing ∈
forever will decrease f forever.
When gTHg is positive, solving for the optimal step size that decreases the Taylor series
approximation of the function the most yields
Figure: A saddle point containing both positive and negative curvature.
The function in this example is f (x) = x2 1 − x2 2 . Along the axis corresponding to x1, the
function curves upward.
The simplest method for doing so is known as Newton’s method. Newton’s method is
based on using a second-order Taylor series expansion to approximate f(x) near some
point x (0) :
Optimization algorithms that use only the gradient such as gradient descent are called first-
order optimization algorithms.
Optimization algorithms that also use the Hessian matrix, such as Newton’s method are
called second-order optimization algorithms.
In the context of deep learning, we sometimes gain some guarantees by restricting
ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous
derivatives.
Convex optimization algorithms are able to provide many more guarantees by making
stronger restrictions.
Convex optimization algorithms are applicable only to convex functions -- functions for
which the Hessian is positive semidefinite everywhere.
Such functions are well-behaved because they lack saddle points and all of their local
minima are necessarily global minima.
However, most problems in deep learning are difficult to express in terms of convex
optimization. Convex optimization is used only as a subroutine of some deep learning
algorithms.
Ideas from the analysis of convex optimization algorithms can be useful for proving the
convergence of deep learning algorithms.