(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
1 Overview
In the previous lecture we introduced the gradient descent algorithm, and mentioned that it falls
under a broader category of methods. In this lecture we describe this general approach called steepest
descent. We will explain how gradient descent is an example of this method, and also introduce
the coordinate descent algorithm which is another example of the steepest descent method. Lastly,
we will present Newton’s method. Newton’s method is a general approach for solving systems of
non-linear equations. Newton’s method can conceptually be seen as a steepest descent method, and
we will show how it can be applied for convex optimization.
2 Steepest Descent
As discussed in the previous lecture, one can consider a search for a stationary point as an iterative
procedure of generating a point x(k+1) which takes steps of certain length tk at direction ∆x(k) from
the previous point x(k) . The direction ∆x(k) decides which direction we search next, and the step
size determines how far we go in that particular direction. We can write this update rule as:
A steepest descent algorithm would be an algorithm which follows the above update rule, where
at each iteration, the direction ∆x(k) is the steepest direction we can take. That is, the algorithm
continues its search in the direction which will minimize the value of function, given the current
point. Or in other words, given a particular point x, we would like to find the direction d s.t.
f (x + d) is minimized.
Finding the steepest direction. In order to find the steepest direction, we can approximate
the function via a first-order Taylor expansion:
f (x + d) ≈ f (x) + ∇f (x)| d
The direction d that minimizes the function implies the following optimization problem1 :
min ∇f (x)| d
d:kvk=1
In general, one may consider various norms for the minimization problem. As we will now see, the
interpretation of steepest descent with different norms leads to different algorithms.
1
Recall that a direction is a vector of unit length.
1
2.1 Steepest descent in `2 : gradient descent
In the case of the `2 norm, let us first find d? ∈ arg maxkdk2 =1 ∇f (x)| d. From the Cauchy-Schwarz
inequality we know that ∇f (x)| d ≤ k∇f (x)kkdk with equality when d = λ∇f (x), λ ∈ R. Since
kdk = 1 this implies:
∇f (x)
d? =
k∇f (x)k
Since we aim to minimize the function, we seek d̂ ∈ argminkdk2 =1 ∇f (x)| d which is simply −d? .
So the iterations of steepest descent in `2 norm are:
x(k+1) = x(k) − tk ∇f (x(k) )
where tk is some step size (think about this as some term multiplied by k∇f (x)k−1 ). Notice that
this is the gradient descent algorithm.
We discussed these special cases in lecture 5 when we proved the minimax theorem: in this case
the optimal solution is to find the index i ∈ [n] for which the absolute value ∂f (x)/∂xi is largest,
and the optimal solution is then sign(ei ), where sign attributes a negative sign to ei if ∂f (x)/∂xi
is positive, and a negative sign otherwise, and ei is the standard basis vector with 1 in index i and
0s in all other indices). So the steepest direction is the direction d which respects:
∂f (x)
d = sign(ei ) , i ∈ argmaxi∈[n]
∂xi
This gives us a very simple algorithm called coordinate descent: we start with an arbitrary point,
and at every stage we iterate over all the coordinates of ∇f (x), find the one with largest absolute
value, and descent in that direction.
2
3 Newton’s Method
The goal of Newton’s method is to find solutions x for which h(x) = 0, for some function h : Rn → Rn .
The main idea is to start with an arbitrary point, and iteratively try to find the point at which the
function evaluates to zero by find the point of the tangent line of the function at the current point
evaluates to zero,
tangent lines of the function and a given point.
Newton’s method for single dimensional functions. For simplicity let’s start with functions
h : R → R. The tangent line of a function h at point x(k) can be expressed as the following equation:
h(x(k) )
h(x(k) ) + h0 (x(k) )(x − x(k) ) = 0 ⇐⇒ x = x(k) −
h0 (x(k) )
At every step k, Newton’s method finds sets the next guess to be x(k+1) = x(k) − h(x(k) )/h0 (x(k) ),
until finding a point close enough to zero. We formally describe the algorithm below.
Newton’s method for multidimensional functions. More generally, Newton’s method can
be applied for multidimensional functions g : Rn → Rn , where g(x) = (g1 (x), g2 (x), . . . , gn (x)). In
this case, Newton iterates satisfy:
In our case, we are interested in minimizing a convex function f : Rn → R. In the previous lecture
we showed that if the function is convex then a stationary point is the minimizer. We there would
3
like to find the solution x for which ∇f (x) = 0. Note that the Jacobian of ∇f is in fact the Hessian,
and we therefore have that for ∇f (x) Newton’s method iterates are:
Example. Consider the problem of minimizing x4 + 2x2 y 2 + y 4 using Newton’s method. Let’s
first write the gradient and the Hessian:
!
∂f (x,y)
4x3 + 4xy 2
∇f (x, y) = ∂x =
∂f (x,y)
∂y
4x2 y + 4y 3
12x2 + 4y 2
8xy
Hf (x, y) =
8xy 4x2 + 12y 2
Let’s assume we start from a guess (x, y) = (a, a) for some a ∈ R. An equivalent way to write our
condition for the next iteration is:
Thus the next point x(1) = (x, y) is the one where the values are x, y respect:
One can check that this system is solved by (x, y) = 2/3(a, a). In general the kth iteration will be:
k
2 a
·
3 a
which converges to zero exponentially fast in k.
It is important to note that Newton’s method will not always converge. For the (non-convex)
function f (x) = x3 − 2x + 2 for example, if we start at x(0) = 0, the points will oscillate between
0 and 1 and will not converge to the root. That is, for all t ∈ N, x(k) = 1 for all k = 2t + 1 and
4
x(k) = 0 for all k = 2t. For the function f (x) = 23 · |x|3/2 one can check that Newton’s method will
oscillate as well. So when can we at least guarantee that Newton’s method does not oscillate?
Definition. A vector ∈ Rn is called a descending direction for f : Rn → R if for all x ∈ Rn
we have that ∇f (x)T d < 0.
Proposition 1. For a given function f , if its Hessian Hf is positive definite then d = −Hf (xk )−1 ∇f (xk )
is a descent direction. That is, Newton’s method decreases the function value at every iteration.
Proof. Since Hf is positive definite this implies that its inverse is also positive definite. Thus:
4 Further Reading
For further reading on steepest descent and Newton’s method see Chapter 9 of the Convex Opti-
mization book by Boyd and Vandenberghe.