O4MD 03 Descent Methods
O4MD 03 Descent Methods
O4MD 03 Descent Methods
Methods that use information about gradients to obtain descent in the objective function at each
iteration form the basis of all of the schemes studied in this book. We describe several methods
of this type, along with analysis of their convergence and complexity properties. This chapter can
be read as an introduction both to the gradient methods and to the fundamental tools of analysis
that are used to understand optimization algorithms.
Throughout the chapter, we consider the unconstrained minimization of a smooth convex func-
tion:
minn f (x). (3.1)
x∈R
The algorithms we consider in this chapter are suited to the case in which f and its gradient ∇f
can be evaluated—exactly, in principle—at arbitrary points x. Bearing in mind that this setup
may not hold for many data analysis problems, we focus on those fundamental algorithms that can
be extended to more general situations, for example:
• Minimization of smooth functions over simple constraint sets, such as bounds on the compo-
nents of x;
• Functions for which f of ∇f cannot be evaluated exactly without a complete sweep through
the data set, but unbiased estimates of ∇f can be obtained easily.
Extensions to the fundamental methods of his chapter, which allow us to handle these more general
cases, will be considered in subsequent chapters.
19
RECHT AND WRIGHT
Line-search methods proceed by identifying a direction d from each x such that f decreases as we
move in the direction d. This notion can be formalized by the following definition:
Definition 3.1. d is a descent direction for f at x if f (x + td) < f (x) for all t > 0 sufficiently
small.
Proof. We use Taylor’s theorem — Theorem 2.1. By continuity of ∇f , we can identify t̄ > 0 such
that ∇f (x + td)T d < 0 for all t ∈ [0, t̄]. Thus from (2.3), we have for any t ∈ (0, t̄] that
∇f (x)
inf dT ∇f (x) = −k∇f k, achieved when d = − .
kdk=1 k∇f (x)k
For this reason, we refer to −∇f (x) as the direction of steepest descent.
Since this direction always provides a descent direction, perhaps the simplest method for opti-
mization of a smooth function has the iterations
for some steplength αk > 0. At each iteration, we are guaranteed that there is either some nonneg-
ative step α which decreases the function value unless ∇f (xk ) = 0. But note that when ∇f (x) = 0,
we will have found a point which satisfies a necessary condition of local optimality. Moreover, if
f is convex, we will have computed a global minimizer of f . This algorithm is called the gradient
method or the method of steepest descent. In the next section, we will analyze how many iterations
are required to find points where the gradient nearly vanishes.
20
OPTIMIZATION FOR MODERN DATA ANALYSIS
To estimate the amount of decrease in f obtained at each iterate of this method, we use Taylor’s
theorem. By setting p = αd in (2.2), we obtain
Z 1
T
f (x + αd) = f (x) + α∇f (x) d + α [∇f (x + γαd) − ∇f (x)]d dγ
0
Z 1
≤ f (x) + α∇f (x)T d + α k∇f (x + γαd) − ∇f (x)kkdk dγ
0
T 2L
≤ f (x) + α∇f (x) d + α kdk2 , (3.4)
2
where we used (2.7) for the last line. For x = xk and d = −∇f (xk ), the value of α that minimizes
the expression on the right-hand side is α = 1/L. By substituting these values, we obtain
1
f (xk+1 ) = f (xk − (1/L)∇f (xk )) ≤ f (xk ) − k∇f (xk )k2 . (3.5)
2L
This expression is one of the foundational inequalities in the analysis of optimization methods. It
quantifies the amount of decrease we can obtain from the function f to two critical quantities: the
norm of the gradient ∇f (xk ) at the current iterate, and the Lipschitz constant L of the gradient.
Depending on the other assumptions about f , we can derive a variety of different convergence rates
from this basic inequality, as we now show.
21
RECHT AND WRIGHT
Theorem 3.3. Suppose that f is convex and L-smooth, and that (3.1) has a solution x∗ . Then
the steepest descent method with stepsize αk ≡ 1/L generates a sequence {xk }∞
k=0 that satisfies
L 0
f (xT ) − f ∗ ≤ kx − x∗ k2 , T = 1, 2, . . . . (3.8)
2T
Proof. By convexity of f , we have f (x∗ ) ≥ f (xk ) + ∇f (xk )T (x∗ − xk ), so by substituting into the
key inequality (3.5), we obtain for k = 0, 1, 2, . . . that
1
f (xk+1 ) ≤ f (x∗ ) + ∇f (xk )T (xk − x∗ ) − k∇f (xk )k2
2L
∗ L k ∗ 2 k ∗ 1 k 2
= f (x ) + kx − x k − kx − x − ∇f (x )k
2 L
L
= f (x∗ ) + kxk − x∗ k2 − kxk+1 − x∗ k2 .
2
By summing over k = 0, 1, 2, . . . , T − 1, we have
T −1 T −1
X L X k
(f (xk+1 ) − f ∗ ) ≤ kx − x∗ k2 − kxk+1 − x∗ k2
2
k=0 k=0
L
kx0 − x∗ k2 − kxT − x∗ k2
=
2
L 0
≤ kx − x∗ k2 .
2
T −1
1 X L 0
f (xT ) − f (x∗ ) ≤ (f (xk+1 ) − f ∗ ) ≤ kx − x∗ k2 ,
T 2T
k=0
as required.
22
OPTIMIZATION FOR MODERN DATA ANALYSIS
The simplest strongly convex function is the squared Euclidean norm kxk2 . Any convex function
can be perturbed to form a strongly convex function by adding any small multiple of the squared
Euclidean norm. In fact, if f is any L-smooth function, then
is strongly convex for µ large enough. Verifying this fact is a interesting exercise.
As another canonical example, note that a quadratic function f (x) = 12 xT Qx is strongly convex
if and only if the smallest eigenvalue of Q is strictly positive. We saw in Theorem 2.8 that a strongly
convex f has a unique minimizer, which we denote by x∗ .
Strongly convex functions are in essence the “easiest” functions to optimize by first-order meth-
ods. First, the norm of the gradient provides useful information about how far away we are from
optimality. Suppose we minimize both sides of the inequality (3.9) with respect to z. The minimizer
on the left-hand side is clearly attained at z = x∗ , while on the right-hand side it is attained at
x − ∇f (x)/m. By plugging these optimal values into (3.9), we obtain
2
1 m 1
f (x∗ ) ≥ f (x) − ∇f (x)T ∇f (x) + ∇f (x)
m 2 m
1
= f (x) − k∇f (x)k2 .
2m
By rearrangement, we obtain
k∇f (x)k2 ≥ 2m[f (x) − f (x∗ )]. (3.10)
If k∇f (x)k < δ then
k∇f (x)k2 δ2
f (x) − f (x∗ ) ≤ ≤ .
2m 2m
Thus, when the gradient is small, we are close to having found a point with minimal function value.
We can even derive a stronger result about the distance of x to the optimal point x∗ . Using (3.9)
and Cauchy-Schwartz, we have
m
f (x∗ ) ≥ f (x) + ∇f (x)T (x∗ − x) + kx − x∗ k2
2
m
≥ f (x) − k∇f (x)k kx∗ − xk + kx − x∗ k2
2
Rearranging terms proves that
2
kx − x∗ k ≤ k∇f (x)k . (3.11)
m
This says we can estimate the distance to the optimal value purely in terms of the norm of the
gradient.
We summarize this discussion in the following
Lemma 3.4. Let f be a strongly convex function with modulus m. Then we have
k∇f (x)k2
f (x) − f (x∗ ) ≤ (3.12)
2m
2
kx − x∗ k ≤ k∇f (x)k . (3.13)
m
23
RECHT AND WRIGHT
We can now proceed to analyze the convergence of gradient descent on strongly convex functions.
By substituting (3.12) into our basic inequality (3.5), we obtain
1 1 m
k+1 k
f (x ) = f x − ∇f (x ) ≤ f (xk ) −
k
k∇f (xk )k2 ≤ f (xk ) − (f (xk ) − f ∗ ).
L 2L L
Lkx0 − x∗ k2
k≥ . (3.16)
2
For the strongly convex case, we have from (3.15) that f (xk ) − f ∗ ≤ for all k satisfying
L
k≥ log((f (x0 ) − f ∗ )/). (3.17)
m
Note that in all three cases, we can get bounds in terms of the distance initial distance to optimality
kx0 − x∗ k rather than in terms of the initial optimality gap f (x0 ) − f ∗ by using the inequality
L 0
f (x0 ) − f ∗ ≤ kx − x∗ k2 .
2
The linear rate (3.17) depends only logarithmically on , whereas the sublinear rates depend on
1/ or 1/2 . When is small (for example = 10−6 ), the linear rate would appear to be dramatically
faster, and indeed this is usually the case. The only exception would be when m is extremely small,
so that L/m is of the same order as . The problem is extremely ill conditioned in this case, and
there is little difference between the linear rate (3.17) and the sublinear rate (3.16).
All of these bounds depend on knowledge of the curvature parameter L. What happens when
we don’t know L? Even when we do know it, is the steplength αk ≡ 1/L good? We have reason to
suspect not, since the inequality (3.5) on which it is based uses the conservative global upper bound
L on curvature. (A sharper bound could be obtained in terms of the curvature in the neighborhood
of the current iterate xk .) In the remainder of this chapter, we expand our view to more general
choices of search directions and stepsizes.
24
OPTIMIZATION FOR MODERN DATA ANALYSIS
The remainder of the analyses used properties about the function f itself that were independent
of the algorithm: smoothness, convexity, and strong convexity. For a general descent method, we
can provide similar analyses based on the property (3.19).
What can we say about the sequence of iterates {xk } generated by such a scheme? We state
an elementary theorem.
Theorem 3.5. Suppose that f is bounded below, with Lipschitz continuous gradient. Then all ac-
cumulation points x̄ of the sequence {xk } generated by a scheme that satisfies (3.19) are stationary,
that is, ∇f (x̄) = 0. If in addition f is convex, each such x̄ is a solution of (3.1).
Proof. Note first from (3.19) that
and since {f (xk )} is a decreasing sequence that is bounded below, it follows that limk→∞ f (xk ) −
f (xk+1 ) = 0. If x̄ is an accumulation point, there is a subsequence S such that limk∈S,k→∞ xk = x̄.
By continuity of ∇f , we have ∇f (x̄) = limk∈S,k→∞ ∇f (xk ) = 0, as required. If f is convex, each
such x̄ satisfies the first-order sufficient conditions to be a solution of (3.1).
It is possible for the the sequence {xk } to be unbounded and have no accumulation points. For
example, some descent methods applied to the scalar function f (x) = e−x will generate iterates
that diverge to ∞. (This function is convex and bounded below but does not attain its minimum
value.)
We can prove other results about rates of convergence of algorithms (3.18) satisfying (3.19),
using almost identical proofs to those of Section 3.2. For example, for the case in which f is
bounded below by some quantity f¯, we can show using the techniques of Section 3.2.1 that
r
k f (x0 ) − f¯
min k∇f (x )k ≤ .
0≤k≤T −1 CT
25
RECHT AND WRIGHT
For the case in which f is strongly convex with modulus m (and unique solution x∗ ), we can
combine (3.12) with (3.19) to deduce that
f (xk+1 ) − f (x∗ ) ≤ f (xk ) − f (x∗ ) − Ck∇f (xk )k2 ≤ (1 − 2mC)[f (xk ) − f (x∗ )],
R02
f (xT ) − f ∗ ≤ T = 1, 2, . . . . (3.21)
CT
Proof. Defining ∆k := f (xk ) − f (x∗ ), we have that
which after subtracting f (x∗ ) from both sides and using the definition of ∆k becomes
C 2 C
∆k+1 ≤ ∆k − 2 ∆k = ∆k 1 − 2 ∆k . (3.22)
R0 R0
By inverting both sides, we obtain
1 1 1
≥
∆k+1 ∆k 1 − C2 ∆k
R 0
Since ∆k+1 ≥ 0, we have from (3.22) that RC2 ∆k ∈ [0, 1], so using the fact that 1
1− ≥ 1 + for all
0
∈ [0, 1], we obtain
1 1 C 1 C
≥ 1 + 2 ∆k = + 2.
∆k+1 ∆k R0 ∆k R0
By applying this formula recursively, we have for any T ≥ 1 that
1 1 TC TC
≥ + 2 ≥ 2,
∆T ∆0 R0 R0
and we obtain the result by taking the inverse of both sides in this bound and using ∆T = f (xT ) −
f (x∗ ).
26
OPTIMIZATION FOR MODERN DATA ANALYSIS
−(dk )T ∇f (xk )
0 < ¯ ≤ , (3.23a)
k∇f (xk )kkdk k
kdk k
0 < γ1 ≤ ≤ γ2 . (3.23b)
k∇f (xk )k
Condition (3.23a) says that the angle between −∇f (xk ) and dk is acute, and bounded away from
π/2, while condition (3.23b) ensures that dk and ∇f (xk ) are not too much different in length. (If
xk is a stationary point, we have ∇f (xk ) = 0 so our algorithm will set dk = 0 and terminate.)
For the “obvious” choice of search direction—the negative gradient dk = −∇f (xk )—the condi-
tions (3.23) hold trivially, with ¯ = γ1 = γ2 = 1.
We can use Taylor’s theorem to bound the change in f when we move along dk from the current
iteration xk . By setting x = xk and p = αdk in (2.2), we obtain
where we used (2.7) for the second-last line and (3.23) throughout. It is clear from this expression
/(Lγ2 ))—we have f (xk+1 ) <
that for all values of α sufficiently small—to be precise, for α ∈ (0, 2¯
k k
f (x ), unless of course x is a stationary point.
In deriving the bound (3.24), we did not require convexity of f , only Lipschitz continuity of the
gradient ∇f . The same is true for most of the analysis in this section. Convexity is used only in
proving rates of convergence to a solution x∗ , in Sections 3.3 and 3.2. (Even there, we could relax
the convexity assumption to obtain results about convergence to stationary points.)
We mention a few possible choices of dk apart from the negative gradient direction −∇f (xk ).
• The transformed negative gradient direction dk = −S k ∇f (xk ), where S k is a symmetric
positive definite matrix with eigenvalues in the range [γ1 , γ2 ], where γ1 and γ2 are positive
quantities as in (3.23). The second condition in (3.23) hold, by definition of S k , and the first
condition holds with ¯ = γ1 /γ2 , since
−(dk )T ∇f (xk ) = ∇f (xk )T S k ∇f (xk ) ≥ γ1 k∇f (xk )k2 ≥ (γ1 /γ2 )k∇f (xk )kkdk k.
27
RECHT AND WRIGHT
Newton’s method, which chooses S k = ∇2 f (xk ), would satisfy this condition provided the
true Hessian has eigenvalues uniformly bounded in the range [1/γ2 , 1/γ1 ] for all xk .
• The Gauss-Southwell variant of coordinate descent chooses dk = −[∇f (xk )]ik , where ik =
arg mini=1,2,...,n |[∇f (xk )]i |. (We leave it as an exercise to show that the conditions (3.23) are
satisfied for this choice of dk .) There does not seem to be an obvious reason to use this search
direction. Since it is defined in terms of the full gradient ∇f (xk ), why not use dk = −∇f (xk )
instead? The answer (as we discuss further in Chapter 6) is that for some important kinds of
functions f , the gradient ∇f (xk ) can be updated efficiently to obtain ∇f (xk+1 ) provided that
xk and xk+1 differ in only a single coordinate. These cost savings make coordinate descent
methods competitive with, and often faster than, full-gradient methods.
• Some algorithms make randomized choices of dk in which the conditions (3.23) hold in the
sense of expectation, rather than deterministically. In one variant of stochastic coordinate
descent, we set dk = −[∇f (xk )]ik , for ik chosen uniformly at random from {1, 2, . . . , n} at
each k. Taking expectations over ik , we have
1X n
1 1
Eik (−dk )T ∇f (xk ) = [∇f (xk )]2i = k∇f (xk )k2 ≥ k∇f (xk )kkdk k,
n n n
i=1
where the last inequality follows from kdk k ≤ k∇f (xk )k, so the first condition in (3.23) holds
in an expected sense. We have that E(kdk k2 ) = n1 k∇f (xk )k22 , so the norms of kdk k and
k∇f (xk )k are also similar to within a scale factor, so the first part of (3.23) also holds in an
expected sense. Rigorous analysis of these methods is presented in Chapter 6.
• Another important class of randomized schemes are the stochastic gradient methods discussed
in Chapter 5. In place of an exact gradient ∇f (xk ), these method typically have access to
a vector g(xk , ξk ), where ξk is a random variable, such that Eξk g(xk , ξk ) = ∇f (xk ). That is,
g(xk , ξk ) is an unbiased (but often very noisy) estimate of the true gradient ∇f (xk ). Again,
if we set dk = −g(xk , ξk ), the conditions (3.23) hold in an expected sense, though the bound
E(kdk k) ≤ γ2 k∇f (xk )k requires additional conditions on the distribution of g(xk , ξk ) as a
function of ξk . Further analysis of stochastic gradient methods appears in Chapter 5.
28
OPTIMIZATION FOR MODERN DATA ANALYSIS
Constant Stepsize. As we have seen in Section 3.2, constant stepsizes can yield rapid conver-
gence rates. The main drawback of the constant stepsize method is that one needs some prior
information to properly choose the stepsize.
The first approach to choosing a constant stepsize (one commonly used in machine learning,
where the step length is often known as the “learning rate”) is trial and error. Extensive experience
in applying gradient (or stochastic gradient) algorithms to a particular class of problems may reveal
that a particular stepsize is reliable and reasonably efficient. Typically, a reasonable heuristic is to
pick α as large as possible such that the algorithm doesn’t diverge. In some sense, this approach
is estimating the Lipschitz constant of the gradient of f by trial and error. Slightly enhanced
variants are also possible, for example, αk may be held constant for many successive iterations
then decreased periodically. Since such schemes are highly application- and problem-dependent,
we cannot say much more about them here.
A second approach is to base the choice of αk on knowledge of the global properties of the
function f , for example, on the Lipschitz constant L for the gradient (see (2.7)) or the modulus of
convexity µ (see (2.17)). We call such variants “short-step” methods. Given the expression (3.24)
above, for example, and supposing we have estimates of all the quantities γ1 , γ2 , and L that appear
therein, we could choose α to maximize the coefficient of the last term. Setting α = ¯/(Lγ2 ), we
obtain from (3.24) and (3.23) that
¯2 ¯2 γ1
f (xk+1 ) ≤ f (xk ) − k∇f (xk )kkdk k ≥ f (xk ) − k∇f (xk )k2 . (3.26)
2Lγ2 2Lγ2
Thus, the amount of decrease in f at iteration k is at least a positive multiple of the squared
gradient norm k∇f (xk )k2 .
Exact Line Search. Once we have chosen a descent direction, we can minimizing the function
restricted to this direction. That is, we can perform a one-dimensional line search along direction
dk to find an approximate solution of the following problem:
This technique requires evalaution of f (xk + αdk ) (and possibly also its derivative with respect to
α, namely (dk )T ∇f (xk + αdk )) economically, for arbitrary positive values of α. There are many
cases where these line searches can be computed at low cost. For example, if f is a multivariate
polynomial, the line search amounts to minimizing a univariate polynomial. Such a minimization
can be performed by finding the roots of the polynomial, and then testing each root to find the
minimum. In other settings, such as coordinate descent methods of Chapter 6, it is possible to
evaluate f (xk + αdk ) cheaply for certain f , provided that dk is a coordinate direction. Convergence
analysis for exact line search methods tracks that for the short-step methods above. Since the exact
minimizer of f (xk + αdk ) will achieve at least as much reduction in f as the choice α = ¯/(Lγ2 )
used to derive the estimate (3.26), it is clear that (3.26) also holds for exact line searches.
Approximate Line Search. In full generality, exact line searches are expensive and unnecessary.
Better empirical performance is achieved by approximate line search. There was a lot of research in
the 1970s and 1980s on finding conditions that should be satisfied by approximate line searches so
as to guarantee good convergence properties, and on identifying line-search procedures which find
29
RECHT AND WRIGHT
such approximate solutions economically. (By “economically,” we mean that an average of three or
less evaluations of f are required.) One popular pair of conditions that the approximate minimizer
α = αk is required to satisfy, called the Weak Wolfe Conditions, is as follows:
Here, c1 and c2 are constants that satisfy 0 < c1 < c2 < 1. The condition (3.28a) is often known
as the “sufficient decrease condition,” because it ensures that the actual amount of decrease in f
is at least a multiple c1 of the amount suggested by the first-order Taylor expansion. The second
condition (3.28b), which we call the “gradient condition,” ensures that αk is not too short; it ensures
that we move far enough along dk that the directional derivative of f along dk is substantially less
negative than its value at α = 0, or is zero or positive. These conditions are illustrated in Figure 3.1.
It can be shown that there exist values of αk that satisfy both weak Wolfe conditions simulta-
neously. To show that these conditions imply a reduction in f that is related to k∇f (xk )k2 (as in
(3.26)), we argue as follows. First, from condition (3.28b) and the Lipschitz property for ∇f , we
have
−(1 − c2 )∇f (xk )T dk ≤ [∇f (xk + αk dk ) − ∇f (xk )]T dk ≤ Lαk kdk k2 ,
and thus
(1 − c2 ) ∇f (xk )T dk
αk ≥ − .
L kdk k2
Substituting into (3.28a), and using the first condition in (3.23), then yields
Backtracking Line Search. Another popular approach to determining an appropriate value for
αk is known as “backtracking.” It is widely used in situations where evaluation of f is economical
and practical, while evaluation of the gradient ∇f is more difficult. It is easy to implement (no
30
OPTIMIZATION FOR MODERN DATA ANALYSIS
f(x+ α d)
f(x)+c1 α f(x)Τ d
∆
Figure 3.1: Weak Wolfe conditions are satisfied when both the gradient condition (3.28b) and the
sufficient decrease condition (3.28a) hold.
31
RECHT AND WRIGHT
estimate of the Lipschitz constant L is required, for example) and still results in reasonably fast
convergence.
In its simplest variant, we first try a value ᾱ > 0 as the initial guess of the steplength, and choose
a constant β ∈ (0, 1). The step length αk is set to the first value in the sequence ᾱ, β ᾱ, β 2 ᾱ, β 3 ᾱ, . . .
for which a sufficient decrease condition (3.28a) is satisfied. Note that backtracking does not require
a condition like (3.28b) to be checked. The purpose of such a condition is to ensure that αk is not
too short, but this is not a concern in backtracking, because we know that αk is either the fixed
value ᾱ, or is within a factor β of a step length that is too long.
Under the assumptions above, we can again show that the decrease in f at iteration k is a
positive multiple of k∇f (xk )k2 . When no backtracking is necessary, that is, αk = ᾱ, we have from
(3.23) that
f (xk+1 ) ≤ f (xk ) + c1 ᾱ∇f (xk )T dk ≤ f (xk ) − c1 ᾱ¯
γ1 k∇f (xk )k2 . (3.29)
When backtracking is needed, we have from the fact that the test (3.28a) is not satisfied for the
previously tried value α = β −1 αk that
(see Theorem 2.4). Our method makes a further smoothness assumption on f . In addition to
Lipschitz continuity of the gradient ∇f , we assume Lipschitz continuity of the Hessian ∇2 f . That
is, we assume that there is a constant M such that
32
OPTIMIZATION FOR MODERN DATA ANALYSIS
By extending Taylor’s theorem (Theorem 2.1) to a third-order term, and using the definition of M ,
we obtain the following cubic upper bound on f :
1 1
f (x + p) ≤ f (x) + ∇f (x)T p + pT ∇2 f (x)p + M kpk3 . (3.33)
2 6
As in Section 3.2, we make an additional assumption that f is bounded below by f¯.
We describe an elementary algorithm that makes use of the expansion (3.33) as well as the
steepest-descent theory of Subsection 3.2. Our algorithm aims to identify a point that approximately
satisfies the second-order necessary conditions (3.31), that is,
(ii) Otherwise, define λk to be the minimum eigenvalue of ∇2 f (xK ), that is, λk := λmin (∇2 f (xk )).
If λk < −H , choose pk to be the eigenvector corresponding to the most negative eigenvalue
of ∇2 f (xk ). Choose the size and sign of pk such that kpk k = 1 and (pk )T ∇f (xk ) ≤ 0, and set
2|λk |
xk+1 = xk + αk pk , where αk = . (3.35)
M
• If neither of these conditions hold, then xk satisfies the necessary conditions (3.34), so is an
approximate second-order-necessary point.
For the steepest-descent step (i), we have from (3.5) that
1 2g
f (xk+1 ) ≤ f (xk ) − k∇f (xk )k2 ≤ f (xk ) − . (3.36)
2L 2L
For a step of type (ii), we have from (3.33) that
1 1
f (xk+1 ) ≤ f (xk ) + αk ∇f (xk )T pk + αk2 (pk )T ∇2 f (xk )pk + M αk3 kpk k3
2 6
2 3
1 2|λk | 1 2|λk |
≤ f (xk ) − |λk | + M
2 M 6 M
2 |λk | 3
= f (xk ) − (3.37)
3 M2
2 3H
= f (xk ) − . (3.38)
3 M2
By aggregating (3.36) and (3.38), we have that at each xk for which the condition (3.34) does not
hold, we attain a decrease in the objective of at least
!
2g 2 3H
min , .
2L 3 M 2
33
RECHT AND WRIGHT
Using the lower bound f¯ on the objective f , we see that the number of iterations K required to
meet he condition (3.34) must satisfy the condition
!
2g 2 3H
K min , ≤ f (x0 ) − f¯,
2L 3 M 2
Note that the maximum number of iterates required to identify a point for which just the approx-
imate stationarity condition k∇f (xk )k ≤ g holds is at most 2L−2 0 ¯
g (f (x ) − f ). (We can just omit
the second-order part of the algorithm.) Note too that it is easy to devise approximate versions of
this algorithm with similar complexity. For example, the negative curvature direction pk in step (ii)
above can be replaced by an approximation to the direction of most negative curvature, obtained
by the Lanczos iteration with random initialization.
Exercises
1. Linear Rates. Let {xk } be a sequence satisfying xk+1 ≤ (1 − β)xk for 0 < β < 1, and
x0 ≤ C. Prove that xk ≤ for all
k ≥ β −1 log C .
2. Verify that if f is twice continuously differentiable with the Hessian satisfying mI ∇2 f (x)
for all x ∈ dom (f ), then the strong convexity condition (2.17) is satisfied.
3. Show, as a corollary of Theorem 3.5 that if the sequence {xk } described in this theorem is
bounded and if f is strictly convex, we have limk→∞ xk = x∗ .
4. How much of the analysis of Sections 3.2, 3.4, 3.5, and 3.3 applies to smooth nonconvex
functions? Specifically, state an analog of Theorem 3.5 that is true when the assumption of
convexity of f is dropped.
5. How is the analysis of Section 3.2 affected if we take an even shorter constant steplength than
1/L, that is, α ∈ (0, 1/L)? Show that we can still attain a “1/k” sublinear convergence rate
for {f (xk )}, but that the rate involves a constant that depends on the choice of α.
6. Find positive values of ¯, γ1 , and γ2 such that the Gauss-Southwell choice dk = −[∇f (xk )]ik ,
where ik = arg mini=1,2,...,n |[∇f (xk )]i | satisfies conditions (3.23).
34
OPTIMIZATION FOR MODERN DATA ANALYSIS
m 2
(a) Show that q(x) := f (x) − 2 kxk is convex with L − m Lipschitz gradients.
(b) Use part (a) to prove that
mL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk2 + k∇f (x) − ∇f (y)k2
m+L m+L
for all x and y.
(c) Use part (b) and the fact that ∇f (x? ) = 0 to show that the kth iterate of the gradient
2
method applied to f with stepsize m+L satisfies
k
κ−1
kxk − x? k ≤ kx0 − x? k ,
κ+1
where κ = L/m.
() ()
(a) Let x? denote an optimal solution of f . Is x? unique?
()
(b) Prove that f (z) − f (x? ) ≤ f (z) − f (x? ) + 2 .
(c) Prove that for an appropriately chosen stepsize, the gradient method applied to f will
find a solution such that
()
f (z) − f (x? ) ≤
2
in at most
R2 L
2
8R
log
iterations. Find a constant stepsize that yields such a convergence rate.
35
RECHT AND WRIGHT
(b) If you run the gradient method on (3.39) starting at x0 = 0, how many iterations are
required to find a solution with n1 kAx − bk2 ≤ ?
(c) Consider the regularized problem
1
minimize `µ (x) := kAx − bk2 + µkxk2 . (3.40)
n
where µ is some positive scalar. Let x(µ) denote the minimizer of (3.40). Compute a
closed form formula of x(µ) .
(d) If you run the gradient method on (3.40) starting at x0 = 0, how many iterations are
required to find a solution with `µ (x) − `µ (x(µ) ) ≤ ?
(e) Suppose x̂ satisfies `µ (x̂) − `µ (x(µ) ) ≤ . Come up with as tight an upper bound as you
can on the quantity n1 kAx̂ − bk2 .
10. Modify the Extrapolation-Bisection Line Search (Algorithm 3.1) so that it terminates at a
point satisfying strong Wolfe conditions, which are
where c1 and c2 are constants that satisfy 0 < c1 < c2 < 1. (The difference with the weak
Wolfe conditions (3.41) is that the directional derivative ∇f (xk +αdk )T dk is not only bounded
below by c2 |∇f (xk )T dk | but also bounded above by this same quantity. That is, it cannot be
too positive. (Hint: You should test separately for the two ways in which (3.41b) is violated,
that is, ∇f (xk +αdk )T dk < −c2 |∇f (xk )T dk | and ∇f (xk +αdk )T dk > c2 |∇f (xk )T dk |. Different
adjustments of L, α, and U are required in these two cases.)
36