O4MD 02 Foundations
O4MD 02 Foundations
O4MD 02 Foundations
Foundations
We outline in this chapter the foundations of the algorithms and theory discussed in later chapters.
These foundations include a review of Taylor’s theorem and its consequences that form the basis
of much of smooth nonlinear optimization. We also provide a concise review of elements of convex
analysis that will be used throughout the book.
• x∗ ∈ D is a strict local minimizer if it is a local minimizer and in addition f (x) > f (x∗ ) for
all x ∈ N with x 6= x∗ .
where Ω ⊂ D ⊂ Rn is a closed set, we modify the terminology slightly to use the word “solution”
rather than “minimizer.” That is, we have the following definitions.
11
RECHT AND WRIGHT
One of the immediate challenges is to provide a simple means of determining whether a par-
ticular point is a local or global solution. To do so, we first review a powerful tool familiar from
calculus: Taylor’s theorem. As we will see, Taylor’s theorem is the most important theorem in all
of continuous optimization. In the next section, we will temporarily turn away from optimization
and dive into a review Taylor’s theorem in the setting of multivariable calculus. This will let us
derive some fundamental lemmas that are the core tools for algorithm analysis.
f (x + p) = f (x) + ∇f (x + γp)T p
= f (x) + ∇f (x)T p + (∇f (x + γp) − ∇f (x))T p
= f (x) + ∇f (x)T p + O(k∇f (x + γp) − ∇f (x)kkpk)
= f (x) + ∇f (x)T p + o(kpk),
where the last step follows from continuity: ∇f (x + γp) − ∇f (x) → 0 as p → 0, for all γ ∈ (0, 1).
As we will see throughout this text, a crucial quantity in optimization is the Lipschitz constant
L for the gradient of f , which is defined to satisfy
12
OPTIMIZATION FOR MODERN DATA ANALYSIS
13
RECHT AND WRIGHT
14
OPTIMIZATION FOR MODERN DATA ANALYSIS
Condition (a) in Theorem 2.4 is called the first-order necessary condition, because it involves the
first-order derivatives of f . For obvious reasons, condition (b) is called the second-order necessary
condition.
We additionally have the following second-order sufficient condition.
Theorem 2.5 (Sufficient Conditions for Smooth Unconstrained Optimization). Suppose that f is
twice continuously differentiable and that for some x∗ , we have ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is positive
definite. Then x∗ is a strict local minimizer of (2.10).
Proof. We use formula (2.5) from Taylor’s theorem. Define a radius ρ sufficiently small and positive
such that the eigenvalues of ∇2 f (x∗ + γp) are bounded below by some positive number , for all
p ∈ Rn with kpk ≤ ρ, and all γ ∈ (0, 1). (Because ∇2 f is positive definite at x∗ and continuous,
and because the eigenvalues of a matrix are continuous functions of the elements of a matrix, it is
possible to choose ρ > 0 and > 0 with these properties.) By setting x = x∗ in (2.5), we have
1 1
f (x∗ + p) = f (x∗ ) + ∇f (x∗ )T p + pT ∇2 f (x∗ + γp)p ≥ f (x∗ ) + kpk2 , for all p with kpk ≤ ρ.
2 2
thus by setting N = {x∗ + p | kpk < ρ}, we have found a neighborhood of x∗ such that f (x) > f (x∗ )
for all x ∈ N with x 6= x∗ , thus satisfying the conditions for a strict local minimizer.
The sufficiency promised by Theorem 2.5 only guarantees a locally optimal solution. We now
turn to a class of functions where we can provide necessary and sufficient guarantees for optimality
using only information from low order derivatives.
For all pairs of points (x, y) contained in Ω, the line segment between x and y is also contained in
Ω. The convex sets that we consider in this book are usually closed.
The defining property of a convex function is the following inequality:
f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y), for all x, y ∈ Rn and all α ∈ [0, 1]. (2.14)
The line segment connecting (x, f (x)) and (y, f (y)) lies entirely above the graph of the function f .
In other words, the epigraph of f , defined as
is a convex set.
The concepts of “minimizer” and “solution” for the case of convex objective function and
constraint set are simpler than for the general case. In particular, the distinction between “local”
and “global” solutions goes away, as we show now.
15
RECHT AND WRIGHT
Theorem 2.6. Suppose that in (2.1), the function f is convex and the set Ω is closed and convex.
We have the following.
Proof. For (a), suppose for contradiction that x∗ ∈ Ω is a local solution but not a global solution,
so there exists a point x̄ ∈ Ω such that f (x̄) < f (x∗ ). Then by convexity we have for any α ∈ (0, 1)
that
f (x∗ + α(x̄ − x∗ )) ≤ (1 − α)f (x∗ ) + αf (x̄) < f (x∗ ).
But for any neighborhood N , we have for sufficiently small α > 0 that x∗ + α(x̄ − x∗ )) ∈ N ∩ Ω
and f (x∗ + α(x̄ − x∗ )) < f (x∗ ), contradicting the definition of a local minimizer.
For (b), we simply apply the definition of convexity for both sets and functions. Given any
global solutions x∗ and x̄, we have f (x̄) = f (x∗ ), so for any α ∈ [0, 1] we have
We have also that f (x∗ + α(x̄ − x∗ )) ≥ f (x∗ ), since x∗ + α(x̄ − x∗ ) ∈ Ω and x∗ is a global minimizer.
It follows from these two inequalities that f (x∗ + α(x̄ − x∗ )) = f (x∗ ), so that x∗ + α(x̄ − x∗ ) is also
a global minimizer.
By applying Taylor’s theorem (in particular, (2.6)) the left-hand side of the definition of con-
vexity (2.14), we obtain
Theorem 2.7. Suppose that f is continuously differentiable and convex. Then if ∇f (x∗ ) = 0, then
x∗ is a global minimizer of (2.10).
Proof. The proof of the first part follows immediately from condition (2.16), if we set x = x∗ . Using
this inequality together with ∇f (x∗ ) = 0, we have for any y that
16
OPTIMIZATION FOR MODERN DATA ANALYSIS
Theorem 2.8. Suppose that f is continuously differentiable and strongly convex. Then if ∇f (x∗ ) =
0, then x∗ is the unique global minimizer of f .
This approximation of convex f by quadratic functions is one of the most central themes in
continuous optimization. Note that when f is strongly convex and twice continuously differentiable,
(2.5) implies the following, when x∗ is the minimizer:
1
f (x) − f (x∗ ) = (x − x∗ )T ∇2 f (x∗ )(x − x∗ ) + o(kx − x∗ k2 ). (2.19)
2
Thus, f behaves like a strongly convex quadratic function in a neighborhood of x∗ . It follows
that we can learn a lot about local convergence properties of algorithms just by studying convex
quadratic functions, and we use quadratic functions as a guide for both intuition and algorithmic
derivation throughout.
Just as we could characterize the Lipschitz constant of the gradient in terms of the eigenvalues
of the Hessian, the strong convexity parameter provides a lower bound on the eigenvalues of the
Hessian when f is twice continuously differentiable. That is, we have the following
Lemma 2.9. Suppose that f is twice continuously differentiable on Rn . Then f has modulus of
convexity m if and only if ∇2 f (x) mI for all x.
Proof. For any x, u ∈ Rn and α > 0, we have from Taylor’s theorem that
1
f (x + αu) = f (x) + α∇f (x)T + α2 uT ∇2 f (x + tαu)u, for some t ∈ (0, 1).
2
From the strong convexity property, we have
m 2
f (x + αu) ≥ f (x) + α∇f (x)T + α kuk2 .
2
17
RECHT AND WRIGHT
uT ∇2 f (x + tαu)u ≥ mkuk2 .
Corollary 2.10. Suppose that the conditions of Lemma 2.3 hold, and in addition that f is convex.
Then 0 ∇2 f (x) LI if and only if f is L-smooth.
Notation
We list key notational conventions that are used in the rest of the book.
• Given two sequences of nonnegative scalars {ηk } and {ζk }, with ζk → ∞, we write ηk = O(ζk )
if there exists a constant M such that ηk ≤ M ζk for all k sufficiently large. The same definition
holds if ζk → 0.
Exercises
1. Prove that the effective domain of a convex function is a convex set.
2. Prove that epi f is a convex subset of Rn+1 for any convex function f .
3. Show rigorously how (2.18) is derived from (2.17) when f is continuously differentiable.
18