Derivatives
Derivatives
Derivatives
Optimization
The foundation of engineering is the ability to use math and physics to design and
optimize complex systems. The advent of computers has made this possible on an
unprecedented scale. This chapter provides a brief introduction to mathematical
optimization theory.
f ′ (x)(h) = T h = J(x) h.
111
112 CHAPTER 5. OPTIMIZATION
one needs the linear structure to compute differences, and the norm topology to
define limits. Completeness guarantees that limits exist under mild conditions.
f (x + th) − f (x)
δf (x; h) ≜ lim ,
t→0 t
where the limit is with respect to the implied mapping from t ∈ R to Y .
Lemma 5.1.2. Let Y = (R, | · |) and suppose that δf (x; h) exists and is negative
for some f , x, and h. Then, there exists t0 > 0 such that f (x + th) < f (x) for all
t ∈ (0, t0 ).
Proof. The δf (x; h) limit implies that, for any ϵ > 0, there is a t0 > 0 such that
for all t ∈ (0, t0 ). If δf (x; h) < 0, then one can choose ϵ = − 21 δf (x; h) to see that
the RHS is negative for all t ∈ (0, t0 ). The stated result follows.
Example 5.1.3. For the standard Banach space X = Y = R2 , let f (x) = (x1 x2 , x1 +
x22 ). Then, for x = (1, 1), h = (1, 2), we have
d
δf (x, h) = ((1 + t)(1 + 2t), (1 + t) + (1 + 2t)2 ) = (3, 5).
dt t=0
∥f (x + h) − f (x) − T (h)∥Y
lim = 0, (5.1)
h→0 ∥h∥X
where the limit is with respect to the implied Banach space mapping X → R. In
this case, the Fréchet derivative at x equals T and is denoted by f ′ (x) in general.
It is worth noting that the orientation of the gradient vector (i.e., row versus column
vector) is sometimes defined differently. This is because derivatives can be under-
stood as linear transforms and either orientation can be used to define the correct
linear transform.
Problem 5.1.10. In the setting of the previous example, show that, if ∇f (x) ̸= 0,
then f x − δ∇f (x) < f (x) for some δ > 0.
Thus, the Gâteaux differential exists and satisfies δf (x; h) = T (h) = f ′ (x)(h).
Proof. For the stated derivatives, the errors in the implied linear approximations are
From the assumptions of differentiability, we know that the first two approximations
become tight for small perturbations. In other words,
∥ϕ(v)∥Y ∥ψ(u)∥Z
lim = 0, lim = 0.
v→0 ∥v∥X u→0 ∥u∥Y
5.1. DERIVATIVES IN BANACH SPACES 115
g(f (x + h)) − g(f (x)) = g f (x) + f ′ (x)(h) + ϕ(h) − g(y).
ρ(h) = g f (x) + f ′ (x)(h) + ϕ(h) − g(y) − g ′ (y) ◦ f ′ (x) (h)
= ψ f ′ (x)(h) + ϕ(h) + g ′ (y) f ′ (x)(h) + ϕ(h) − g ′ (y) ◦ f ′ (x) (h)
= ψ f ′ (x)(h) + ϕ(h) + g ′ (y)(ϕ(h)).
sups∈[0,1] ∥δf (1 − s)x1 + sx2 ; h ∥
M= .
∥x2 − x1 ∥
116 CHAPTER 5. OPTIMIZATION
Suppose that ∥f (x2 ) − f (x1 )∥ > M ∥x2 − x1 ∥. Then, there is an ϵ > 0 such that
one or both of the following conditions must hold:
Repeating indefinitely and choosing a satisfying subinterval at each step, one gets a
sequence wn of midpoints that converges to x = (1−s)x1 +sx2 for some s ∈ [0, 1].
Since the Gâteaux differential δf (x; h) exists by assumption, it follows that
This contradicts the definition of M and, thus, ∥f (x2 )−f (x1 )∥ ≤ M ∥x2 −x1 ∥.
Proof. Assume ∥f ′ (x)∥ ≤ L for all x in a convex set A ⊆ X. Then, for any
x1 , x2 ∈ A, let h = x2 − x1 and notice that Theorem 5.1.11 implies that
∥δf x1 + sh; h ∥ = ∥f ′ x1 + sh (h)∥ ≤ ∥f ′ (x1 + sh)∥∥h∥,
for all s ∈ [0, 1]. Applying Theorem 5.1.13, we see that ∥f (x2 )−f (x1 )∥ ≤ M ∥x2 −
x1 ∥ with M ≤ ∥f ′ (x)∥ ≤ L. This completes the proof.
Lemma 5.1.15. Let f : X → R map the Hilbert space X to the real numbers. If
∇f (x) exists and satisfies ∥∇f (y) − ∇f (x)∥ ≤ L∥y − x∥, then
1
f (y) − f (x) − ⟨y − x, ∇f (x)⟩ ≤ L∥y − x∥2 .
2
5.2. UNCONSTRAINED OPTIMIZATION 117
Proof. Let h = y − x and ϕ(t) = f (x + th). Then, ϕ′ (t) = ⟨h, ∇f (x + th)⟩ and
Z 1
f (y) − f (x) − ⟨h, ∇f (x)⟩ = (ϕ′ (t) − ϕ′ (0)) dt
0
Z 1
= ⟨h, ∇f (x + th) − ∇f (x)⟩ dt
0
Z 1
≤ ∥h∥ ∥∇f (x + th) − ∇f (x)∥ dt
0
Z 1
≤ ∥h∥L∥th∥ dt
0
1
= L∥h∥2 .
2
Linear functionals (i.e., functionals that are linear) are used to define many im-
portant concepts in abstract vector spaces. For unconstrained optimization, how-
ever, linear functionals are not interesting because they are either zero or they
achieve all values in F .
Definition 5.2.2. Let (X, ∥ · ∥) be a normed vector space. Then, a real functional
f : X → R achieves a local minimum value at x0 ∈ X if there is an ϵ > 0 such
that, for all x ∈ X satisfying ∥x − x0 ∥ < ϵ, we have f (x) ≥ f (x0 ). If the bound
holds for all x ∈ X, then the local minimum is also a global minimum value.
Proof. First, we apply Lemma 5.1.2 with the x and h for which δf (x0 , h) < 0. This
gives a t0 > 0 such that f (x0 +th) < f (x0 ) for all t ∈ (0, t0 ). Thus, there can no be
no ϵ > 0 satisfying the definition of a local minimum value in Definition 5.2.2.
118 CHAPTER 5. OPTIMIZATION
Definition 5.3.2. A Banach space X is called strictly convex if the unit ball, given
by {x ∈ X| ∥x∥ ≤ 1}, is a strictly convex set. An equivalent condition is that
equality in the triangle inequality (i.e., ∥x + y∥ = ∥x∥ + ∥y∥) for non-zero vectors
implies that x = sy for some s ∈ F .
Example 5.3.3. Let (X, ∥· ∥) be a normed vector space. Then, the norm ∥ ·∥ : X →
R is a convex functional on X. Proving this is a good introductory exercise.
with equality iff x = y. Thus, the square of the induced norm ∥ · ∥2 is a strictly
convex functional on X.
Proof. Let x0 ∈ A a point where the functional achieves a local minimum value.
Proving by contradiction, we suppose that there is another point x1 ∈ A such that
f (x1 ) < f (x0 ). From the definition of a local minimum value, we find an ϵ > 0
such that f (x) ≥ f (x0 ) for all x ∈ A satisfying ∥x − x0 ∥ < ϵ. Choosing λ <
ϵ
∥x0 −x1 ∥
in (0, 1) and x = (1 − λ)x0 + λx1 implies that ∥x − x0 ∥ < ϵ while the
convexity of f implies that
This contradicts the definition of a local minimum value and implies that f (x0 ) is
a global minimum value on A. If f is strictly convex and f (x1 ) = f (x0 ), then we
suppose that x0 ̸= x1 . In this case, strict convexity implies that
This contradicts the fact that f (x0 ) is a global minimum value on A and implies
that x0 = x1 is unique.
Theorem 5.3.6. Let (X, ∥·∥) be a normed vector space and f : X → R be a convex
functional on a convex set A ⊆ X. If f is Gâteaux differentiable at x0 ∈ A, then
for all λ ∈ (0, 1). Also, if f is strictly convex, then (5.2) strict for x ̸= x0 . Thus,
f (x + λ(x − x0 )) − f (x0 )
f (x) ≥ f (x0 ) +
λ
and taking the limit at λ ↓ 0 completes the proof for a convex functional.
For the case where f is strictly convex, we first apply the convex result to see
where the second step holds because δf (x; h) is linear in h. This gives
f (x0 + λ(x − x0 )) − f (x0 )
δf (x0 ; x − x0 ) ≤ < f (x) − f (x0 ),
λ
where the second inequality holds because (5.2) is a strict inequality for x ̸= x0 .
120 CHAPTER 5. OPTIMIZATION
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, 2, . . . , m
hj (x) = 0, j = 1, 2, . . . , p
x ∈ D.
p∗ = inf f0 (x).
x∈F
not feasible.
Evaluating the function at any feasible point automatically an upper bound be-
cause
p∗ ≤ f0 (x) ∀x ∈ F.
The optimization of a linear function with arbitrary affine equality and inequal-
ity constraints is called a linear program. Linear programs (LPs) have many equiv-
alent forms and any linear program can be transformed into any standard form.
minimize cT x minimize cT x
subject to Ax = b subject to Ax ⪰ b.
x⪰0 x ⪰ 0.
where λi is the Lagrange multiplier associated with the i-th inequality constraint
and νj is the Lagrange multiplier associated with the j-th equality constraint.
Definition 5.4.5. A point x∗ is called locally optimal if there is an ϵ0 > 0 such that,
for all ϵ < ϵ0 , it holds that f0 (x) ≥ f0 (x∗ ) for all x ∈ F satisfying ∥x − x∗ ∥ < ϵ.
The i-th inequality constraint is active at x∗ if fi (x∗ ) = 0. Otherwise, it is inactive.
Let A = {i ∈ {1, . . . , m} | fi (x∗ ) = 0} be the set of active constraints at x∗ .
122 CHAPTER 5. OPTIMIZATION
Figure 5.1: A contour plot of the function f0 (x1 , x2 ) = (x1 −1)2 +(x2 −1)2 −x1 x2 /2
whose minimum occurs at (4/3, 4/3) (i.e., the center of the blue ellipse). The red
line indicates the inequality constraint f1 (x1 , x2 ) = 1.85 + (x1 − 2.25)2 /2 − x2 ≤ 0.
The picture shows that the constrained minimum occurs where the objective contour
line is tangent to the active constraint line.
The geometric picture implied by Theorem 5.4.7 is that of a game where one
would like to decrease the objective f0 (x∗ ) by choosing y such that ∇f0 (x∗ )T y < 0
but there are constraints on the set of allowable y’s. Let H = span({∇hj (x∗ )}) be
the subspace of directions that violate the equality constraints at x∗ . Similarly, let
the cone of directions that violate the active inequality constraints is given by
( )
X
F = λi ∇fi (x∗ ) λi ≥ 0, i ∈ A .
i∈A
Thus, one can only pick directions y that are orthogonal to all vectors in H and also
have a non-positive inner product with all vectors in F .
Let the matrix P define the orthogonal projection of Rn onto H ⊥ . Using this,
124 CHAPTER 5. OPTIMIZATION
−P ∇f0 (x∗ ) ∈ P F
or “the projection, onto H ⊥ , of the descent direction lies in the projection, onto H ⊥ ,
of the cone of directions that violate the inequality constraints”. The reason for this
is that we can absorb the ∇hj terms into the ∇fi terms by defining
p
X
∗
f (i)
= ∇fi (x ) + νj,i ∇hj (x∗ ) = P ∇fi (x∗ )
j=1
If −P ∇f0 (x∗ ) ∈
/ P F , then we project −P ∇f0 (x∗ ) onto P F to get a non-zero
residual y. The resulting vector gives a direction where the objective function de-
creases linearly in t and the constraint violations are o(t). The challenge in making
this proof precise is that, unless the equality constraints are affine, they may not be
exactly satisfied for t > 0. In standard proofs of this result, this difficulty is over-
come by using the implicit function theorem to construct an x(t) that starts in the
direction of y but is perturbed slightly to remain feasible.
Proof. For simplicity, we prove only the case where hj (x) = aTj x − b is affine.
First, we define
m p
X X
∗ ∗
y(λ, ν) = −∇f0 (x ) − λi ∇fi (x ) − νj ∇hj (x∗ ) .
| {z }
i=1 j=1
aj
The vector y(λ, ν) can be seen as the residual of the descent direction for the ob-
jective function after the constraint gradients have been used to cancel some parts.
Next, we let ν ∗ (λ) = arg minν∈Rp ∥y(λ, ν)∥ and apply the best approximation the-
orem (for the standard inner product space) to see that
where the matrix P defines an orthogonal projection onto H ⊥ and H = span({aj }).
This ensures that each hj (x∗ + ty(λ, ν ∗ (λ))) = 0 for all λ ∈ Rm and t ∈ R.
5.4. CONSTRAINED OPTIMIZATION 125
/ A} and compute
Continuing, we let S = {λ ∈ Rm | λ ≥ 0, λi = 0, i ∈
This optimization uses the gradients of the active constraints to cancel as much of
the residual descent direction as possible. Thus, y ∗ ̸= 0 implies there is a descent
direction that does not violate the constraints. Looking at the formulas for y(λ, ν)
and y(λ, ν ∗ (λ)), we can also interpret y ∗ as the error vector for the projection of
v = −P ∇f0 (x∗ ) onto the convex set P F , which is the closed convex cone of
perturbations that preserve the equality constraints but locally violate the inequality
constraints. Then, u∗ = v − y ∗ equals the projection itself and we observe that
Theorem 4.6.7 implies (u − u∗ )T (v − u∗ ) ≤ 0 for all u ∈ P F . Since 0 ∈ P F , we
can choose u = 0 to see that (u∗ )T (v − u∗ ) ≥ 0. Using this, we can write
maximize g(λ, ν)
subject to λ ≥ 0
Proof. The Lagrangian dual function is concave because it is the pointwise infimum
of affine functions
= αg(λ, ν) + (1 − α)g(λ′ , ν ′ ).
Thus, it follows from Theorem 5.3.5 that g has a unique maximum value d∗ which
can be upper bounded by
(a)
g(λ, ν) = inf L(x, λ, ν) ≤ inf L(x, λ, ν)
x∈D x∈F
m
X
(b) (c)
= p∗ + λi fi (x) ≤ p∗ ,
i=1
where (a) is implied by F ⊆ D, (b) follows from hj (x) = 0 for x ∈ F, and (c)
holds by combining fi (x) ≤ 0 for x ∈ F and λi ≥ 0.
The Lagrangian dual function can be −∞ for a wide range of (λ, ν). In this
case, it makes sense to eliminate these points by defining the implicit constraint set
The points (λ, ν) ∈ C are called dual feasible and it follows that
Definition 5.4.10. If d∗ = p∗ , then one says strong duality holds for the problem.
Theorem 5.4.11. Let x∗ be a primal optimal point and (λ∗ , ν ∗ ) be a dual optimal
point. If strong duality holds, x∗ ∈ D◦ , and all fi and hj functions are differentiable
at x∗ , then we get the KKT conditions of complementary slackness, λ∗i fi (x∗ ) = 0
for i = 1, . . . , m, and stationarity (5.3).
d∗ = g(λ∗ , ν ∗ ) ≤ f0 (x∗ ) = p∗ .
Since x∗ is feasible, combining d∗ = p∗ with the proof of weak duality shows that
and λ∗i = 0 if the i-th inequality constraint is inactive (i.e., fi (x∗ ) < 0). Thus,
we also observe that complementary slackness condition λ∗i fi (x∗ ) = 0 holds for
i = 1, . . . , m. Since x∗ ∈ D◦ , it follows that x∗ is a locally optimal point of
L(x, λ∗ , ν ∗ ). Thus, x∗ must be a stationary point of L(x, λ∗ , ν ∗ ) and taking the
x-derivative gives (5.3).
Example 5.4.12. For the first LP in Definition 5.4.3, the Lagrangian is given by
L(x, λ, ν) = cT x + ν T (b − Ax) − λT x,
where the λ term is negative because the constraint is x ⪰ 0. Thus, the Lagrangian
dual function is given by
b T ν if c − AT ν − λ = 0
g(λ, ν) = inf L(x, λ, ν) =
x∈D −∞ otherwise.
Solving the implicit constraint and using the fact that λ ⪰ 0, one gets the dual LP
problem
maximize bT ν
subject to AT ν ⪯ c.
Strong duality for linear programs says that, if the original LP has an optimal so-
lution (i.e., it is neither unbounded nor infeasible), then the dual LP has an optimal
solution of the same value.
128 CHAPTER 5. OPTIMIZATION
Applying Theorem 5.3.5 to this setup shows that a convex standard-form opti-
mization problem has a unique minimum value. Also, if the function f0 is strictly
convex, then the minimum value achieved uniquely. There are a number of stronger
conditions that also imply strong duality for convex optimization problems. Slater’s
condition is stated below as a theorem and its proof can be found in [?, Sec. 5.3.2].
Example 5.4.16. For a channel with colored noise, the input distribution that max-
imizes the achievable information rate can be found by solving the convex optimiza-
tion problem, known as water-filling, given by
n
X
minimize − log (xi + αi )
i=1
n
X
subject to xi = P
i=1
x ⪰ 0.
Choosing xi = P
n
for i = 1, . . . , n gives a point that satisfies Slater’s condition, so
strong duality holds for this problem.
Example 5.4.17. For the water-filling problem, the Lagrangian can be written as
n m n
!
X X X
L(x, λ, ν) = − log(xi + αi ) − λi xi + ν −P + xi
i=1 i=1 i=1
5.4. CONSTRAINED OPTIMIZATION 129
and the Lagrangian dual is given by g(λ, ν) = inf x∈Rn L(x, λ, ν).
If λi < 0, then the Lagrangian tends to −∞ as xi → −∞. Thus, the system
is implicitly constrained to have λi ≥ 0. The first-order optimality conditions, for
i = 1, 2, . . . , n, are given by
1
− − λi + ν = 0.
xi + α i
Solving this for xi shows that xi is increasing in λi (for λi ≥ 0) and this implies
that g(λ, ν) is decreasing in λi (for λi ≥ 0 and xi ≥ 0).
Thus, the expression maxλ≥0 g(λ, ν) is given by choosing the smallest non-
negative λi ’s for which xi ≥ 0. This implies that
1 − αi , 0 if ν < 1
ν αi
(xi , λi ) =
0, ν − 1 if ν ≥ 1
.
αi αi
By strong duality, the optimal value of the dual problem equals the optimal value
of the original problem. Finally, the problem can be easily solved for a range of P
values by sweeping through a range of ν values and computing P in terms of ν.
130 CHAPTER 5. OPTIMIZATION