Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Derivatives

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Chapter 5

Optimization

The foundation of engineering is the ability to use math and physics to design and
optimize complex systems. The advent of computers has made this possible on an
unprecedented scale. This chapter provides a brief introduction to mathematical
optimization theory.

5.1 Derivatives in Banach Spaces


In this chapter, we assume that readers are familiar with derivatives as defined in
undergraduate multivariable calculus. To gain insight, we first recall the standard
interpretation of the derivative as a local linear approximation of a function. For a
function f : Rn → Rm , this interpretation gives

f (x + h) = f (x) + J(x) h + higher order terms,

where J(x) ∈ Rm×n is the Jacobian matrix of f at x.


Instead of interpreting a multivariate derivative as a matrix, we will view the
derivative f ′ (x) as a linear transform T from the domain to codomain. This trans-
form maps the input perturbation h to a local approximation of the output pertur-
bation. Since both are finite dimensional in our example, the linear transform T is
represented by the Jacobian matrix and we have

f ′ (x)(h) = T h = J(x) h.

Mathematically, such definitions require the structure of a Banach space because

111
112 CHAPTER 5. OPTIMIZATION

one needs the linear structure to compute differences, and the norm topology to
define limits. Completeness guarantees that limits exist under mild conditions.

Definition 5.1.1. Let f : X → Y be a mapping from a vector space X over R to a


Banach space (Y, ∥ · ∥). Then, if it exists, the Gâteaux differential (or directional
derivative) of f at x in direction h is given by

f (x + th) − f (x)
δf (x; h) ≜ lim ,
t→0 t
where the limit is with respect to the implied mapping from t ∈ R to Y .

When this directional derivative exits, we can write the approximation

f (x + th) ≈ f (x) + tδf (x; h).

In fact, we can get a tighter characterization that is especially meaningful in the


context of optimization.

Lemma 5.1.2. Let Y = (R, | · |) and suppose that δf (x; h) exists and is negative
for some f , x, and h. Then, there exists t0 > 0 such that f (x + th) < f (x) for all
t ∈ (0, t0 ).

Proof. The δf (x; h) limit implies that, for any ϵ > 0, there is a t0 > 0 such that

f (x + th) − f (x) ≤ (δf (x; h) + ϵ) t

for all t ∈ (0, t0 ). If δf (x; h) < 0, then one can choose ϵ = − 21 δf (x; h) to see that
the RHS is negative for all t ∈ (0, t0 ). The stated result follows.

Example 5.1.3. For the standard Banach space X = Y = R2 , let f (x) = (x1 x2 , x1 +
x22 ). Then, for x = (1, 1), h = (1, 2), we have
d
δf (x, h) = ((1 + t)(1 + 2t), (1 + t) + (1 + 2t)2 ) = (3, 5).
dt t=0

Problem 5.1.4. Suppose X = Y = L1 ([0, 1]) is the Banach space of Lebesgue


R1
absolutely integrable functions mapping [0, 1] to R and f (x) = ∥x∥ = 0 |x(s)|ds
is the norm of x. Assuming the set {s ∈ [0, 1]|x(s) = 0} has measure 0, show that
Z 1 Z 1
1
δf (x; h) ≜ lim (|x(s) + th(s)| − |x(s)|) ds = sgn(x(s))h(s)ds.
t→0 0 t 0
5.1. DERIVATIVES IN BANACH SPACES 113

Definition 5.1.5. Let f : X → Y be a mapping from a vector space X over R to


a Banach space (Y, ∥ · ∥). Then, f is Gâteaux differentiable at x if the Gâteaux
differential δf (x; h) exists for all h ∈ X and is a linear function of h. If, in addition,
X is a Banach space, then δf (x; h) must be a continuous linear function of h.

Remark 5.1.6. For simplicity, our treatment of Gâteaux derivatives assumes X is


a vector space over R but similar results are possible over C as well.

Definition 5.1.7. Let f : X → Y be a mapping from a Banach space (X, ∥ · ∥X ) to


a Banach space (Y, ∥ · ∥Y ). Then, f is Fréchet differentiable at x if there is a linear
transformation T : X → Y with ∥T ∥ < ∞ that satisfies

∥f (x + h) − f (x) − T (h)∥Y
lim = 0, (5.1)
h→0 ∥h∥X
where the limit is with respect to the implied Banach space mapping X → R. In
this case, the Fréchet derivative at x equals T and is denoted by f ′ (x) in general.

Example 5.1.8. A function f : Rn → Rm with f = (f1 , f2 , . . . , fm )T is (Fréchet)


differentiable at x0 if the mapping J from Rn to the Jacobian matrix,
 ∂f ∂f1 ∂f1

1
∂x1
(x) ∂x2
(x) · · · ∂xn
(x)
 ∂f2 ∂f2 ∂f2 
 (x) (x) · · · (x) 
J(x) = f (x) ≜ 


∂x 1
..
∂x2
.. ..
∂xn
.. ,

 . . . . 
∂fm ∂fm ∂fm
∂x1
(x) ∂x2
(x) ··· ∂xn
(x)

exists and is continuous in x at x = x0 . A necessary and sufficient condition for


this is that each partial derivative is continuous in x at x = x0 .
If m = 1, then the Jacobian is closely related to the gradient of the function
h iH
∇f (x) ≜ f ′ (x)H = ∂x ∂f
1
(x) ∂f
∂x2
(x) · · · ∂f
∂xn
(x) .

It is worth noting that the orientation of the gradient vector (i.e., row versus column
vector) is sometimes defined differently. This is because derivatives can be under-
stood as linear transforms and either orientation can be used to define the correct
linear transform.

Example 5.1.9. Let X be a Hilbert space over R and f : X → R be a real func-


tional. If the Fréchet derivative f ′ (x) exists, then it is a continuous linear functional
114 CHAPTER 5. OPTIMIZATION

on X. Thus, the Riesz representation theorem guarantees that there is a vector


u ∈ X such that f ′ (x)(h) = ⟨h, u⟩ for all h ∈ X. This vector is called the gradient
∇f (x) and it follows that

f ′ (x)(h) = ⟨h, ∇f (x)⟩ for all h ∈ X.

Problem 5.1.10. In the setting of the previous example, show that, if ∇f (x) ̸= 0,

then f x − δ∇f (x) < f (x) for some δ > 0.

Theorem 5.1.11. Let f : X → Y be a mapping from a Banach space (X, ∥ · ∥X )


to a Banach space (Y, ∥ · ∥Y ). If f is Fréchet differentiable at x with derivative f ′ ,
then f is Gâteaux differentiable at x with Gâteaux differential δf (x; h) = f ′ (x)(h).

Proof. For h = 0, the statement is trivial. For h ̸= 0, we first observe that th → 0


as t → 0. Letting T = f ′ (x), we can combine this with (5.1) to see that
∥f (x + th) − f (x) − T (th)∥Y
0 = lim
t→0 ∥th∥X
f (x + th) − f (x) tT (h)
= lim −
t→0 t∥h∥X t∥h∥X Y
1 f (x + th) − f (x)
= lim − T (h) .
∥h∥X t→0 t Y

Thus, the Gâteaux differential exists and satisfies δf (x; h) = T (h) = f ′ (x)(h).

Theorem 5.1.12. Let X, Y, Z be Banach spaces and let f : X → Y and g : Y → Z


be functions. If f is Fréchet differentiable at x and g is Fréchet differentiable at
y = f (x), then (g ◦ f )(x) = g(f (x)) is Fréchet differentiable at x with derivative
g ′ (f (x)) ◦ f ′ (x).

Proof. For the stated derivatives, the errors in the implied linear approximations are

ϕ(v) = f (x + v) − f (x) − f ′ (x)(v)


ψ(u) = g(y + u) − g(y) − g ′ (y)(u)

ρ(h) = g(f (x + h)) − g(f (x)) − g ′ (y) ◦ f ′ (x) (h).

From the assumptions of differentiability, we know that the first two approximations
become tight for small perturbations. In other words,
∥ϕ(v)∥Y ∥ψ(u)∥Z
lim = 0, lim = 0.
v→0 ∥v∥X u→0 ∥u∥Y
5.1. DERIVATIVES IN BANACH SPACES 115

Next, we observe that the definition of ϕ implies


g(f (x + h)) − g(f (x)) = g f (x) + f ′ (x)(h) + ϕ(h) − g(y).

Combining this with the definition of ρ shows that

 
ρ(h) = g f (x) + f ′ (x)(h) + ϕ(h) − g(y) − g ′ (y) ◦ f ′ (x) (h)
  
= ψ f ′ (x)(h) + ϕ(h) + g ′ (y) f ′ (x)(h) + ϕ(h) − g ′ (y) ◦ f ′ (x) (h)

= ψ f ′ (x)(h) + ϕ(h) + g ′ (y)(ϕ(h)).

We take this opportunity to note that ∥g ′ (f (x)) ◦ f ′ (x)∥ ≤ ∥g ′ (f (x))∥∥f ′ (x)∥ ≤ ∞


because ∥f ′ (x)∥ ≤ ∞ and ∥g ′ (f (x))∥ < ∞. Since limh→0 ∥ϕ(h)∥Y /∥h∥X = 0,
there is a t > 0 such that ∥ϕ(h)∥Y ≤ ∥f ′ (x)∥∥h∥X if ∥h∥X < t. Under the same
condition, it follows that 2∥f ′ (x)∥∥h∥X ≥ ∥f ′ (x)∥∥h∥X + ∥ϕ(h)∥Y . Using this,
we can write

∥ρ(h)∥Z ∥ψ f ′ (x)(h) + ϕ(h) + g ′ (y)(ϕ(h))∥Z
=
∥h∥X ∥h∥X

′ ∥ψ f (x)(h) + ϕ(h) ∥Z ∥g ′ (y)(ϕ(h))∥Z

≤ 2∥f (x)∥ +
2∥f ′ (x)∥∥h∥X ∥h∥X

 ′
∥ψ f (x)(h) + ϕ(h) ∥Z ∥g (y)∥∥ϕ(h)∥Y
≤ 2∥f ′ (x)∥ ′ +
∥f (x)∥∥h∥X + ∥ϕ(h)∥Y ∥h∥X

 ′
∥ψ f (x)(h) + ϕ(h) ∥Z ∥g (y)∥∥ϕ(h)∥Y
≤ 2∥f ′ (x)∥ + .
∥f ′ (x)(h) + ϕ(h)∥Y ∥h∥X

Since (f ′ (x)(h) + ϕ(h)) → 0 as h → 0, it follows that the limit of the RHS, as


h → 0, also exists and equals 0. Thus, limh→0 ∥ρ(h)∥Z /∥h∥X = 0 and the Fréchet
derivative of g(f (x)) exists and satisfies the chain rule.

Theorem 5.1.13. Let X, Y be Banach spaces and f : X → Y be a function. For


x1 , x2 ∈ X, let h = x2 − x1 and assume the Gâteaux differential δf (1 − s)x1 +

sx2 ; h exists for all s ∈ [0, 1]. Then, ∥f (x2 ) − f (x1 )∥ ≤ M ∥x2 − x1 ∥, where


sups∈[0,1] ∥δf (1 − s)x1 + sx2 ; h ∥
M= .
∥x2 − x1 ∥
116 CHAPTER 5. OPTIMIZATION

Proof. For w1 = 12 (x1 + x2 ), observe that

∥f (x2 ) − f (x1 )∥ ∥f (x2 ) − f (w1 ) + f (w1 ) − f (x1 )∥


=
∥x2 − x1 ∥ ∥x2 − x1 ∥
∥f (x2 ) − f (w1 )∥ + ∥f (w1 ) − f (x1 )∥

∥x2 − x1 ∥
∥f (x2 ) − f (w1 )∥ ∥f (w1 ) − f (x1 )∥
= + .
2∥x2 − w1 ∥ 2∥w1 − x1 ∥

Suppose that ∥f (x2 ) − f (x1 )∥ > M ∥x2 − x1 ∥. Then, there is an ϵ > 0 such that
one or both of the following conditions must hold:

∥f (x2 ) − f (w1 )∥ ∥f (w1 ) − f (x1 )∥


≥ M + ϵ and ≥ M + ϵ.
∥x2 − w1 ∥ ∥w1 − x1 ∥

Repeating indefinitely and choosing a satisfying subinterval at each step, one gets a
sequence wn of midpoints that converges to x = (1−s)x1 +sx2 for some s ∈ [0, 1].
Since the Gâteaux differential δf (x; h) exists by assumption, it follows that

∥f (wn ) − f (x)∥ f (x ± 2−n h) − f (x) ∥δf (x; h)∥


M +ϵ≤ = n
→ .
∥wn − x∥ 2 ∥x2 − x1 ∥ ∥x2 − x1 ∥

This contradicts the definition of M and, thus, ∥f (x2 )−f (x1 )∥ ≤ M ∥x2 −x1 ∥.

Lemma 5.1.14. Let X, Y be Banach spaces and f : X → Y be a function. If the


Fréchet derivative f ′ (x) exists and satisfies ∥f ′ (x)∥ ≤ L for all x in a convex set
A ⊆ X, then f is Lipschitz continuous on A with Lipschitz constant L.

Proof. Assume ∥f ′ (x)∥ ≤ L for all x in a convex set A ⊆ X. Then, for any
x1 , x2 ∈ A, let h = x2 − x1 and notice that Theorem 5.1.11 implies that
 
∥δf x1 + sh; h ∥ = ∥f ′ x1 + sh (h)∥ ≤ ∥f ′ (x1 + sh)∥∥h∥,

for all s ∈ [0, 1]. Applying Theorem 5.1.13, we see that ∥f (x2 )−f (x1 )∥ ≤ M ∥x2 −
x1 ∥ with M ≤ ∥f ′ (x)∥ ≤ L. This completes the proof.

Lemma 5.1.15. Let f : X → R map the Hilbert space X to the real numbers. If
∇f (x) exists and satisfies ∥∇f (y) − ∇f (x)∥ ≤ L∥y − x∥, then

1
f (y) − f (x) − ⟨y − x, ∇f (x)⟩ ≤ L∥y − x∥2 .
2
5.2. UNCONSTRAINED OPTIMIZATION 117

Proof. Let h = y − x and ϕ(t) = f (x + th). Then, ϕ′ (t) = ⟨h, ∇f (x + th)⟩ and
Z 1
f (y) − f (x) − ⟨h, ∇f (x)⟩ = (ϕ′ (t) − ϕ′ (0)) dt
0
Z 1
= ⟨h, ∇f (x + th) − ∇f (x)⟩ dt
0
Z 1
≤ ∥h∥ ∥∇f (x + th) − ∇f (x)∥ dt
0
Z 1
≤ ∥h∥L∥th∥ dt
0
1
= L∥h∥2 .
2

5.2 Unconstrained Optimization


Functions mapping elements of a vector space (over F ) down to the scalar field F
play a very special role in the analysis of vector spaces.

Definition 5.2.1. Let V be a vector space over F . Then, a functional on V is a


function f : V → F that maps V to F .

Linear functionals (i.e., functionals that are linear) are used to define many im-
portant concepts in abstract vector spaces. For unconstrained optimization, how-
ever, linear functionals are not interesting because they are either zero or they
achieve all values in F .

Definition 5.2.2. Let (X, ∥ · ∥) be a normed vector space. Then, a real functional
f : X → R achieves a local minimum value at x0 ∈ X if there is an ϵ > 0 such
that, for all x ∈ X satisfying ∥x − x0 ∥ < ϵ, we have f (x) ≥ f (x0 ). If the bound
holds for all x ∈ X, then the local minimum is also a global minimum value.

Theorem 5.2.3. Let (X, ∥ · ∥) be a normed vector space and f : X → R be a real


functional. If δf (x0 , h) exists and is negative for any h ∈ X, then x0 is not a local
minimum value.

Proof. First, we apply Lemma 5.1.2 with the x and h for which δf (x0 , h) < 0. This
gives a t0 > 0 such that f (x0 +th) < f (x0 ) for all t ∈ (0, t0 ). Thus, there can no be
no ϵ > 0 satisfying the definition of a local minimum value in Definition 5.2.2.
118 CHAPTER 5. OPTIMIZATION

5.3 Convex Functionals


Convexity is a particularly nice property of spaces and functionals that leads to
well-defined minimum values.

Definition 5.3.1. Let V be a vector space, A ⊆ V be a convex set, and f : V → R


be a functional. Then, a functional f is called convex on A if, for all a1 , a2 ∈ A
and λ ∈ (0, 1), we have

f (λa1 + (1 − λ)a2 ) ≤ λf (a1 ) + (1 − λ)f (a2 ).

The functional is strictly convex if equality occurs only when a1 = a2 . A functional


is f is called (strictly) concave if −f is (strictly) convex.

Definition 5.3.2. A Banach space X is called strictly convex if the unit ball, given
by {x ∈ X| ∥x∥ ≤ 1}, is a strictly convex set. An equivalent condition is that
equality in the triangle inequality (i.e., ∥x + y∥ = ∥x∥ + ∥y∥) for non-zero vectors
implies that x = sy for some s ∈ F .

Example 5.3.3. Let (X, ∥· ∥) be a normed vector space. Then, the norm ∥ ·∥ : X →
R is a convex functional on X. Proving this is a good introductory exercise.

Example 5.3.4. Let X be an an inner-product space. For x, y ∈ X and λ ∈ (0, 1),

∥λx + (1 − λ)y∥2 = λ2 ∥x∥2 + 2λ(1 − λ)Re⟨x, y⟩ + (1 − λ)2 ∥y∥2


= λ∥x∥2 + (1 − λ)∥y∥2 − λ(1 − λ)(∥x∥2 + ∥y∥2 − 2Re⟨x, y⟩)
= λ∥x∥2 + (1 − λ)∥y∥2 − λ(1 − λ)∥x − y∥2
≤ λ∥x∥2 + (1 − λ)∥y∥2 ,

with equality iff x = y. Thus, the square of the induced norm ∥ · ∥2 is a strictly
convex functional on X.

Theorem 5.3.5. Let (X, ∥ · ∥) be a normed vector space, A ⊆ X be a convex set,


and f : X → R be a convex functional on A. Then, any local minimum value of
f on A is a global minimum value on A. If the functional is strictly convex on A
and achieves a local minimum value on A, then there is a unique point x0 ∈ A that
achieves the global minimum value on A.
5.3. CONVEX FUNCTIONALS 119

Proof. Let x0 ∈ A a point where the functional achieves a local minimum value.
Proving by contradiction, we suppose that there is another point x1 ∈ A such that
f (x1 ) < f (x0 ). From the definition of a local minimum value, we find an ϵ > 0
such that f (x) ≥ f (x0 ) for all x ∈ A satisfying ∥x − x0 ∥ < ϵ. Choosing λ <
ϵ
∥x0 −x1 ∥
in (0, 1) and x = (1 − λ)x0 + λx1 implies that ∥x − x0 ∥ < ϵ while the
convexity of f implies that

f (x) = f ((1 − λ)x0 + λx1 ) ≤ (1 − λ)f (x0 ) + λf (x1 ) < f (x0 ).

This contradicts the definition of a local minimum value and implies that f (x0 ) is
a global minimum value on A. If f is strictly convex and f (x1 ) = f (x0 ), then we
suppose that x0 ̸= x1 . In this case, strict convexity implies that

f ((1 − λ)x0 + λx1 ) < (1 − λ)f (x0 ) + λf (x1 ) = f (x0 ).

This contradicts the fact that f (x0 ) is a global minimum value on A and implies
that x0 = x1 is unique.

Theorem 5.3.6. Let (X, ∥·∥) be a normed vector space and f : X → R be a convex
functional on a convex set A ⊆ X. If f is Gâteaux differentiable at x0 ∈ A, then

f (x) ≥ f (x0 ) + δf (x0 ; x − x0 )

for all x ∈ A. If f is strictly convex then the inequality is strict for x ̸= x0 .

Proof. By the convexity of A and f , we have x0 + λ(x − x0 ) ∈ A and

f (x0 + λ(x − x0 )) ≤ f (x0 ) + λ (f (x) − f (x0 )) (5.2)

for all λ ∈ (0, 1). Also, if f is strictly convex, then (5.2) strict for x ̸= x0 . Thus,
f (x + λ(x − x0 )) − f (x0 )
f (x) ≥ f (x0 ) +
λ
and taking the limit at λ ↓ 0 completes the proof for a convex functional.
For the case where f is strictly convex, we first apply the convex result to see

f (x0 + λ(x − x0 )) ≥ f (x0 ) + δf (x0 ; λ(x − x0 )) = f (x0 ) + λδf (x0 ; x − x0 ),

where the second step holds because δf (x; h) is linear in h. This gives
f (x0 + λ(x − x0 )) − f (x0 )
δf (x0 ; x − x0 ) ≤ < f (x) − f (x0 ),
λ
where the second inequality holds because (5.2) is a strict inequality for x ̸= x0 .
120 CHAPTER 5. OPTIMIZATION

Corollary 5.3.7. Let (X, ∥ · ∥) be a normed vector space and f : X → R be a


convex functional on a convex set A ⊆ X. If f is Gâteaux differentiable at x0 ∈ A
and δf (x0 ; x − x0 ) = 0 for all x ∈ A, then

f (x0 ) = min f (x).


x∈A

If f is strictly convex, x0 is the unique minimizer over A.

5.4 Constrained Optimization


Lagrangian optimization is an indispensable tool in engineering and physics that al-
lows one to solve constrained non-linear optimization problems. For convex prob-
lems, there are now efficient algorithms that can handle thousands of variables and
constraints. In some cases, there are also analytical techniques that allow one to
derive tight bounds on optimum value. These approaches have become so common
that convex Lagrangian optimization problems are now taught as a fundamental part
of the graduate engineering curriculum. For simplicity, we focus on the case where
the domain D is a subset of the finite-dimensional real space Rn .
Constrained non-linear optimization problems over D ⊆ Rn can be put into the
following standard form. Let fi : D → R and hj : D → R be a real functionals on
D for i = 0, 1, . . . , m and j = 1, 2, . . . , p. Then, the standard form is

minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, 2, . . . , m
hj (x) = 0, j = 1, 2, . . . , p
x ∈ D.

The function f0 is called the objective function while the functions f1 , . . . , fm


are called inequality constraints and the functions h1 , . . . , hp are called equality
constraints.

Definition 5.4.1. A vector x ∈ D is feasible if it satisfies the constraints. Let


F = {x ∈ D | fi (x) ≤ 0, i = 1, 2, . . . , m , hj (x) = 0, j = 1, . . . , p} be the set of
feasible vectors. Then, the problem is feasible if F ̸= ∅.
5.4. CONSTRAINED OPTIMIZATION 121

Definition 5.4.2. The optimal value is

p∗ = inf f0 (x).
x∈F

By convention, p is allowed to take infinite values and p∗ = ∞ if the problem is


not feasible.

Evaluating the function at any feasible point automatically an upper bound be-
cause
p∗ ≤ f0 (x) ∀x ∈ F.
The optimization of a linear function with arbitrary affine equality and inequal-
ity constraints is called a linear program. Linear programs (LPs) have many equiv-
alent forms and any linear program can be transformed into any standard form.

Definition 5.4.3. Two standard minimization forms of an LP are given by:

minimize cT x minimize cT x
subject to Ax = b subject to Ax ⪰ b.
x⪰0 x ⪰ 0.

5.4.1 The Lagrangian


The Lagrangian is used to transform constrained optimization problems into uncon-
strained optimization problems. One can think of it as introducing a cost λi ≥ 0
associated with violating the i-th inequality constraint and a variable νj used to
enforce the j-th equality constraint.

Definition 5.4.4. The Lagrangian L : D × Rm × Rp → R associated with opti-


mization problem is
m p
X X
L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x),
i=1 j=1

where λi is the Lagrange multiplier associated with the i-th inequality constraint
and νj is the Lagrange multiplier associated with the j-th equality constraint.

Definition 5.4.5. A point x∗ is called locally optimal if there is an ϵ0 > 0 such that,
for all ϵ < ϵ0 , it holds that f0 (x) ≥ f0 (x∗ ) for all x ∈ F satisfying ∥x − x∗ ∥ < ϵ.
The i-th inequality constraint is active at x∗ if fi (x∗ ) = 0. Otherwise, it is inactive.
Let A = {i ∈ {1, . . . , m} | fi (x∗ ) = 0} be the set of active constraints at x∗ .
122 CHAPTER 5. OPTIMIZATION

Definition 5.4.6 (Mangasarian-Fromovitz). A standard constrained optimization


problem satisfies the MF constraint qualification at x∗ if the functions fi and hj
are all continuously differentiable at x∗ and there exists a vector w ∈ Rn sat-
isfying ∇fi (x∗ )T w < 0 for i ∈ A and ∇hj (x∗ )T w = 0 for j = 1, . . . , p. If
the constraints hj are not all affine, then one additionally needs that the vectors
∇h1 (x∗ ), . . . , ∇hp (x∗ ) ∈ Rn form a linearly independent set.

Theorem 5.4.7 (Karush-Kuhn-Tucker). If x∗ is a constrained local optimum that


satisfies the MF constraint qualification, then there exist λ∗ ≥ 0 and ν ∗ such that
p
X X

∇f0 (x ) + λ∗i ∇fi (x∗ ) + νj∗ ∇hj (x∗ ) = 0. (5.3)
i∈A j=1

This theorem provides a necessary condition for a point x∗ to be locally optimal


for a constrained optimization problem. Before considering its proof, it is useful to
discuss the geometric picture underlying its contrapositive statement: if (5.3) does
not hold for all λ∗ ≥ 0 and ν ∗ ∈ Rp , then x∗ is not locally optimal.
Now, consider what happens if we evaluate the function at x(t) = x∗ + ty for
some direction y and a sufficiently small t > 0. For any continuously differentiable
function f , the definition of the derivative implies that

f (x(t)) = f (x∗ ) + t∇f (x∗ )T y + o(t),

where o(t) → 0 as t → 0. If the problem is unconstrained (e.g., m = p = 0),


then ∇f0 (x∗ ) must be 0. This is because the negative gradient −∇f0 (x∗ ) gives
the direction of steepest descent for the objective function and one is guaranteed
to reduce the function by choosing y = −∇f0 (x∗ ) (e.g., see Lemma 5.1.2). If
there are constraints, however, then x(t) may be infeasible. For the j-th equality
constraint, the definition of the derivative implies that, for sufficiently small t, x(t)
will be infeasible if |∇hj (x∗ )T y| > 0. Thus, we certainly need ∇hj (x∗ )T y = 0 for
all j.
If the i-th inequality constraint is active (i.e., fi (x∗ ) = 0), then the definition
of the derivative implies that, for sufficiently small t, x(t) will be infeasible if
∇fi (x∗ )T y > 0. Thus, we certainly need ∇fi (x∗ )T y ≤ 0 for all i ∈ A. If the
constraint is inactive (i.e., fi (x∗ ) < 0), then due to continuity it will remain satis-
fied for sufficiently small t.
5.4. CONSTRAINED OPTIMIZATION 123

Figure 5.1: A contour plot of the function f0 (x1 , x2 ) = (x1 −1)2 +(x2 −1)2 −x1 x2 /2
whose minimum occurs at (4/3, 4/3) (i.e., the center of the blue ellipse). The red
line indicates the inequality constraint f1 (x1 , x2 ) = 1.85 + (x1 − 2.25)2 /2 − x2 ≤ 0.
The picture shows that the constrained minimum occurs where the objective contour
line is tangent to the active constraint line.

The geometric picture implied by Theorem 5.4.7 is that of a game where one
would like to decrease the objective f0 (x∗ ) by choosing y such that ∇f0 (x∗ )T y < 0
but there are constraints on the set of allowable y’s. Let H = span({∇hj (x∗ )}) be
the subspace of directions that violate the equality constraints at x∗ . Similarly, let
the cone of directions that violate the active inequality constraints is given by
( )
X
F = λi ∇fi (x∗ ) λi ≥ 0, i ∈ A .
i∈A

Thus, one can only pick directions y that are orthogonal to all vectors in H and also
have a non-positive inner product with all vectors in F .
Let the matrix P define the orthogonal projection of Rn onto H ⊥ . Using this,
124 CHAPTER 5. OPTIMIZATION

we can translate the equation (5.3) into the statement

−P ∇f0 (x∗ ) ∈ P F

or “the projection, onto H ⊥ , of the descent direction lies in the projection, onto H ⊥ ,
of the cone of directions that violate the inequality constraints”. The reason for this
is that we can absorb the ∇hj terms into the ∇fi terms by defining
p
X

f (i)
= ∇fi (x ) + νj,i ∇hj (x∗ ) = P ∇fi (x∗ )
j=1

so that f (i) ∈ H ⊥ for i = 0, 1, . . . , m. Then, the cone P F is defined by


( )
X
PF = λi f (i) λi ≥ 0, i ∈ A .
i∈A

If −P ∇f0 (x∗ ) ∈
/ P F , then we project −P ∇f0 (x∗ ) onto P F to get a non-zero
residual y. The resulting vector gives a direction where the objective function de-
creases linearly in t and the constraint violations are o(t). The challenge in making
this proof precise is that, unless the equality constraints are affine, they may not be
exactly satisfied for t > 0. In standard proofs of this result, this difficulty is over-
come by using the implicit function theorem to construct an x(t) that starts in the
direction of y but is perturbed slightly to remain feasible.

Proof. For simplicity, we prove only the case where hj (x) = aTj x − b is affine.
First, we define
m p
X X
∗ ∗
y(λ, ν) = −∇f0 (x ) − λi ∇fi (x ) − νj ∇hj (x∗ ) .
| {z }
i=1 j=1
aj

The vector y(λ, ν) can be seen as the residual of the descent direction for the ob-
jective function after the constraint gradients have been used to cancel some parts.
Next, we let ν ∗ (λ) = arg minν∈Rp ∥y(λ, ν)∥ and apply the best approximation the-
orem (for the standard inner product space) to see that

y(λ, ν ∗ (λ)) = P y(λ, 0),

where the matrix P defines an orthogonal projection onto H ⊥ and H = span({aj }).
This ensures that each hj (x∗ + ty(λ, ν ∗ (λ))) = 0 for all λ ∈ Rm and t ∈ R.
5.4. CONSTRAINED OPTIMIZATION 125

/ A} and compute
Continuing, we let S = {λ ∈ Rm | λ ≥ 0, λi = 0, i ∈

y ∗ = arg min ∥y(λ, ν ∗ (λ))∥.


λ∈S

This optimization uses the gradients of the active constraints to cancel as much of
the residual descent direction as possible. Thus, y ∗ ̸= 0 implies there is a descent
direction that does not violate the constraints. Looking at the formulas for y(λ, ν)
and y(λ, ν ∗ (λ)), we can also interpret y ∗ as the error vector for the projection of
v = −P ∇f0 (x∗ ) onto the convex set P F , which is the closed convex cone of
perturbations that preserve the equality constraints but locally violate the inequality
constraints. Then, u∗ = v − y ∗ equals the projection itself and we observe that
Theorem 4.6.7 implies (u − u∗ )T (v − u∗ ) ≤ 0 for all u ∈ P F . Since 0 ∈ P F , we
can choose u = 0 to see that (u∗ )T (v − u∗ ) ≥ 0. Using this, we can write

−(P ∇f0 (x∗ ))T y ∗ = v T (v − u∗ )


= (v − u∗ )T (v − u∗ ) + (u∗ )T (v − u∗ )
≥ ∥y ∗ ∥2 .

If (5.3) cannot be satisfied by some λ ∈ S and ν ∈ Rp , then y ∗ ̸= 0 and ∥y ∗ ∥ > 0.


Thus, a perturbation in the y ∗ direction will decrease the value of the objective
function while essentially preserving feasibility.
But, the y ∗ direction is only guaranteed to preserve feasibility to first order (i.e.,
(P ∇fi (x∗ ))T y ∗ ≤ 0) for i = 1, . . . , m. To fix this, first one needs to augment
y ∗ with a small amount of some vector w satisfying (P ∇fi (x∗ ))T w < 0 for all
i = 1, . . . , m. Such a vector is guaranteed by the constraint qualification. Then, we
can choose some δ > 0 such that (P ∇f0 (x∗ ))T (y ∗ + δw) ≤ − 12 ∥y ∗ ∥. With this
modification, the definition of the derivative implies that, for sufficiently small t,
x(t) = x∗ + t(y ∗ + δw) will be a feasible vector satisfying f0 (x(t)) < f0 (x∗ ).

5.4.2 Lagrangian Duality


Definition 5.4.8. The Lagrangian dual function is defined to be

g(λ, ν) ≜ inf L(x, λ, ν).


x∈D
126 CHAPTER 5. OPTIMIZATION

Lemma 5.4.9. The Lagrangian dual problem

maximize g(λ, ν)
subject to λ ≥ 0

has a unique maximum value d∗ ≤ p∗ . This property is known as weak duality.

Proof. The Lagrangian dual function is concave because it is the pointwise infimum
of affine functions

g(αλ+(1 − α)λ′ , αν + (1 − α)ν ′ )


= inf L(x, αλ + (1 − α)λ′ , αν + (1 − α)ν ′ )
x∈D

= inf αL(x, λ, ν) + (1 − α)L(x, λ′ , ν ′ )
x∈D

≥ inf αL(x, λ, ν) + inf



(1 − α)L(x′ , λ′ , ν ′ )
x∈D x ∈D

= αg(λ, ν) + (1 − α)g(λ′ , ν ′ ).

Thus, it follows from Theorem 5.3.5 that g has a unique maximum value d∗ which
can be upper bounded by
(a)
g(λ, ν) = inf L(x, λ, ν) ≤ inf L(x, λ, ν)
x∈D x∈F
m
X
(b) (c)
= p∗ + λi fi (x) ≤ p∗ ,
i=1

where (a) is implied by F ⊆ D, (b) follows from hj (x) = 0 for x ∈ F, and (c)
holds by combining fi (x) ≤ 0 for x ∈ F and λi ≥ 0.

The Lagrangian dual function can be −∞ for a wide range of (λ, ν). In this
case, it makes sense to eliminate these points by defining the implicit constraint set

C ≜ {(λ, ν) ∈ Rm × Rp |λ ⪰ 0, g(λ, ν) > −∞} .

The points (λ, ν) ∈ C are called dual feasible and it follows that

d∗ = sup g(λ, ν).


(λ,ν)∈C

By convention, d∗ = −∞ if the dual problem is not feasible (i.e., C = ∅).


5.4. CONSTRAINED OPTIMIZATION 127

Definition 5.4.10. If d∗ = p∗ , then one says strong duality holds for the problem.

Theorem 5.4.11. Let x∗ be a primal optimal point and (λ∗ , ν ∗ ) be a dual optimal
point. If strong duality holds, x∗ ∈ D◦ , and all fi and hj functions are differentiable
at x∗ , then we get the KKT conditions of complementary slackness, λ∗i fi (x∗ ) = 0
for i = 1, . . . , m, and stationarity (5.3).

Proof. By weak duality, we have

d∗ = g(λ∗ , ν ∗ ) ≤ f0 (x∗ ) = p∗ .

Since x∗ is feasible, combining d∗ = p∗ with the proof of weak duality shows that

inf L(x, λ∗ , ν ∗ ) = L(x∗ , λ∗ , ν ∗ )


x∈D

and λ∗i = 0 if the i-th inequality constraint is inactive (i.e., fi (x∗ ) < 0). Thus,
we also observe that complementary slackness condition λ∗i fi (x∗ ) = 0 holds for
i = 1, . . . , m. Since x∗ ∈ D◦ , it follows that x∗ is a locally optimal point of
L(x, λ∗ , ν ∗ ). Thus, x∗ must be a stationary point of L(x, λ∗ , ν ∗ ) and taking the
x-derivative gives (5.3).

Example 5.4.12. For the first LP in Definition 5.4.3, the Lagrangian is given by

L(x, λ, ν) = cT x + ν T (b − Ax) − λT x,

where the λ term is negative because the constraint is x ⪰ 0. Thus, the Lagrangian
dual function is given by

b T ν if c − AT ν − λ = 0
g(λ, ν) = inf L(x, λ, ν) =
x∈D −∞ otherwise.

Solving the implicit constraint and using the fact that λ ⪰ 0, one gets the dual LP
problem

maximize bT ν
subject to AT ν ⪯ c.

Strong duality for linear programs says that, if the original LP has an optimal so-
lution (i.e., it is neither unbounded nor infeasible), then the dual LP has an optimal
solution of the same value.
128 CHAPTER 5. OPTIMIZATION

5.4.3 Convex Optimization


Definition 5.4.13. An optimization problem in standard form is called convex if all
fi functions are convex, all the hj functions are affine (i.e., hj (x) = aTj x − bj ), and
D = Rn .

Problem 5.4.14. For a convex standard-form optimization problem (i.e., satisfying


Definition 5.4.13), show that the feasible set is a convex set.

Applying Theorem 5.3.5 to this setup shows that a convex standard-form opti-
mization problem has a unique minimum value. Also, if the function f0 is strictly
convex, then the minimum value achieved uniquely. There are a number of stronger
conditions that also imply strong duality for convex optimization problems. Slater’s
condition is stated below as a theorem and its proof can be found in [?, Sec. 5.3.2].

Theorem 5.4.15 (Slater’s Condition). If a convex optimization problem has a point


x0 where fi (x0 ) < 0 for i = 1, . . . , m and hj (x0 ) = 0 for j = 1, . . . , p, then the MF
constraint qualification and strong duality both hold for the problem. In addition,
if all fi functions are differentiable, then the KKT conditions are necessary and
sufficient for optimality.

Example 5.4.16. For a channel with colored noise, the input distribution that max-
imizes the achievable information rate can be found by solving the convex optimiza-
tion problem, known as water-filling, given by
n
X
minimize − log (xi + αi )
i=1
n
X
subject to xi = P
i=1

x ⪰ 0.

Choosing xi = P
n
for i = 1, . . . , n gives a point that satisfies Slater’s condition, so
strong duality holds for this problem.

Example 5.4.17. For the water-filling problem, the Lagrangian can be written as
n m n
!
X X X
L(x, λ, ν) = − log(xi + αi ) − λi xi + ν −P + xi
i=1 i=1 i=1
5.4. CONSTRAINED OPTIMIZATION 129

and the Lagrangian dual is given by g(λ, ν) = inf x∈Rn L(x, λ, ν).
If λi < 0, then the Lagrangian tends to −∞ as xi → −∞. Thus, the system
is implicitly constrained to have λi ≥ 0. The first-order optimality conditions, for
i = 1, 2, . . . , n, are given by
1
− − λi + ν = 0.
xi + α i
Solving this for xi shows that xi is increasing in λi (for λi ≥ 0) and this implies
that g(λ, ν) is decreasing in λi (for λi ≥ 0 and xi ≥ 0).
Thus, the expression maxλ≥0 g(λ, ν) is given by choosing the smallest non-
negative λi ’s for which xi ≥ 0. This implies that
 
 1 − αi , 0 if ν < 1
ν αi
(xi , λi ) =  
 0, ν − 1 if ν ≥ 1
.
αi αi

From this, the value of ν can be determined by solving


n
X n
X  
1
xi = max 0, − αi = P.
i=1 i=1
ν

By strong duality, the optimal value of the dual problem equals the optimal value
of the original problem. Finally, the problem can be easily solved for a range of P
values by sweeping through a range of ν values and computing P in terms of ν.
130 CHAPTER 5. OPTIMIZATION

You might also like