Basic Mathematical Tools: Optimization Theory
Basic Mathematical Tools: Optimization Theory
1 INTRODUCTION
During the lectures we need some basic topics and concepts from mathematical analysis. This
material is actually not so di¢ cult, –if you happen to have seen it before. If this is the …rst time,
experience has shown that even if it looks simple and obvious, it is necessary to spend some time
digesting it.
Nevertheless, the note should be read somewhat relaxed. Not all details are included, nor are all
proofs written out in detail. After all, this is not a course in mathematical analysis.
Among the central topics are the Taylor Formula in n dimensions, the general optimization setting,
and above all, basic properties of convex sets and convex functions. A very short review about
matrix norms and Hilbert space has also been included. The big optimization theorem in Hilbert
space is the Projection Theorem. Its signi…cance in modern technology and signal processing can
hardly be over-emphasized, although it is often disguised under other fancy names.
The …nal result in the note is the Implicit Function Theorem which ensures the existence of
solutions of implicit equations.
The abbreviation N&W refers to the textbook, J. Nocedal and S. Wright: Numerical Optimization,
Springer. Note that page numbers in the …rst and second editions are di¤erent.
We shall follow this convention and write rf (x) p for rf (x) p. There are, however, some
situations where the direction d, de…ned by the gradient is needed, and then d = rf 0 . In the
lectures we use 0 for transposing vectors and matrices.
A set V Rn is open if all points in V may be surrounded by a ball in Rn belonging to V : For
all x0 2 V , there is an r > 0 such that
fx ; kx x0 k < rg V: (3)
1
(This notation means ”The collection of all x-s such that kx x0 k < r”).
It is convenient also to say that a set V is open in if there is an open set W Rn such
that V = W \ (The mathematical term for this is a relatively open set). Let = [0; 1] R.
The set [0; 1=2) is not open in R (why?). However, as a subset of , [0; 1=2) [0; 1], it is open in
, since [0; 1=2) = ( 1=2; 1=2) \ [0; 1] (Think about this for a while!).
A neighborhood N of a point x is simply an open set containing x.
sup S; (4)
is the smallest number that is equal to or larger than all members of the set.
It is a very fundamental property of real numbers that the supremum always exists, although it
may be in…nite. If there is a member x0 2 S such that
x0 = sup S; (5)
x0 = max S: (6)
In this case, there is no maximum element in S. However, sup S = 1, since no number less than
1 …ts the de…nition. Nevertheless, 1 is not a maximum, since it is not a member of the set. This
is the rule:
A supremum always exists, but may be +1. If a maximum exists, it is equal to the supremum.
For example,
2
2.2 Convergence of Sequences
This de…nition is a bit tricky, but if you pick an " > 0, I can always …nd an n" such that
lim xn = a (12)
n!1
x1 x2 x3 : (13)
(it may diverge to +1). Thus, a monotonically increasing sequence that is bounded above, is
always convergent (You should try to prove this by applying the de…nition of sup and the de…nition
of a Cauchy sequence!).
Similar results also apply for monotonically decreasing sequences.
A set S in Rn is bounded if
sup kxk < 1: (15)
x2S
It is easy to see, by noting that every component of the vectors is a sequence of real numbers,
that all Cauchy sequences in Rn converge.
A set C in Rn is closed if all Cauchy sequences that can be formed from elements in C converge to
elements in C. This may be a bit di¢ cult to grasp: Can you see why the interval [0; 1] is closed,
while (0; 1) or (0; 1] are not? What about [0; 1)? Thus, a set is closed if it already contains all
the limits of its Cauchy sequences. By adding these limits to an arbitrary set C, we close it, and
write C for the closure of C. For example,
3
Consider a bounded sequence S = fxn g1
i=1 in R, and assume for simplicity that
Split the interval [0 ; 1] into half, say [0 ; 21 ) and [ 21 ; 1]. Select one of these intervals containing
in…nitely many elements from S; and pick one xn1 2 S from the same interval. Repeat the
operation by halving this interval and selecting another element xn2 . Continue the same way. On
step k, the interval Ik will have length 2 k and all later elements xnk , xnk+1 , xnk+2 , will be
members of Ik . This makes the sub-sequence fxnk g1 k=1 S into a Cauchy sequence (why?), and
hence it converges. A similar argument works for a sequence in Rn .
A closed set with the property that all bounded sequences have convergent subsequences, is called
compact (this is a mathematical term, not really related to the everyday meaning of the word).
By an easy adaptation of the argument above, we have now proved that all bounded and closed
sets in Rn are compact.
Of course, as long as the set above is bounded, fxnk g1
k=1 will be convergent, but the limit may
not belong to the set, unless it is closed.
If you know the Hilbert space l2 (or P1see below) consisting of all in…nite-dimensional vectors
2 2
x = f 1 ; 2 ; g such that kxk = i=1 j i j < 1, you will probably also know that the unit
ball, B = fx ; kxk 1g is bounded (obvious) and closed (not so obvious). All unit vectors fei g1i=1
in an orthogonal basis will belong to B. However, kei ej k2 = kei k2 +k ej k2 = 2;whenever i 6= j.
We have no convergent subsequences in this case, and B is not compact! This rather surprising
example occurs because l2 has in…nite dimension.
It is convenient to write that the size of f (x) is of the order of g (x) when x ! a in the short
form
f (x) = O (g (x)) ; x ! a: (19)
Mathematically, this means that there exists two …nite numbers, m and M such that
and assume that lower bound, not very much smaller than M g (x) can be found. For example,
log (1 + x) x = O x2
when x ! 0.
The other symbol, o (), is slightly more precise: We say that f (x) = o (g (x)) when x ! a if
f (x)
lim = 0: (22)
x!a g (x)
4
2.5 The Taylor Formula
You should all be familiar with the Taylor series of a function g of one variable,
g 00 (t0 ) g 000 (t0 )
g (t) = g (t0 ) + g 0 (t0 ) (t t0 ) + (t t0 )2 + (t t0 )3 + : (23)
2! 3!
The Taylor Formula is not a series, but a quite useful …nite identity. In essence, the Taylor
Formula gives an expression for the error between the function and its Taylor series truncated
after a …nite number of terms.
We shall not dwell with the derivation of the formula, which follows by successive partial integra-
tions of the expression Z t
g (t) = g (0) + g 0 (s) ds; (24)
0
and the Integral Mean Value Theorem,
Z t Z t
f (s) ' (s) ds = f ( t) ' (s) ds; ' 0; f continuous, 2 (0; 1) :
0 0
The formulae below state for simplicity the results around t = 0, but any point is equally good.
The simplest and very useful form of Taylor Formula is also known as the Secant Formula:
If the derivative g 0 exists for all values between 0 and t, there is a 2 (0; 1) such that
This is an identity. However, since we do not know the value of , which in general depends on t,
we can not use it for computing g (t)! Nevertheless, the argument t is at least somewhere in the
open interval between 0 and t.
If g 0 is continuous at t = 0, we may write
g (t) = g (0) + g 0 ( t) t
= g (0) + g 0 (0) t + g 0 ( t) g 0 (0) t (26)
0
= g (0) + g (0) t + o (t) :
t2
g (t) = g (0) + g 0 (0) t + g 00 ( t) (27)
2!
(Try to prove this using the Integral Mean Value Theorem and assuming that g 00 is continuous!
Be sure to use s t for the integral of ds).
Hence, if g 00 is bounded,
g (t) = g (0) + g 0 (0) t + O t2 (28)
The general form of Taylor Formula, around 0 and with su¢ ciently smooth functions, reads
N
X g (j) (0)
g (t) = tj + RN (t) ; (29)
j!
j=0
Z t
g (N +1) (s) g (N +1) ( t) N +1
RN (t) = (t s)N ds = t ; 2 (0; 1) : (30)
0 N! (N + 1)!
5
2.6 The n-dimensional Taylor Formula
The n-dimensional Taylor formula will be quite important to us, and the derivation is based on
the one-dimensional formula above.
Let f : Rn ! R, and assume that rf exists around x = 0. Let us write g (s) = f (sx). Then
n
X n
X
@f d (sxi ) @f
g0 ( ) = ( x) ( )= ( x) xi = rf ( x) x; (31)
@xi ds @xi
i=1 i=1
and we obtain
f (x) = g (1)
= g (0) + g 0 ( ) 1 (32)
= f (0) + rf ( x) x; 2 (0; 1) ;
which is the n-dimensional analogue of the Secant Formula. Note that the point x is somewhere
on the line segment between 0 and x, and that the same applies to all components of x (but
again, is an unknown function of x).
As above, if rf is continuous at x = 0,
Contrary what is stated in the …rst edition of N&W (and numerous other non-mathematical
textbooks!), it is not su¢ cient that all partial derivatives exist at x0 (Think about this for a
while: The components of rf contain only partial derivatives of f along the coordinate axis.
Find a function on R2 where rf (0) = 0 but which, nevertheless, is not di¤erentiable at x = 0.
E.g., consider the function de…ned as sin 2 in polar coordinates)
The next term of the n-dimensional Taylor Formula is derived similarly:
n n
d X @f (sx) X @ 2 f (sx)
g 00 ( ) = xi = xj xi = x0 H ( x) x: (37)
ds @xi @xi @xj s=
i=1 s= i;j=1
6
Yes, Optimization Theory uses sometimes the unfortunate notation r2 f (x), which is not the
familiar Laplacian used in Physics and PDE theory!
From the above, the second order Taylor formula may now be written
1
f (x) = f (0) + rf (0) x + x0 r2 f ( x) x; 2 (0; 1) : (39)
2
Higher order terms get increasingly more complicated and are seldom used.
By truncating the n-dimensional Taylor series after the second term, we end up with what is
called a quadratic function, or a quadratic form,
1
q (x) = a + b0 x + x0 Ax: (40)
2
By considering quadratic functions we may analyze many important algorithms in optimization
theory analytically, and one very important case occurs if A is positive de…nite. The function q is
then convex (see below) and min q (x) is obtained for the unique vector
1
x = A b: (41)
We shall, from time to time, use the notation ”A > 0” to mean that the matrix A is positive
de…nite (NB! This does not mean that all aij > 0!). Similarly, ”A 0” means that A is positive
semide…nite.
Positive de…nite matrices lead to what is called matrix (or skew ) norms on Rn . The matrix norms
are important in the analysis of the Steepest Descent Method, and above all, in the derivation of
the Conjugate Gradient Method.
Assume that A is a symmetric positive de…nite n n matrix with eigenvalues
0< 1 2 n; (42)
and a corresponding set of orthogonal and normalized eigenvectors fei gni=1 . Any vector x 2 Rn
may be expanded into a series of the form
n
X
x= i ei ; (43)
i=1
and hence,
n
X n
X
Ax = i Aei = i i ei ; (44)
i=1 i=1
and
n
X
0 2
x Ax = i i: (45)
i=1
The A-norm is de…ned
1=2
kxkA = x0 Ax : (46)
Since
n
X n
X
2 2 2
1 kxk = 1 i x0 Ax n
2
i = n kxk ; (47)
i=1 i=1
7
we observe that
1=2 1=2
1 kxk kxkA n kxk ; (48)
and the norms kxk = kxk2 and kxkA are equivalent (as are any pair of norms in Rn ). The
veri…cations of the norm properties are left for the reader:
(i) x = 0 () kxkA = 0;
(ii) k xkA = j j kxkA ; (49)
(iii) kx + ykA kxkA + kykA :
In fact, Rn even becomes a Hilbert space in this setting if we de…ne a corresponding inner product
h ; iA as
hy; xiA = y 0 Ax: (50)
It is customary to say that x and y are A-conjugate (or A-orthogonal) if hy; xiA = 0.
A Hilbert space H is a linear space, and for our applications, consisting of vectors or functions.
In case you have never heard about a Hilbert space, use what you know about Rn .
It is …rst of all a linear space, so that if x; y 2 H and ; 2 R, also x + y has a meaning and
is an element of H (We will not need complex spaces).
Furthermore, it has a scalar product h ; i with its usual properties,
However, the really big theorem in Hilbert spaces related to optimization theory is the Projection
Theorem:
The Projection Theorem: If H0 is a closed subspace of H and x 2 H; then miny2H0 kx yk
is obtained for a unique vector y0 2 H0 ; where
8
y0 is the best approximation to x in H0 :
The theorem is often stated by saying that any vector in H may be written in a unique way as
x = y0 + e; (53)
i = hx; ei i ; i = 1; 2; (54)
If you ever need some Hilbert space theory, the above will probably cover it.
9
A local minimum x is strict (or an isolated) local minimum if there is an N such that
f (x ) < f (x) for all x 2 N , x 6= x :
for a solution x of (58). If there is only one minimum, which is then both global and strict, we
say it is unique.
As we saw for some trivial cases above, a minimum does not necessarily exist. So what about a
criterion for existence? The following result is fundamental:
Assume that f is a continuous function de…ned on a closed and bounded set Rn . Then there
exists x 2 such that
f (x ) = min f (x) : (60)
x2
This theorem, which states that the minimum (and not only an in…mum) really exists, is the most
basic existence theorem for minima that we have. A parallel version exists for maxima.
Because of this result, we always prefer that the domain we are taking the minimum or maximum
over is bounded and closed (Later in the text, when we consider a domain , think of it as closed).
Let us look at the proof. We …rst establish that f is bounded below over , that is, inf x2 f (x) is
…nite. Assume the opposite. Then there are xn 2 such that f (xn ) < n, n = 1; 2; 3 . Hence
limn!1 f (xn ) = 1. At the same time, since was bounded and closed, there are convergent
subsequences, say limk!1 xnk = x0 2 . But limk!1 f (xnk ) = 1 = 6 f (x0 ); thus contradicting
that f is continuous, and hence …nite at x0 .
Since f is bounded below, we know that there is an a 2 R such that
Since a is the largest number that is less or equal to f (x) for all x 2 , we also know that for
any n; there must be an xn 2 such that
1
f (xn ) < a + (62)
n
(think about it!).
We thus obtain, as above, a sequence fxn g that has a convergent subsequence fxnk g1
k=1 ;
10
Ω
Ω
(a) (b)
Figure 1: (a) Feasible directions in the interior and on the boundary of . (b) Feasible directions when
(the circle itself, not the disc!) does not contain any line segment.
Hence
f (x0 ) = a: (66)
But this means that
f (x0 ) = a = inf f (x) = min f (x) ; (67)
x2 x2
d (t) x
= lim : (70)
kdk t!0+ k (t) xk
11
3.3 First and Second Order Conditions for Minima
12
Convex Not convex
Figure 2: For a convex set, all straight line segments connecting two points are contained in the set.
4 BASIC CONVEXITY
Convexity is one of the most important concepts in optimization. Although the results here are
all quite simple and obvious, they are nevertheless very powerful.
The space R2
f(x; y) 2 R2 ; x2 + 2y 2 2g
f(x; y) 2 R2 ; x2 2y 2 2g
fx 2 Rn ; Ax b, b 2 Rm and A 2 Rm ng
13
Theorem 1: If 1; ; N Rn are convex sets, then
N
\
1 \ \ n = i (76)
i=1
is convex.
Proof: Choose two points x ; y 2 \N
i=1 i. Then x + (1 )y 2 i for i = 1; ; N , that is,
x + (1 ) y 2 \N
i=1 i .
Thus, intersections of convex sets are convex!
Consider the graph of f in R and the connecting line segment from (x1 ; f (x1 )) to (x2 ; f (x2 )),
consisting of the following points in Rn+1 :
x1 + (1 ) x2 ;
f (x1 ) + (1 ) f (x2 ) ; 2 (0; 1) :
The function is convex if all such line segments lie on or above the graph. Note that a linear
function, say
f (x) = b0 x + a; (78)
is convex according to this de…nition, since in that particular case, Eqn. 77 will always be an
equality.
When the inequality in Eqn. 77 is strict, that is, we have "<" instead of " ", then we say that
the function is strictly convex. A linear function is convex, but not strictly convex.
Note that a convex function may not be continuous: Let = [0; 1) and f be the function
1; x = 0;
f (x) = (79)
0; x > 0:
Show that f is convex. This example is a bit strange, and we shall only consider continuous
convex functions in the following.
Proposition 1: If f and g are convex, and ; 0; then f + g is convex (on the common
convex domain where both f and g are de…ned ).
Idea of proof: Show that f + g satis…es the de…nition in Eqn. 77.
What is the conclusion in Proposition 1 if at least one of the functions are strictly convex and ,
> 0? Can Proposition 1 be generalized?
Proposition 2: If f is convex, then the set
is convex.
14
Figure 3: Simple examples of graphs of convex and strictly convex functions (should be used only as
mental images!).
f ( x1 + (1 ) x2 ) f (x1 ) + (1 ) f (x2 )
c + (1 ) c = c: (81)
This proposition has an important corollary for sets de…ned by several inequalities:
Corollary 1: Assume that the functions f1 , f2 , ; fm ; are convex. Then the set
is convex.
Try to show that the maximum of a collection of convex functions, g (x) = maxi ffi (x)g, is also
convex.
We recall that di¤erentiable functions had tangent planes
and
f (x) Tx0 (x) = o (kx x0 k) : (84)
Proposition 3: A di¤ erentiable function on the convex set is convex if and only if its graph
lies above its tangent planes.
Proof: Let us start by assuming that f is convex and x0 2 . Then
Thus,
f (x) f (x0 ) + rf (x0 ) (x x0 ) = Tx0 (x) : (86)
15
z=f(x) Connection lines
Tangent planes
x
Figure 4: A useful mental image of a convex function: Connecting line segments above, and tangent
planes below the graph!
For the opposite, assume that the graph of f lies above its tangent planes. Consider two arbitrary
points x1 and x2 in and a point x on the line segment between them, x = x1 + (1 ) x2 .
Then
f (x1 ) f (x ) + rf (x ) (x1 x );
f (x2 ) f (x ) + rf (x ) (x2 x ): (87)
Multiply the …rst equation by and the last by (1 ) and show that this implies that
The following proposition assumes that the second order derivatives of f , that is, the Hessian
r2 f , exists in . We leave out the proof, which is not di¢ cult:
Proposition 4: A smooth function f de…ned on a convex set is convex if and only if r2 f is
positive semi-de…nite in : Moreover, f will be strictly convex if r2 f is positive de…nite.
The opposite of convex is concave. The de…nition should be obvious. Most functions occurring in
practice are either convex and concave locally, but not for their whole domain of de…nition.
All results above have counterparts for concave functions.
The results about minimization of convex functions de…ned on convex sets are simple, but very
powerful:
Theorem 2: Let f be a convex function de…ned on the convex set . If f has minima in ,
these are global minima and the set of minima,
is convex.
16
Note 1: Let = R and f (x) = ex . In this case the convex function f (x) de…ned on the convex
set R has no minima.
Note 2: Note that itself is convex: All minima are collected at one place. There are no isolated
local minima here and there!
Proof: Assume that x0 is a minimum which is not a global minimum. We then know there is
a y 2 where f (y) < f (x0 ). The line segment going from (x0 ; f (x0 )) to (y; f (y)) is therefore
sloping downward. However, because f is convex,
for all 2 [0; 1). Hence, x0 can not be a local minimum, but a global minimum!
Assume that f (x0 ) = c: Then
= fy ; f (y) = cg (91)
= fy ; f (y) cg ;
is convex by Proposition 2.
Corollary 1: Assume that f is a convex function on the convex set and assume that the
directional derivatives exist at x0 . Then x0 belongs to the set of global minima of f (x) in if
and only if f (x0 ; d) 0 for all feasible directions.
Proof: We already know that f (x0 ; d) would be nonnegative if x0 is a (global) minimum, so
assume that x0 is not a global minimum. Then f (y) < f (x0 ) for some y 2 , and d = y x0 is
a feasible direction (why?). But this implies that
Corollary 2: Assume, that f is a di¤ erentiable convex function on the convex set and that
rf (x0 ) = 0. Then x0 belongs to the set of global minima of f (x) in .
Proof: Here f (x0 ; d) = rf (x0 ) d = 0 (which is larger or equal to 0!).
Note that if f is convex on the convex set , and f (x; y x) exists for all x; y 2 , then
inequality (92) may be written
Life is easy when the functions are convex, and one usually puts quite some e¤ort either into
formulating the problem so that it is convex, or tries to prove that for the problem at hand!
Jensen’s Inequality is a classic result in mathematical analysis where convexity plays an essential
role. The inequality may be extended to a double-inequality which is equally simple to derive.
17
w1
w2
w3 (1 − θ )ϕ (λ1 )+ θϕ (λn )
z = ϕ (λ ) (λ , ϕ (λ ))
wn
ϕ (λ )
λ1 λ λn
Figure 5: Think of the points as mass-particles and determine their center of gravity!
The inequality is among the few statements in mathematics where the proof is easier to remember
than the result itself!
Let ' be a convex function, ' : R ! R. We …rst consider the discrete case where 1 n,
and fwi gni=1 are positive numbers. Jensen’s double-inequality then goes as follows:
where
Pn
wi i
= Pi=1
n ;
wi
Pni=1
w ' ( i)
Pn i
' ( ) = i=1 ; (94)
i=1 wi
1
= :
n 1
The name "Jensen’s double inequality" is not very common, but suitable since there are two
(non-trivial) inequalities involved.
The proof may be read directly out from Fig. 5, thinking in pure mechanical terms: The center
of gravity for the n mass points at f i ; ' ( i )gni=1 with weights fwi gni=1 , is located at ;'( ) .
Because of the convexity of ', the ordinate ' ( ) has to be somewhere between ' and l ,
that is, the point corresponding to on the line segment joining ( 1 ; ' ( 1 )) and ( n ; ' ( n )).
That is all !
It is the left part of the double inequality that traditionally is called Jensen’s Inequality.
Also try to write the inequality in the case when w is a positive function of , and derive the
following inequality for a real stochastic variable:
(Hint: RThe mass density is now the probability density w ( ) for the variable, and recall that
1
EX = 1 w ( ) d ).
18
A lot of inequalities are derived from the left hand side of Jensen’s double-inequality. However,
the Kantorovitch Inequality, discussed next is an exception, since it is based on the right hand
part of the inequality.
kxk2A kxk2A 1 1( 1 + n)
2
: (96)
kxk4 4 1 n
Pn
Since the inequality is invariant with respect to the norm of x, we shall assume that x = i=1 i ei ,
and set wi = i2 so that
Xn
wi = kxk2 = 1: (97)
i=1
1
Since we are on the positive real axis, the function ' ( ) = is convex, and
n
X
kxk2A 0
= x Ax = i wi = ;
i=1
n
X 1
kxk2A 1 =xA0 1
x= wi = ' ( ): (98)
i=1 i
The right hand side is a second order polynomial in with a maximum value,
2
1( 1 + n)
; (100)
4 1 n
19
where
1
f (x) = b0 x + x0 Ax; A > 0:
2
We know that the gradient direction g = (rf )0 in this case is equal to b + Ax, and the Hessian
r2 f is equal to A. The problem has a unique solution for b + Ax = 0, that is, x = A 1 b.
At a certain point xk , the steepest descent is along the direction gk = (b + Axk ). We therefore
have to solve the one-dimensional sub-problem
xk+1 = xk k gk ; (103)
rf (xk+1 ) gk = 0; (104)
0
or gk+1 gk = 0. This gives us the equation
0
[b + A (xk k gk )] gk =
0
(gk k Agk ) gk = 0; (105)
or
gk0 gk kgk k
k = 0 = : (106)
gk Agk kgk kA
The algorithm, which at the same time is an iterative method for the system Ax = b, goes as
follows:
Given x1 and g1 = b + Ax1 :
for k = 1 until convergence do
gk0 gk
k = gk0 (Agk )
xk+1 = xk k gk
gk+1 = gk k (Agk )
end
In order to get an estimate of the error on step k; we note that
1 1
A gk = A (b + Axk ) = x + xk : (107)
Hence,
0
kxk x k2A = A 1
gk A A 1
gk = kgk k2A 1 ; (108)
and
kxk+1 x k2A kgk+1 k2A 1
= : (109)
kxk x k2A kgk k2A 1
20
Let us look at kgk+1 k2A 1 on the right hand side:
kgk+1 k2A 1
0
= gk+1 A 1
(gk k (Agk ))
0 1 0
= gk+1 A gk k gk+1 gk
0 1
= gk+1 A gk (110)
0 1
= (gk k (Agk )) A gk
1 (gk0 gk )2
= gk A gk
gk0 (Agk )
kgk k4
= kgk k2A 1 :
kgk k2A
Thus,
kgk k4
=1
kgk k2A 1 kgk k2A
4 1 n
1 (111)
( 1 + n )2
2 2
1 n 1
= = ;
1 + n +1
where Kantorovitch Inequality was applied for the inequality in the middle. We recognize =
n = 1 as the condition number for the Hessian A.
If the condition number of the Hessian is large, the convergence of the steepest descent method
may be very slow!
x2 + y 2 1 = 0: (112)
Given a general equation h(x; y) = 0, it is natural to ask whether it is possible to write this as
y = f (x). For Eqn. 112, it works well locally around a solution (x0 ; y0 ), except for the points
( 1; 0) and (1; 0). In more di¢ cult situations it may not be so obvious, and then the Implicit
Function Theorem is valuable.
The Implicit Function Theorem tells us that if we have an equation h(x; y) = 0 and a solution
(x0 ; y0 ), h(x0 ; y0 ) = 0, then there exists (if the conditions of the theorem are valid) a neighborhood
21
N around x0 such that we may write
y = f (x);
h (x; f (x)) = 0; for all x 2 N : (113)
The theorem guarantees that f exists, but does not solve the equation for us, and does not say
in a simple way how large N is.
Consider the implicit function equation
x2 y2 = 0 (114)
to see that we only …nd solutions in a neighborhood of a known solution, and that we, in this
particular case, will have problems at the origin.
We are going to present a somewhat simpli…ed version of the theorem which, however, is general
enough to show the essentials.
Let
h(x; y) = 0 (115)
be an equation involving the m-dimensional vector y and the n-dimensional vector x. Assume
that h is m-dimensional, such that there is hope that a solution with respect to y exists. We thus
have m nonlinear scalar equations for the m unknown components of y:
Assume we know at least one solution (x0 ; y0 ) of Eqn. 115, and by moving the origin to (x0 ; y0 ),
we may assume that this solution is the origin, h(0; 0) = 0. Let the matrix B be the Jacobian of
h with respect to y at (0; 0):
@h @hi
B= (0) = f (0)g: (116)
@y @yj
The Implicit Function Theorem may then be stated as follows:
Assume that h is a di¤ erentiable function with continuous derivatives both in x and y. If the
matrix B is non-singular, there is a neighborhood N around x = 0, where we can write y = f (x)
for a di¤ erentiable function f such that
The theorem is not unreasonable: Consider the Taylor expansion of h around (0; 0):
The matrix A is the Jacobian of h with respect to x, and B is the matrix above. To the …rst
order, we thus have to solve the equation
Ax + By = 0; (119)
The full proof of the Implicit Function Theorem is technical, and it is perfectly OK to stop the
reading here!
22
For the brave, we start by stating Taylor’s Formula to …rst order for a vector valued function
y = g (x), x 2 Rn ; y 2 Rm :
g(x) = g(x0 ) + rg(x )(x x0 );
x = x0 + (I )x: (121)
Note that since g has m components, rg(x ) is an m n matrix (the Jacobian),
2 3
rg1 (x 1 )
6 rg2 (x ) 7
6 2 7
rg(x ) = 6 .. 7; (122)
4 . 5
rgm (x m )
and is a matrix, = diag f 1 ; ; mg : We shall assume that all gradients are continuous as
well, and hence
g(x) = g(x0 ) + rg(x0 )(x x0 ) + (rg(x ) rg(x0 ))(x x0 )
= g(x0 ) + rg(x0 )(x x0 ) + a(x; x0 )(x x0 ) (123)
where a(x; x0 ) ! 0.
x!x0
Put
(x; y) = h(x; y) Ax By; (124)
where, as above, A = @h=@x(0) and B = @h=@y(0). From Taylor’s Formula,
(x; y) = a(x; y)x + b(x; y)y; (125)
where both a and b tend to 0 when x; y ! 0 . Thus, for any positive ", there are neighborhoods
B(x; rx ) = fx; kxk < rx g;
B(y; ry ) = fy; kyk < ry g; (126)
such that
(i) k (x; y)k " kxk + " kyk ; x 2 B(x; rx ); y 2 B(y; ry );
(ii) k (x1 ; y1 ) (x2 ; y2 )k " kx1 x2 k + " ky1 y2 k ; (127)
x1 ; x2 2 B(x; rx ); y1 ; y2 2 B(y; ry ):
We now de…ne the non-linear mapping y ! T (y) as
1 1
y ! T (y) = B Ax B (x; y); (128)
and will show that this mapping is a contraction on B(y; ry ) for all x 2 B(x; rx ). This is the core
of the proof.
Choose " so small that " + jjB 1 jj" < 1. Then, …nd rx and ry such that (i) and (ii) hold, and
also ensure that rx is so small that
"
rx < ry : (129)
jjB Ajj + jjB 1 jj"
1
23
Thus T (B(y; ry )) B(y; ry ). Moreover,
1
kT (y1 ) T (y2 )k = jjB ( (x; y1 ) (x; y2 )) jj
1
" jjB jj ky1 y2 k (131)
< (1 ") ky1 y2 k :
or
Ax + By0 + (x; y0 ) = h(x; y0 ) = 0 (133)
for all x 2 B(x; rx )!
This proves the existence of the function x ! f (x) = y0 in the theorem for all x 2 B(x; rx ).
The continuity is simple:
1 1
y2 y1 = B A(x2 x1 ) B ( (x2 ; y2 ) (x1 ; y1 )) ; (134)
giving
1 1
ky2 y1 k B A kx2 x1 k + B (" kx2 x1 k + " ky2 y1 k) ; (135)
and hence
B 1A + B 1 "
ky2 y1 k kx2 x1 k :
1 kB 1 k "
Di¤erentiability of f in the origin follows from the de…nition and (ii ) above. Proof of the dif-
ferentiability in other neighboring locations is simply to move the origin there and repeat the
proof.
Luenberger gives a more complete and precise version of the theorem. The smoothness of f
depends on the smoothness of h.
A …nal word: Remember the theorem by recalling the equation
Ax + By = 0; (136)
6 REFERENCES
Luenberger, D. G.: Linear and Nonlinear Programming, 2nd ed., Addison Westley, 1984.
24