Math Skript
Math Skript
Mario Ullrich
Institut für Analysis
Version of
October 19, 2022
Mathematics for
Artificial Intelligence 1–3
JOHANNES KEPLER
UNIVERSITÄT LINZ
Altenbergerstraße 69
4040 Linz, Österreich
www.jku.at
2
Preface
These lecture notes belong to the lecture of the same name, and were produced starting in the
winter semester 2019. Big part of the work has been done by Julian Hofstadler and Corinna
Perchtold, who were employed as undergraduate assistants in this period for the preparation of
these notes and corresponding slides.
This is the first time that this lecture has been held at the JKU Linz and therefore, these notes
are far from being perfect. However, many parts are taken (or merged) from the repeatedly
used lecture notes of Prof. Aicke Hinrichs (“Analysis für Lehramt”, JKU, 2018), Prof. Andreas
Neubauer (“Mathematics for Chemistry”, JKU, 2017) and myself (“Klassische Harmonische
Analysis”, JKU, 2017). I thank both colleagues for the permission to do so.
Mario Ullrich
(mario.ullrich@jku.at)
October 2021
3
Contents
Preface 2
A set M is a collection of different ’objects’ which we call elements of M . This rather intuitive
description of a set was first given by Georg Cantor (1845–1918). We use the following notation:
x belongs to M, write x ∈ M,
or
x is not in M, write x ∈
/ M.
x∈M x∈
/M
M M
x
x
Some very important (and partly well-known) sets of numbers together with their ’symbol’ are:
(Note that we write “:=” instead of “=” if the equation is meant as a definition.)
7
All these sets will be precisely introduced and discussed later in this chapter. First, let us see
that there are multiple ways to define sets. The easiest way would be to list all its elements, as
for:
• A := {0, 1, 2}
However, if we have sets containing an infinite amount of elements we cannot name all of them.
In this case we use dots if it is clear what is contained in the set. For example, we might write
• G := {2, 4, 6, . . . } and
• U := {1, 3, 5, . . . }
for the even and odd natural numbers. However, this may lead to difficulties of interpretation
as this is not a unique description. For example, we may define the set of natural numbers by
N := {1, 2, 3, . . . } = {1, 2, 3, 4, 5, 6, 7, . . . },
• P := {n ∈ N : n is prime}
• G := {n ∈ N : n is an even number}.
A special but important set is the empty set, short ∅, which does not contain any element,
i.e. ∅ = {}.
Two sets M and N can also be related to each other. If for arbitrary m ∈ M we also have
m ∈ N then we say M is a subset of N , and we write M ⊂ N or M ⊆ N . In this case, we also
call N a superset of M . If we have a look at the sets defined above, we have e.g. G ⊂ N and
U ⊂ N. Note that for any set M we have the obvious relations M ⊂ M and ∅ ⊂ M .
Sets M and N are called equal if they contain the same elements, i.e. M ⊂ N and N ⊂ M .
For example we have
A = {0, 1, 2} = {0, 0, 0, 1, 1, 1, 2} =: Ã.
To verify this equation we have to show that A ⊂ à and à ⊂ A and start by showing A ⊂ Ã.
Obviously 0 ∈ Ã, 1 ∈ Ã and 2 ∈ Ã, hence by definition we have A ⊂ Ã. The other way around,
i.e. Ã ⊂ A, is left as an exercise. Note that multiplicities are irrelevant in sets.
If we have M ⊂ N and M 6= N , then we say that M is a proper or strict subset of N and
write M ( N . For example N ( N0 ( Z.
Remark 1.1. Some authors prefer to use “⊆” instead of “⊂” to indicate that equality is not
excluded. (And we also do so sometimes.) The same authors may use “⊂” instead of “(” for
proper subsets . So, one might be careful when using different literature.
8
Sets may contain other sets. An important example is the power set P(M ) for a set M , which
is the set of all possible subsets of M , i.e.,
P(M ) := {A : A ⊂ M }.
Consider once more the set A = {0, 1, 2}, then its power set is given by:
P(A) = ∅, {0}, {1}, {2}, {0, 1}, {0, 2}, {1, 2}, {0, 1, 2} .
Note that we always have M ∈ P(M ) and ∅ ∈ P(M ). (Important: The statement M ⊂ P(M )
is usually false! M contains elements, and P(M ) contains sets of elements.)
We can also create new sets from given sets, say M and N , by using set operations:
The union M ∪ N contains all elements which belong to the set M and all elements which
belong to the set N .
The intersection M ∩ N consists of all elements which are in both sets M and N .
The difference (or relative complement) of M and N , written as M \N , is the set of all elements
of M which are not contained in N .
If we only work with subsets M ⊂ Ω for a fixed set Ω, then we call Ω the underlying set or
the universal set. In this case, we work with the notation M c = Ω \ M for the complement
of M (in Ω).
M ∪N M ∩N
M N M N
M \N Mc
M N
The illustrations above are called Venn-diagrams (John Venn, 1834-1923) and are a good tool
when working with sets.
However, we often need precise definitions of these set operations in mathematical language.
9
M ∪ N := {x : x ∈ M or x ∈ N },
M ∩ N := {x : x ∈ M and x ∈ N },
M \ N := {x ∈ M : x ∈
/ N}
M c := {x ∈ Ω : x ∈
/ M }.
Elements of sets are not ordered since {a, b} = {b, a}. Nevertheless, it is often important to
order the objects under consideration.
Definition 1.3 (Tuples and Cartesian product). Let A and B be sets, and a ∈ A and
b ∈ B arbitrary elements.
• The expression (a, b), which is sensitive to order, is called a tuple or an ordered pair.
• Two tuples (a, b) and (a0 , b0 ) are equal if and only if a = a0 and b = b0 .
Note that
A × B = {(x, 1), (x, 2), (x, 3), (y, 1), (y, 2), (y, 3)}
but
B × A = {(1, x), (1, y), (2, x), (2, y), (3, x), (3, y)}.
10
Let us finally fix some other mathematical language to make writing mathematical statements
more ’elegant’. To this end, we start with the universal and the existential quantifier, which
form the basis of many expressions in mathematical language. First, a proposition P is an
expression which can either be true or false, like 1 = 1 or 0 = 1. If we have that P (m) is a
proposition for all elements m of a set M , then we say that P (·) is a predicate for M .
Definition 1.5. Let P (·) be a predicate for M , i.e., P (m) is a proposition for every m ∈ M .
The universal quantifier builds a proposition
∀ m ∈ M : P (m),
which is true if and only if for all m ∈ M the proposition P (m) is true.
The existential quantifier builds a proposition
∃ m ∈ M : P (m),
which is true if and only if there exists at least one m ∈ M such that P (m) is true.
The uniqueness quantifier for which
∃! m ∈ M : P (m)
is true if and only if there exists exactly one m ∈ M such that P (m) is true.
Example 1.6. Consider the set M = {0, 1, 2} and set P (m) = (m > 1). Inserting all the
elements of M into P (·) we get (0 > 1), (1 > 1) and (2 > 1). Clearly, only the last statement
is true. Hence, (∀m ∈ M : P (m)) = (∀m ∈ M : m > 1) is a wrong proposition, while ∃m ∈
M : m > 1 is true. We even have that ∃!m ∈ M : m > 1 is true.
Example 1.7. With these quantifiers we can also give a more mathematical (or ’elegant’)
definition of a subset. We have that M ⊂ N if and only if
∀x ∈ M : x ∈ N.
Moreover, we have M ( N if and only if M ⊂ N and ∃x ∈ N : x ∈
/ M.
As you might have already noticed, we will often need the terms “if” or “if and only if”, and
therefore we define a mathematical symbol for them. Let A and B be two propositions. Then,
Roughly speaking, relations shall describe connections between two objects. Here, we give a
formal description and important properties. We then introduce functions, which are used to
“map” every element of a set to something different, and discuss special relations that are used
to compare, group or order elements of a given set.
Definition 1.8. A relation R between two sets M and N is a subset of the cartesian
product of M and N , i.e. R ⊂ M × N .
To make things clearer we have a look at the upcoming illustration, which depicts every element
of R as a line. As you can see it is possible that x ∈ M is “connected” to some y ∈ N , which we
denote by (x, y) ∈ R. However, this does not have to be the case for every x ∈ M , and, different
elements of M may be mapped to the same y ∈ N . Moreover, x ∈ M can be mapped to more
than one element in N .
a relation
Example 1.9. Let M = {Anna, Philipp, Kevin, Julia} and N = {Corinna, Jakob, Anja}. Now
we define a relation R ⊆ M × N , where we have (x, y) ∈ R if and only if the first letter of x
equals the first letter of y. Clearly we have R = {(Anna, Anja), (Julia, Jakob)} ( M × N .
12
Now we head to a very important type of relation, i.e., functions, which assign to each element
of M exactly one element of N .
f (S) := {f (x) : x ∈ S} ⊆ N,
f −1 (T ) := {x : f (x) ∈ T } ⊆ M.
To show the connection between functions and relations, we define the following.
Gf := {(x, f (x)) : x ∈ M } ⊂ M × N.
Note that the graph of a function is a relation. In this sense, all functions induce a relation, but
not vice versa.
We can visualize real valued functions by plotting its graph in a usual coordinate system (in
R2 ). For f (x) = x2 and f (x) = x + 1 this is demonstrated in the next illustration.
f (x)
f (x) = x2
f (x) = x + 1
In what follows, we will define several important properties of relation. We will always demon-
strate afterwards what this means for functions.
∀(x1 , y1 ), (x2 , y2 ) ∈ R : x1 6= x2 ⇒ y1 6= y2 ,
which is equivalent to
∀(x1 , y1 ), (x2 , y2 ) ∈ R : y1 = y2 ⇒ x1 = x2 .
∀y ∈ N ∃x ∈ M : (x, y) ∈ R.
∀x ∈ M, y1 , y2 ∈ N : (x, y1 ), (x, y2 ) ∈ R ⇒ y1 = y2 .
Note that the graph of a function is a functional relation, and vice versa. We can therefore
rephrase the above definitions for functions. If f : M → N is a function we say:
We call injective, surjective and bijective function also injections, surjections and bijections,
respectively.
14
We will now define two functions that can in principle be defined on arbitrary sets M . First, we
define the identity function IdM : M → M which maps each element to itself, i.e., x 7→ x.
∀x ∈ M : g(f (x)) = x
and
∀y ∈ N : f (g(y)) = y,
then f and g are inverses of each other.
In this case we write f −1 := g and g −1 := f and call f (or g) invertible.
Note that we used the notation f −1 already for the preimage, see Definition 1.10. There, the
input was a set, and the preimage was defined for any function f . For an invertible function
f : M → N we have, by definition, f −1 ({f (x)}) = {x} and f ({f −1 (y)}) = {y} for any x ∈ M and
y ∈ N . In particular, the preimage of any one-element subset of N has also exactly one element.
So, this notation makes sense, if we identify f −1 (y) with the unique element in f −1 ({y}).
4 f (x) = x2
3 y
√
2 g(y) = y
x
0.5 1 1.5 2 2.5 3 3.5 4
√
Figure 8: The function f (x) = x2 and its inverse f −1 (x) = x
The following theorem provides us with a tool to check if a function has an inverse or not.
However, finding a closed formula is often much harder or even impossible. This is one of
the main reasons for using numerical software and approximations. We will see several
examples later during the course.
18
f is invertible ⇐⇒ f is bijective.
Proof. We want to prove an equivalence, and therefore have to prove both directions.
First we are going to prove “⇒”:
Since f is invertible, there exists f −1 : N → M such that
f −1 (f (x)) = x ∀x ∈ M
and
f (f −1 (y)) = y ∀y ∈ N.
Recall that bijective means surjective and injective. We show that f is surjective, i.e.,
∀y ∈ N ∃x ∈ M : f (x) = y.
x1 = f −1 (f (x1 )) = f −1 (f (x2 )) = x2 .
Invertible (or bijective) functions may be used to formally define the cardinality of a set.
For this, note that the existence of a bijective function f : M → N means that there is a one-
to-one correspondence between M and N . In particular, both sets must have the same
cardinality (aka. size). This means that a set M has cardinality n ∈ N, i.e., |M | = n, if and only
if there is a bijective function f : {1, . . . , n} → M . (Clearly, this is equivalent to the existence
of a bijective/invertible function g : M → {1, . . . , n}.)
Even more, for two finite sets A, B we have that
This notion also allows us (to some extent) to characterize the cardinality of an infinite set.
Definition 1.17. Let A be a set and n ∈ N. If the elements of A can be labeled by the
numbers {1, . . . , n}, then we say that A has cardinality n, and we write
|A| = n or #A = n.
Note that countability is the precise definition of the “simple” property that the elements of A
can be enumerated by the natural numbers {1, 2, 3, . . . }.
f g
X Y Z
g◦f
X Z
Example 1.20. Consider the function h(x) = sin(x2 ). If we set f (x) = x2 and g(y) = sin(y)
we get (g ◦ f )(x) = g(f (x)) = sin(x2 ) = h(x). However, since (f ◦ g)(x) = f (g(x)) = (sin(x))2 ,
it is important to mind that, in general, we have (f ◦ g) 6= (g ◦ f ).
∀y ∈ Y ∃!x ∈ X : f −1 (y) = x
and
∀z ∈ Z ∃!y ∈ Y : g −1 (z) = y.
Furthermore
∀x ∈ X ∃!y ∈ Y : f (x) = y
and
∀y ∈ Y ∃!z ∈ Z : g(y) = z.
This yields
(f −1 ◦ g −1 )(z) = f −1 (g −1 (z)) = f −1 (y) = x
and
(g ◦ f )(x) = g(f (x)) = g(y) = z.
Thus
(g ◦ f )[(f −1 ◦ g −1 )(z)] = (g ◦ f )(x) = z
and
(f −1 ◦ g −1 )((g ◦ f )(x)) = (f −1 ◦ g −1 )(z) = x,
which shows that f −1 ◦ g −1 is the inverse of g ◦ f , and vice versa.
Finally, let us discuss some relations that are, instead of mappings, used to compare, group
or order elements of a set M . For this we consider relations R ⊂ M 2 = M × M . Such relations
have usually nothing to do with functions, but are still very essential. Again, let us give names
to some of their important characteristics.
21
• antisymmetric if an only if
• total if an only if
∀x, y ∈ M : (x, y) ∈ R or (y, x) ∈ R.
Some relations which have some of these properties are especially important.
• reflexive: x ≤ x,
• antisymmetric: x ≤ y and y ≤ x =⇒ x = y,
• transitive: x ≤ y and y ≤ z =⇒ x ≤ z,
• total: x ≤ y or y ≤ x
Example 1.24. The usual equality relation “=” in R is reflexive, symmetric, antisymmetric
and transitive, but not total. It is therefore an equivalence relation and a partial order. As an
exercise, show that “=” is the only reflexive relation that is symmetric and antisymmetric.
Example 1.25. Define the “strictly less” relation “<” by a < b if and only if a ≤ b and
a 6= b. Determine all characteristics of “<”, as well as “≥”, and “>”. (Relations with the same
characteristics as “<” are called strict partial orders.)
22
Example 1.26. Let us also mention the not equal relation in R, which is defined by a 6= b if
and only if a < b or b < a. Which characteristics does it have?
Example 1.27. Show that the divisibility relation R ⊂ N2 , with (a, b) ∈ R :⇔ a|b, i.e., a divides
b, is a partial order.
Example 1.28. One can define a partial order on a set of sets by using the subset relation ⊂.
For example, for M := {∅, {1}, {2}, {1, 2}}, we have that, e.g., ∅ ⊂ {1} ⊂ {1, 2}. It is easy to
see that ⊂ is reflexive, antisymmetric and transitive. Hence, ⊂ is a partial order on M . However,
since {1} 6⊂ {2} and {2} 6⊂ {1}, it is not a linear order on M .
23
The commonly known natural numbers or rational numbers (fractions) are not sufficient for a
rigorous foundation of mathematical analysis. The historical development shows that for issues
concerning analysis, the rational numbers have to be extended to the real numbers. Maybe, you
already know a lot about the real numbers. However, you probably do not know that all its
properties follow from a few basic ones. So, let us introduce them from scratch.
We begin with the set of natural numbers, i.e.,
N := {1, 2, 3, . . . }.
However, we have seen that such a definition is not precise enough, and we want to define this
set solely by its properties.
These properties are called the Peano axioms, presented first by Giuseppe Peano (1858–1932)
around 1889. The only thing we need to assume is that we know what the number “1” is,
and what it means to add “1” to a number. (This is very reasonable, when we think about
“counting” in real life.)
We then define the natural numbers by the (unique) set N such that
1. 1 ∈ N,
2. n ∈ N =⇒ n + 1 ∈ N,
3. ∀n, m ∈ N : n = m ⇐⇒ n + 1 = m + 1, and
4. ∀n ∈ N : n + 1 6= 1.
In words, 1 is a natural number, for every natural number n, its “successor” n + 1 is also a
natural number, two natural numbers are equal iff their successors are equal, and no natural
number has the successor 1. Although these axioms seem to be obvious or redundant, depending
on the point of view, they are really the only “axioms” we need to assume to build up most of
modern mathematics. (A detailed discussion goes far beyond scope here.)
Next, the set of natural numbers with zero (or non-negative integers) is defined by
N0 := {0} ∪ N.
The sets N and N0 are closed under addition and multiplication, i.e., for all n, m ∈ N we have
n + m ∈ N and m · n ∈ N. However, we already get in trouble when we try to work with
subtraction, since 21 − 42 = −21 ∈
/ N.
The set of integer numbers is also closed under subtraction and is defined as
Z := {0, −1, 1, −2, 2, . . . } = {· · · − 2, −1, 0, 1, 2, . . . } = N0 ∪ (−N),
where −N := {−n : n ∈ N}. Clearly, for a, b ∈ Z we have a + b ∈ Z, a − b ∈ Z and a · b ∈ Z.
However, division is still a problem if we use integer numbers only.
The set of all fractions of two integers is the set of rational numbers which is denoted as
p
Q := : p, q ∈ Z , q 6= 0 ,
q
where we call p numerator and q denominator.
We call two rational numbers nk and ml equal if and only if km = ln. Further an integer k ∈ Z
can be identified with the fraction k1 ∈ Q. Consequently, the inclusions N ( Z ( Q are true.
24
One main reason to introduce “new” sets of numbers ist that we want to solve equations
and, in particular, we want to know if a solution exists in a given set.
For example, if we want to solve the equation a · x + b = 0, where a, b ∈ Q are fixed constants
and a 6= 0, it is easy to see that
−b
x= ∈Q
a
solves the equation.
However, what about the simple equation x2 − 2 = 0? Is there some x ∈ Q such that x2 = 2?
The following discussion will show that this is not possible.
Proof. Proof by contradiction, i.e., we assume ∃x ∈ Q : x2 = 2 and show that this yields a
wrong result. Since (−x)2 = x2 we may assume that x > 0 and that x = m n , where m, n ∈ N.
Furthermore we consider that at least one of the numbers n or m is odd. Otherwise we could
cancel by 2. This yields the equation
m2
x2 = =2
n2
which is equivalent to
m2 = 2 · n2 .
Hence m2 is an even number and consequently (previous lemma) m is an even number and we
can write m = 2 · k for some k ∈ N. So we have
m2 = 4 · k 2 = 2 · n2 ,
leading to
2 · k 2 = n2 .
Applying the previous lemma once again, we get the result that n has to be an even number,
which contradicts the assumption that either m or n has to be odd.
25
Thus the equation x2 − 2 = 0 is not solvable in Q, but as we all know from school there is a
√ √ 2
number 2 such that 2 = 2.
Making the so called number line complete, we finally get to the set of real numbers. These
numbers can have infinitely many decimals, i.e.,
a1 a2
R := z + r : z ∈ Z, r = + + . . . ; where a1 , a2 , · · · ∈ {0, 1, . . . 9} .
10 100
One may think of R as the set of all points on the number line, i.e., without holes. Note that
the rational numbers, written as decimal numbers, either have a finite number of digits or the
sequence of digits is periodic. This means, that some points on the line are missing. These
correspond to numbers which have a non-periodic infinite√number of decimals, in other terms
which cannot be written as fractions, as various roots, like 2, and the numbers π and e. These
ones, i.e. numbers in R \ Q are called irrational numbers. We will later see that the set R is
(assumed to be) in some sense complete. Such a precise definition was only given in the 19th
century, probably by Karl Weierstraß (1815–1897).
Although, we will not prove that here, let us note that the set of rational numbers is
countable, whereas the set of real numbers are uncountable. (The proofs can easily be
found in the literature and are quite interesting, but they go beyond the scope here.)
In summary the following set relations hold:
N ( N0 ( Z ( Q ( R.
Note that the following calculation rules are valid for any real numbers:
We call these properties axioms because, actually, we somehow assume them to be true. (Or
how would you prove these statements?) Many of the calculations that follow in the upcoming
chapters could also be done with other fields. We do not go into details here.
Example 1.31. Another important and well-known field is the finite field Z2 := {0, 1} with
the addition 0 + 0 := 0, 1 + 0 := 1, 0 + 1 := 1, 1 + 1 := 0, and the multiplication 0 · 0 := 0,
1 · 0 := 0, 0 · 1 := 0, 1 · 1 := 1. (Note that we formally need to define what we mean by addition
and multiplication.) These are numbers and operations, computers are working with. Verify
yourself that Z2 is a field.
26
First, recall the following calculation rules for inequalities. Let a, b, c, d ∈ R, then
• a < b =⇒ a ± c < b ± c
• a < b =⇒ −a > −b
1 1
• 0 < a < b =⇒ 0 < b < a
We now want to specify the “least” and the “greatest” element of a set. To define this
formally, we first need the definition of bounded set (in R).
∃C ∈ R ∀a ∈ A : a ≤ C
∃c ∈ R ∀a ∈ A : c ≤ a
Let us discuss this with the following basic subsets of R, i.e., intervals.
[a, b] := {x ∈ R : a ≤ x ≤ b},
Let a, b ∈ R such that a < b. Then for the closed interval [a, b] we have that a, a − 1, a − 42 are
lower bounds and b, b + 42 are upper bounds, and the same is true for the corresponding (half)
open interval. So, an upper bound is not unique.
To fix a specific upper/lower bound, let us first define the minimum and maximum of a set.
Example 1.36. The set of the natural numbers N ⊂ R is bounded from below, with min N = 1.
However, maxima and minima do not have to exist: While b is the least upper bound of both
[a, b] and (a, b), we have that b ∈ [a, b], but b ∈
/ (a, b). Hence, the set (a, b) does not have a
maximum (or minimum), but still we would like to work with the “best possible” bounds for
such a set, which are clearly a and b in this example.
For this we define the infimum and the supremum as the greatest lower bound and the least
upper bound, respectively. These objects will be very important in the upcoming analysis.
Clearly, if inf A ∈ A, then inf A = min A, and if sup A ∈ A, then sup A = max A. In words, if
the infimum (or supremum) of a set A is contained in A, then A has a minimum (or maximum)
which has the same value.
28
Moreover, infimum and supremum are uniquely determined. To see this, assume that
there are two suprema T1 and T2 of A. Since sup A = T1 , we have A ≤ T1 . In addition, since
sup A = T2 , we obtain by the second defining property above that A ≤ x =⇒ T2 ≤ x. Setting
x = T1 , we have T2 ≤ T1 . In the same way, we get T1 ≤ T2 , and hence T1 = T2 .
and
max[a, b] = sup[a, b] = sup[a, b) = sup(a, b] = sup(a, b) = b.
However, min(a, b), min(a, b], max(a, b) and max[a, b) do not exist.
Example 1.39. Let A = x2 : x ∈ (−1, 1) , then inf A = min A = 0 and sup A = 1. Verify
yourself!
However, it is not clear at all if every set has an infimum and supremum. For example, if
we would try the same with Q instead of R, i.e., we look for a least upper bound T ∈ Q for a
set A ⊂ Q, this would not be true. Consider 2
√ e.g. the set A = {x ∈ Q : x ≤ 2}. If we consider
A as a√subset of R, then its supremum is 2 ∈ R. But as a subset of Q it has no supremum,
since 2 ∈/ Q. The reason is, that the rational numbers have “gaps”.
The real numbers R were precisely defined to have no such “gaps”. However, note that this is
actually an assumption and we formalize this by the next axiom of this lecture.
Let us state an equivalent definition of supremum and infimum for bounded sets in R.
Although it looks more complicated at first sight, this formulation is sometimes very helpful.
Let A ⊂ R be bounded from below. Then, T = inf A if and only if
Remark 1.40.(*) Let us add, that all the definitions above (bounded, inf/sup) also make sense,
if we replace the “usual” ≤-relation on R by another partial order on some set M .
For example, consider the subset relation ⊂ on M := {∅, {1}, {2}, {1, 2}}, see also Example 1.28.
With A := {{1}, {2}}, we have sup A = {1, 2} ∈ M . (Verify precisely that this is the least upper
bound of A with respect to ⊂.) Note that A does not have a maximum. All other suprema are
easy to calculate. In particular, every subset of M has a supremum in M , i.e., M with ⊂ is
complete. (What about the same relation on M 0 := {∅, {1}, {2}}?)
29
Remark 1.41. (*) We call a set Ω with a partial order complete, if every A ⊂ Ω that is
bounded from above has sup A ∈ Ω. Thus, very formally, the real numbers R are assumed to be
a complete field with a linear order. Note that this would not be true for N and Z, as they do
not fulfill the field axioms, and also not for Q because, again, {x2 < 2} has no supremum in Q.
Let us finally discuss the Archimedean property. It is based on the fact that the set of natural
numbers is unbounded. Even though this property seems unimpressive, it was of significant
importance for real analysis.
Proof. a) Let us recall that the natural numbers N are defined solely by the properties 1 ∈ N,
and that n ∈ N implies that n + 1 ∈ N.
We first prove by contradiction that N is unbounded. For this, we assume that N is bounded.
Thus, the supremum x = sup N exists by the completeness axiom. As x is the smallest upper
bound of N, x − 1 is no upper bound of N, so there exists a n ∈ N with x − 1 < n. By addition
with 1 this also implies x < n + 1. But n + 1 ∈ N follows from n ∈ N, and therefore x is no upper
bound of N, which is a contradiction to the original assumption. So, N must be unbounded.
In order to show b) and c) one continues as follows:
N has no upper bound ⇐⇒ ∀x ∈ R ∃n ∈ N : n > x
1 1 1
⇐⇒ ∀x > 0 ∃n ∈ N : < ; set ε :=
n x x
1
⇐⇒ ∀ε > 0 ∃n ∈ N : < ε.
n
c) It is sufficient to check the case x, y > 0 (Why?). Then we are looking for m, n ∈ N such
that nx ≤ m ≤ ny. With part a) above, we get a n ∈ N with n(y − x) ≥ 1. Now take the
smallest natural number m ≥ nx (which clearly exists). Then m ≥ nx and m − 1 < nx. The
last condition implies that m < nx + 1 ≤ ny, so we get nx ≤ m ≤ ny.
Based on this, we easily calculate certain infima and suprema of (discrete) sets.
n o
1
Example 1.43. Let A = n : n ∈ N , then inf A = 0 and sup A = max A = 1.
To prove that, first note that 0 < n1 ≤ 1 for all n ∈ N. So, 0 is a lower bound, and 1 is an upper
bound. Since 1 ∈ A (for n = 1), we therefore obtain that sup A = max A = 1. For the infimum
we need to show the there is no larger lower bound. For this, let ε > 0 be arbitrary. By the
Archimedean property there exists an n ∈ N with n1 < ε. Hence, ε is not a lower bound. As this
works for all ε > 0, we conclude that 0 is the largest lower bound of A, and hence inf A = 0.
n o
1
Example 1.44. Let A = n2 −n−3
: n ∈ N , then inf A = min A = −1 and sup A = max A = 13 .
30
In the following we are dealing with mathematical induction which helps us to prove state-
ments or define objects for all n ∈ N.
Theorem 1.45 (Mathematical induction). A predicate P (n) is true for all n ∈ N if and
only if the following two steps hold:
Note that, by the Peano axioms, we reach every natural number this way.
The concept of mathematical induction can be used for the definition of sequences of objects,
which is sometimes quite helpful. If, for instance, G(n) is a quantity that has to be defined for
all n ∈ N, then it is sufficient to define G(1) and, for all n ∈ N, G(n + 1) in terms of G(n).
One example are the formal definitions of the sum and product symbols.
Definition 1.46. Let ak ∈ R for k ∈ N. Then we define sum and product as follows:
1
X n+1
X n
X
ak := a1 ak := an+1 + ak
k=1 k=1 k=1
Y1 n+1
Y n
Y
ak := a1 ak := an+1 · ak
k=1 k=1 k=1
Let us see how to prove by induction. As this is an often used and very structural proof, we
will use some short notation to highlight the corresponding parts. According to Theorem 1.45,
one needs to show that the statement is true for the first element, which we call the induction
basis, denoted by (IB). Then, by assuming the induction hypothesis (IH), that the assertion is
true for n, we prove that it is also true for n + 1, which we call the induction step (IS).
Let us discuss two examples to demonstrate this type of proof.
The first one is (at least without proof) known to many from school. The young Carl Friedrich
Gauß (1777–1855) knew it already by the age of nine.
Pn n(n+1)
Example 1.47. We prove the formula k=1 k = 2 (Gauß’sche Summenformel):
P1 1(1+1)
IB (n = 1) : k=1 k =1= 2 is true.
Pn n(n+1)
IH : k=1 k = 2
IS (n → n + 1) :
n+1 n
X X IH n(n + 1) 2 (n + 1)(n + 2)
k = k + (n + 1) = + (n + 1) = .
k=1 k=1
2 2 2
31
(1 + x)n ≥ 1 + nx.
Proof. Let x ∈ R such that x ≥ −1. We prove the statement for all n ∈ N by induction:
IB (n = 1) : (1 + x) ≥ 1 + x is clearly true.
IH : (1 + x)n ≥ 1 + nx
IS (n → n + 1) :
IH
(1 + x)n+1 = (1 + x)n (1 + x) ≥ (1 + nx)(1 + x)
= 1 + (n + 1)x + nx2 ≥ 1 + (n + 1)x.
This was what has to be shown.
Example 1.49. Try to prove the Bernoulli’s inequality with > instead of ≥. Can we use the
same assumption on x in this case?
Let us now turn to some combinatorial quantities. These quantities are heavily used when it
comes to discrete mathematics or elementary probability theory, as they represent the number
of permutations or subsets of certain size.
In addition, we set 0! := 1.
n
The binomial coefficient k (we say “n choose k”) for n, k ∈ N0 with n ≥ k is defined by
!
n n · (n − 1) · (n − 2) · · · (n − k + 2) · (n − k + 1) n!
:= = .
k k! k!(n − k)!
n n n n
Clearly, we have 0 = 1, 1 = n and k = n−k .
The factorial n! is the number of permutations (orderings) of a set of n numbers. If you pick
your first arbitrary element, you have n possibilities for your choice. When it comes to your
second decision you have just n − 1 options left, and so on.
The binomial coefficient nk is read as n choose k (or “n over k”; in German “n über k”). It
represents the number of ways to choose k unordered outcomes from a set of n possibilities, also
known as a combination. E.g., n2 is the number of two-element subsets of {1, . . . , n}. Clearly,
there are n(n − 1) ordered pairs (i, j) ∈ {1, . . . , n}2 with i 6= j. As the ordering is irrelevant for
sets, we have n(n−1) two-element subsets which coincides with n2 .
2
32
Proof.
! !
n n n(n − 1) · · · (n − k + 1) n(n − 1) · · · (n − k)
+ = +
k k+1 k! (k + 1)!
n(n − 1) · · · (n − k + 1)
= (k + 1) + (n − k)
(k + 1)!
n(n − 1) · · · (n − k + 1)
= n+1
(k + 1)!
(n + 1)n(n − 1) · · · (n − k + 1)
=
(k + 1)!
!
n+1
= .
k+1
The “simple” explanation of this equality is, that subsets of {1, . . . , n + 1} with k + 1 elements
can be splitted into those that contain the number n + 1 and those that don’t contain it. To find
the number of all sets that contain it, we need to count all k-element subsets of {1, . . . , n}, and
there are nk of them. The number of all sets that don’t contain it, is the same as the number
n
of all (k + 1)-element subsets of {1, . . . , n}, which is k+1 . So the total number is their sum.
Based on this, we can prove one of the most famous theorems in mathematics.
Proof. If we expand (x + y)n we get a sum of 2n expressions of the form z1 z2 · · · zn , where for all
i = 1, 2, . . . n we either have zi = x or zi = y. Counting all summands where x occurs exactly k
times (and y therefore n − k times) gives us nk . Hence there are exactly nk summands of the
Example 1.53. As a good exercise try to prove the binomial theorem by induction using
Lemma 1.51.
33
Example 1.55. It is very important to know the following special cases of the binomial theorem.
For x, y ∈ R we have
(x + y)2 = x2 + 2xy + y 2 .
and
(x + y)3 = x3 + 3x2 y + 3xy 2 + y 3 .
(Verify these formulas!)
The binomial theorem can also be used to (easily) improve upon Bernoulli’s inequality.
This lemma is a direct consequence of the binomial theorem, see Theorem 1.52 with y = 1. One
just leaves out some of the non-negative terms. Note that this is Bernoulli’s inequality for k = 1.
However, note that we need x ≥ 0 here. (It is false, e.g., for x = −1 and n ≥ k = 2.)
34
In this section we want to briefly discuss some essential types of functions, which need to be
introduced for the sake of completeness. We start with the absolute value of a number, since
this function and its generalizations are heavily used throughout this lecture and beyond.
The absolute value of x ∈ R is defined by:
(
x, if x ≥ 0,
|x| :=
−x, if x < 0.
y
3 y = |x|
2.5
1.5
0.5
x
−3 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3
Note that this function maps R onto R≥0 := [0, ∞), i.e., the domain and range of | · | are R and
R≥0 , respectively. As, e.g., | − 1| = |1|, | · | is not injective on R, and hence also not bijective.
(Try to prove | − x| = |x| formally!)
1. |x| ≥ 0
2. |x| = 0 ⇐⇒ x = 0
3. |x| ≥ x
4. |x · y| = |x| · |y|
5. |x| ≤ z ⇐⇒ −z ≤ x ≤ z
Proof. We prove the lemma by using case distinction. In fact, this is a good illustration on
how to work with absolute values in general.
35
We only prove the first statement, the other work similarly, and are left as an exercise.
Case 1: x ≥ 0, then we have |x| = x ≥ 0.
Absolute values appear very regularly, and it essential to know how to work with them.
Mostly, we are interested in the set of all numbers, say x, that satisfy a certain inequality.
So for a given inequality, say |x − 1| < 2, we want to find all x ∈ R that satisfy it. The set of all
these x is called the solution set of the inequality, and is often denoted by L.
Example 1.58. We look for the solution set L of the following inequality (in R):
2|x + 3| − 4|x − 1| ≥ 8x − 2.
We have to distinguish three regions: x ≤ −3, −3 < x ≤ 1 and 1 < x, since the expressions in
the absolute values change their signs at these points.
Case 1 (x ≤ −3): 2|x+3|−4|x−1| = 2(−x−3)−4(−x+1) = −2x−6+4x−4 = 2x−10 ≥ 8x−2
⇐⇒ x ≤ − 34
We therefore obtain: L1 := {x ∈ R : x ≤ −3, x ≤ − 34 } = (−∞, −3].
The following inequality, which is probably the most well known one, is at the same time the
most important (also in more general scenarios). You will not use any other inequality more
often than this one.
|x + y| ≤ |x| + |y|.
|a + b| − |b| ≤ |a|
For b = −x we get
|y| − |x| ≤ |x − y|.
The result follows from |x| − |y| = max{|x| − |y|, |y| − |x|}.
We now turn to some other elementary functions. Note that it is not necessary to understand
all of these definitions completely yet. We just state them here as a reference for later.
a) Very simple functions are affine linear functions, which are defined as
f: R→R
x 7→ ax + b
with a, b ∈ R. If a = 0, f is called constant function. If a 6= 0 and b = 0 we call f a
linear function. Note that linear functions satisfy f (x + y) = f (x) + f (y).
(Is the same true for affine linear functions?)
Moreover, they are special cases of polynomial functions, which are defined by
p: R → R
n
X
x 7→ ai xi
i=0
with a0 , . . . , an ∈ R.
f : R+ → R
x 7→ xa ,
for some fixed a ∈ R and R+ := [0, ∞), see Figure 11. (In short: f (x) = xa for x > 0.)
37
4
y a=8 a=2 4 y
3 a=1 3
2 2
a = 1/2
a = 1/8
1 1 a = −1/8
a = −1/2
a = −1
x x a = −2
a = −8
1 2 3 1 2 3
Let us say some words about, how these expressions are precisely defined. Since we know
how to multiply two or more numbers, we may clearly define the power functions with
non-negative integer exponents, i.e., xn with n ∈ Z, even for all x ∈ R \ {0}, as in the
polynomials above. (We only need to exclude x = 0 for negative exponents, because cannot
divide by 0.) Moreover,
√ since we know how a root of a positive number is defined, we can
also define, e.g., √2 = 21/2 or 31/8 . (31/8 is just the number z with z 8 = 3.) With this, we
n
can define x m = m xn for every x > 0, n ∈ Z and m ∈ N. Hence, we know how to define
xq for every x > 0 and q ∈ Q.
√
But what about 4 2 or π π ? (Think a bit how you would calculate/define these numbers!)
In this case, i.e., if we consider xa for some a ∈ R\Q, the most natural way is to define them
by using supremum or infimum. Taking the monotonicity into account (see Figure 11), we
define for x > 0 the powers
(
a sup{xq : q ∈ Q, q < a}, if x ≥ 1,
x := (1.1)
inf{xq : q ∈ Q, q > a}, if 0 < x < 1.
Note that we know how to compute the expressions inside sup/inf, and that sup/inf always
exist as the sets are bounded, due to the Completeness axiom.
With this, we get the well-known calculation rules for arbitrary x, y ∈ R+ , and a, b ∈ R
that
1 xa
x−a = a xa · xb = xa+b = xa−b
x xb
a
a b a·b a a a x xa
(x ) = x (x · y) = x · y = a
y y
However, note that there is –and there will be– no natural and satisfying way to define
arbitrary powers of a negative number, like (−2)π . We do not go into detail here.
c) Other well-known functions are the exponential functions, which characterise rapid
growth or decay. Exponential functions are defined by
f : R → R+ , x 7→ ax ,
38
where the basis a > 0 is a fixed parameter. That is, in contrast to power functions, the
variable x ∈ R is the exponent.
4 a>1
1
a<1
−3 −2 −1 1 2 3
Note that ax is defined for all x ∈ R. For a precise definition, we use again supremum and
infimum, see (1.1). That is, for a ≥ 1 we set ax := sup{aq : q ∈ Q, q < x}. Together with
ax = ( a1 )−x for 0 < a < 1, this defines ax for all a > 0 and x ∈ R.
Among all exponential functions, one is particularly important. This is when we choose
the basis a = e ≈ 2, 7182818284, where e is Euler’s number. This special (irrational)
number e may be defined by different means:
n n ∞
1 1 1
X
e = sup 1+ : n ∈ N = lim 1+ = .
n n→∞ n k=0
k!
(Do not panic yet! We discuss the meaning of these expressions later.)
d) The exponential functions often appear together with the logarithmic functions, which
are there inverses:
f : R+ → R, x 7→ logb (x).
Read: logarithm of x to base b. Here, we usually assume that b > 1, but in principle one
may also consider b ∈ (0, 1).
2
b=2
1 b=e
b = 10
x
0.5 1 1.5 2 2.5 3
−1
−2
Figure 13: The graph of the function f (x) = logb (x) for different b
39
Figure 14 shows the ’geometric definition’ of these functions in the unit circle.
(cot(x), 1)
1
tan(x) (1, tan(x))
sin(x) sin(x), cos(x)
x x
−1 cos(x)1 −1 cot(x)
−1 −1
The variable x ∈ R corresponds to the angle that is enclosed between the horizontal axis
and the line to the point (sin(x), cos(x)). It can therefore be given in degrees (deg) in the
range from 0◦ and 360◦ . However, it is more convenient in science to interpret x as the arc
length of the part of the unit circle that is contained between the lines. The appropriate
unit is called radians (rad). Since the length of the unit circle is given by 2π, radians can
by obtained from the degrees by:
deg
x = π.
180
The functions sin x and cos x are defined for all x ∈ R, see Figure 15.
1
sin x
Using some elementary geometry on the above illustration we obtain the calculation rules:
π π π π 2π 3π 5π
ϕ 0 6 4 3 2 3 4 6 π
√ √ √ √ √ √ √ √ √
0 1 2 3 4 3 2 1 0
sin(ϕ) 2 2 2 2 2 2 2 2 2
√ √ √ √ √ √ √ √ √
4 3 2 1 0 − 1 − 2 − 3 − 4
cos(ϕ) 2 2 2 2 2 2 2 2 2
The corresponding values in (π, 2π) can be obtained from sin(x + π) = − sin(x) and
cos(x + π) = − cos(x).
Moreover, we have the very important trigonometric identity
sin2 x + cos2 x = 1,
41
(It’s not needed to memorize these formulas, but important to know where to find them.)
4
tan x
3
1
cot x
− 5π −π − 3π − π2 − π4 π π 3π π 5π
4 4 4 2 4 4
−1
−2
−3
−4
Additionally, we can define the inverse trigonometric functions. For this, first note
that sin and tan are increasing and on [ −π π −π π
2 , 2 ] and ( 2 , 2 ), respectively, cos is decreasing
on [0, π], see Figures 15 and 16. In particular, all three functions are bijective on these
intervals (i.e., every value of sin(x), cos(x) and tan(x) is achieved by exactly one x in the
interval), and therefore have inverse functions, see Theorem 1.16.
Due to their importance, the inverse trigonometric function have their own name. Namely,
arcsine arcsin := sin−1 : [−1, 1] → [− π2 , π2 ], the arccosine arccos := cos−1 : [−1, 1] →
[0, π], and the arctangent arctan := tan−1 : R → [−π/2, π/2]. These functions are de-
fined by
y = arcsin x : ⇐⇒ x = sin y and y ∈ [−π/2, π/2],
y = arccos x : ⇐⇒ x = cos y and y ∈ [0, π],
y = arctan x : ⇐⇒ x = tan y and y ∈ (−π/2, π/2),
arcsin x
π/2 arctan x
arccos x
−5 −4 −3 −2 −1 1 2 3 4 5
−π/2
At a certain point in mathematics we are not able to continue with real numbers anymore. This
point is reached, once we want to find a solution of the equation x2 + 1 = 0. If we introduce a
new element √
i := −1,
then the equation has the two solutions ±i. This extension leads us to the (field of) complex
numbers.
C := {z = x + iy : x, y ∈ R}.
For a complex number z = x + iy we call x the real part of z and y the imaginary part.
We write Re z = x and Im z = y.
z := x − iy.
• z + w = (x + u) + i(y + v)
√
Example 1.64. Recall that i is just a symbol for −1. For example, we have
√ q √ √
−4 = (−1) · 4 = 4 −1 = 2i.
Formally, one can identify the complex numbers C with a tuple (x, y) of real numbers. Each
complex number z = x + iy can, therefore, be illustrated as a point in the plane, which is called
the complex plane. The coordinate axes are called real and imaginary axis.
44
(Exercise: Think about where to draw the complex conjugate z in the complex plane.)
We define the absolute value of a complex number z = x + iy to be the length of the straight
line from (0, 0) to (x, y).
Several calculation rules for the absolute value are just the same as for the absolute value on R.
(That’s why we use the same name.)
1. |z| ≥ 0
2. |z| = 0 ⇔ z = 0
3. |z| ≥ |Re z|
4. |z| ≥ |Im z|
5. |zw| = |z||w|
|z + w| ≤ |z| + |w|,
with equality if and only if zw ≥ 0. (Note that zw ≥ 0 means, in particular, that zw ∈ R.)
Moreover, we have
|z| − |w| ≤ |z − w|.
|x + y| − |y| ≤ |x|.
|z| − |w| ≤ |z + w|
Let us give a summary about the representation and visualization of complex numbers in the
complex plane:
Instead of using the cartesian coordinates (x, y) for a complex number z = x + iy, one can also
switch to the so called polar coordinates (r, ϕ). The following relations hold between them
(see Figure 15 and 18):
q q √
r := |z| := x2 + y 2 = (Re z)2 + (Im z)2 = zz
x := |z| cos ϕ
y := |z| sin ϕ
Here, we use that every point on the circle {(x, y) : x2 + y 2 = 1} can be written as (cos ϕ, sin ϕ)
for some ϕ ∈ R. Actually, one can find such a ϕ ∈ [0, 2π), and this must satisfy tan ϕ = xy .
Therefore, we can write z in the trigonometric or polar form:
We call r = |z| the radius, and ϕ the argument of the complex number z.
The following question arises: under which condition is the representation (r, ϕ) unique?
Answer: Yes, if |z| =
6 0 and we assume ϕ ∈ [0, 2π) (or any other interval of length 2π).
An easy and useful way of writing complex numbers can be obtained by using Euler’s formula
Note that eiϕ can at the moment only be understood as a symbol for the right hand side above,
and it is already useful as such. However, we will see later that this is really an equation, i.e.,
we will also define ez for z ∈ C and show the above.
To work with this formula, it is essential to mind the following important values:
π π π π 2π 3π 5π
ϕ 0 6 4 3 2 3 4 6 π
√ √ √ √ √ √ √ √ √
0 1 2 3 4 3 2 1 0
sin(ϕ) 2 2 2 2 2 2 2 2 2
√ √ √ √ √ √ √ √ √
4 3 2 1 0 − 1 − 2 − 3 − 4
cos(ϕ) 2 2 2 2 2 2 2 2 2
π 3π
Using these values of the trigonometric functions at the points 2 , π, 2 , 2π we obtain
iπ 3π
e2 =i eiπ = −1 ei 2 = −i ei2π = ei0 = 1.
Additionally, we obtain from these values, together with the periodicity of the trigonometric
functions (or standard calculation rules for exponentials), that for k ∈ Z,
(
ikπ 1, if k is even,
e =
−1, if k is odd.
(Verify that!)
47
Using Euler’s formula we can write every complex number (in its polar form) as
z = reiϕ .
This representation is particularly useful when it comes to multiplication and powers of complex
numbers. Given two complex numbers
z = reiϕ and w = seiψ ,
we obtain
zw = reiϕ seiψ = rs ei(ϕ+ψ) .
From this formula (with z = w) we obtain by induction de Moivre’s formula for powers z n
of z = r(cos ϕ + i sin ϕ) = reiϕ with n ∈ N:
z n = rn (cos nϕ + i sin nϕ) = rn einϕ .
With Euler’s formula one may also write this in a more compact way:
√ πi 42 42πi πi
(1 + i)42 = 2e 4 = 221 e 4 = 221 e10πi e 2 = 221 i.
Caution: The above consideration only holds for integer exponents. (Why also for negative
integers?) More general exponents need more care and will not be discussed here.
Moreover, the polar form is also very useful for theoretical purposes.
Example 1.69. Show that |z + w| = |z| + |w| for z, w ∈ C if and only if z and w have the same
argument. (Hint: Consider Lemma 1.67 together with the polar form of z and w.)
It is an important result in mathematics, that complex numbers are really all we need to solve
polynomial equations. In fact, the Fundamental theorem of algebra (in one of its variants) even
gives the precise answer for the number of solutions of polynomial equations. A proof of this
result is beyond the scope of this course.
Note that it is a hard problem in general to find the zeros of a given polynomial of high degree.
For such tasks, one usually employs numerical software which, in most cases, can only output
approximations of the zeros.
with d being called the dimension. (Cd denotes the d-fold Cartesian product of C.)
Note that d-dimensional vectors v ∈ Cd consist of the d components v1 , . . . , vd ∈ C.
We define the addition and scalar multiplication of vectors component-wise. That is, for
two vectors u = (ui )di=1 , v = (vi )di=1 ∈ Cd and a number λ ∈ C, we define
u1 v1 u1 + v1 λv1
u2 v2 u2 + v2 λv2
u + v := . + . = .
and λ · v := .
.. .. .. ..
ud vd ud + vd λvd
Note that it is important for vector addition that the vectors have the same dimension. If the
dimensions of the two vectors do not agree, then their sum is not defined.
(The term “scalar” is used to distinguish this multiplication from the others discussed below.)
Remark 1.72. Here we use the field of complex numbers C to ’build’ complex vectors v ∈ Cd .
Note that we can easily also consider only real vectors v ∈ Rd , as real vectors are special
complex vectors. However, since all definitions here work directly in the complex case, and this
will be needed later, we define it in the more general context and comment on necessary changes
when needed. Moreover, note that we could define vectors, and the corresponding operations,
in a much more general context, as long as the operations for the components are well-defined.
This will be discussed much later.
In analogy to the real and complex numbers above, we want to define (and use) a quantity that
allows for measuring ’how large’ a vector is.
Let us begin with the most important choice for such a quantity:
49
Moreover, for u = (ui )di=1 , v = (vi )di=1 ∈ Cd we define the inner product (or dot product)
of u and v by
d
X
hu, vi := ui vi .
i=1
Remark 1.74. We use the subscript ’2’ here, because we will study also other norms later.
Note that some authors use the notation |·| also for the Euclidean norm, which shows its role as
generalization of the absolute value.
From now on, we use x, y, z also as symbols for vectors for notational convenience.
The inner product is (formally) a mapping h·, ·i : Cd × Cd → C, and we have the following
properties for all x, y, z ∈ Cd and λ ∈ C:
and
hx, λy + µzi = λhx, yi + µhx, zi.
for all x, y, z ∈ Cd and λ, µ ∈ C. (Verify this!) Note that we need to take the complex con-
jugate when we take a scalar out of the ’second input’. One says, the inner product is sesquilinear.
50
The first means that the norm of v is zero if and only if all components of v are zero (which is
called definiteness), and the second property is called homogeneity. Both properties were also
clear for the absolute value of a real/complex number, which is also covered by the above in the
case d = 1. However, there is a third property that is essential for the upcoming considerations.
Namely, the triangle inequality, see also Theorem 1.59 and Lemma 1.66.
Moreover, the equality kx + yk2 = kxk2 + kyk2 holds if and only if y = α · x for some α ≥ 0.
Note that in the case that d = 2 or d = 3 the Euclidean norm coincides with the usual intuition
of the ’length’ between 0 and the point x, i.e., k · k2 is the length of the ’direct way’ between 0
and x. This makes the triangle inequality appear to be an obvious statement. However, when
it comes to d > 3 this might not be that clear, and therefore we present a proof below.
For this, we first need another inequality, which is also of independent interest. The Cauchy-
Schwarz (CS) inequality gives a relation between the inner product and the Euclidean norm.
Moreover, we have the equality |hx, yi| = kxk2 kyk2 if and only if y = c · x for some c ∈ C.
Remark 1.77. The choice of the Euclidean norm for the following analysis is quite arbitrary,
especially when it comes to non-geometrical applications, and there are sometimes other natural
choices. We will discuss some other possibilities soon. However, although the triangle inequality
is an essential property also for ’other norms’, the (sometimes very helpful) Cauchy-Schwarz
inequality is special to the Euclidean norm. Therefore, we usually work with this norm.
Proof. Let us prove the second statement. The first follows by squaring both sides.
First, if either x = 0 or y = 0, then the statment is clearly true. Otherwise, we use the inequality
2 2
ab ≤ a +b 2
2 , which holds for any real numbers a, b, and follows from (a − b) ≥ 0. (Verify this!)
We define
|xi | |yi |
ai := and bi := .
kxk2 kyk2
Observe that, by definition,
d
X d
X
a2i = b2i = 1,
i=1 i=1
51
(Note that the first inequality, was just the triangle inequality for sums of numbers.)
To obtain equality, both above used inequalities need to be equalities. Equality in the first is
equivalent to xi yi having the same argument for all i, see Exercise 1.69. Equality in the second
is equivalent to ai = bi for all i (Check that!), which means |xi | = r|yi | for all i and some r > 0.
Since we fixed argument and absolute value, we obtain equality if and only if x = c · y for some
c ∈ C.
The CS inequality can now be used to prove the triangle inequality for k·k2 .
To obtain equality, note that Lemma 1.76 shows that we need y = c · x for some c ∈ C to obtain
equality in the second inequality. Using this, we need Re (c) = |c| to achieve equality in the first
inequality above. This holds iff c ≥ 0.
We finally introduce some particular important vectors which will appear very often in the
upcoming considerations. These are the unit vectors e1 , . . . , ed ∈ Rd , where ek is the vector
which is zero except for the k-th entry, which is one. That is,
1 0 0
0 1 0
e1 =
.. ,
e2 =
.. ,
... , ed =
.. .
. . .
0 0 1
52
Note that they all have norm 1, independent of which norm above one chooses.
One important property of the unit vectors is, that they can be used to represent arbitrary
vectors. For this, note that λ · ek with λ ∈ R is the vector with λ in the k-th entry, and zero
elsewhere. It is therefore easy to see that
x1
x2
d
X
. =
. xk · e k ∈ Rd .
. k=1
xd
One might guess that working with a representation as given on the right can be quite useful.
Remark 1.78. Although we presented such a representation only with respect to the unit
vectors e1 , . . . , ed , we will see later that this works also with other vectors. A set of vectors that
can be used to represent arbitrary vectors in a unique way as above is called a basis of Rd . In
this context, the set {e1 , . . . , ed } is called the standard basis of Rd , and the numbers xk are
called coordinates of x with respect to {e1 , . . . , ed }. It is important to note again that the
xk ∈ R are just numbers, and only the ek ∈ Rd are vectors (i.e., elements from Rd ). We will
discuss this (and the related concept of a vector space) later in more detail.
53
a11 · x1 + a12 · x2 = b1 ,
a21 · x1 + a22 · x2 = b2
are fulfilled for some given a11 , a12 , a21 , a22 , b1 , b2 ∈ R. (Note that we use subscripts ak` in our
notation to keep track ’where’ a coefficient appears in the system of equations.)
As before, we see that the first equation is equivalent to
b1 − a12 · x2
x1 = if a11 6= 0.
a11
This can be put into the second equation, which then only depends on the unknown x2 , and
can therefore (potentially) be solved as in the one dimensional case. In this way, we may find
a unique solution for x2 which then implies a solution for x1 by the equation above. By this
procedure, we can find solutions to linear equations, also for larger systems.
However, it is not clear if there really is a (unique) solution for x1 and x2 for every choice of
a11 , a12 , a21 , a22 , b1 , b2 ∈ R. There might also be no or infinitely many solutions (as for a single
equation with a = 0).
For demonstration, let us discuss a specific example:
Consider the system of linear equations
2x1 + x2 = 1,
6x1 + 3x2 = 2.
The first equation is equivalent to x2 = 1 − 2x1 . If we put this into the second equation, we
obtain 6x1 + 3 − 6x1 = 2 which is never fulfilled. That is, there is no solution to this system of
equations. However, if we change the system to
2x1 + x2 = 1,
6x1 + 3x2 = 3,
then the first equation is still equivalent to x2 = 1 − 2x1 . But, after putting this into the second
equation, we have 6x1 + 3 − 6x1 = 3 which holds for every x1 ∈ R. Therefore, the above system
is solved by all (x1 , x2 ) ∈ R2 with 2x1 + x2 = 1, e.g., for (x1 , x2 ) = (0, 1) or (x1 , x2 ) = (1, −1).
This shows that such systems of equations might be quite sensitive to small changes of the
parameters (and this was just a two dimensional example). It is therefore desirable to have
54
criteria for a given system of equations to be (uniquely) solvable that can be checked more
easily and before we start trying to calculate a solution.
Moreover, this procedure is useful only for ’small’ systems of equations, say with at most 3
unknowns. It is rather impractical for larger systems, and there are faster methods to solve
such systems by hand (and with a computer). This is particularly important since modern
applications are usually ’high dimensional’.
The most convenient way to formally work with linear systems is to introduce matrices, which
is just a way of writing numbers in an array similarly to vectors. For example, we will write
the above system of equations by Ax = b, where the matrix A ∈ R2×2 (means that it is a 2 × 2
array of real numbers) and the vector b ∈ R2 are given by
! !
a11 a12 b1
A = and b = .
a21 a22 b2
We discuss shortly how the operation Ax, i.e., matrix-vector multiplication, is precisely defined,
and that a system of equations may then be written by Ax = b. This now looks almost like the
one dimensional equation, and one might like to just “divide by A if it is not zero” to obtain
a solution. Unfortunately, that’s not (always) so easy, and we have to be careful with what it
means that “A 6= 0” or to divide by a matrix.
However, we will see that matrices are the ’right’ tool to work with such problems, and we
will introduce operations and calculation rules for matrices, which will enable us to transform
systems of equations to others which might be easier to solve. By this, we introduce techniques
to solve also large systems of equations in a straight-forward way.
2.1 Matrices
In this case we use the notation A ∈ Rm×n , and call m and n the dimensions of A.
A m × 1 matrix is called a column vector, a 1 × n matrix is called a row vector,
and if m = n, then the matrix is called quadratic, or a square matrix.
Remark 2.2. We mostly consider only matrices of real numbers here. The case of complex
matrices, i.e., aij ∈ C, can be treated analogously, and we write A ∈ Cm×n . We will comment
on differences of the complex case if needed.
Remark 2.3. The index notation at (aij )m,n i,j=1 just means that we consider i = 1, . . . , m and
j = 1, . . . , n. Some authors use (aij )m,n
i=1,j=1 or even (aij )i=1,...,n,j=1,...,n for the same.
55
We now turn to basic operations of matrices. The first two, namely scalar multiplication and
matrix addition, are very easy (and familiar from the corresponding operations for vectors).
These operations are component-wise operations, meaning that they are performed in each
entry of the matrices individually.
The matrix addition of two m × n-matrices A = (aij )m,n m,n
i,j=1 and B = (bij )i,j=1 is defined by
component-wise addition, i.e.,
a11 + b11 a12 + b12 ... a1n + b1n
a21 + b21 a22 + b22 ... a2n + b2n
bij )m,n
A + B = (aij + i,j=1 = .. .. .. .. .
. . . .
am1 + bm1 am2 + bm2 . . . amn + bmn
Note that it is important here that the matrices have the same dimension. If the dimensions of
the two matrices do not agree, then their sum is not defined.
λa11 λa12 ... λa1n
λa21 λa22 ... λa2n
λ·A =
.. .. .. ..
. . . .
λam1 λam2 . . . λamn
(The term “scalar” is used to distinguish this from the matrix product discussed below.)
The third basic operation is the matrix product, i.e., the product of two matrices. In contrast
to the last operations, the product of two matrices is not defined component-wise:
Given a m×n-matrix A = (aij )m,n n,p
i,j=1 and a n×p-matrix B = (bij )i,j=1 , we define the m×p-matrix
m,p
C = (cij )i,j=1 with
n
X
cij = aik bkj ,
k=1
as the product of A and B, i.e., C = AB. This definition shows that the ij-th entry of the
product C, i.e., cij , depends only on the i-th row of A and the j-th column of B. One may mind
this rule by “cij is the product of the i-th row of A with the j-th column of B”.
To make this more precise, note that a matrix A ∈ Rm×n has m rows of length n, or n columns of
length m. So, let us assume that a matrix A has the rows a1 , a2 , . . . , am ∈ R1×n and columns
c1 , c2 , . . . , cn ∈ Rm×1 = Rm . We use the notation
a1
a2
A=
.. = c1 , c2 , . . . , cn .
(2.1)
.
am
Note that ak and ck are not numbers, but vectors. However, the notation is consistent in the
sense that a matrix can be seen as a row vector consisting of column vectors, and vice versa.
Remark 2.4. It does not matter if we put commas in A = c1 , c2 , . . . , cn or not. Both notations
are used, and there is actually no room for misunderstanding once we specified clearly what the
entries “ck ” are.
56
That is, the ij-th entry of AB is the inner product of the i-th row of A with the j-th column of
B, as stated above.
(Note that all inner dimensions for the involved matrix(-vector)-products agree.)
Note that in this case, also the matrix BA is defined. However, the product BA is a 2×2-matrix.
Namely, !
28 91
BA = ,
8 48
and there is no obvious relation between AB and BA.
Note that the matrix product is only defined if the inner dimensions agree. That is, if we want
to multiply a m × p-matrix A ∈ Rm×p and a q × n-matrix B ∈ Rq×n , then we need that p = q.
Otherwise, the product is not defined. Note that this implies that we might define the product
AB of two matrices A and B, but the “reverse” product BA is not defined. Consider, e.g., the
case A ∈ Rm×n and B ∈ Rn×n with m 6= n.
The following rules for calculation, which may remind on the respective rules for numbers, follow
easily from the definition:
However, even if A, B ∈ Rn×n , i.e., A and B are quadratic such that AB and BA is defined, we
do not have in general that AB = BA. That is, matrix multiplication is not commutative.
Moreover, there are identity elements for these operations, i.e., there are matrices such that
addition/multiplication of them do not change the second matrix. This corresponds to 0 and 1
for real and complex numbers.
For this, let us introduce the following special matrices.
For m, n ∈ N, we define the zero matrix 0mn ∈ Rm×n by
0 ··· 0
.. . . .
0mn := . . .. ,
0 ··· 0
Note that the identity is a square matrix, and we may write I := In if the dimension is clear.
Let us show formally that the identity matrix is an identity for matrix multiplication, see the
field axioms (Axiom 1). However, note that we need a different dimension of the identities if
the ’other matrix’ is not quadratic.
Example 2.6. Let In ∈ Rn×n and Im ∈ Rm×m be the identity matrices of the given dimensions,
and let A ∈ Rm×n be an arbitrary m × n-matrix. Let us check that
Im · A = A · In = A.
For this, we compute the ij-th entry of Im · A, which we call (Im A)ij . By definition
m
X
(Im A)ij = δik akj ,
k=1
where the Kronecker delta δik is zero if k 6= i. Thus the sum reduces to only one term, which is
δii aij = aij . This yields that the ij-th entry of the matrix product Im · A is aij , i.e., Im · A = A.
A similar calculation yields that A · In = A.
Note that for a quadratic matrix A ∈ Rn×n , we have
In · A = A · In = A.
Recall that the unit vectors ek = (δik )ni=1 ∈ Rn , k = 1, . . . , n, are the (column) vectors that
contain exactly one 1 and all other entries are zero. Let us also define the row unit vectors
eTk ∈ R1×n , i.e., eT1 := (1, 0, . . . , 0), eT2 := (0, 1, . . . , 0) and so on. With them, we can write the
identity by
eT1
eT2
In = e1 , e2 , . . . , en = .. ,
.
eTn
i.e., ek is the k-th column, and eTk is the k-th row of In .
With the above considerations and In · A = A · In = A, we see that the unit vectors can be used
to “extract” the rows and columns from a matrix. That is, given a matrix A ∈ Rn×n of the
form (2.1), we obtain that
i.e., Aek gives the k-th column, and eTk A gives the k-th row of A. (Verify this!) The same can
be done for rectangular matrices A ∈ Rm×n , but one need to consider unit vectors of different
length.
59
The last concept related to matrix multiplication that we will need is the inverse of a matrix.
That is, for a given matrix A ∈ Rn×n , the inverse matrix, if it exists, is a matrix A−1 ∈ Rn×n
such that
AA−1 = A−1 A = In .
If an inverse exists, then we call a matrix invertible or regular, see Section 2.6.
Some matrices are clearly invertible, like the identity with In−1 = In . Others are clearly not
invertible, like the zero matrix, because 0 cannot be multiplied by any matrix to become “non-
zero”. But, in general, it is not easy to see whether a matrix is invertible or not. For example,
the matrix ( 13 24 ) is invertible, but ( 13 26 ) is not. We will discuss a way to verify if a matrix is
invertible, and how to compute an inverse in Section 2.6.
However, let us already add here, that even if we know that a matrix is invertible, it is usually
difficult (also computationally) to compute an inverse.
We will come back to this issue later, and present some ways for computing the inverse, at least
for ’small’ matrices. This will be the ultimate tool to solve (certain) systems of linear equations.
But we will first discuss some more direct, but less powerful ways to calculate solutions.
We finally discuss the transpose of a matrix. Since the dimensions of a matrix are important, it
makes a huge difference if a matrix is m × n or n × m, and it is quite useful to have a compact
notation to somehow ’switch’ the rows and columns of a matrix. That is, for a given m × n
matrix A = (aij ), we define its transpose AT as the n × m matrix whose rows are the columns
of A. To be more precise, the ij-th component of AT is aji , i.e.,
a11 a21 . . . am1
a12 a22 . . . am2
T
(aji )n,m
A = j,i=1 = .. .. .. .. .
. . . .
a1n a2n . . . amn
And
T T
1 6 !T 1 6
1 2 3
2 5 = = 2 5
6 5 4
3 4 3 4
In the above example we saw that (AT )T = A, and this obviously holds in general. (The ij-th
component of (AT )T is the ji-th component of AT , which is aij .)
There is one calculation rule related to the transpose, that is sometimes also very useful for
computing the product of matrices. We state this in the following lemma.
(AB)T = B T AT .
In particular, this lemma shows that (AAT )T = AAT for every A ∈ Rn×n .
Proof. First of all, note that B T ∈ Rn×p and AT ∈ Rp×m . Therefore, the inner dimensions of
B T and AT agree, and their product is defined.
Now, let us write (C)ij for the ij-th entry of a matrix. By definition, we have that
p
X
(AB)ij = (A)ik (B)kj ,
k=1
and
p
X p
X p
X
T T T T
(B A )ij = (B )ik (A )kj = (B)ki (A)jk = (A)jk (B)ki = (AB)ji .
k=1 k=1 k=1
Note that symmetric matrices must be quadratic, and we will see later that symmetric matrices
have several important properties.
The easiest examples are the identity and the (quadratic) zero matrix. Another important class
of symmetric matrices are diagonal matrices.
The transpose is often used when working with vectors. Therefore, it is essential to understand
especially this case.
Consider a column vector x ∈ Rn = Rn×1 , then its transposed xT ∈ R1×n is a row vector.
Now, since the inner dimensions in both cases agree, we can define xT x and xxT . However, we
obtain that xT x ∈ R = R1×1 is a number, but xxT ∈ Rn×n is a quadratic matrix.
The above examples can clearly be generalized to the case of two different vectors x, y ∈ Rn .
That is, we can define the number xT y = ni=1 xi yi , which is just the inner
P
√product of x and y.
In particular, the Euclidean norm of a vector x can be written as kxk2 = xT x.
Moreover, we can define the matrix xy T ∈ Rn×n . One may even define such matrices based on
vectors of different dimensions. As these matrices appear rather often in theory and applications,
they have been given names.
Definition 2.13. Let A ∈ Rm×n . If there exist two vectors x ∈ Rm and y ∈ Rn such that
A = xy T ,
These matrices play an important role in the work with high dimensional data.
Remark 2.14. If we consider complex-valued matrices A = (aij ) ∈ Cm×n , then all the
definitions above still make sense. However, instead of a transpose AT , we would speak of the
adjoint matrix A∗ = (aji ), i.e., the conjugate transpose. (That is, we take additionally the
complex conjugate in each component.) Clearly, the adjoint equals the transpose if the matrix
is real-valued. If A = A∗ , we call A a self-adjoint (or hermitian) matrix.
Remark 2.15. The above calculation rules may be compared to the field axioms for real num-
bers, see Axiom 1. Note that many properties are also fulfilled for matrices. However, matrix
multiplication is not commutative, i.e., we do not have AB = BA in general, and that’s why the
set of matrices (of same dimensions) is not a field. (One may see that it is a group or even a
ring, but we do not discuss this type of algebraic structures here.)
62
Let us now consider systems of linear equations, which are the most frequently occurring type
of multivariate (i.e., depending on more than one variable) problems to solve, although they
appear to be the easiest. One reason is that many (even “non-linear”) numerical problems can
be rewritten as, or approximated by, a (very large) system of linear equations. And although
such systems are usually solved by a computer, it is up to the user to transfer the problem
under consideration to a well-defined linear system. It is therefore indispensable to have a solid
understanding of these basic problems.
Throughout this section, there will be the parameters m, n ∈ N, where
The system of equations we want to solve here, will be of the following form.
If we recall that the matrix-vector product of the matrix of coefficients A ∈ Rm×n and the vector
of variables x ∈ Rn is defined by
a11 x1 + a12 x2 + · · · + a1n xn
a21 x1 + a22 x2 + · · · + a2n xn
∈ Rm ,
Ax = ..
.
am1 x1 + am2 x2 + · · · + amn xn
we see that the system of linear equations, given in Definition 2.16, can be written in short by
Ax = b.
this can be verified by analyzing the matrix of coefficients more detailed. Before we come to
this, let us introduce some more notation and discuss some examples.
Definition 2.17. Given a linear system Ax = b with coefficient matrix A and RHS b, then
we denote the set of solutions by
n o
L(A, b) := x ∈ Rn : Ax = b .
Example 2.18. Let us consider the examples from the beginning of Section 2. That is, we want
to solve the system
2x1 + x2 = 1,
6x1 + 3x2 = 3,
and we have already seen that this system is solved by any (x1 , x2 ) ∈ R2 with 2x1 + x2 = 1.
Putting this into our notation, we have that A = ( 26 13 ) and b = ( 13 ), and with this
n o
L(A, b) = L ( 26 13 ), ( 13 ) = (x1 , x2 ) : 2x1 + x2 = 1 .
Recall that changing the RHS to b = ( 12 ) leads to a system without solution, i.e.,
L(A, b) = L ( 26 13 ), ( 21 ) = ∅.
This example is somehow special because the existence of a solution depends on the RHS.
Although this is not a rare case, we will see that there are conditions for a linear system to
be uniquely solvable for any RHS b. This is of particular interest in applications, where the
RHS b usually represents some kind of measurements or requirements, which we may not choose
ourself, and we want to find a solution x to specify some parameters.
Before we discuss a systematic way to solve large linear systems, let us see some more examples.
By substituting x2 = x1 −6 in the first equation, we see that a solution must satisfy 8x1 −30 = 42,
and we obtain that the price of a pizza is x1 = 9. From x2 = x1 − 6 we then see that x2 = 3.
Therefore, n o
L 31 −1
5 , ( 42 ) = ( 9 ) .
6 3
Note that L(A, b) is a set and therefore, we need to write L(A, b) = {x}, and not L(A, b) = x,
if x is the only solution.
64
Remark 2.21. In the above example, we wrote L(A, b) = {(x1 , . . . , xn )}. Note that this is a
bit inaccurate, because (x1 , . . . , xn ) seems to be a row vector, but we have the convention that
elements from Rn , so in particular the solution x ∈ Rn , are column vectors. More precisely,
we should have written L(A, b) = {(x1 , . . . , xn )T }. However, as it is obvious that the vector is
supposed to be a column vector, we usually omit the transpose, for simplicity. (Note that the
matrix product Ax would not be defined if x is a row vector.) In the same way, we might define
x = (x1 , . . . , xn ) ∈ Rn and assume, unless stated otherwise, that x is a column vector.
The next examples are given with solution for your own exercise.
Example 2.22. Consider the linear system
x1 + x4 = 0,
− 4x2 + 16x4 = 0,
2x3 − 6x4 = 0.
Then, the set of solutions is given by
1 0 0 1 0 n o
L 0 −4 0 16 , 0 = (−λ, 4λ, 3λ, λ) : λ ∈ R .
0 0 2 −6 0
We now discuss one special case of equations, which will also lead to a first answer to the question
if a linear system can have infinitely many solutions.
Ax = 0,
i.e., the RHS is the zero vector 0 := 0m1 , is called a homogeneous system (of equations).
To a given linear system Ax = b, we call Ax = 0 the corresponding homogeneous system.
65
It is rather easy to see that for any matrix A we have A · 0 = 0 (again 0 is a vector here). Just
have a look at the definition of the matrix-vector product. This implies that every homogeneous
linear system has at least the solution x = 0, and this solution is called the trivial solution.
Written mathematically, we have that L(A, 0) ⊃ {0} for any matrix A.
In some cases, a homogeneous system has more than the trivial solution, see Example 2.22.
However, if the trivial solution is also the only solution of a homogeneous system, then the
following lemma shows that the solution to a linear system Ax = b, if it exists, is unique.
Moreover, if Ay = 0 has only the trivial solution y = 0, and there is at least one solution x
to Ax = b, then this solution x is unique.
Proof. The first statement follows directly from the linearity of matrix multiplication. We obtain
A(x + y) = Ax + Ay = b + 0 = b.
For the second statement, note that if there is a solution y 6= 0 satisfies Ay = 0, then also
the vector λ · y satisfies A(λy) = λAy = 0, and is therefore also a solution. Hence, there are
automatically infinitely many solutions to the homogeneous system. Together with the first
part, we see that Ax = b has infinitely many solutions.
For the third part, assume that the only solution to Ay = 0 is y = 0. If there would be two
solutions x and z to the linear system, i.e., Ax = Az = b, then their difference would satisfy
A(x − z) = Ax − Az = b − b = 0.
In other words, x − z is a solution to the homogeneous system. As we assumed that the only
solution to the homogeneous system is zero, we obtain that x − z = 0 or x = z. This shows
that, if there are two solutions to Ax = b, then they are equal, which shows the uniqueness of
the solution.
Remark 2.26. Analogously one can define linear systems also in the complex case, i.e., co-
efficients and RHS might be complex. Then, we are clearly interested in complex solutions.
However, note that every complex equation can be written as two “real” equations by consid-
ering the real and complex parts separately. Therefore, every complex linear system can be
written as a larger real linear system. We therefore only consider the real case.
66
Now we make an important observation which allows us to derive an algorithm for solving
linear systems by manipulating matrices. We can do the following operations to a linear system
without changing the set of solutions:
1) Interchanging any two equations, i.e., changing the order of the equations.
(Think for a second, why these operations do not change the set of solutions.)
Since every system of linear equations can be written with the help of a matrix, it is clearly
of interest how the above operations change the corresponding matrix of coefficients of a linear
system. We will see that they indeed allow for successive modifications that lead to “much
simpler” matrices, i.e., matrices in echelon form. From such a matrix, we will be able to basically
see if a corresponding linear system is (uniquely) solvable or not.
Let us start by discussing how the above operations to a linear system Ax = b affect the
corresponding matrix A. However, note already now that these operations also change the
RHS b of a linear system, and this is essential. We will come back to this shortly, but for now
we only consider the corresponding matrix of coefficients.
In view of the operations from above that can be used to change a linear system Ax = b without
changing the set of solutions, we see that the matrix A is changed in the following way:
That is, there are as many as possible 0’s to the left of every row, and they are ordered such that
number of zero entries is increasing from top to bottom. In particular, the leading coefficient
of a row, i.e., the first non-zero entry of a row, is strictly to the right of the leading coefficient
of the row above. Before we discuss some examples, let us state the general definition.
67
where ∗ stands for an arbitrary entry, is in row echelon form (ger. ’Treppenform’).
That is, there exist numbers k ≤ m and 1 ≤ j1 < · · · < jk ≤ n such that for all 1 ≤ i ≤ k:
• ciji 6= 0,
• cij = 0 for all j < ji , i.e. ciji is the first non-zero element in the i-th row, and
• c`ji = 0 for all ` > i, i.e. ciji is the last non-zero element in the ji -th column.
• c`ji = 0 for all ` < i, i.e. ciji = 1 is the only non-zero element in the ji -th column,
We do not prove the following statement here formally, but note that it is the basis of the
considerations below.
Theorem 2.28. Every matrix can be transformed to (reduced) row echelon form by per-
forming row operations. Moreover, the reduced row echelon form of a matrix is unique.
In contrast, a given matrix A can be transformed by row operations into different matrices in
(non-reduced) row echelon form. (For example, multiplying any row by 2 leads to another row
echelon form.) But even then, all row echelon forms of a matrix have the same “k”, i.e., the
same rank. That’s why the rank is a characteristic of a matrix, which turns out to be essential.
Note that the definition implies that rank(A) ≤ min{m, n} for any A ∈ Rm×n , and one may say
that the rank is the number of independent rows in the matrix. (We will discuss later what
this precisely means.)
68
Let us see some examples, before we show that computing the reduced row echelon form of a
matrix is actually solving the corresponding system of linear equations.
Example 2.30 (Diagonal matrices). Diagonal matrices with all diagonal entries non-zero are
already in echelon form. By dividing each row by the diagonal entry, we obtain the reduced
row echelon form, which equals the identity. In particular, all diagonal matrices with non-zero
diagonal entries have the same reduced row echelon form.
Example 2.31 (Rearranging rows). In general, we may easily bring every matrix into echelon
form that only contains at most one non-zero entry per row. We just need to rearrange the
rows. For example, a row echelon form of
0 0 0 −1 4 0 0 0
4 0 0 0 0 3 0 0
is .
0 0 0 0 0 0 0 −1
0 3 0 0 0 0 0 0
(In the definition above, we have (j1 , j2 , j3 ) = (1, 2, 4), and c11 = 4, c22 = 3 and c34 = −1.)
1 0 0 0
The rank is 3, and the reduced row echelon form is 0100 .
0001
Example 2.32 (Removing duplicate rows). Addition of (a multiple of) a row to another is a
row operation. In particular, we can subtract a row from another. For example, by subtracting
the first row from the second, and twice the first row from the third, we see that the reduced
row echelon form of the matrix
1 1 1 1 1 1
1 1 1 is 0 0 0 .
2 2 2 0 0 0
The above example shows that a rank-one matrix has indeed rank one.
Given x ∈ Rn and y ∈ Rm , we consider the rank-one matrix xy T ∈ Rn×m . By definition of the
matrix product, we see that every column of xy T is a multiple of x, and every row is a multiple
of y T . In particular, the k-th row of xy T is xk · y T . Since all rows are multiples of each other,
we can proceed as in Example 2.32 to obtain rank(xy T ) = 1.
0
Example 2.33. For example, let x = 1 ∈ R3 = R3×1 and y T = (2, 3, 4, 5) ∈ R1×4 . We obtain
2
0 0 0 0 0
3×4
1 2 3 4 5 = 2 3 4 5 ∈ R .
2 4 6 8 10
Subtracting twice
2 3 4the second row from the third, and interchanging rows, leads to the row
5
echelon form 0 0 0 0 . This shows rank(xy T ) = 1.
0000
69
We bring this matrix into reduced row echelon form by using only row operations. Systematically,
we want to create zeros in the lower left entries. So, we start by subtracting the first row 4-times
from the second, which leads to a zero in the first entry of the second row. We indicate this
operation by “II − 4I”. Afterward, we subtract the first row 7-times from the third (“III − 7I”),
which leads to a zero in the first entry of the last row, and so on.
1 2 3
4 5 6
7 8 9
1 2 3
II − 4I 0 −3 −6
−−−−−→
7 8 9
1 2 3
III − 7I 0 −3 −6
−−−−−−→
0 −6 −12
1 2 3
III − 2II 0 −3 −6
−−−−−−→
0 0 0
1 2 3
1
− II 0 1 2
−−3−→
0 0 0
1 0 −1
I − 2II 0 1 2
−−−−−→
0 0 0
This is the reduced row echelon form of the matrix. Moreover, we can see that the rank is 2.
Note that there are several ways of indicating which row operation is performed. For example,
one might be more precise and write “II → II − 4I” to state that the operation is performed
only in the second (II) row. We decided to use this notation with the convention that only the
row that appears first will be changed.
70
Example 2.35. We use “I ↔ II” to indicate that we interchanged the first and the second
row. (Note that afterwards, the first row is denoted by II and vice versa.)
0 8 0
3 6 0
6 0 1
6 15 0
3 6 0
0 8 0
I−−↔
−−II
→
6 0 1
6 15 0
3 6 0
0 8 0
III − 2I
−−−−−−→ 0 −12 1
6 15 0
3 6 0
0 8 0
IV − 2I
−−−−−→ 0 −12 1
0 3 0
3 6 0
3 0 8 0
IV − II
0 −12 1
−−−−−8−→
0 0 0
3 6 0
3 0 8 0
III + II
0 0 1
−−−−−−2−→
0 0 0
1 2 0
1 0 8 0
I
0 0 1
3→
−
0 0 0
1 2 0
1 0 1 0
II
0 0 1
8−→
−
0 0 0
1 0 0
0 1 0
I − 2II
−−−−−→ 0 0 1
0 0 0
This matrix has therefore rank 3. (We have (j1 , j2 , j3 ) = (1, 2, 3) and c11 = c22 = c33 = 1.)
As you see it can be rather time consuming to compute the reduced row echelon form even for
rather small matrices. However, it is a very straight-forward method. Meaning that it is
always obvious what to do next, and miscalculation is basically the only source of errors.
71
Recall that the reduced echelon form of a matrix is unique, but the order of the calculations to
find it is not. To make computations generally easier and faster, there are two rules of thumb:
The second “shortcut” can clearly be performed by adding multiples of the row with the unique
non-zero entry to all other rows. Let’s consider an example.
Example 2.36. We consider the matrix
2 6 5 0 4
4 17 8 0 16
12 42 8 14 0 .
0 0 13 0 0
28 0 3 0 0
We see that in the fourth row there is only one non-zero entry, which is in the third column.
Hence we can reduce the third column immediately and get
2 6 0 0 4
4 17 0 0 16
12 42 0 14 0 .
0 0 1 0 0
28 0 0 0 0
Note that we also divided the 4th row by 13 to make things easier.
Next we see that the fifth row is already in the form we wish for the first row. So, we interchange
them. Again, we can put all entries in the first column to zero:
1 0 0 0 0
0 17 0 0 16
0 42 0 14 0 .
0 0 1 0 0
0 6 0 0 4
If we now subtract 4-times the last row from the second, we see that the new second row reads
(0, −7, 0, 0, 0). Dividing by -7, and putting all other entries in the second column to zero, yields
1 0 0 0 0
0 1 0 0 0
0 0 0 14 0 .
0 0 1 0 0
0 0 0 0 4
Deviding the third row by 14, and interchange III and IV , finally yields the reduced row echelon
form
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0 .
0 0 0 1 0
0 0 0 0 1
We see that the rank is 5, and j` = ` for ` = 1, . . . , 5, i.e., (j1 , j2 , j3 , j4 , j5 ) = (1, 2, 3, 4, 5).
Moreover, c`` = 1 for ` = 1, . . . , 5.
72
Now we discuss how we can solve linear systems by calculating (reduced) row echelon forms
of matrices. We consider the linear system Ax = b with corresponding matrix A = (aij ) ∈ Rm×n
and RHS b = (b1 , . . . , bm ), and define the augmented matrix (A|b), i.e., we consider the array
a11 a12 ... a1n b1
a21 a22 ... a2n b2
(A|b) := .. .. .. .. .
..
. . . . .
am1 am2 . . . amn bm
This means that we add b as new column. We already discussed that by interchanging, mul-
tiplying and adding rows we do not change the set of solutions. So, if some (C|b0 ) is obtained
from (A|b) only by row operations, then
L(A, b) = L(C, b0 ),
Now assume that the augmented matrix (A|b) is transformed into an augmented matrix (C|b0 ),
where C is in row echelon form. (Here, we consider the vector b as the last column of the matrix,
and therefore have to respect it while performing row operations. But we only want to bring
A in row echelon form.) From this augmented matrix we can just “see” the solutions of the
corresponding linear system. This way of computing solutions is called Gaussian elimination.
1 2 6
Example 2.38. Let us again consider Example 2.20, which is given by Ax = b with A = 250
0 010
and b = −1 . We bring the augmented matrix in reduced row echelon form:
3
1 2 6 0 1 2 6 0 1 0 6 −6
2 5 0 −1 II − 2I 0 1 −12 −1 ...→ 0 1 0 3
−−−−−→ −−
0 1 0 3 0 1 0 3 0 0 −12 −4
1 0 0 −8
...→ 0 1 0 3
−−
0 0 1 1/3
For the general procedure, assume that A has rank k, i.e., rank(A) = k. We obtain that the
augmented matrix (C|b0 ) in row echelon form, which we obtain from (A|b), looks like
b01
0 0 c1j1 ∗ ∗
0
0 c2j2 ∗ ∗ b02
0 0 c3j3 ∗ ∗ b03
0
(C|b ) = ,
0 0 ckjk ∗ ∗ 0
bk
0
0 b0k+1
0 0 b0m
0 · x1 + 0 · x2 + · · · + 0 · xn = b0l .
Since the left hand side is equal to zero for every x, we obtain a contradiction if one of the b` ’s
is not equal to zero. We therefore obtain the rule:
In the other case, i.e., b0k+1 = . . . , b0m = 0, the system will always have a solution, which we
compute “from the bottom to the top”:
We first consider the last non-zero equation, i.e., the k-th equation, which reads
n
ck` x` = b0k ,
X
`=jk
b0k
x n = x jk = ,
ckjk
All possible solutions have to fulfill this identity, otherwise the k-th equation could not be true.
Therefore, for any given choice of xjk +1 , . . . , xn , we have to choose the unique xjk . However,
these x` with ` = jk + 1, . . . , n can be chosen freely.
74
Now assume that we have already fixed the values for xn , . . . , xjk . We turn to the next equation,
which is the (k − 1)-th:
By the same principle we can give a formula for xjk−1 depending only on the x` with ` ≥ jk−1 +1.
Since we have only fixed xn , . . . , xjk , we can again choose x` with ` = jk−1 + 1, . . . , jk − 1 freely.
(Note that there is no free choice if jk = jk−1 + 1.) So, after this step, we have computed the
components xn , . . . , xjk , . . . , xjk−1 of a solution (x1 , . . . , xn ) of Ax = b, i.e., the last components.
If we continue this process, we finally see, that there are precisely k = rank(A) such equalities
that “fix” the value of the unknowns xj1 , . . . , xjk . These equalities are
n
1 0 X
x ji = bi − ci` x`
ciji `=j +1 i
for i = 1, . . . , k. But the remaining variables, i.e., the x` with ` ∈ {1, 2, . . . , n} \ {j1 , j2 , . . . , jk },
can all be chosen freely. Hence, when we write down the solution, there are n−k free parameters.
This is usually phrased as “a linear system has n − k degrees of freedom”.
and b = (5, 1, 0). To compute a solution x = (x1 , x2 , x3 ), we consider the augmented matrix
2 1 −2 5
(A|b) = 0 1 2 1.
0 0 0 0
Since this matrix is already in row echelon form, we do not need to perform row operations.
Specifically, we have the row echelon form with (j1 , j2 ) = (1, 2). By Gaussian elimination (with
C = A and b0 = b), we can choose the variables with index in {1, 2, 3} \ {j1 , j2 } = {3} freely, i.e.,
we already know that x3 is a free parameter. Using the formulas established above, we obtain
(“from the bottom to the top”)
3
!
1 X 1
x2 = b2 − c2` x` = (1 − 2x3 ) = 1 − 2x3
a22 `=3
1
For notational convenience, we choose λ as a name for the free parameter, and write
2 1 −2 5 n o
L 01 2 , 1 = (2 + 2λ, 1 − 2λ, λ) ∈ R3 : λ ∈ R .
00 0 0
Note that we have infinitely many solutions. For example, x = (2, 1, 0) (for λ = 0) or x =
(0, 3, −1) (for λ = −1).
However, if we would consider, e.g., the RHS b0 = (0, 0, 1), then there would be no solution since
the reduced echelon form
2 1 −2 0
(A|b0 ) = 0 1 2 0
0 0 0 1
shows the contradiction in the last equation.
Example 2.40. Note that some formulas get easier when we transfer a linear system into
reduced row echelon form. If we consider the example from above, we see that it can be
transferred to
2 1 −2 5 1 0 −2 2
0
(A|b) = 0 1 2 1 → 0 1 2 1 =: (C|b ).
0 0 0 0 0 0 0 0
Again, the second row is seen to be equivalent to x2 = 1 − 2x3 . However, the first row shows
directly that x1 = 2 + 2x3 , and there is no need to plug the already obtained formula for x2 in,
as above. That’s another advantage of the reduced row echelon form. We obtain again
3
that L(A, b) = (2 + 2λ, 1 − 2λ, λ) ∈ R : λ ∈ R .
Example 2.42. A larger example, that we only present with a short solution is the linear
system given by
9 −6 −12 30 −6
5 −10 −20 0 20
(A|b) = .
−5 2 4 10 −10
−2 −4 −8 40 −16
This has the reduced row echelon form
1 0 0 0 0
0 1 2 0 −2
.
0 0 0 1 − 35
0 0 0 0 0
We see that x3 is a free parameter. From the third equation we get that x4 = − 35 . We deduce
from the second equation that x2 = −2 − 2x3 , and the first equation yields that x1 = 0. So we
obtain the set of solutions
0
−2 − 2λ
L(A, b) = : λ ∈ R .
λ
− 35
We collect these findings in the following lemma, which are particularly meaningful if m = n,
i.e., if A is a square matrix. This is the case if there are as many equations as unknowns.
1. If rank(A) < m, then the linear system Ax = b has no solution for certain b ∈ Rn .
2. If rank(A) < n, then the homogeneous system Ax = 0 has infinitely many solutions.
Hence, if rank(A) < min{m, n}, then the linear system Ax = b has either no or infinitely
many solutions, depending on b ∈ Rn .
Moreover, if A ∈ Rn×n is a square matrix, i.e., m = n, with rank(A) = n, then the linear
system Ax = b has a unique solution for any b ∈ Rn .
Proof. The lemma follows from the considerations above, together with the fact that the homo-
geneous system Ax = 0 has always at least the solution x = 0, see also Lemma 2.25.
77
It is worth noting that a row echelon form of a square matrix is always an upper triangular
matrix. That is, the row echelon form is of the form
a11 . . . a1n
.. .. ,
. .
0 ann
i.e., all entries below the diagonal are zero. The matrix has full rank if and only if all
diagonal entries are non-zero. On the contrary, every upper triangular matrix A ∈ Rn×n
with non-zero diagonal elements is already in row echelon form.
Matrices with full rank play an important role, in particular, since there is a unique solution to
a corresponding linear system independent of the RHS b. However, the technique above only
seems to work for a fixed RHS b.
In the following sections we will discuss the inverse of a matrix A ∈ Rn×n , if it exists, which will
lead to a formula for a solution of Ax = b for any b. However, let us add that it is not usual
to compute the inverse of a large matrix on a computer, since it is computationally quite hard.
Large linear systems are usually solved with (variants of) Gaussian elimination.
We now introduce the determinant of a matrix, which will be a frequently appearing quantity,
and is a good tool to decide if a matrix is invertible or not. Moreover, it is needed to introduce
Cramer’s rule for solving linear systems, and to give an explicit formula for the inverse matrix.
Definition 2.46. Let A ∈ Rn×n be a square matrix, and denote the rows of A by
a1 , . . . , an ∈ R1×n . Moreover, let λ, µ ∈ R, and w ∈ R1×n .
We define the determinant of A by the unique mapping det : Rn×n → R such that:
2. If there exist i 6= j such that ai = aj , i.e., two equal rows, then det(A) = 0.
det(In ) = 1.
The definition directly shows the connection to linear systems. However, let us show explic-
itly how the determinant changes under row operations.
3. Adding a multiple of one row to another row does not change the determinant.
Proof. The first point follows immediately from Definition 2.46(1) with µ = 0. Moreover, note
that λA is the matrix with every row multiplied by λ. Since there are n rows, we have to apply
this rule for each row one by one and obtain det(λA) = λn det(A).
For the second point we assume that B is the matrix which is obtained by interchanging the
i-th and the j-th row of the matrix A. (The case of column uses again the transpose.) Recall
from Definition 2.46(3) that, if a matrix contains a row more than once, then its determinant is
79
It is apparent that the same formual already holds for triangular matrices.
Lemma 2.49. Let A ∈ Rn×n be an upper triangular matrix, which is a matrix of the
form
a11 . . . a1n
A=
.. .. ,
. .
0 ann
i.e., all entries below the diagonal are zero. Then,
n
Y
det A = aii .
i=1
This formula holds also for lower triangular matrices, which are zero above the diagonal.
Proof. Let us first assume that rank(A) < n. In this case we can produce a zero row in A by
using only row operations, see the definition of the rank. Therefore, the determinant is zero in
this case.
If rank(A) = n, then we know that A is already in row echelon form and that, in particular,
aii 6= 0 for all i = 1, . . . , n. Therefore, by adding multiples of rows to other rows (“from bottom to
top”), we can transform the matrix to a diagonal matrix without changing its diagonal elements
and its determinant. We obtain det A = det(diag(a11 , a22 , . . . , ann )) which proves the result.
The determinant is of special interest, because it can be used to characterize whether a linear
system is uniquely solvable, see Lemma 2.43, i.e., the matrix has full rank. We will see soon
that is equivalent to the corresponding matrix being regular (aka. invertible).
det A 6= 0 ⇐⇒ rank A = n
Proof. Note that every matrix can be brought into row echelon form by just adding repeatedly
rows to other rows, and this does not change the determinant. Since the row echelon form
of a square matrix is always a upper triangular matrix, we see that the matrix has full rank,
i.e., rank(A) = n, if and only if all diagonal entries of the row echelon form are not zero. By
Lemma 2.49 this is equivalent to det(A) 6= 0.
Example 2.51. Consider the matrix A = ( 13 24 ). We know that adding a multiple of one row to
another does not change the determinant. Therefore,
! ! !
1 2 1 2 1 0
det = det = det = −2
3 4 0 −2 0 −2
81
Repeating these computations for arbitrary matrices, we obtain rather easy formulas to mind.
(Verify yourself!)
Using this formula again to calculate the determinant of the above example, we see
!
1 2
det = 1 · 4 − 2 · 3 = −2.
3 4
This formula is usually called Rule of Sarrus, and can be easily minded by noting that one
has to multiply only entries on certain diagonals, and add them according to the orientation of
these diagonals. (Think for second which entries are multiplied, and how they are summed up.)
Again we want to give an example
1 2 3
det 4 5 6 = 45 + 84 + 96 − 105 − 72 − 48 = 0.
7 8 9
Let us present two more calculation rules, without proof. The first shows that transposing does
not change the determinant.
det(AT ) = det(A).
82
Note that this shows that Lemma 2.47 also holds if we replace “column” by “row”. That is, the
calculation rules for the determinant also hold for column operations.
The next rule of calculation shows that the determinant of the product of two matrices is the
product of the respective determinants. Recall that the determinant is only defined for square
matrices. Therefore, both matrices need to have the same dimensions.
The proofs of the last two lemmas are not hard, but quite long, and therefore we omit them
here. Note that the last lemma is mostly of theoretical value as, in general, we do not know if
a matrix is a product of easier matrices.
Example 2.57. Consider the matrix A = ( 15 7 10 ). To compute its determinant, we are lucky
22
to know that ( 15 22 ) = ( 3 4 )( 3 4 ), i.e., A = B 2 = B · B with B := ( 13 24 ). Since det(B) = −2,
7 10 1 2 1 2
Example 2.58. Another important case of Lemma 2.56 is that one of the determinants vanishes,
i.e., det(A) = 0 or det(B) = 0. In this case we already know that det(AB) = 0, without actually
computing the product AB. For example, since det( 10 00 ) = 0, we already know that
! !! !
1 0 1 2 1 2
det = 0 · det = 0.
0 0 3 4 3 4
Now that we know that the determinant “behaves well” with respect to transposition and mul-
tiplication, one might guess that a similar relation also holds for addition. However, and unfor-
tunately, there is no similar formula for the determinant of the sum of matrices as the
following simple example shows.
such that A + B = I2 . We obtain (e.g., by the formula as given in Example 2.53) that
This shows that, in general, we can not extrapolate from the determinant of A and B to the
determinant of their sum. (Clearly, there might be exceptions.)
83
Let us finally introduce the Laplace expansion for the determinant, which is also called co-
factor expansion or expansion along a row/column. This formula allows to compute the
determinant of large matrices recursively by computing the determinant of smaller matrices.
Let us first introduce a bit new notation.
Theorem 2.61 (Laplace expansion). Let A = (aij ) ∈ Rn×n . Then, we can compute the
determinant of A by expansion along ...
As a proof of this result would require a more detailed analysis, we leave it out.
Although this result may look complicated at first sight, it is actually rather simple to apply,
and can lead to very fast computations if the matrix under consideration contains many zeros.
Example 2.62. Consider again the matrix A = ( 13 24 ). We want to compute the determinant of
A using expansion along the first row. By Theorem 2.61 (with i = 1) we see that
2
!
1 2 X
det = (−1)1+j a1j M1j = (−1)1+1 a11 M11 + (−1)1+2 a12 M12 = 1 · M11 − 2 · M12 .
3 4 j=1
Using M11 = det(4) = 4 and M12 = det(3) = 3, we obtain det(A) = 4 − 2 · 3 = −2, as required.
84
To compute its determinant we look for a row or column with preferably only one non-zero
entry. This makes the Laplace expansion particularly useful, because most of the terms in the
sum vanish.
We choose the fourth column, i.e., we take the Laplace expansion with j = 4. (Clearly, there
are also other good choices.) We obtain
4
X
det(A) = (−1)i+4 ai4 Mi4 .
i=1
To compute M14 we have to compute the determinant of the matrix that is obtained by deleting
the first row and the last column of A. That is,
4 7 14
M14 = det 0 13 0 = 104.
0 3 2
The last computation could be done directly with the Rule of Sarrus, or by using again Laplace
expansion along the second row to see that M14 = 13 · det( 40 14
2 ) = 13 · 8. We finally obtain
det(A) = −6 · 104 = −624.
The examples above show that one may compute determinants very fast by using Laplace
expansion. Moreover, it is interesting that some of the entries (like the 17 in the upper left
corner) were not even needed in the computation.
Now, all the terms appearing in the last system of equations can be written as determinants.
Recall that det(A) = a11 a22 − a12 a21 . We obtain that the above system can be written by
!
b a
det(A) x1 = det 1 12 ,
b2 a22
!
a b
det(A) x2 = det 11 1 .
a21 b2
This shows that, whenever det(A) 6= 0, we can just divide by it to obtain the precise values of
x1 and x2 , i.e., !
1 b1 a12
x1 = det
det A b2 a22
and !
1 a b
x2 = det 11 1 .
det A a21 b2
Note that, to obtain the k-th entry of the (unique) solution x, we just need to replace the k-th
column of A by the RHS b and compute the corresponding determinant. After dividing by
det(A) we are done.
We will now see that the computations in the last example also work for more than two equations,
i.e., in the case n > 2. This is called Cramer’s rule.
det(Ak )
xk = ,
det(A)
where Ak is given by
a1,1 . . . a1,k−1 b1 a1,k+1 . . . a1,n
a2,1 . . . a2,k−1 b2 a2,k+1 . . . a2,n
Ak :=
.. .. .. .. .. .
. . . . .
an,1 . . . an,k−1 bn an,k+1 . . . an,n
Proof. From Theorem 2.50, we know that rank A = n ⇐⇒ det A 6= 0 and so, that there exists a
unique solution to the linear system Ax = b, see Lemma 2.43. Recall that the vectors ek ∈ Rn
with 1 ≤ k ≤ n are the unit vectors, and that x = (x1 , x2 , . . . , xn )T is the column vector
representing the solution. We now define the matrices
Xk = e1 e2 . . . ek−1 x ek+1 . . . en .
By computing the row echelon form of Xk we see that det(Xk ) = xk for all k = 1, . . . , n. If we
denote the columns of A by ck , i.e., A = (c1 , . . . , cn ), and recall that Aek = ck , we obtain
A · Xk = Ae1 Ae2 . . . Aek−1 Ax Aek+1 . . . Aen
= c1 c2 . . . ck−1 Ax ck+1 . . . cn .
86
Since Ax = b, we see that A · Xk = Ak with Ak given in the theorem. Using Lemma 2.56, we
see that
det(Ak ) = det(A · Xk ) = det(A) · det(Xk ) = xk · det A,
which proves the result.
We see that det A = 1, hence Cramer’s rule, see Theorem 2.64, implies that
!
7 3
x1 = det = 49 − 48 = 1.
16 7
and !
1 7
x2 = det = 16 − 14 = 2.
2 16
for which we have det A = −280. Then, due to Cramer’s rule, we compute the following solution,
denoted by x = (x1 , x2 , x3 )T , for the RHS b = (1, 1, 1)T ,
1 1 3
−1 29
x1 = det 1 0 11 =
280 280
1 5 0
8 1 3
−1 27
x2 = det 7 1 11 =
280 280
5 1 0
8 1 1
−1 7
x3 = det 7 0 1 = .
280 280
5 5 1
87
We now combine the findings of the last sections to give an explicit formula for the inverse
matrix, if it exists. This is particularly useful to compute solutions of a linear system Ax = b if
the RHS b is a priori not known. Moreover, the inverse is handy when we want to work with a
(unique) solution theoretically.
For completeness, let us repeat the definition.
Definition 2.67. Let A ∈ Rn×n and assume that there exists some A0 ∈ Rn×n with the
property that
A · A0 = A0 · A = I n .
Then, we say that A is invertible or regular, and we write A−1 := A0 to denote the inverse.
Note that a matrix must be a square matrix for both matrix products above being defined.
That’s why we define the inverse only for A ∈ Rn×n .
1. We consider matrices as kind of numbers and, for a matrix A, we look for another matrix
that is the inverse element of A for matrix multiplication, see the field axioms (Axiom 1).
Ax = b ⇐⇒ x = In x = A−1 Ax = A−1 b.
det A 6= 0 ⇐⇒ rank A = n
for a given matrix A ∈ Rn×n . We now combine that with Lemma 2.43 to show that A is bijective,
and hence invertible, in this case.
Let us state that as a lemma.
det A 6= 0 ⇐⇒ A is invertible,
Proof. For the equivalence, note that rank(A) = n if and only if the linear system Ax = b has a
unique solution for all b ∈ Rn , see Lemma 2.43. In other words, the mapping A : Rn → Rn (i.e.,
A maps vectors to vectors) is injective (“For every b ∈ Rn there is at most one x with Ax = b.”)
and surjective (“For every b ∈ Rn there is some x with Ax = b.”). Hence, A is bijective, and
therefore invertible (aka. regular), see Theorem 1.16.
From Lemma 2.56 we know that
Example 2.69. Note that, if A is regular, then A−1 exists and is also regular. Hence, the
inverse of the inverse exists, and fulfills (A−1 )−1 = A. (Verify yourself!)
Note that the inverse of the product of matrices is the product of the inverses, but
we have to change the order (as for the transpose).
(A · B)−1 = B −1 · A−1 .
Proof. First note that det(AB) = det(A) det(B) 6= 0 due to Lemma 2.56 and Lemma 2.68,
which shows that AB is regular. If we note that (AB)(B −1 A−1 ) = A(BB −1 )A−1 = AA−1 = In ,
we see- that (B −1 A−1 ) is the inverse of AB.
For the computation of the inverse A−1 , denote the columns of A−1 by c1 , . . . , cn ∈ Rn , i.e.,
A−1 = c1 , c2 , . . . , cn .
We then have A−1 ek = ck , where ek is the k-th unit vector. (We used already earlier that
matrix-vector multiplication with a unit vector gives a column of the matrix.) Using the above
equivalence, with x = ck and b = ek , we see that
A−1 ek = ck ⇐⇒ Ack = ek .
That is, we can calculate ck , i.e., the k-th column of A−1 , by solving the linear system Ax = ek .
We can now use Cramer’s rule, together with the Laplace expansion, to compute the inverse
of A. Recall that the cofactor matrix of A is defined by C = (Cij ) ∈ Rn×n , where
and Mij is the (i, j)-minor, i.e., the determinant of the matrix that is obtained by deleting the
i-th row and the j-th column, see Definition 2.60.
89
The following theorem shows that one can compute the inverse of a matrix as the transpose of
its cofactor matrix divided by the determinant.
Theorem 2.71. Let A ∈ Rn×n with det(A) 6= 0, and let C = (Cij ) ∈ Rn×n be the cofactor
matrix of A. Then,
1
A−1 = CT ,
det A
i.e.,
Cji
(A−1 )ij = ,
det A
where (A−1 )ij denotes the ij-th entry of A−1 .
Proof. Fix some j = 1, . . . , n. The discussion above shows that the j-th column of A−1 can be
computed by solving the linear system Ax = ej . Cramer’s rule, see Theorem 2.64, yields that
the i-th entry of the solution x = (x1 , . . . , xn ) to this linear system is given by
det(Ai )
xi = ,
det(A)
where
a1,1 ... a1,i−1 0 a1,i+1 ... a1,n
.. .. .. ..
. . 0
. .
a . . . aj−1,i−1 0 aj−1,i+1 . . . aj−1,n
j−1,1
Ai = det aj,1 . . . aj,i−1 1 aj,i+1 . . . aj,n .
aj+1,1 . . . aj+1,i−1 0 aj+1,i+1 . . . aj+1,n
.. .. .. ..
. . . .
an,1 . . . an,i−1 0 an,i+1 ... an,n
We just replaced the i-th column of A by ej . Now we use Laplace expansion, see Theorem 2.61,
with expansion along the i-th column. (Note that in the statement of the Laplace expansion we
used the j-th column. Therefore, we need to be careful with the indices.) We see that the only
non-zero entry in the i-th column of Ai is the 1 in the j-th row. We obtain (for fixed i)
i.e., the determinant of Ai is the (j, i)-cofactor of A. (Note the reversed indices.) This finally
shows that
Cji
(A−1 )ij = xi = .
det A
2. Compute the cofactor matrix C = (Cij )ni,j=1 with Cij = (−1)i+j Mij .
For example, we obtain M21 by deleting the second row and the first column of A, and compute
the determinant M21 = det(2) = 2.
To compute the cofactor matrix C, we have to multiply each entry Mij by (−1)i+j , i.e., we
multiply with −1 if i + j is odd, and leave all other entries unchanged. We obtain
!
4 −3
C = .
−2 1
The inverse matrix can now be used to calculate the unique solution to Ax = b for any RHS b.
We obtain ! ! !
−1 −2 1 b1 b2 − 2b1
x = A b = = 3b1 −b2 .
3/2 −1/2 b2 2
3−2
For example, the solution to Ax = ( 31 ), i.e., we have b = ( 31 ), is given by x = 3·3−1 = ( 14 ).
2
by using Cramer’s rule. First, we have that det A = 1. The matrix of minors is found to be
−7 4 4
M = −2 1 1 ,
−2 0 1
91
where, e.g., M11 = det( 11 81 ) = −7 and M32 = det( 14 28 ) = 0. We obtain the cofactor matrix
−7 −4 4
C = 2 1 −1 ,
−2 0 1
Finally, we want to discuss the Gauss-Jordan algorithm to compute the inverse. This method
is very similar to the Gaussian elimination, and is sometimes handy, at least for small matrices.
(I do not suggest to use this method, as it is prone to error, but others think differently, and so
I state it for completeness.)
For this recall that we can apply the Gaussian elimination to more vectors at once, see Re-
mark 2.45, which can be used to solve the linear system for different RHS’s simultaneously.
From the discussion above, we know that we actually need to solve the linear systems Ax = ej
for all j = 1, . . . , n to obtain all columns of A−1 . Hence, we can compute all columns of A−1 at
once by computing the reduced row echelon form of
a11 . . . a1n 1 0
.. .
.. . ..
(A|In ) = . .
an1 . . . ann 0 1
If A is regular, i.e., rank A = n, we know that the reduced row echelon form of A is the identity
matrix. Thus, by using Gaussian elimination, we are able to compute
(A|I) −→ (I|A−1 )
a11 . . . a1n 1 0
.. .. ..
. . .
an1 . . . ann 0 1
We apply the Gauss-Jordan algorithm, which means we have to transform the following matrix
into its echelon form by
!
1 2 1 0
3 4 0 1
!
1 2 1 0
II − 3I
−−−−−→ 0 −2 −3 1
!
1 0 −2 1
I + II
−−−−→ 0 −2 −3 1
!
1 0 −2 1
−1/2II .
−−−−−→ 0 1 3/2 −1/2
The result clearly agrees with the one from Example 2.72.
−7 2 −2
to see A−1 = −4 1 0 , which is the same result as in Example 2.73.
4 −1 1
93
If the index set is clear, we may just write (an ) for (an )n∈I .
As one considers a sequence as a mapping defined on the index set I ⊂ Z, one wants to express
that we are dealing with a list of elements in a particular order. Thus, we clearly distinguish
between the sequence (an ) and its range {an : n ∈ I}.
Note that two sequences (an )n∈I and (bn )n∈I are equal if and only if
∀n ∈ I : an = bn ,
in this case we write (an )n∈I = (bn )n∈I .
In the special cases M = R or M = C we say that (an )n∈I is a real-valued or complex-valued
sequence, respectively. We will focus on real-valued sequences in this lecture. However, most
statements also hold for complex-valued sequences. We comment on the differences when needed.
To define a sequence, the most common way is to use an explicit formula, for instance
1
an = 2n or bn = 1 + ,
n
or by a recursion, i.e., we give one (or more) starting value(s) and a rule how to calculate a
new term from previous terms. For the above examples we could write
a1 = 2, an+1 = 2an
94
and
n(n + 2)
b1 = 2, bn+1 = bn · .
(n + 1)2
(Verify these formulas!)
Example 3.2 (Fibonacci sequence). One of the most famous sequences, which appears in several
areas of natural science, is the so called Fibonacci sequence. Here, the recursion depends on
more than just the last value. The sequence (Fn )n∈N is defined by
The first values of this sequence are 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, . . . √
1+ 5
It is an interesting phenomenon that the quotients Fn+1 /Fn converge to the golden ratio 2
(See e.g. Wikipedia for its importance). We will see later how to prove such statements.
Example 3.3 (Infinite sums aka. series). If we want to add infinitely many numbers, say all the
numbers an , n ∈ N, then we can consider the new sequence sn = nk=1 ak , which can be given
P
(−1)n
Figure 19: convergence of sequence 1 + n
It is mostly clear from the context if we consider real or complex neighborhoods. Note that,
for M = R and a ∈ R the ε-neighborhood Uε (a) is just the open interval (a − ε, a + ε). In the
complex case, i.e., M = C and a ∈ C, Uε (a) is the disc of radius ε around a in the complex
plane.
Remark 3.5. Note already now, that this definition is quite flexible if we switch to more
complex situations. That is, we can define neighborhoods whenever we have a measure for the
’distance’ on the set M . We therefore use this notation to get used to it.
or, equivalently,
∀ε > 0 ∃n0 ∈ N ∀n ≥ n0 : an ∈ Uε (a),
or, equivalently,
∀ε > 0 ∃n0 ∈ N : (an )∞
n=n0 ⊂ Uε (a).
(an )n∈N is called convergent, or we say that the limit of (an )n∈N exists, if there exists
some a ∈ C such that an → a, otherwise (an )n∈N is called divergent.
The statement
∀ε > 0 ∃n0 ∈ N ∀n ≥ n0 : |an − a| < ε
can be equivalently phrased as:
For all ε > 0 we have |an − a| < ε for...
For the second wording, note that there must be a largest of the finitely many exceptions (i.e.,
the n for which |an − a| ≥ ε). One may choose n0 just one larger than this number.
Remark 3.7. Note that the limit does not depend on the first terms of a sequence. So, in
particular, if (bn ) is a sequence with bn = an for almost all n, then an → a ⇐⇒ bn → a.
96
Example 3.9. Consider the sequence (an )n∈N with an = (−1)n . This sequence is divergent. For
a proof we assume the opposite, i.e., that (an ) converges to some a ∈ R. Now, by the definition
of convergence, we have that there exists some large enough n0 such that an ∈ U1/2 (a) for all
n ≥ n0 . (Note that the 1/2 is arbitrary here. Every ε < 1 would work.) However, we always
have |an+1 − an | > 1, and therefore, if an ∈ U1/2 (a), we have an+1 ∈
/ U1/2 (a). In particular, there
cannot be an n0 such that an ∈ U1/2 (a) for all n ≥ n0 : A contradiction. Hence, (an ) cannot be
a convergent sequence.
We now turn to some special properties of sequences or, as one might say, we just give names
to sequences with special properties. Afterwards we analyse the relation of these properties.
Definition 3.10 (Null sequence). Let (an )n∈N be a real sequence such that
lim an = 0.
n→∞
Example 3.11. The sequences ( n1 )n∈N and (2−n )n∈N are null sequences.
Definition 3.13. Let (an )n∈N ⊂ C be a sequence. We call the sequence bounded if
∃R > 0 ∀n ∈ N : |an | ≤ R.
Moreover, if (an )n∈N ⊂ R, we call the sequence bounded from above if and only if
∃C ∈ R ∀n ∈ N : an ≤ C,
∃c ∈ R ∀n ∈ N : an ≥ c.
In other words a sequence is bounded (from above/below), if and only if its range is a bounded
set (from above/below).
97
Theorem 3.15. Let (an )n∈N be a convergent sequence. Then (an )n∈N is bounded.
Fix ε0 = 1. Then we can find a N such that ∀n ≥ N : |an − a| < 1. The triangle inequality
yields
|an | ≤ |an − a| + |a| < 1 + |a|,
for all n ≥ N . For the remaining elements {a1 , . . . aN −1 } we simply take the maximum, c1 =
max{|a1 |, |a2 |, . . . , |aN −1 |}. The maximum of a finite amount of real numbers always exists.
Hence
∀n ∈ N : |an | ≤ max{c1 , |a| + 1}.
Example 3.16. The sequence given by bn = (−1)n is bounded by 1 but not convergent. So the
other direction of the above theorem does not hold in general.
Definition 3.17. Let (an )n∈N ⊂ R. The sequence (an )n∈N tends to ∞ (= +∞) if
We now study how to determine the limit of (complicated) sequences. This always follows the
same procedure: Either we already know the limit of the sequence under consideration, or one
has to split up the sequence into easier parts that can be handled, or splitted again.
To effectively do this, one needs a sufficiently large collection of known limits, and we will
present the most important below. Together with some rules of calculation, this will allow to
compute also the limit of quite complicated sequences.
Let us start with a lemma that shows how to verify that a sequence is a null sequence by
comparison with another null sequence. The proof is left to the reader.
Lemma 3.19. If (an )n∈N ⊂ C is a null sequence and (bn )n∈N ⊂ C is a sequence with
From this lemma we directly see that the sequences ( nCc )n∈N for fixed c, C > 0 are null sequences.
Let us now consider other important “building blocks”, i.e., limits that may be considered known
from now on, together with the corresponding proofs.
lim z n = 0.
n→∞
1
Proof. For z = 0 the result is clear. For z 6= 0 we set x > 0 such that |z| = 1+x . With the
n
Bernoulli inequality or the binomial formula we get (1 + x) ≥ 1 + nx and obtain
1 1 11
|z n | = n
≤ ≤ .
(1 + x) 1 + nx xn
Since ( n1 ) is a null sequence, we get from Lemma 3.19 that z n is a null sequence as well.
As the above sequence is not bounded for |z| > 1 (Check yourself!), we obtain from Theorem 3.15
that it cannot be convergent, i.e., (z n ) is divergent if |z| > 1.
In the case |z| = 1, we cannot say in general if the sequence is convergent or not: Although we
have the constant (and therefore clearly convergent sequence) for z = 1, the sequence (z n ) does
not converge for other z, like z = −1 or z = eiπ/2 . We leave out the details here.
The next example shows what happens if we replace the n-th power by a n-th root.
√
Proof. Let us first consider a ≥ 1. We will show that xn := n a − 1 satisfies xn → 0, which
proves the statement. Since a ≥ 1 we have xn ≥ 0. By Bernoulli’s inequality (1 + x)n ≥ 1 + nx,
which holds for x ≥ −1, we obtain
√ n
a = n a = (1 + xn )n ≥ 1 + nxn .
This implies
a−1
|xn | = xn ≤ .
n
Since ( n1 ) is a null sequence we get that (xn ) is also a null sequence.
√
For a < 1, let b = 1/a > 1 and consider xn = n b − 1. From the above we know that (xn ) is a
√
(non-negative) null sequence. Moreover, we have n a ≤ 1 and therefore
√ √ √ √ n
| n a − 1| = 1 − n a = n a b − 1 ≤ 1 · xn .
The next important limit shows that the constant a in the example above may even be replaced
by an unbounded sequence.
Example 3.22. √
n
lim n = 1.
n→∞
√
Proof. Let xn := n
n − 1, so we have to show that xn → 0. From Lemma 1.56 with k = 2 we
obtain !
n n 2 n(n − 1) 2
n = (1 + xn ) ≥ 1 + x =1+ xn .
2 n 2
This implies r
2
|xn | = xn ≤ .
n
√
Since ( √1n ) is a null sequence we get that xn is a null sequence, which proves n n → 1.
We state one last example before we turn to general rules for the calculation with limits. This
example could be phrased as “exponential growth is faster than polynomial growth”,
and is one of the basic arguments when dealing with limits.
Example 3.23. Let z ∈ C with |z| > 1 and k ∈ Z. Then,
nk
lim = 0.
n→∞ z n
Proof. The proof of this limit combines all the ideas from the preceeding examples.
First, note that the limit is already clear from the above if k ≤ 0. (Why?)
For k > 0, set x := |z| − 1 with x > 0, and assume that n > 2k. (This is possible, since we are
only interested in large n.) From Lemma 1.56 we obtain
!
n n n n · (n − 1) · · · (n − k) k+1
|z| = (1 + x) > xk+1 = x .
k+1 (k + 1)!
100
From n > 2k, we obtain that n − k > n/2 (or more general n − k + ` > n/2 for all ` ∈ {0, . . . , k}).
Therefore,
n · (n − 1) · · · (n − k) k+1 (n/2)k+1 k+1
|z|n = (1 + x)n > x > x .
(k + 1)! (k + 1)!
It follows
nk nk 2k+1 (k + 1)! nk 1
n = < =: K · ,
z |z|n x k+1 n k+1 n
k
where K is a constant (i.e., does not depend on n). As ( n1 ) is a null sequence, we get that ( nz n )
is also a null sequence.
The following calculation rules for convergent sequences and their limits will be very useful if
we want to compute the limits of more complicated sequences.
Theorem 3.24. Let (an )n∈N , (bn )n∈N be convergent sequences, and let λ ∈ C.
Moreover, let a := limn→∞ an and b := limn→∞ bn . Then, we have
(ii) limn→∞ (λ · an ) = λ · a
Proof. For the first statement we need to show that ∀ε > 0∃n0 ∈ N : |an + bn − (a + b)| ≤ ε.
Therefore let ε > 0 be arbitrary but fixed. Using the triangle inequality we get that
Since (an )∞ ∞
n=1 and (bn )n=1 are convergent sequences we can find N ∈ N such that for all n ≥ N :
ε ε
|an − a| ≤ and |bn − b| ≤ .
2 2
Since this holds for arbitrary ε we get the result.
For the second statement, we use that
As (an )∞
n=1 is convergent we get the result.
For the third statement, note that
|an bn − ab| ≤ |an bn − abn | + |abn − ab| = |bn | · |an − a| + |a| · |bn − b|.
Since (bn )∞
n=1 is convergent it is bounded, so ∃C ∈ R∀n ∈ N : |bn | ≤ C. Hence
Therefore, the desired statement follows from the convergence of (an ) and (bn ).
101
For the last statement, we only need to prove that limn→∞ b1n = 1b , if b 6= 0. The more general
statement then follows together with part (iii).
Let us assume w.l.o.g. (i.e., without loss of generality) that b > 0. (Otherwise, consider the
sequence (−bn ).) We obtain that ∃N0 ∈ N ∀n ≥ N0 : bn > 2b . (Why?) Hence,
b − bn
1 1 2
b − b = b · b < b2 |b − bn |
n n
Remark 3.25. For a complex-valued sequence (zn )n∈N , convergence to the complex number
z = x + iy ∈ C, i.e., zn → z, is equivalent to the convergence of the real and imaginary parts of
zn to x and y, respectively.
The first direction, i.e., that
Re zn → x and Im zn → y.
implies that zn → x + iy, follows from Theorem 3.24(i).
For the reverse statement, choose ε > 0 and n0 such that |zn − z| < ε for n ≥ n0 holds. Then,
for n ≥ n0 it holds that
| Re zn − Re z| = | Re(zn − z)| ≤ |zn − z| < ε.
As this holds for all ε > 0, this proves the convergence of the real part of zn to Re z = x. The
convergence of the imaginary part can be proven analogously. Consequently, we can say that
zn → z ⇐⇒ Re zn → Re z and Im zn → Im z.
With the help of calculation rules and the knowledge about limits we can compute more sophis-
ticated limits.
Example 3.26.
√
3 − 7 n + 42n 1 1
lim = 3 · lim − 7 lim √ + lim 42 = 0 − 0 + 42 = 42.
n→∞ n n→∞ n n→∞ n n→∞
Proof. For k = 0 this is obvious (and for k = 1 we have shown that already in Example 3.22).
Moreover, we have for arbitrary k ∈ Z that
√
n
√
n
√
n √ √n
√
n
lim nk+1 = lim nk n = lim nk lim n n = lim nk · 1 = lim nk ,
n→∞ n→∞ n→∞ n→∞ n→∞ n→∞
as well as
√
n
√
n 1 √n 1 √n
lim nk−1 = lim nk lim √
n
= lim nk · √
n
= lim nk ,
n→∞ n→∞ n→∞ n n→∞ limn→∞ n n→∞
where we used Theorem 3.24(iv) for the second equality. By induction on k in both directions
(with induction basis k = 0), we get that the limit is the same for all k, and therefore equals 1.
102
The next result gives another tool for the calculation of difficult limits. This one is helpful
when the sequence under consideration can be bounded from above and below by sequences
that converge to the same limit.
Theorem 3.28 (Sandwich rule). Let (an )n∈N and (cn )n∈N be convergent real-valued se-
quences and let (bn )n∈N be a sequence such that
an ≤ bn ≤ cn for all n ∈ N.
If additionally,
a := lim an = lim cn ,
n→∞ n→∞
Since (an )n∈N and (cn )n∈N do both converge to a, we have that there is some n0 such that
bn − a ≤ cn − a ≤ ε and a − bn ≤ a − an ≤ ε for all n ≥ n0 . Thus, |bn − a| < ε for n ≥ n0 . As
this holds for all ε, we obtain limn→∞ bn = a.
Remark 3.29. Note that, as we consider limits, the assumption an ≤ bn ≤ cn in the sandwich
rule is not essential for the first terms and may be replaced by ∃N ∈ N ∀n ≥ N : an ≤ bn ≤ cn .
In other cases, we may even do not know the precise values of the terms of a sequence since
the explicit or recursive formula for them is to complicated. Also in such cases we can possibly
apply the sandwich rule to obtain the limit.
n
Example 3.31. Consider the sequence bn = (1+sin(n))
n2n . Since | sin(x)| ≤ 1 for all x ∈ R, we
(1+sin(n))n 2n 1
have 0 ≤ n2n ≤ n2n = n for all n ∈ N. Using the sandwich rule and that the sequences
on both sides are null sequences, we obtain bn → 0.
(Note that we did not even need the precise values of sin(n).)
103
We end this section with calculation rules for definitely divergent sequences:
an → ∞, bn → ∞ =⇒ an + bn → ∞ and an · bn → ∞
an → ∞, bn → b =⇒ an + bn → ∞
α
an → ∞, α ∈ R =⇒ →0
an
an → ∞, α > 0 =⇒ α · an → ∞
an → ∞, α < 0 =⇒ α · an → −∞
(Verify yourself!)
If an → ∞ and bn → ∞, no general rule can be given for (an − bn ) and abnn . Therefore, also
the limit of (an bn ) for an → 0 and bn → ∞ needs more care, and these limits do not have to
exist nor be definitely divergent, consider e.g. an = (−1)n /n and bn = n.
We will come back to the computation of such limits later.
We saw that it is essential to verify that sequences are convergent for applying the rules above.
Here, we show that a large class of sequences, namely all monotone and bounded sequences, are
convergent. This is an essential insight.
Since we want to assume that the terms of a sequence are (monotonically) ordered, we need an
order, and therefore only work with real-valued sequences here.
∀n ∈ N : an+1 ≥ an ,
∀n ∈ N : an+1 ≤ an .
Note that a sequence that is non-decreasing and non-increasing at the same time, must be a
constant sequence.
104
Many of the sequences, that we have discussed so far, were strictly monotone. This holds clearly
for the sequences (n)n∈N , ( n1 )n∈N or more general (nk )n∈N for k ∈ Z \ {0}, as well as the sequence
(an )n∈N for a ∈ (0, 1) or a > 1. However, for some sequences this is not so easy to see.
1
Example 3.33. Let us have a look at the sequence given by an := n2 −n+1 . This is a decreasing
1 1 1 1
sequence, since an+1 = (n+1)2 −(n+1)+1 = n2 +2n+1−n−1+1) = n2 +n+1 < n2 −n+1 = an .
In some cases, a helpful trick is to consider the quotients of consecutive terms of a sequence
and show that they are bounded (from above or below) by one. That is, e.g., a sequence is
non-decreasing if
an+1
≥1 for all n ∈ N,
an
and non-increasing if
an+1
≤1 for all n ∈ N.
an
Example 3.34. One interesting example, that we will study more detailed soon, is the sequence
given by
1 n n+1 n
an := 1 + =
n n
We consider quotients of successive terms and observe that
n+1
n+2 n+1 n+1
an+1 (n + 2)n n+1 1 n+1
n+1
= n = = 1− .
an n+1 (n + 1)2 n (n + 1)2 n
n
1
The Bernoulli inequality (1 + x)n+1 ≥ 1 + (n + 1)x, with x = − (n+1)2 ≥ −1, yields
an+1 1 n+1
≥ 1− = 1.
an n+1 n
Hence (an ) is a non-decreasing sequence. (If we note that Bernoulli’s inequality is a strict
inequality for x > −1 with x 6= 0, we even obtain that (an ) is increasing.)
The following result shows that monotonicity is a very helpful property as we only have to know
if a sequence is bounded to verify whether it is convergent or not. Note that boundedness of a
sequence is usually much easier to show.
Theorem 3.35 (Monotonicity principle). Let (an )n∈N ⊂ R be a monotone sequence. Then,
By the completeness axiom, supremum and infimum exist for every bounded subset of R.
105
Proof. We know from Theorem 3.15 that convergent sequences are bounded, which proves the
first direction of the statement.
For the second, let us consider the case where (an )n∈N is non-decreasing. The other case, where
(an )n∈N non-increasing, can be treated in the same way (replacing, in particular, sup by inf).
We now assume that (an )n∈N is bounded, and prove that it converges to a = sup{an : n ∈ N}.
Since (an )n∈N is bounded, we get that the range of (an )n∈N is a bounded set, which implies that
a = sup{an : n ∈ N} exists. Let now ε > 0 be arbitrary but fixed. Since a is the supremum
of the range of (an )n∈N we get (by definition) that a − ε is no upper bound for the sequence
(an )n∈N . Thus, there exists an n0 with an0 > a − ε. Since (an )n∈N in non-decreasing, the same
then holds for n ≥ n0 . That is, ∃n0 ∈ N ∀n ≥ n0 : a − ε < an . In addition, we clearly have
an ≤ a < a + ε. Hence, ∃n0 ∈ N ∀n ≥ n0 : |an − a| < ε. As this holds for all ε > 0, we obtain
that (an )n∈N converges to a.
The statement for unbounded sequences can be proven similarly, and is left for the reader.
With this theorem we see that there are convergent sequences where we do not have to know
the limit to verify that it exists. In some cases, we may even define numbers just as limits of
specific sequences, because we do not have another (explicit) description. One typical example
is Euler’s number:
Example 3.36 (Euler number). Consider sequences (an )n∈N , (bn )n∈N which are defined by
n n n+1 n+1
1 n+1 1 n+1
an := 1 + = and bn := 1 + = .
n n n n
Clearly an ≤ bn . Note that we have considered the sequence (an ) already in Example 3.34 and
showed that it is non-decreasing (which also implies that an ≥ a1 = 2 for all n ∈ N). It remains
to bound (an ) from above to show that it is convergent. For this, we show that (bn ) is bounded
from above. Together with an ≤ bn this implies also the boundedness of (an ). In fact, we show
that (bn ) is non-increasing, which implies that bn ≤ b1 = 4 for all n ∈ N. Again we compute
quotients of consecutive terms and obtain
n+1
n+1 !n+2
bn n (n + 1)2 n
= n+2 =
bn+1 n+2 n(n + 2) n+1
n+1
n+2
1 n 1 n
= 1+ ≥ 1+ = 1,
n(n + 2) n+1 n n+1
1
where we used Bernoulli’s inequality (1 + x)n+2 ≥ 1 + (n + 2)x, with x = n(n+2) ≥ −1.
bn+1
From this we have bn ≤ 1 and therefore that (bn ) is non-increasing. (In fact, it is decreasing.)
All in all
2 = a1 ≤ a2 ≤ . . . ≤ an ≤ . . . ≤ bn ≤ . . . ≤ b2 ≤ b1 = 4.
This shows that the limit of (an )n∈N exists and equals sup(an ), and we define this limit to be
Euler’s number. Moreover, the limit of (bn ) also exists, equals inf(bn ), and we show that this is
also equal to e. For this consider
n
1 1 an 4
bn − an = 1 + 1+ −1 = ≤ .
n n n n
This implies an ≤ bn ≤ an + n4 and the sandwich rule yields lim an = lim bn . (One may also use
that, with cn = (n + 1)/n, we have bn = an · cn . Then, cn → 1 yields the result.)
106
This example shows, in particular, that it may happen that a sequence is converging but we do
not know the precise value of its limit. In some cases, however, we are able to find the limit (or
at least some possible candidates) by alternative considerations. For example, if the sequence
(an )n∈N is convergent, we have
lim an = lim an+1 = lim an+2 = ...,
n→∞ n→∞ n→∞
which may just be seen as ignoring the first terms of a sequence. Such equations are particularly
useful for recursively defined sequences.
Example 3.37. Let x > 0. We consider the following recursively defined sequence:
1 x
a1 > 0 is arbitrary and an+1 := an + for all n ∈ N.
2 an
It is obvious that (an ) is a positive sequence, i.e., an > 0 for all n ∈ N. We now show that (an )
is decreasing for n ≥ 2, which implies, by the monotonicity principle, that (an ) is convergent.
(Note that, since we are only interested in limits, it is always ok to prove things only for large n.)
We obtain
1 x x
an+1 = an + ≤ an ⇐⇒ an + ≤ 2an ⇐⇒ x ≤ a2n
2 an an
√
for all n ∈ N. This is equivalent to an ≥ x, because the an are positive. We now show that
√ √
for all n ≥ 2 it holds that an ≥ x, or equivalently: an+1 ≥ x for all n ∈ N:
√ 1 x √ √
an+1 ≥ x ⇐⇒ an + ≥ x ⇐⇒ a2n + x ≥ 2 xan
2 a
2 √ n
⇐⇒ an − 2 xan + x ≥ 0
√
⇐⇒ (an − x)2 ≥ 0.
Since the last statement is clearly true, we finally obtain that (an )n∈N is decreasing for n ≥ 2,
and therefore convergent.
Moreover, we can determine the limit by exploiting the fact that the limits of (an )n∈N and
(an+1 )n∈N are the same. Let a := lim(an ) = lim(an+1 ). Then,
1 x 1 x 1 x
a = lim an+1 = lim an + = lim an + = a+
n→∞ n→∞ 2 an 2 n→∞ limn→∞ an 2 a
With the same calculations as above we see that this equation can only be fulfilled if
1 x √
a= a+ ⇐⇒ a2 = x ⇐⇒ a = ± x.
2 a
√ √
Since (an ) is non-negative, − x cannot be the limit of the sequence and therefore x is the
only possibility, i.e., √
lim an = x.
n→∞
√
(Note that the limit would be − x if the ’starting value’ a1 would be negative, Check yourself!)
107
The concepts of the last sections deal with sequences that converge or, in other words, concen-
trate around a single point. In some cases, however, also divergent sequences have only some
points of interest for very large n. An obvious example is ((−1)n )n∈N .
In this section, we want to formalize the idea of sequences having more than one limit, i.e., points
where the sequences accumulates for large n. We will show the (to some extent surprising) fact
that every bounded sequence has such an accumulation point. We will also specify special
convergent subsequences, which leads to the so-called limit superior and limit inferior of a
sequence.
Example 3.39. Consider the sequence given by bn = (−1)n (1 − n1 ). This is not convergent, as
it "jumps" between ’close to 1’ and ’close to -1’. However, if we take the sequence of even natural
numbers, i.e., (n1 , n2 , n3 , . . . ) = (2, 4, 6, . . . ), then the terms of the subsequence (bnk )k∈N =
1
(b2n )n∈N = (b2 , b4 , b6 , . . . ) are of the form bnk = 1− 2k . Hence, (bnk )k∈N is a convergent sequence,
and hence, a convergent subsequence of (bn )n∈N .
or
∀ε > 0 ∀n0 ∈ N ∃n ≥ n0 : an ∈ Uε (a),
or n o
∀ε > 0 : # n ∈ N : an ∈ Uε (a) = ∞,
Example 3.41. Considering the example from above, i.e., bn = (−1)n (1 − n1 ), we see that
1
b2n → 1. Hence, 1 is an accumulation point of (bn ). Moreover, b2n+1 = −(1 − 2n+1 ) → −1 also
defines a convergent subsequence, and −1 is therefore also an accumulation point. It is not hard
to see that there is no other possible limit of a subsequence.
108
Next we want to show that each bounded sequence has at least one convergent subsequence. In
particular, we show (in the proof) that every sequence contains either a non-increasing or a non-
decreasing subsequence (or both), which together with the boundedness implies its convergence,
see Theorem 3.35. Note that every subsequence of a bounded sequence is also bounded.
This result bears the names of Bolzano and Weierstrass and is an important technical tool for
proofs in many areas of analysis.
Proof. We present a proof only in the real case. The complex case works similar.
We call m ∈ N a peak of (an )∞
n=1 if ∀n > m we have that an < am . If there exist infinitely many
peaks of (an )∞
n=1 , denoted by n1 , n2 , n3 , . . . then the sequence (an1 , an2 , an3 , . . . ) is decreasing
and bounded. Hence it is convergent as we know from before.
If there are at most l ∈ N peaks, then we we choose n1 bigger than the largest peak, or, if there
are no peaks, then we choose n1 = 1. In both cases n1 is no peak, hence there exists n2 > n1
such that an2 ≥ an1 . Furthermore n2 is no peak, which implies that there exists a n3 > n2
such that an3 ≥ an2 . If we repeat this process we end up with an non-decreasing subsequence
of (an )∞
n=1 , which is also bounded. Therefore this subsequence converges.
We can also give another proof of this statement, which is a bit more of geometric flavour.
Alternative proof. Every sequence (an )n∈N has infinitely many (not necessarily different) terms.
If infinitely many are equal, we are done, because a list of these terms is a convergent subse-
quence.
If not, assume w.l.o.g. that 0 ≤ an ≤ 1 for all n. Every other bounded sequence can be treated
the same way. Now, split the interval [0, 1] into the halves [0, 12 ] and [ 12 , 1]. As there are infinitely
many points in [0, 1], at least one of the halves must also contain infinitely many points. Now,
split up this one, and so on. With this procedure we come arbitrary close to a point a, whose
neighborhoods –by construction– all contain infinitely many points. This point is therefore an
accumulation point. (We just note here that a is defined as the intersection of infinitely may
nested intervals. It follows from the sandwich rule that this intersection is not empty.)
Remark 3.43. The Bolzano-Weierstrass theorem is also true for sequences in much more general
(e.g. multidimensional) situations.
Example 3.44. Note that every bounded sequence has at least one convergent subsequence but
not every sequence with a convergent subsequence is bounded. E.g., consider the sequence (an )
with a2n = n and a2n−1 = 0. Clearly, (a1 , a3 , a5 , . . . ) is a null sequence, but there is no upper
bound for this sequence.
109
Inspired by the proof of the Bolzano-Weierstrass theorem, we will define two special accumulation
points of a sequence, i.e., the smallest and the largest accumulation point. They can be seen
as the limiting bounds on the sequence, i.e., every limit of a convergent subsequence must lie
between them.
Note that (inf k≥n ak )n∈N is a non-decreasing sequence, since we take the infimum over a smaller
set if n increases. Hence, together with Theorem 3.35, we can alternatively define the limes
inferior by
lim inf an = sup inf ak .
n→∞ n∈N k≥n
Since the infimum and supremum exist of arbitrary bounded sets, we obtain that the limit
inferior and limit superior also exist for arbitrary bounded sequences. Moreover, if the sequence
is unbounded, then the corresponding ’inner’ infimum or supremum (or both) are infinity.
So, if we allow limes inferior and limes superior to have also the values −∞ and ∞, then they
exist for arbitrary sequences. That is, to every real-valued sequence (an ), we may assign
the unique values lim inf(an ) ∈ R ∪ {−∞, ∞} and lim sup(an ) ∈ R ∪ {−∞, ∞}.
Example 3.49. Consider the sequence (an )n∈N with an = 2n + (−2)n . The terms of (an ) equal
either 2n+1 (for even n) or 0 (for odd n). Therefore, lim inf(an ) = 0 and lim sup(an ) = ∞.
The limit inferior and limit superior are accumulation points of (an ), if the sequence is
bounded. We show this for the limit inferior b := lim inf(an ).
With bn := inf k≥n ak we have b = lim(bn ). (Note that (bn ) is in general no subsequence of (an ).)
Now let n1 ∈ N be such that b1 > an1 − 12 . Such an n1 exists by the definition of the infimum.
Next, let n2 > n1 be such that bn1 +1 > an2 − 212 , and so on. That is, we obtain an increasing
sequence (nk )k∈N of natural numbers with bnk +1 > ank+1 − 21k . In addition, we have by definition
that bnk +1 ≤ ank+1 . We obtain that |bnk +1 −ank+1 | < 21k , which implies that (bnk +1 −ank+1 )k∈N is
a null sequence. Hence, limk→∞ bnk +1 = limk→∞ ank+1 . Since all subsequences of (bn ) converge
to the same limit, we obtain limk→∞ ank+1 = lim(bn ) = b, i.e., b is an accumulation point.
Moreover, lim inf(an ) and lim sup(an ) are the smallest and largest accumulation point of
(an ), respectively. To see this, let a ∈ R be any accumulation point of (an ), i.e., there exists a
subsequence (ank )k∈N with ank → a. Then we have the bounds
lim inf an ≤ a ≤ lim sup an .
n→∞ n→∞
Proof. If a > lim sup(an ), then there is some ε > 0 and infinitely many terms of (an ) that are
larger than lim sup(an ) + ε. Hence, supk≥n ak ≥ lim sup(an ) + ε for all n: A contradiction to the
definition of the limit.
Although all accumulation points of (an ) are bounded from below and above by lim inf(an )
and lim sup(an ), respectively, this clearly does not need to hold for all (large enough) terms of
the sequence. It may even happen that all elements of a sequence lie outside of the interval
lim inf(an ), lim sup(an ) . Consider, e.g., an = (−1)n (1 + n1 ) ∈
/ [−1, 1] with lim inf(an ) = −1 and
lim sup(an ) = 1.
Moreover, if we are (only) interested in the limiting behavior of the sequence (an ), then the
trivial bound
inf{an : n ∈ N} ≤ aN ≤ sup{an : n ∈ N},
which holds for all N ∈ N, is useless in general.
We can use the limit inferior and superior to bound the elements of a sequence for large N :
As lim inf and lim sup are the smallest and largest accumulation point of a sequence, respectively,
we
obtain have that all elements of (an ) with large enough n are at least close to the interval
lim inf(an ), lim sup(an ) . That is, for all ε > 0 there is some n0 ∈ N such that
lim inf(an ) − ε ≤ aN ≤ lim sup(an ) + ε
for all N ≥ n0 . (Verify this!)
111
We finally show that the limit inferior and limit superior are indeed generalizations of the concept
of a limit. Clearly, it is more generally applicable, as lim inf and lim sup are well-defined for
arbitrary sequences. The next result shows that, for convergent sequences, all these values are
just the same. This follows from the considerations above, and the fact that every subsequence
of a convergent sequence converges to the same limit.
Lemma 3.50. A sequence (an )n∈N ⊂ R is convergent (or definitely divergent) if and only
if
lim inf an = lim sup an
n→∞ n→∞
In this case,
lim an = lim inf an = lim sup an .
n→∞ n→∞ n→∞
This means that a sequence is convergent if and only if the sequence is bounded and
has exactly one accumulation point.
Remark 3.51 (Complex sequences). Note that the lim inf and lim sup cannot be defined for
complex-valued sequences, because we do not have an order on C, and hence no supremum or
infimum. However, it is still true that a complex-valued sequence is convergent if and only if it
is bounded and has exactly one accumulation point. For a proof, we may consider lim inf and
lim sup of the real and imaginary parts separately.
Example 3.52. Consider the sequence (an )n∈N , which is a list of all rational numbers in [0, 1].
We already showed that the rational numbers are dense in R. Thus every x ∈ [0, 1] is an
accumulation point of (an )n∈N . In other words, the set of accumulation points is uncountable
and therefore in some sense “larger” than the set of the sequence elements.
112
In this section we introduce the Cauchy criterion for proving convergence of a sequence. This
is, similarly to the monotonicity principle (Theorem 3.35), an important tool to verify that a
sequence is convergent, without knowing its limit. The Cauchy criterion will be the dominant
technique for proofs of convergence when it comes to higher mathematics, including sequences
of more general objects.
The central object is a Cauchy sequence.
That is, the terms of a Cauchy sequence are pairwise close to each other for large n.
Compare this definition with the definition of convergence in order to gain better understanding.
1 1 1
Example 3.54. The sequence (an )n∈N with an = n satisfies |an − am | = m − n < ε for
n ≥ m > 1ε =: n0 . Hence, (an ) is a Cauchy sequence.
Remark 3.55. For a complex-valued sequence (zn ) with zn = xn +iyn and (xn )n∈N , (yn )n∈N ⊂ R
it holds:
(zn ) is a Cauchy sequence ⇐⇒ (xn ) and (yn ) are Cauchy sequences.
This follows from
q
max |xn − xm |, |yn − ym | ≤ (xn − xm )2 + (yn − ym )2 = |zn − zm |.
We now prove the important property that every convergent sequence is a Cauchy sequence,
and vice versa.
|an − am | ≤ |an − a| + |a − am | ≤ ε.
Since this holds for all ε > 0, we get that (an ) is a Cauchy sequence.
113
We now show the other direction “(an )n∈N is Cauchy sequence =⇒ (an )n∈N is convergent ”:
First, we choose ε = 1 in the definition of a Cauchy sequence, and obtain some n0 such that
|an − am | < 1 for all m, n ≥ n0 . Moreover, the triangle inequality implies
∀n ∈ N : |an | ≤ C
which makes (an )n∈N a bounded sequence. The Bolzano-Weierstrass theorem implies that
(an )n∈N has at least one accumulation point a ∈ C. Thus there exists a subsequence (ank )k∈N
(of (an )n∈N ) such that
lim ank = a.
k→∞
Let ε > 0. By the convergence of (ank ) we obtain that |a − ank | < 2ε for large enough k, i.e., for
large enough nk . Moreover, since (an ) is a Cauchy sequence, |an − ank | < 2ε for n and nk large
enough. (Formally, this shows that, for all ε, there exist n0 , k0 such that for all n ≥ n0 , k ≥ k0
we have |an − a| < ε. But since the statement does not depend on k, we just omit this part.)
That is, for arbitrary ε > 0 and n large enough, we have
|an − a| < ε,
Remark 3.57. One might show for every convergent sequence discussed so far, that it is a
Cauchy sequence. The proof would following directly the first part of the proof above, and one
would not not learn much by these computations. (You might still try it on your own!) However,
verifying that a sequence is a Cauchy sequence is often much easier, as we will see later on.
Example 3.58. Note that it is not enough that neighboring elements of a sequence become arbi-
√
trarily close. For example, the sequence (an )n∈N with an = n, which is clearly not convergent,
satisfies
√ √ √ √
√ √ ( n + K − n)( n + K + n)
|an+K − an | = n + K − n = √ √
( n + K + n)
K K
= √ √ < √ → 0
n+K + n 2 n
for every fixed K ∈ N. Hence, ’terms at fixed distance’ become arbitrarily close together.
√
However, we have, e.g., |a4n − an | = n → ∞. So, there is no n0 such that |am − an | < 1 for all
m, n ≥ n0 .
114
3.6 Series
If the sequence of partial sums (sn ) converges to some s ∈ C, i.e., sn → s, then we say that
the series converges or that the sequence (an )n∈N is summable, call s the sum of the
series, and write
∞
X n
X
ak := lim ak = lim sn = s.
n→∞ n→∞
k=1 k=1
P∞
If sn → ±∞ we also write k=1 ak = ±∞, and say that the series is definitely divergent.
P∞
Otherwise we call the series k=1 ak divergent and the sequence (an )n∈N not summable.
Note that series is just another word for an infinite sum of elements of a sequence. Moreover,
the notation ∞
P
k=1 ak should be understood as a formal symbol for the limit: It might be a
number or ±∞, but it might also not exist (as a number).
The definition above states that a series converges if and only if the sequence (sn )∞
n=1 of partial
sums converges. This implies that we can use the results from the previous section to analyze
series. Moreover, we will see that there are even more tools for working with series. But first
let us see some examples, that will be essential for the upcoming considerations.
115
The first example is one of the most well-known and used infinite sums. Although it is usually
considered only for real q ∈ (−1, 1), we see that it also holds for complex bases.
Example 3.60 (Geometric series). Let q ∈ C with |q| < 1. Then we have that
∞ ∞
X 1 X q
qk = and qk = .
k=0
1−q k=1
1−q
which holds for all n ≥ k0 and q 6= 1. (Note that the last formula does not contain a limit.)
Proof. We only prove the result for k0 = 0, and leave the rest for the reader.
Let sn := nk=0 q k and consider the equation
P
n
X n
X n
X n
X n
X n+1
X
(1 − q)sn = qk − q · qk = qk − q k+1 = qk − q k = q 0 − q n+1
k=0 k=0 k=0 k=0 k=0 k=1
n+1
= 1−q .
From the next to the last equation we used the fact that the terms q k for k ∈ {1, 2, . . . , n}
appear in both sums (and the second sum is subtracted from the first). Thus, the only terms
that remain are q 0 and q n+1 (and the second gets a minus in front). Such arguments, i.e., that
many (or all) terms of a series cancel each other out, are called telescoping tricks and sums
of this form are called telescoping sums. We will come back to this kind of series later.
From the above equation we obtain
1 − q n+1
sn =
1−q
for all q 6= 1. Using this representation of the partial sums we can compute its limit easily. Note
1
that 1−q is a constant factor and (q n )n∈N with |q| < 1 is a null sequence, see Example 3.20.
Hence
∞
X 1 1 1
q k = lim sn = lim 1 − q n+1 = 1 − lim q n+1 = .
k=0
n→∞ 1 − q n→∞ 1−q n→∞ 1−q
1
Example 3.61. If we set q = 2 we get e.g. that
∞ ∞
X 1 X 1
= 2 and = 1
n=0
2n n=1
2n
Remark 3.62. Note that the telescoping trick from above also works for |q| > 1 (but not
n+1
if q = 1). It then follows from the explicit formula sn = q q−1−1 , where we just multiplied
numerator and denominator by −1, that (sn ) is unbounded, and therefore divergent. If q > 1
(and, in particular, a real number) we see that (sn ) tends to infinity, while the limits simply do
not exist for q ≤ −1. In general, the case |q| = 1 needs more care. We will see in Example 3.72
that ∞ k
k=0 q is divergent for every |q| ≥ 1, and definitely divergent only for q ≥ 1.
P
116
Now we will see that not all convergent sequences lead to convergent series. Moreover,
this is the most important prototype of a divergent sequence, as it is “almost summable”. We
will see later what this means.
1
Example 3.64 (Harmonic series). Consider the sequence given by an = n. Then, the corre-
sponding series satisfies
∞
X 1
= ∞,
n=1
n
i.e., it is definitely divergent. This series is called harmonic series.
However, we will see later that n−α is convergent if α > 1.
P
Proof. We show that the sequence of partial sums is bounded from below by a divergent sequence.
For this, we successively group the terms in the partial sums, and bound this group by the
number of elements times its smallest member:
1 1 1 1 1 1 1 1 1 1
s2n = 1+ + + + + + + + ··· + + + ··· +
2 3 4 5 6 7 8 2n−1 + 1 2n−1 + 2 2n
1 1 1 1
≥ 1 + + 2 · + 4 · + · · · + 2n−1 · n
2 4 8 2
1 1 1 1
= 1 + + + + ··· +
2 2 2 2
n
= 1+ .
2
This means that, with N such that 2n ≤ N < 2n+1 ⇐⇒ n ≤ log2 (N ) < n + 1, we obtain
sN ≥ s2n ≥ 1 + n/2 ≥ 1+log22 (N ) . We obtain sn → ∞.
Example 3.65. We want to discuss the telescoping trick once more. This is sometimes a
very powerful tool to obtain the precise value of apparently complicated series. Let us therefore
prove
∞
X 1
= 1.
k=1
k(k + 1)
It is not clear yet that the series on the left hand side converges, and its precise value is also far
from obvious. But one might notice that the terms can be expanded to
1 k+1 k 1 1
= − = − .
k(k + 1) k(k + 1) k(k + 1) k k+1
6n+9 1 1
Example 3.66. Consider an = n2 (n+3)2
= n2
− (n+3)2
=: bn − bn+3 . Hence,
n
X n
X n
X n+3
X
ak = (bk − bk+3 ) = bk − bk = b1 + b2 + b3 + bn+1 + bn+2 + bn+3 .
k=1 k=1 k=1 k=4
P∞ 1 1 49
Since bn → 0, we obtain k=1 ak =1+ 4 + 9 = 36 .
However, note that it is rare that we can compute the sum precisely. Already for the
example ∞ 1
P
k=1 k2 , which is very similar to the one above, we need some higher mathematics, to
2
find its precise value π6 ≈ 1.645. For many other sums, there is just no closed expression.
Still, we might be interested if the sum exists. For this, we can deduce several techniques from
our knowledge about sequences. Let us start with some calculation rules.
P∞ P∞
Theorem 3.67. Let k=1 ak and k=1 bk be convergent series, and let c ∈ C. Then we
have ∞ ∞ ∞
X X X
(ak + bk ) = ak + bk
k=1 k=1 k=1
and ∞ ∞
X X
c · ak = c · ak .
k=1 k=1
Since both series are convergent, we obtain the results from Theorem 3.24.
We now consider two results on the convergence of series that follow directly from the results
of the last section. In fact, they are just reformulations of the monotonicity principle (Theo-
rem 3.35) and the Cauchy criterion (Theorem 3.56).
118
The first is concerned with series with non-negative terms and bounded partial sums.
Theorem 3.68. Let (an )∞ n=1 ⊂ R be a non-negative sequence, i.e. ak ≥ 0 for all k ∈ N.
Then, the sequence of partial sums sn := nk=1 ak is bounded, i.e.,
P
∃C ∈ R ∀ n ∈ N : sn ≤ C,
P∞
if and only if the series k=1 ak converges.
Proof. Since ak ≥ 0, we obtain that (sn ) is a non-decreasing, and therefore monotone, sequence.
The result follows from the monotonicity principle (Theorem 3.35).
This theorem is already enough to show that the aforementioned alternative representation of
Euler’s number (defined in Section 2) as an infinite sum converges.
1
Example 3.69 (Euler’s number). Consider the series given by the sequence an = n! , starting
at 0 in this case. The partial sums are given by
n
X 1
sn = .
k=0
k!
n!
where the last step follows from (n−k)! = (n − k + 1) · (n − k + 2) · · · (n − 1) · n ≤ nk .
Hence,
n ∞
1 1
X
e := lim 1+ ≤ .
n→∞ n k=0
k!
On the other hand, for fixed m ∈ N and n ≥ m, we obtain, by similar arguments as above, that
n m
n ! !
1 n 1 n 1
X X
lim 1+ = lim k
≥ lim
n→∞ n n→∞
k=0
k n n→∞
k=0
k nk
m m
1 n! 1
X X
= lim = ,
k=0
k! n→∞ (n − k)! nk k=0
k!
The next result is just the Cauchy criterion applied to partial sums.
P∞ P∞
Theorem 3.70 (Cauchy criterion). Let k=1 ak be a series. Then we have that k=1 ak
is convergent if and only if
m
X
∀ε > 0 ∃ n0 ∈ N ∀m > n ≥ n0 :
ak < ε.
k=n+1
P∞
In other words, the series k=1 ak is convergent if and only if the sequence of partial sums
is a Cauchy sequence.
m
X n
X m
X
sm − sn = ak − ak = ak ,
k=1 k=1 k=n+1
we see that the condition in the theorem is equivalent to (sn ) being a Cauchy sequence. This is
equivalent to (sn ) being convergent, see Theorem 3.56.
This theorem immediately leads to the following simple criterion. In many cases, this is already
enough to show that a sequence is divergent.
Proof. We use the Cauchy criterion for series and set m = n + 1. Then we get
P∞ k
Example 3.72. From this, we finally obtain that k=0 q can only be convergent for |q| < 1.
Remark 3.73. We have already seen that there are null sequences which do not give a conver-
gent series, e.g. ( n1 )∞
n=1 . So the above corollary gives a necessary but not a sufficient condition
for a series to be convergent.
Remark 3.74. (*) Indeed the representation of e via a series can be generalized to obtain the
xk
known exponential function. This leads to ex = ∞
P
k=0 k! . (We will prove this much later!)
We now discuss several criteria to prove that a series is convergent. These convergence tests
mostly do not lead to the precise sum of a series. However, they are quite generally applicable.
The first tests that will be discussed are based on another form of convergence of series, which
will turn out to be a stronger criterion.
|ak | < ∞ is really the same as |ak | being convergent. This follows from Theo-
P P
Note that
rem 3.68 and the fact that |ak | ≥ 0.
Moreover, note that for non-negative sequences (an )n∈N , i.e., ak ≥ 0 for all k, absolute summa-
bility and summability are just the same.
is the harmonic series, which is divergent. However, we will see later that the alternating
harmonic series is convergent.
121
The next result shows that absolute convergence is indeed a stronger criterion.
Proof. We use the Cauchy criterion and the triangle inequality to prove this result. Let ∞
P
k=1 ak
be a absolutely convergent series and ε > 0. By the Cauchy criterion there exists some n0 ∈ N
such that for all m > n ≥ n0 we have
m
X
|ak | < ε.
k=n
The triangle inequality yields
m m
X X
a k
≤ |ak | < ε.
k=n k=n
P∞
Thus the Cauchy criterion implies that k=1 ak is convergent.
We will now discuss several criteria, called convergence tests, that can by used to verify if a
series is convergent or not. However, note that these test are sometimes inconclusive, i.e., we do
not get a definite answer by applying them, and one needs to apply other techniques.
The first test is based on the existence of another series, that is known to be convergent or
not. This (quite obvious) test is used in nearly every application, and is clearly based on the
numerous examples we are discussing. In particular, some of the other convergence tests are
based on a simple application of this one.
P∞ P∞
Theorem 3.79. Let k=1 ak and k=1 bk be series.
Proof. Since |bk | is an upper bound for the partial sums of |ak |, we have that |ak | is con-
P P P
P
vergent by Theorem 3.68, and therefore finite. This implies that ak is absolutely convergent.
P
The second point follows from similar argument. We use that the partial sums of bk are
divergent (which is by Theorem 3.68 the same as bk = ∞, since bk ≥ 0), and that that partial
P
P P
sums of ak are just larger. This shows that also ak is unbounded, and hence divergent.
122
P∞ −c
Let us consider the series k=1 k for c > 0, with polynomially decaying terms.
Example 3.80. We have
∞ ∞ ∞
X 1 X 1 k+1 X 1
= ≤ 2 = 2 < ∞,
k=1
k2 k=1
k(k + 1) k k=1
k(k + 1)
P∞ 1 P∞ P∞
and hence that is (absolutely) convergent. Similarly, √1 ≥ 1
= ∞.
k=1 k2 k=1 k k=1 k
(In fact, we will see soon that the series is actually convergent for all c > 1.)
P∞ k3 +4k2 −3
Example 3.82. However, k=1 k4 −k+1 = ∞, since
∞ 3 ∞ ∞
2 k3
k + 4k − 3 1
X X X
4 ≥ 4
= = ∞.
k −k+1 k k
k=1 k=1 k=1
We now turn to convergence tests that can be applied to the terms of a series, and we do not
need precise knowledge about the partial sums. As the proof shows, these tests just follow from
a comparison of the series under consideration with a geometric series, see Example 3.60.
P∞
Theorem 3.83 (root test). Let k=1 ak be a series.
(ii) Conversely, if q
k
|ak | ≥ 1 for infinitely many k,
P
then ak is divergent.
123
Proof. By assumption, there is some k0 such that |ak | ≤ q k for k ≥ k0 . Hence, we get that
∞
X 0 −1
kX ∞
X 0 −1
kX ∞
X 0 −1
kX ∞
X
|ak | = |ak | + |ak | ≤ |ak | + qk ≤ |ak | + qk ,
k=1 k=1 k=k0 k=1 k=k0 k=1 k=0
0 −1
where the first inequality comes from Theorem 3.79. Now, kk=1 |ak | is a finite sum and ∞ k
P P
P∞ k=0 q
is a geometric series. As both are finite for q ∈ [0, 1), we get that k=1 ak converges absolutely.
For part (ii), just note that (an ) fails to converge to 0 under the given assumption.
The condition of the root test can be written equivalently with the help of limits:
For this, consider the limit superior
q
k
a := lim sup |ak |.
k→∞
P
Then, the series ak is
(iii) and we do not gain any information from the root test, if a = 1.
P∞
Theorem 3.86 (ratio test). Let k=1 ak be a series.
(ii) Conversely, if
ak+1
ak 6= 0 and a ≥1
for almost all k,
k
P
then ak is divergent.
k0 −1
Proof. As in the proof of the root test we can split the series in ∞ ∞
P P P
k=1 ak = k=1 ak + k=k0 ak ,
Pk0 −1
where k=1 ak is a finite sum and does not change the convergence of the series. We assume
w.l.o.g that k0 = 1. By induction we have that |ak | ≤ q k−1 |a1 | for all k ∈ N. An index shift
implies
∞
X ∞
X ∞
X ∞
X
|ak | ≤ q k−1 |a1 | = q k |a1 | = |a1 | qk .
k=1 k=1 k=0 k=0
P∞
Since q ∈ [0, 1) we get that k=1 ak is absolutely convergent from Theorem 3.79.
For part (ii) we follow similar steps and obtain |ak | ≥ |ak0 | for all k ≥ k0 and k0 large enough.
P
Hence, (ak ) is not a null sequence, and consequently ak not convergent.
The condition of the ratio test can be written equivalently with the help of limits:
For this, assume that the limit
ak+1
a := lim
k→∞ ak
P
exists. Then, the series ak is
Remark 3.87. One may prove that the ratio test is a bit more special than the root test in the
following sense:
a p
Assume that A := limk→∞ k+1 ak exists, then also B := limk→∞
k
|ak | exists and A = B. That
is, whenever we successfully applied the ratio test, we may have also applied the root test to
come to the same conclusion. (We will not prove this here.) However, the ratio test is sometimes
much easier to apply.
125
P∞ kk/2
Example 3.88. We show that the series k=1 k! is absolutely convergent.
For this, note that
(k+1)/2 k! (1 + 1/k)k/2
ak+1
= (k + 1) = √
2
≤ √ → 0.
a k/2
k (k + 1)! k+1
k k
The ratio test implies the absolute convergence.
Example 3.89. The root and ratio test have their limitations, e.g.,√ for series whose terms are
P −c k
only polynomially decaying, i.e., k for some c > 0. Since lim k −c = 1 independent of c,
we cannot distinguish between different c with the root test, although the series is convergent
for some c, and divergent for others, see Example 3.80.
126
The convergence test we want to discuss now is only applicable if the terms of the series are a
non-negative and monotone null sequence. However, this test is very powerful in this case.
P∞
Theorem 3.90 (Cauchy’s condensation test). Let k=1 ak be a series with 0 ≤ ak+1 ≤ ak
for all k. Then,
∞
X ∞
X
ak is convergent ⇐⇒ 2k a2k is convergent .
k=1 k=1
P
Proof. We will bound the series ak from above and below. This will imply the result by
Theorem 3.79. For this, we group the terms of the non-increasing sequence (ak ) into “blocks”
with indices {2k , 2k +1, . . . , 2k+1 −1}, and just bound all elements in such a block by the smallest
and the largest one, respectively.
To be precise, note that every natural number can be written uniquely as 2k + ` for some k ∈ N
and some ` ∈ {0, 1, . . . , 2k − 1} (Check that!), which shows
∞ ∞ 2X
−1k
X X
ak = a2k +` .
k=1 k=1 `=0
Moreover, by simply bounding by maximum or minimum and the number of elements, we obtain
k −1
2X
k
2 a2k+1 ≤ a2k +` ≤ 2k a2k
`=0
P∞ −c
We are finally in the position to study the series k=1 k for c > 0.
Proof. By the condensation test we see that this series is convergent if and only if
∞ ∞
k
2k (2k )−c =
X X
21−c < ∞.
k=1 k=1
Now note that this is just a geometric series (with q = 21−c ) and we have 21−c < 1 if and only
if c > 1, which proves the result.
127
The last convergence test we want to discuss is again only applicable for special series. Again,
the terms of the series are based on a monotone null sequence. However, we consider their
alternating sum and show that this is always a convergent series.
Theorem 3.92 (Leibniz criterion). Let (ak )k∈N be monotone with ak → 0. Then,
∞
X
(−1)k ak is convergent.
k=1
Proof. Assume w.l.o.g. that (ak ) is non-increasing, and therefore non-negative. (Why?)
Since the sequence (ak ) is non-increasing, we have that ak − ak+1 ≥ 0 for all k. Thus
s2n+2 = s2n + (−1)2n+1 a2n+1 + (−1)2n+2 a2n+2 = s2n − (a2n+1 − a2n+2 ) ≤ s2n .
This means the sequence (s2n )n∈N is non-increasing. The same argument implies that the
sequence (s2n−1 )n∈N is non-decreasing. Furthermore s2n − s2n−1 = a2n ≥ 0 implies that s2n−1 ≤
s2n . This yields that for all n ∈ N
s1 ≤ s3 ≤ · · · ≤ s2n−1 ≤ s2n ≤ · · · ≤ s4 ≤ s2 .
So we have two monotone and bounded sequences, (s2n ) and (s2n+1 ), which are therefore con-
vergent. We still need to show that their limits are the same. (Otherwise we would have only
proven that the series has two accumulation points.) But we clearly have
This theorem shows that alternating series are somehow easier to handle than their non-alternating
versions. The following example will demonstrate this.
k
Example 3.93. The alternating harmonic series (−1)
P
k converges by the Leibniz criterion,
1
since ak = k is a decreasing null sequence. Later, we will even prove that
∞
X (−1)k+1
= ln(2) ≈ 0.693,
k=1
k
where ln(x) is the natural logarithm. Recall that the ’normal’ harmonic series does not converge.
√1 (−1)k ak is
P
Example 3.94. Since ak = k+4
is a decreasing null sequence, we have that
(−1)k
convergent. However, note that bk = √
k
is also a null sequence, but not monotone, and
P 1
(−1)k bk = √ = ∞.
P
k
128
We finally consider series that contain a free parameter or, in other words, describe a function
whenever they are convergent.
Definition 3.95 (Power series). Let (ak )∞k=0 ⊂ C be a sequence and let c ∈ C.
Then, for z ∈ C, we define the (formal) power series as
∞
X
f (z) := ak (z − c)k .
k=0
We call the ak the coefficients and c the center of the power series.
We call
1
R := R(f ) := p
k
lim supk→∞ |ak |
the radius of convergence of the power series f .
p
k
p
k
(We set R = ∞ if lim supk→∞ |ak | = 0, and R = 0 if lim supk→∞ |ak | = ∞.)
We call Df = {y ∈ C : |y − c| < R(f )} the disc of convergence of f .
However, as the name radius of convergence already indicates, the number R(f ) plays a crucial
role for deciding if a series converges.
Theorem 3.96 (Radius of convergence). Let (ak )∞ k=0 ⊂ C be a sequence, c ∈ C, and let f
be the corresponding formal power series with radius of convergence R := R(f ).
P∞
Then, the power series f (z) = k=0 ak (z − c)k is
P∞ zk P∞
Example 3.97. Let us consider the power series f (z) = k=0 k , i.e., k=0 ak (z − c)k with
ak = k1 and c = 0. We have
√ 1
| k ak | = √
k
→ 1.
k
P zk
So, the radius of convergence R(f ) = 1, and we obtain that k is absolutely convergent if
|z| < 1.
In some cases, however, it is easier to use the ratio test to verify convergence of a power series.
Luckily, we have that the radius of convergence can also be given by the corresponding limit, if
it exists.
P∞ k
Lemma 3.98. The radius of convergence of a power series f (z) := k=0 ak (z−c) satisfies
ak
R(f ) = lim
k→∞ ak+1
We obtain by the above ratio test that the radius of convergence is ∞, and hence, that this
power series converges for all z ∈ C.
zk
We will see later that this series describes the exponential function, i.e, ez = ∞
P
k=0 k! .
as function, and it would be interesting to find explicit expressions as functions, at least for
some power series. We already know the most important example:
130
1 1
10 1−x 1−x2
f (x)
8
2 8
1+x2
x 6
1+4x2
−2 −1.5 −1 −0.5 0.5 1 1.5 2
1 1 8 6
Figure 20: The graph of the functions 1−x , 1−x2 , 1+x2 , 1+4x2
Many series can be brought into the form of a geometric series, and we can therefore find also
explicit expressions for them. However, note this works only if the series converge.
Example 3.102. Consider the power series f (x) := ∞ x2k for x ∈ R.
P
k=0
P∞ 1
Using the last example, this can be written as f (x) = k=0 (x2 )k = 1−x 2 for all x ∈ (−1, 1).
P∞ k 2k ∞ 2 k 8
Analogously, we find k=0 8(−1) x = 8 k=0 (−x ) = 1+x2 for x ∈ (−1, 1), or
P
∞ ∞
X X 6
6(−4)k x2k = 6 (−4x2 )k = ,
k=0 k=0
1 + 4x2
which only holds for x ∈ (− 21 , 21 ). Note that it is also not easy to “see” from the graph of the
explicit expression, where it can be written as power series, see Figure 20.
Sometimes it is also helpful to write an explicit function as a series, i.e., a series expansions.
Example 3.103. Assume we want to write f (x) = x12 as a power series. Then, we can denote
y := 1 − x2 such that 1 − y = x2 . Since we can write ∞ k 1
k=0 y = 1−y for all |y| < 1, we see that
P
∞ ∞
1 1 X
k
X
f (x) = 2 = = y = (1 − x2 )k
x 1−y k=0 k=0
√ √
for all x ∈ R with |1 − x2 | < 1, i.e., x ∈ (− 2, 2) \ {0}.
P∞
However, rewriting this as a power series k=0 ak xk for some (ak ) ⊂ R would be a rather hard
computation. We will see later how to do that in a systematic way using derivatives.
131
Still we can use the above techniques to find out where more complicated series converge.
Note that, in general, there is no way to give an explicit form for series as the ones
above.
132
Note that a function is continuous iff an expression of the form lim f (xn ) does not depend on
the specific sequence (xn ), but only on its limit x0 := lim(xn ) ∈ D.
Roughly speaking, we can interchange the limit with the function, if it is continuous.
Here, it is important that x0 ∈ D since otherwise f (x0 ) may not be defined. Later we will also
consider limits of functions in the other case.
Example 4.3. The prototype of a discontinuous function is the Heaviside function which is
defined by (
0 if x < 0,
H(x) =
1 if x ≥ 0.
1
If we now consider the sequence xn = n and −xn = − n1 we get that
and hence that H is not continuous at 0. However, H is continuous at every other point. (This
is because H is then constant in a neighborhood around this point, and constant functions are
continuous.) Furthermore, the Heaviside function is not an ’exotic’ example of a not continuous
function, in fact this function plays an important role in physics.
Example 4.4. The example of the Heaviside function can be extended to ’jump-functions’.
Therefore let I = [a, b] be a closed interval, where a < b. If there exists some t ∈ I such that
a 6= t and b 6= t, then functions of the form
(
c1 if x < t
f (x) =
c2 if x ≥ t
Often one can find continuity in the following equivalent form, which is called the ε-δ-criterion.
In words: Given x0 ∈ D. For all (fixed) ε > 0 there exists δ > 0 such that for all x ∈ D with
|x − x0 | < δ we have that |f (x) − f (x0 )| < ε.
The condition in the above theorem was the first precise definition of a continuous function and
is one of the most essential (and frightening) mathematical statements. It may be stated as “a
small change in x0 only allows a small change in f (x0 )”. The precise definition is ultimately due
to Karl Weierstraß (1815–1897), who is often cited as the “father of modern analysis”.
The following figure gives a visualization of the criterion.
Now, for n ∈ N, let δn = n1 . Thus we can find xn ∈ D such that |xn −x0 | < n1 and |f (xn )−f (x)| ≥
ε0 . So we have found a sequence xn → x and f (xn ) −→ 6 f (x), which contradicts the continuity
of f . Hence, the ε-δ-criterion must be satisfied.
For the other direction assume the ε-δ-criterion holds and let ε > 0 and (xn ) be an arbitrary
sequence such that xn → x0 . By our assumption we can find δ > 0 such that for all x ∈ D we
have |x − x0 | < δ =⇒ |f (x) − f (x0 )| < ε. Since the sequence (xn ) converges to x0 we have that
there exists n0 ∈ N such that
Example 4.6. The identity, i.e. f (x) := x, is continuous on R. We want to prove the
statement two times, first by using the definition and then by using the ε-δ-criterion.
Proof by using the definition. Let x0 ∈ R and (xn ) ⊂ R be such that xn → x0 . Clearly,
This yields that f is continuous at x0 . As this holds for all x0 ∈ R, we have that f is continuous.
Thus, by the ε-δ-criterion, f is continuous at x0 . As this holds for all x0 ∈ R, we have that f is
continuous.
Example 4.7. Let us consider the quadratic function f (x) = x2 on R. Since for every
convergent sequence (xn ) ⊂ R with xn → x, we have lim f (xn ) = lim x2n = (lim xn )2 = x2 =
f (x), and hence that f is continuous on R. We also give a proof of this fact using the ε-δ-criterion
to show the difference to the above example.
Proof by ε-δ-criterion. Let ε > 0 and x0 ∈ R. We need to find δ > 0 such that for all x with
|x − x0 | < δ we have that |f (x) − f (x0 )| = |x2 − x20 | < ε. Note that δ may depend on ε and on
x0 but not on x. (This may be justified by the order of the quantifiers in the definition.)
We use the triangle inequality to obtain
2
x − x2 = (x − x0 )(x + x0 ) = |x − x0 | |x + x0 | ≤ |x − x0 | |x − x0 | + 2|x0 | .
0
ε ε
for all x with |x − x0 | < δ ≤ min{1, 1+2|x 0|
}. Therefore, we can set δ := min{1, 1+2|x 0|
} and
obtain
|x − x0 | < δ =⇒ |x2 − x20 | < ε.
Note that δ > 0 for all ε > 0, which implies that f is continuous at x0 . As this also holds for all
x0 ∈ R, we obtain that f is continuous.
Remark 4.8 (*). Note that in the above example, δ depends on ε and x0 , and it is not hard
to see that this dependence is necessary here. It is even no problem that δ → 0 if ε → 0 and/or
x0 → ±∞, since we only need δ > 0 for all fixed ε and x0 .
√
Example 4.9. The root function f (x) = x on [0, ∞) is continuous.
Proof. Let ε > 0 and x, y ∈ [0, 1]. The binomial theorem implies
√ √ √
| x − y|2 = x + y − 2 xy ≤ x + y − 2 min{x, y} = |x − y|.
√ √ √ √
This shows that | x − y| ≤ |x − y|1/2 . If we now choose δ = ε2 , we see that | x − y| < ε for
all x, y with |x − y| < δ. This shows that f is continuous on [0, ∞).
Example 4.10. Let us also show that the exponential function f (x) = ax for fixed a ∈ (0, ∞)
is a continuous function on R.
Proof. Let (xn ) be convergent with xn → x0 . Then, for every k ∈ N there is a nk ∈ N such that
1
|xn − x0 | ≤
k
√
for all n ≥ nk . Using the fact that a1/k = k
a → 1, as k → ∞, we obtain that axn −x0 → 1, as
n → ∞. This yields
Example 4.11. The trigonometric functions sin and cos are continuous on R. This can
be seen from their ’graphical definition’, or also from the inequality | sin(x) − sin(y)| ≤ |x − y|.
We do not give a formal proof here, as there will be a very simple one later.
136
Next we want to establish some calculation rules for continuous functions. These rules allow to
prove continuity of complicated functions by proving that its (hopefully easier) ’building blocks’
are continuous.
Proof. The theorem follows immediately from Theorem 3.24 about the calculation rules for
limits. We only show that f + g is continuous at x0 , if f and g are continuous in x0 . The
remaining cases can be shown in the same way. Assume that (xn ) is a sequence in D converging
to x0 . By the calculation rules for limits and the continuity of f and g, we obtain
Proof. We have already seen that constant functions and the identity are continuous on R.
Applying the above theorem several times we obtain that also xk , and therefore ck xk , are
continuous on R. Adding up these terms and applying the theorem again, we get the result.
By Theorem 4.65, we see that p is uniformly continuous on every closed interval D ⊂ R. If D
is an (half-)open interval, say D = (a, b), then note that p is uniformly continuous on [a, b], and
therefore uniformly continuous on every subset, e.g., on D.
Proof. We already know that p and q are continuous in D from the last example. Furthermore,
from the above theorem, pq is continuous at x0 whenever q(x0 ) 6= 0. Since q has no zeros in D
we get the result.
Example 4.16. Another consequence of the above theorem is the continuity of tan x at all
x ∈ R with cos x 6= 0, i.e. x 6= π2 + kπ for all k ∈ Z. This follows from the representation
sin x
tan x = cos x and the continuity of sin and cos. Analogously, if x 6= kπ then cot x is continuous.
Proof. Consider an arbitrary sequence (xn ) in D converging to x0 . Setting yn = f (xn ) and using
continuity of f and g we obtain
lim g(yn ) = g lim yn = g lim f (xn ) = g f lim xn = g (f (x0 )) = g ◦ f (x0 ).
n→∞ n→∞ n→∞ n→∞
Example 4.19. Let f, g : D → R be continuous functions (on D). Then min{f, g} and
max{f, g} are also continuous.
f + g − |f − g|
min{f, g} =
2
and
f + g + |f − g|
max{f, g} = .
2
(This may be shown by case distinction.) The right hand sides are continuous by the above
theorem.
We now show that also the inverse of a continuous function on intervals is continuous.
Recall that the inverse only exists for bijective functions.
Note that the interval might be open, closed, bounded or unbounded, but it is important that
it is a “connected” set.
138
Proof. Let y ∈ D be arbitrary and let (yn ) be a sequence converging to y. We want to show
that f −1 (yn ) converges to f −1 (y). For this, we define xn := f −1 (yn ) and x := f −1 (y), thus
yn = f (xn ) and y = f (x). Clearly, the sequence (xn ) is contained in the bounded set [a, b].
Hence, the Bolzano-Weierstrass theorem (Theorem 3.42) yields that there exists a convergent
subsequence (xnk ), and we set z := limk→∞ xnk . Using that the continuity of f and that
z ∈ [a, b], we obtain
lim ynk = lim f (xnk ) = f (z).
k→∞ k→∞
Since yn → y also implies limk→∞ ynk = y = f (x), we get that f (z) = f (x). Since f is bijective,
this implies z = x. This gives limk→∞ xnk = x. As this holds for all convergent subsequences
of (xn ), we see that (xn ) has exactly one accumulation point, and is therefore convergent, i.e.,
xn → x. All in all we have shown that
f −1 (yn ) = xn → x = f −1 (y),
which concludes the proof.
This theorem gives an easy argument for the continuity of several known functions.
√
Example 4.21. We obtain that root functions f (x) = k x are continuous on [0, ∞) for
arbitrary k ∈ N. Just note that f is the inverse function of f −1 (x) = xk on [0, ∞), which is
bijective, and continuous because it is a polynomial.
In the same way, we obtain that f (x) = √ 1 −1/k is continuous on (0, ∞) for every k ∈ N.
k x = x
Example 4.22. We also obtain that logarithmic functions f (x) = logb (x) are continuous on
R+ := (0, ∞) for every b > 1.
Again, just note that f is the inverse function of the exponential function f −1 (x) = bx on
R, which is bijective, and continuous, see Example 4.10. (Note that f : R+ → R, and hence
f −1 : R → R+ .)
Note that we allow also ±∞ as accumulation points. Clearly, they are accumulation points if D
is not bounded (from below/above).
Moreover, note that accumulation points of a set D may not be contained in D, and that not
all points of D are accumulation points.
Example 4.24. The set of all accumulation points of (a, b) (or [a, b]) with a, b ∈ R is the closed
interval [a, b].
The set of all accumulation points of (a, ∞) is [a, ∞) ∪ {∞}.
Example 4.25. The sets N and Z do not contain any accumulation points. However, note that
N has the accumulation point ∞, and Z has ±∞ as accumulation points.
n o
Example 4.26. Consider M = n1 : n ∈ N . Then, 0 is an accumulation point of M , but 0 is
not in M . Moreover, 0 is the only accumulation point, since there is no non-constant sequence
1
in M converging to, e.g., 42 . This example shows that M and the set of its accumulation points
can even be disjoint.
lim f (xn ) = y.
n→∞
It is important to note that the existence of the limit limx→x0 f (x) does not depend on the value
f (x0 ). However, if f is continuous at x0 ∈ D and x0 is an accumulation point, then this limit
must still be equal to f (x0 ).
However, note that limx→0 f (x) does not exists. For this, consider xn := 1/n and yn = −1/n,
which satisfy lim(xn ) = lim(yn ) = 0, but lim f (xn ) = ∞ and lim f (yn ) = −∞.
Example 4.32. Let f : R \ {0} → R be defined by f (x) = x12 . Since f is continuous at all
x ∈ R \ {0}, we have, e.g., limx→1 f (x) = 1 or limx→−2 f (x) = 1/4. Moreover, we have
Example 4.33 (Euler number). Let us consider an interesting limit that we have considered
already for sequences, see Example 3.36. Recall that Euler’s number was defined by
n n
1 1
e = lim 1+ = sup 1 + .
n→∞ n n n
It is not hard to see that one obtains the same limit for every xn → ∞ in place of xn = n, i.e.,
y y
1 1
e = lim 1+ = sup 1+ :y>0 .
y→∞ y y
(Verify this precisely!)
With this, we can find a useful representation for the powers ex for x ∈ R. First note that by
continuity of the function f (y) := y x on (0, ∞), we obtain
y x xy
1 1
ex = lim 1+ = lim 1+
y→∞ y y→∞ y
For x > 0 we use that for every sequence (yn ) we have yn → ∞ ⇐⇒ zn := x · yn → ∞. Hence,
xy z n
1 x x
ex = lim 1+ = lim 1+ = lim 1+ .
y→∞ y z→∞ z n→∞ n
(We used the substitution z = x · y ⇐⇒ y = xz . The last equality is just taking the special
1
sequence zn = n.) Using ex = e−x , we can prove the same equation for x < 0. (Verify this!)
We can now follow exactly the same lines as in Example 3.69, using the binomial theorem, to
obtain the representations
∞
x n xk
X
x
e = lim 1 + = ,
n→∞ n k=0
k!
which is valid for all x ∈ R.
141
Based on what we know about limits and continuous functions in general, we again obtain the
following rules of calculation.
f (x) A
(iii) if B 6= 0, then lim = B.
x→x0 g(x)
Example 4.36. Let f : R \ {0} → R with f (x) = sin(1/x). Then, we can write f = h ◦ g with
h(x) = sin(x) and g(x) = x1 . Since h is continuous on R and g is continuous at every x0 6= 0
with y0 := lim g(x) = g(x0 ) = x10 , we obtain limx→x0 f (x) = h(y0 ) = sin( x10 ) for every x0 6= 0.
x→x0
So, e.g., limx→2 f (x) = sin( 12 ) or limx→1/π f (x) = sin(π) = 0. Moreover, we have (for x0 = ±∞)
that limx→∞ f (x) = limx→−∞ f (x) = sin(0) = 0.
1
However, the limit limx→0 f (x) does not exist. To see this, define the sequence xn = π(n+1/2) ,
n
and note that xn → 0, but f (xn ) = (−1) , which is not convergent, see Figure 23.
1
0.5 sin(1/x)
0
−0.5
−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−x2 ≤ x2 · sin(1/x) ≤ x2
for all x 6= 0, since | sin(y)| ≤ 1 for all y ∈ R. Hence, limn→∞ |f (xn )| ≤ limn→∞ x2n = 0 for every
null sequence (xn ). This implies that limx→0 f (x) = 0, see Figure 24.
x2
0.01
x2 sin(1/x)
-0.1 0.1
-0.01
−x2
The calculations from the last example should remind you to the application of the sandwich
rule for limits of sequences, Theorem 3.28. In fact, we have a very similar rule for limits of
functions, if two enclosing functions have the same limit.
Most often, we will apply the sandwich rule in the following form.
Another thing that can be seen from the above example is that, sometimes, function are not
well-defined at a single point (or even on a set), but we can compute the limit at this point. In
such a case, we may “extend the domain” of the function using these limits. That is, given
a function f : D → R and some accumulation point x0 ∈ / D of D (i.e., f is not defined at x0 ),
such that y := limx→x0 f (x) exists. Then, we can define the function g : D ∪ {x0 } → R with
(
f (x) if x 6= x0
g(x) :=
y if x = x0 .
x2 − 1
f (x) = .
x−1
Then, f is not well defined in x0 = 1 (since dividing by zero is not allowed). However,
x2 − 1 (x + 1)(x − 1)
lim = lim = lim (x + 1) = 2.
x→1 x − 1 x→1 x−1 x→1
x−1
(Note that x−1 = 1 can only be used for x 6= 1, which holds inside the limit.)
Hence, the function g : R → R with
x2 −1
(
x−1 if x 6= 1
g(x) =
2 if x = 1,
However, note that limx→0 H(x) would exist, if we only allow positive or negative sequences,
respectively. This motivates the definition of one-sided limits of functions.
f (xn ) → y.
Note that the assumption that x0 is an accumulation point of D+ just means that there are
points in D that are to the right and arbitrarily close to x0 , i.e., there is a sequence in D+
converging to x0 . That x0 is an accumulation point of D− means the same ’to the left’.
Example 4.44. Let H be the Heaviside function as above. Then, H is not continuous at 0,
but the one-sided limits exist and we have
Since these limits are different, limx→0 H(x) does not exist.
1
Figure 25: f (x) = 1−x2
on D = R.
Due to
x−2 x−2 1
f (x) = 2
= =
x −4 (x − 2)(x + 2) x+2
for x 6= −2 we obtain
1 1
lim f (x) = lim = .
x→2 x→2 x+2 4
Hence, f is not continuous at x = 2. Moreover, we have lim f (x) = −∞ and lim f (x) = ∞.
x%−2 x&−2
So, it is also discontinuous at x = −2.
x−2
Figure 26: f (x) = x2 −4
We see that one-sided limits can exist although the limit does not exist. In this case, the one-
sided limits are still helpful to verify if the (two-sided) limit exists, or even if the function is
continuous. This is again an example of a mathematical concept that is just introduced to split
a task into two easier ones.
146
We now prove that limx→x0 f (x) exists if and only if both one-sided limits exist and are equal.
lim f (x) exists ⇐⇒ lim f (x) and lim f (x) exist and are equal.
x→x0 x&x0 x%x0
In particular,
Proof. First we assume that limx→x0 f (x) exists. Clearly, right and left handed limits are special
cases and therefore exist and limx%x0 f (x) = limx&x0 f (x).
For the other direction we assume that right- and left-handed limits in x0 exist and limx%x0 f (x) =
limx&x0 f (x). Let (xn ) be an arbitrary sequence such that xn → x0 and xn 6= x0 . We can split
(xn ) into two subsequences (yk+ ) and (yk− ), where yk+ > x0 and yk− < x0 for all k ∈ N. By our
assumption we get that
lim f (yk+ ) = lim f (yk− ).
k→∞ k→∞
This implies that for arbitrary ε > 0 we can find some k0 ∈ N such that |f (yk+ ) − f (x0 )| ≤ ε and
|f (yk− )−f (x0 )| ≤ ε for all k ≥ k0 . Consequently, there is some n0 ∈ N such that |f (xn )−f (x0 )| ≤
ε for all n ≥ n0 .
Let us finally consider another interesting example, which is one of the most important limits
related to trigonometric functions. We consider the function si : R → R with
sin x ,
if x 6= 0,
si(x) := x
1, if x = 0.
0.5
six
−15 −10 −5 5 10 15
This function is called the sinus cardinalis and is clearly well defined for all x 6= 0. We will
prove now that si is a continuous function on R.
Example 4.48. It remains to show that
sin x
lim = 1.
x→0 x
147
Proof. As si is even, i.e. si(−x) = si(x), it is sufficient to show that limx&0 sinx x = 1. Now, recall
the definition of sin, cos and tan on the unit circle, i.e., the circle with radius 1 and center at
the origin, see the following figure.
We see that the area of the enclosed circular sector with angle x ∈ [0, π2 ), which equals x2 , is
larger than the area of the triangle with legs sin x, cos x, but smaller than the area of the triangle
with legs tan x, 1. Using the corresponding area formulas we obtain
1 1 1
sin x cos x ≤ x ≤ tan x.
2 2 2
This is equivalent to
sin x
sin x cos x ≤ x ≤ tan x = ,
cos x
and therefore to
sin x 1
cos x ≤ ≤ .
x cos x
(Really try to prove this inequality from the one before!)
We know that cos x is continuous and limx→0 cos x = cos(0) = 1. Using the sandwich rule we
obtain
sin x
1 ≤ lim ≤ 1.
x→0 x
148
We now discuss two important properties of continuous functions. Both are (or at least look)
obvious for ’easy’ functions. However, we show that they hold under the weak assumption that
the function is continuous and defined on a closed interval.
The first will be the Intermediate value theorem, which states that a continuous function on a
closed interval attains all values between the function values at the endpoints of the interval.
Theorem 4.49 (Intermediate value theorem). Let I = [a, b] be a closed interval and
f : I → R be a continuous function. Then, for every y ∈ R with
min f (a), f (b) ≤ y ≤ max f (a), f (b) ,
As this theorem holds for every y ∈ J := min{f (a), f (b)}, max{f (a), f (b)} , this theorem states
that the image of I under f , i.e. f (I), at least contains this interval, i.e. f (I) ⊃ J, if I is a
closed interval, see Figure 27.
Proof. The case f (a) = f (b) is obvious. Now assume w.l.o.g. that f (a) < f (b). The case
f (a) > f (b) can be proven in the same way by replacing f by −f .
Now let y ∈ [f (a), f (b)]. We will define a sequence (xn ) with xn → x ∈ I and f (x) = y.
First, let a1 = a, b1 = b and x1 = a1 +b 2 , i.e., x1 is the midpoint between a and b.
1
If f (x1 ) ≥ y, then we set [a2 , b2 ] to be the ’left half’ of [a1 , b1 ]. If on the other hand f (x1 ) < y,
we set [a2 , b2 ] to be the ’right half’ of [a1 , b1 ]. In both cases we get that [a2 , b2 ] ⊂ [a1 , b1 ] and
f (a2 ) ≤ y ≤ f (b2 ). We iterate this process.
Let’s make that more formal: For n ∈ N and given an , bn such that f (an ) ≤ y ≤ f (bn ), we
define
an + bn
xn := ,
2
149
an+1 := an ,
bn+1 := xn ,
an+1 := xn ,
bn+1 := bn .
With this, we get two sequences (an ) and (bn ) with [an , bn ] ⊂ [a, b] and f (an ) ≤ y ≤ f (bn ) for
all n ∈ N, and
a = a1 ≤ a2 ≤ · · · ≤ an ≤ · · · ≤ bn ≤ · · · ≤ b2 ≤ b1 = b.
This yields that (an ), (bn ) are monotone and bounded sequences, which are therefore convergent.
Moreover, we have bn − an = |a−b| 2n−1
, because we always halve the interval, and get that
lim an = lim bn = : x.
n→∞ n→∞
We clearly have x ∈ [a, b]. (Why?) Using f (an ) ≤ y ≤ f (bn ) for all n, we obtain
Hence
f (x) = y.
An important special case is the following corollary, which is sometimes called Bolzano’s theorem.
Example 4.51. The intermediate value theorem yields an alternative argument for the existence
of arbitrary positive roots: For a > 0 and n ∈ N, let f : R≥0 → R be given by f (x) = xn −a. As a
polynomial, f is continuous. Furthermore, we have f (0) = −a < 0 and f (1+a) = (1+a)n −a > 0.
The intermediate value theorem then yields some x ∈ [0, 1 + a] with f (x) = 0, i.e. xn = a or
√
x = n a, respectively.
Example 4.52. The intermediate value theorem also yields the existence of real valued zeros
of polynomials with an odd degree. Therefore let n ∈ N be odd and p : R → R a polynomial of
degree n given by
p(x) = an xn + an−1 xn−1 + . . . + a1 x + a0
with an 6= 0. We assume w.l.o.g. that an > 0. Clearly we have limx→∞ p(x) = +∞ and
limx→−∞ p(x) = −∞. Consequently, there exist a, b ∈ R with a < b and p(a) < 0, p(b) > 0.
The intermediate value theorem yields then a x ∈ (a, b) with p(x) = 0.
150
Example 4.53. The next application refers to fixed points, i.e., points that are not changed
by a function. Let f : [0, 1] → [0, 1] be continuous. Then, there exists a x ∈ [0, 1] with f (x) = x.
For the proof we look at the continuous function g(x) = f (x) − x. We have g(0) = f (0) − 0 ≥ 0
and g(1) = f (1) − 1 ≤ 0, and the intermediate value theorem then yields a x ∈ [0, 1] with
g(x) = 0, so we have f (x) = x.
We now want to discuss the extreme value theorem, which states that a continuous function on
a closed interval attains also its minimal and maximal value.
Let us start with the definition of minimal and maximal points of a function.
We use the term extremum if we do not specify if it is a minimum or maximum. Moreover, we
call them global extrema, as we will later discuss also a local variant of this concept.
Note that the minimum/maximum (value) of a function, if it exists, is unique. However, there
might still be more than one minimum/maximum point.
Theorem 4.56 (Extreme value theorem). Let I = [a, b] ⊂ R be a closed interval and
f : I → R be a continuous function. Then there exist xmin , xmax ∈ I such that
In other words, continuous functions attain their extreme values on closed intervals.
This theorem shows that the infimum inf x∈I f (x) and supremum supx∈I f (x) are attained at
some points in I. In fact, infimum and supremum are actually minimum and maximum.
Proof. We only show that f attains its maximal value, the other case can be treated similarly.
Recall that f (I) = {y ∈ R : f (x) = y for some x ∈ I}. Since I is non-empty, f (I) is non-empty
and therefore, S := sup f (I) exists (the case S = ∞ is still allowed here). By the properties
of suprema there exists a sequence (yn ) in f (I) such that yn → S. Furthermore there exists
a sequence (xn ) in I such that f (xn ) = yn . This sequence is bounded, since all xn ∈ I, i.e.,
a ≤ xn ≤ b. By the Bolzano-Weierstrass theorem (Theorem 3.42) there exists a convergent
subsequence (xnk ) which converges to some x0 ∈ R. Now, by the sandwich rule (Theorem 3.28)
and a ≤ xnk ≤ b, we also obtain that a ≤ x0 ≤ b, i.e., x0 ∈ I. The definition of continuity yields
which proves the claim with xmax := x0 (and implies S < ∞).
Remark 4.57. Note that it is important that we have a closed interval in the extreme
value theorem. For open intervals the statement is not true in general. For instance, if we
consider f (x) = x (we already know that this is a continuous function) and I = (a, b) = (0, 1),
then
sup f (x) = 1 inf f (x) = 0,
x∈I x∈I
Example 4.58. Another example that shows that the extreme value theorem does not hold in
general for open intervals is the function f (x) = x1 on (0, 1). This function is continuous, but
unbounded and therefore does not have a maximum.
• The extreme value theorem shows that m := inf x∈I f (x) and M := supx∈I f (x) are
attained at some points xmin , xmax ∈ I.
• The intermediate value theorem (applied to the interval [xmin , xmax ]) implies that all
intermediate values are attained.
h i
• In short: f (I) = minx∈I f (x), maxx∈I f (x) .
152
Figure 29: All values between the extreme values f (xmin ) and f (xmax ) are attained
In words, continuous functions on closed intervals are bounded, i.e., supx∈I |f (x)| < ∞.
We finally want to discuss shortly how one can actually find a point x∗ ∈ [a, b] such that f (x∗ ) = y
for y with f (a) ≤ y ≤ f (b). (The intermediate value theorem implies that it exists.) For the
sake of simplicity we only discuss the case that y = 0. (One might consider g(x) = f (x) − y
otherwise.) The proof of the intermediate value theorem leads directly to a (practical) algorithm
for the approximation of x∗ with f (x∗ ) = 0 that is called the bisection method.
Let f : [a, b] → R be continuous with f (a) < 0 < f (b). The bisection method is inductively
defined as follows:
We always have ak , bk ∈ [a, b] and f (ak ) < 0 < f (bk ), i.e., there is always a zero x∗ in the interval
[ak , bk ], and we have
1
|xk − x∗ | ≤ (bk − ak ) = 2−k (b − a).
2
The iteration is stopped if (by chance) f (xk ) = 0 holds for some k or if the upper bound is
smaller than a prescribed ε > 0. xk is then used as an “ε-approximation” for the zero x∗ .
√
Example 4.60. The positive zero of the function f (x) = x2 − 2 is clearly x∗ = 2. We start
the bisection method with a = 0 and b = 2 and want to achieve an error |xk − x∗ | that is smaller
than ε = 0.05.
The requirements of the bisection method are fulfilled, since f is continuous and f (0) < 0 < f (2).
To satisfy the prescribed error bound ε > 0, we need that
1
−k
2 (2 − 0) < ε ⇐⇒ k > log2 + 1.
ε
For ε = 0.05, we can choose k = 6. In fact, for k = 6 the bounds shows that the error will be at
most 2−5 = 321
= 0.03125. Let us have a look on the first iterations:
k ak bk xk f (xk )
1 0 2 1 -1
3 1
2 1 2 2 4
3 5 7
3 1 2 4 − 16
5 3 11 7
4 4 2 8 − 64
11 3 23 17
5 8 2 16 256
11 23 45 23
6 8 16 32 − 1024
√
The actual error of x6 = 45
32 ≈ 1.40625 to the exact zero x∗ = 2 ≈ 1.414214 is, however, only
about 0.007964.
Let us introduce two stronger forms of continuity of functions that will be useful later. Note
that both are defined for the whole domain, and not at single points.
By the above considerations, we immediately see that constant functions and the identity f (x) :=
x are uniformly continuous (Check yourself!), and we will show in Theorem 4.70 that every
uniformly continuous function is continuous. To see that uniform continuity is indeed a stronger
condition, consider the following example.
154
Example 4.62. Let f (x) := x2 on R, which is continuous by Example 4.7. Moreover, consider
the sequences defined by xn = n + 1/n and yn = n, which clearly satisfy |xn − yn | = n1 → 0.
However, we have
1 2
n + 2 + 1 − n2 → 2.
2
2
|f (xn ) − f (yn )| = n + −n =
n n2
As discussed in Example 4.7, the difference in proving continuity of, e.g., the functions x and
x2 , was that we had to choose δ depending on x0 in the latter case. We will now see that
uniformly continuous functions are precisely those, where we may find a δ independent of x0 in
an “ε-δ-proof” of continuity. (Clearly, such a δ still depends on ε in general.)
In words: For all (fixed) ε > 0 there exists δ > 0 such that for all x, y ∈ D with |x − y| < δ
we have that |f (x) − f (y)| < ε.
From this equivalent definition of uniform continuity, one may already see that it is indeed
a stronger condition than continuity. For this, note the additional “for all” quantifier in the
statement. For a better understanding, try to write the definition of “f is continuous on D (i.e.,
for all x0 ∈ D)” solely with quantifiers, and spot the difference.
Proof. First we prove that the ε-δ-criterion for uniform continuity implies uniform continuity.
Therefore assume that
∀n ≥ n0 : |xn − yn | < δ0 .
Hence
∀n ≥ n0 : |f (xn ) − f (yn )| < ε0 .
Since ε0 was arbitrary, we obtain that limn→∞ |f (xn ) − f (yn )| = 0. Since also (xn ) and (yn )
were arbitrary, we get that f is uniformly continuous on D from the definition.
The other direction is proved by contradiction. Therefore, we assume
and we want to show that this implies that f is not uniformly continuous, i.e., that there exist
sequences (xn ), (yn ) in D such that |xn − yn | → 0 and |f (xn ) − f (yn )| 6→ 0. For this, let
1
δm := m for all m ∈ N. From the assumption (with δ = δm ), we can find xm , ym ∈ D such that
1
|xm − ym | < m and |f (xm ) − f (ym )| ≥ ε0 . Thus, we found sequences (xm ), (ym ) such that
1
∀m ∈ N : |xm − ym | < and |f (xm ) − f (ym )| ≥ ε0 ,
m
i.e., |xn − yn | → 0 and |f (xn ) − f (yn )| 6→ 0. This finishes the proof.
Example 4.64. Try to prove yourself, using the ε-δ-criterion, that the absolute value function,
i.e. f (x) = |x|, is uniformly continuous on R.
Uniform continuity is an essential tool for many of the following considerations, in the same
way as absolute convergence was essential for series. But uniform continuity is sometimes not
so easy to show. The following theorem shows that, however, if a continuous function is defined
on a closed interval, then it is automatically uniformly continuous.
Together with Theorem 4.70 (see below), this shows that continuity and uniform continuity are
just the same for functions defined on closed intervals.
Proof. We prove the result by contradiction, so we assume that there exist sequences (xn ) and
(yn ) such that
|xn − yn | → 0 and ∀n ∈ N : |f (xn ) − f (yn )| ≥ ε0 ,
for some ε0 > 0, i.e. that f is not uniformly continuous. Since [a, b] is bounded, and (xn ) ⊂ [a, b],
we have that (xn ) is bounded, and therefore, that there exists a convergent subsequence of (xn ),
say (xnk ) with xnk → x0 . This is a consequence of the Bolzano-Weierstrass theorem. Using
triangle inequality we see
|x0 − ynk | ≤ |x0 − xnk | + |xnk − ynk | → 0,
so (ynk ) also converges to x0 . The continuity of f and | · | yields
lim |f (xnk ) − f (ynk )| = f lim xnk − f lim ynk = |f (x0 ) − f (x0 )| = 0,
k→∞ k→∞ k→∞
The last type of continuity we want to discuss is Lipschitz continuity. This one is the strongest,
but also the easiest to verify, concept one may consider. Luckily, it is enough to deal with this
type for most practical applications.
Example 4.68. Again, the constant function f (x) := c and the linear function f (x) := x,
are Lipschitz continuous. In both cases we can choose the Lipschitz constant L = 1. (For the
constant function, the Lipschitz constant may be chosen arbitrarily small.)
Example 4.69. It is also not hard to show that f (x) := x2 on D is Lipschitz continuous on
arbitrary bounded D ⊂ R. Check yourself!
Proof. First we show that Lipschitz continuous functions are uniformly continuous. For arbitrary
ε > 0 we set δ = Lε , where L is the Lipschitz constant of f . Using the ε-δ-criterion for uniform
continuity we obtain
As this holds for all ε > 0, the ε-δ-criterion for uniform continuity implies that f is uniformly
continuous.
Now assume that f is uniformly continuous on D and let x0 ∈ D. The uniform continuity yields
Hence f is continuous on D.
This theorem shows that Lipschitz continuity implies the two other forms of continuity we have
just discussed. However, one may still ask if some of these concepts are actually the same. We
have already seen that the function f (x) = x2 is not uniformly continuous on R, which shows
that uniform continuity is indeed stronger than continuity. The following example shows that a
function may be continuous on a closed interval (and therefore uniformly continuous), but not
Lipschitz continuous. That is, we do not have the reverse implications in Theorem 4.70, i.e.
Lipschitz continuous ⇐=
6 Uniformly continuous ⇐=
6 Continuous.
157
√
Example 4.71. Let f : [0, 1] → R with x 7→ x. Then f is uniformly continuous but not
Lipschitz continuous.
√
1 f (x) = x
0.8
y
0.6
0.4
0.2
x
0.2 0.4 0.6 0.8 1
√
Figure 30: The functionf (x) = x
Proof. First we show that f is uniformly continuous on [0, 1]. Again, we use the ε-δ-criterion
for the purpose of demonstration. Therefore, let ε > 0 and x, y ∈ [0, 1]. The binomial theorem
implies √ √ √
| x − y|2 = x + y − 2 xy ≤ x + y − 2 min{x, y} = |x − y|.
√ √ √ √
This shows that | x − y| ≤ |x − y|1/2 . If we now choose δ = ε2 , we see that | x − y| < ε for
all x, y with |x − y| < δ. This shows that f is uniformly continuous on [0, 1].
√ √
We now show that, however, f is not Lipschitz continuous. Multiplying and dividing by | x+ y|
(and the binomial theorem) yield
√ √ |x − y|
| x − y| = √ √ .
| x + y|
If f would be Lipschitz continuous, then there would exist some L > 0 such that
√ √ |x − y|
| x − y| = √ √ < L|x − y|
| x + y|
1√
for all x, y ∈ [0, 1]. However, this would mean L > |√x+ y|
for all x, y ∈ [0, 1], which is clearly
not true, as the right hand side can be made arbitrary large by choosing x and y small enough.
Hence f cannot be Lipschitz continuous.
Example 4.72. Let us finally mention, without formal proof, that the trigonometric func-
tions sin and cos are Lipschitz continuous on R with Lipschitz constant L = 1. This can
be seen from their ’graphical definition’. See also Figure 31 which shows that all function values
lie in the colored cone. The trigonometric functions are therefore also uniformly continuous.
158
5 Differential calculus
In this chapter we want to introduce and study derivatives of real-valued functions which give us
a better understanding of how small local changes of the input effect the output of a function.
For functions defined on the real line, as will assume throughout this chapter, one may think
about the slope (german: Anstieg) of the tangent line attached to a point of the graph of the
function. As already for continuity, this is a good intuition and, when well understood, makes
many of the upcoming results obvious for ’easy’ functions. However, we need again a precise
definition of a derivative to handle cases where a visualization is not possible or helpful.
Prominent application of the differential calculus are an alternative (and more precise) defini-
tion of minima and maxima of a function, and the derivative provides us with information of
whether a function is increasing or decreasing in a given point. However, there is much more
information ’hidden’ in the values of the higher order derivatives at a point. In fact, under
certain assumptions on the function, a function can be given approximately in a neighborhood
of the point just by knowing some of these values. This will be formalized by means of the
Taylor polynomial. Finally, we present the very useful l’Hospital rule, which is a powerful tool
to compute complicated limits.
Let us begin with a precise definition of a derivative of a function, which is clearly only a precise
notion (using limits of function) of the slope of a tangent line in a point.
From now on we mostly consider functions that are defined on an open interval I. This is
because the endpoints of an interval need sometimes more care. (Moreover, the derivative is
clearly not defined at isolated points, if the domain of definition contains some.) We comment
on the differences when needed.
The expression
f (x0 + h) − f (x0 )
h
is called difference quotient. Geometrically this is the slope of the secant through the points
(x0 , f (x0 )) and (x0 + h, f (x0 + h)). Hence if f 0 (x0 ) exists, it is the slope of the tangent or the
function at the point x0 .
160
Obviously, constant functions, i.e., f (x) = c for c ∈ R, have the derivative f 0 = 0. Let us discuss
some more examples which will serve as building blocks for more complicated functions.
Example 5.2. Let f (x) = xn , with n ∈ N. Then f is differentiable on R with f 0 (x) = nxn−1 .
Example 5.3. Let f (x) = x1n , n ∈ N. Then f is differentiable on R \ {0} with f 0 (x) = −n xn+1
1
.
0 k 0
Together with the last example, we have (1) = 0 and (x ) = kx k−1 for all k ∈ Z \ {0}. We will
see that this formula in fact holds for all k ∈ R \ {0}.
Proof. We want to compute f 0 (x) for x 6= 0. Therefore we use the binomial theorem to obtain
n
1 1 1 1 x − (x + h)n
n
− n =
h (x + h) x h xn (x + h)n
Pn n n−k k
!
1 xn − k=0 k x h
= n n
h x (x + h)
−nxn−1 h − nk=2 nk xn−k hk
P !
1
=
h xn (x + h)n
−nxn−1 − nk=2 nk xn−k hk−1
P
=
xn (x + h)n
161
Observe that the denominator converges to x2n for h → 0. To take the limit into the denomi-
nator, we have to assume x 6= 0. In the numerator, all terms in the latter sum go to zero for
h → 0. Therefore,
n
h n−1 n−k hk
−nxn−1
P
0 k=1 k+1 x
f (x) = lim n Pn n n−k k = = −nx−n−1 .
h→0 x k=1 k x h x2n
Example 5.4. Let f (x) = sin(x). Then f is differentiable on R and it holds f 0 (x) = cos(x),
i.e.,
sin0 (x) = cos(x) for all x ∈ R.
Using sin(x)
x → 1 as x → 0, see Example 4.48 and the continuity of cos we obtain the result as
h → 0.
Example 5.5. In the same way, one can compute that cos is differentiable on R and
where we us
x+y x−y
cos(x) − cos(y) = −2 sin sin .
2 2
Proof. First, recall that Bernoulli’s inequality (Theorem 1.48) states that
eh − 1
ex → ex .
h
Remark 5.8. Note that calculating a derivative is in fact the same as calculating limits of a
function. Therefore, to show that a function is not differentiable at a point x0 , it is sufficient to
find two sequences (hn ) and (h̃n ) such that hn → 0 and h̃n → 0, but
1 −1
Proof. Set hn = n and h̃n = n . It follows that
This shows that a continuous function is not necessarily differentiable, however, the reverse
statement holds.
Proof. By assumption,
f (x) − f (x0 )
lim = f 0 (x0 ).
x→x0 x − x0
By the calculation rules for limits we get
f (xn ) − f (x0 )
lim (f (x) − f (x0 )) = lim (x − x0 ) = lim (x − x0 ) f 0 (x0 ) = 0.
x→x0 x→x0 xn − x0 x→x0
As in the previous sections we want to establish some rules of calculation for differentiable func-
tions that will be the main tools to derive the derivatives of complicated functions. In particular,
we will establish rules for calculating the derivative of products, quotients and compositions of
functions, as well as of the inverse.
If one is not already completely confident with these rules of calculation, it is highly recom-
mended to practice and calculate derivatives of increasingly complicated functions. Note that √
with the rules that follow, it should be no problem to calculate the derivative of sin(x42 ) ecos( x)
step-by-step, although it might be time-consuming.
Proof. Both follow easily from the calculation rules for limits.
Example 5.12. We already know that (xn )0 = nxn−1 for all n ∈ N. By linearity, all polynomials
are differentiable on R and
0
cn xn cn−1 xn−1 + . . . c1 x + c0 = cn nxn−1 + cn−1 (n − 1)xn−2 + · · · + c2 2x + c1 .
164
Theorem 5.13 (Product rule). Let f, g be differentiable at x0 , then (f g)0 (x0 ) exists and
In short, (f g)0 = f 0 g + g 0 f .
Proof. We compute
f (x + h)g(x + h) − f (x)g(x)
(f g)0 (x) = lim
h→0 h
g(x + h) − g(x) f (x + h) − f (x)
= lim f (x + h) + lim g(x)
h→0 h h→0 h
= f (x)g 0 (x) + f 0 (x)g(x),
In short, (g ◦ f )0 = (g 0 ◦ f ) · f 0 .
Theorem 5.15 (Quotient rule). Let f, g be differentiable at x0 , and assume that g(x0 ) 6= 0.
Then, fg is differentiable at x0 and
0
f f 0 (x0 )g(x0 ) − f (x0 )g 0 (x0 )
(x0 ) = .
g g(x0 )2
0
f f 0 g−f g 0
In short, g = g2
.
0
Proof. We want to use the product rule for the functions f and g1 . But first we compute 1
g .
0
1 −1 1
Recall that x = x2
for x 6= 0, and that g can be written as h ◦ g, where h(y) = y1 . The chain
rule yields 0
1 g 0 (x0 )
(x0 ) = − .
g g(x0 )2
Using the product rule, we obtain
0
f f 0 (x0 )g(x0 ) − f (x0 )g 0 (x0 )
(x0 ) = .
g g(x0 )2
These rules allow to compute complicated derivatives by computing several easy derivatives.
Example 5.16. We want to compute the derivative of the tangent function tan : (− π2 , π2 ) → R.
By the quotient rule, and sin2 x + cos2 x = 1, we obtain
Example 5.17. The calculation rules for derivatives and the already proven fact that trigono-
metric functions and (algebraic) polynomials are differentiable, implies that trigonometric poly-
nomials are differentiable on R.
There is also a formula for differentiating inverse functions. Note that the assumption that
a continuous function f is strictly monotone on an interval I is just equivalent to f mapping
bijectively to another interval J ⊂ R.
Proof. By Theorem 4.20, we know that f −1 is continuous (in y0 ). Take an arbitrary sequence
(yn ) ⊂ f (I) with yn → y0 and yn 6= y0 , and define xn := f −1 (yn ). We obtain
xn = f −1 (yn ) → f −1 (y0 ) = x0 ,
as well as xn 6= x0 . Therefore,
f −1 (yn ) − f −1 (y0 ) xn − x0 1 1
lim = lim = lim = ,
n→∞ yn − y0 n→∞ f (xn ) − f (x0 ) n→∞ f (xn )−f (x0 ) f 0 (x 0)
xn −x0
Example 5.19. We compute the derivative of the natural logarithm ln(y), y > 0. This is the
inverse of the function f (x) = ex = y with derivative f 0 (x) = ex , see Example 5.7. Thus
d 1 1
ln(y) = 0 = .
dy f (x) y
Example 5.20. We use the derivative of ln(x) and the chain rule to prove that
d a
x = axa−1 for arbitrary a ∈ R and x > 0.
dx
For this, write xa = ea ln(x) . With f (x) = a ln(x) and g(x) = ex , we obtain f 0 (x) = a
x and
g 0 (x) = ex , which yields
d a d a
x = (g ◦ f )(x) = g 0 (f (x))f 0 (x) = ea ln(x) = axa−1 .
dx dx x
d 1 1 1
arctan y = 0 = = ,
dy f (x) 1 + tan2 x 1 + y2
where we use Example 5.16 and, again, set y := f (x).
Example 5.22. One can also use the theorem to calculate the derivatives of the inverses of the
trigonometric functions, see Section 1.7, to obtain
1 1 1 1
arcsin0 (y) = 0 = =q =p with y = sin(x),
sin (x) cos(x) 1 − sin2 (x) 1 − y2
and
1 −1 −1 −1
arccos0 (y) = 0
= =p 2
=p with y = cos(x),
cos (x) sin(x) 1 − cos (x) 1 − y2
for all y ∈ (−1, 1). (Note that we used sin2 x + cos2 x = 1.)
167
In this section we discuss the most important application for derivatives. That is, the calculation
and classification of extreme points, i.e., points where a function attains a (local) maximum or
minimum. Let us first restate the definition of global extrema, see Definition 4.54.
Again, note that the extreme values of a function, if they exist, are unique, but there might be
several extreme points. (Think about f (x) = x2 on [−1, 1] or (−1, 1), see Example 4.55)
As we want to discuss the connection of extreme points to the derivative of a function, which is
a local property, we also need a local notion of extreme points.
and a strict local minimum if f (x) > f (x0 ) for all x ∈ D ∩ (x0 − ε, x0 + ε) \ {x0 }.
Analogously, we say f has a local maximum at x0 ∈ D if there exists ε > 0 such that
and a strict local maximum if f (x) < f (x0 ) for all x ∈ D ∩ (x0 − ε, x0 + ε) \ {x0 }.
The point x0 is called local maximum/minimum point, or local extreme point.
It follows immediately that a global extreme point has to be a local extreme point.
However, as we have seen in the last example, there might be already several global extrema
and clearly even more local extrema, see Figure 34.
4
2
x2
Figure 34: Graph of x · sin(x) · e− 100
168
We will now discuss how to find (and classify) local extrema by using derivatives. One possible
way of finding a global extremum of a function f : [a, b] → R then is:
If the function is defined on an open (or unbounded) interval (a, b) with −∞ ≤ a < b ≤ ∞, then
calculating f (a) and f (b) in Step 2 should be replaced by calculating the boundary values
limx→a f (x) and limx→b f (x). Clearly, if they lead to the maximal/minimal values, then there
exists no global maximum/minimum point, as it is not attained in the domain.
We now turn to step one, i.e., finding all local extrema. For this, note that the above figure
shows that the slope of the tangent line attached to an extremum is zero, i.e., the tangent line
is horizontal. This means that the derivative at such a point is zero. We now show that this is
a necessary condition if the function is differentiable, which means that it is enough to ’check’
all points with this condition if you want to find a local extremum.
Theorem 5.25 (Necessary condition for an extreme point). Let I = (a, b) and f : I → R.
If x0 ∈ I is a local extreme point of f and f is differentiable at x0 , then
f 0 (x0 ) = 0.
−2 2
−1
−2
Proof of Theorem 5.25. Let f have a local minimum at x0 and let ε > 0 be as in the definition
required. We now have for x ∈ I with x0 < x < x0 + ε, which implies x − x0 > 0, that
f (x) − f (x0 )
≥ 0.
x − x0
f (x)−f (x0 )
So, limx&x0 x−x0 ≥ 0. On the other hand for x0 − ε < x < x0 we obtain
f (x) − f (x0 )
≤ 0,
x − x0
In many cases, it is already enough to determine all stationary points in order to find all (global)
extreme points of a function. But only from the function and derivative value at a point, we
can not decide if a stationary point is indeed an extreme point and, if it is, if we face a local
maximum or a local minimum. This is in particular a problem, if we are not able to draw the
function under consideration. However, there is a sufficient condition for having an extrema
that involves higher-order derivatives.
Let f : (a, b) → R be differentiable and let f 0 be also differentiable (at x). Then we say that f
is twice differentiable (at x) and write
d d d 0
00
f (x) := f (x) = f (x).
dx dx dx
This procedure can be repeated as long as derivatives exist and so we can define the n-th
derivative of f inductively by
dn d (n−1)
f (n) (x) := n
f (x) = f (x).
dx dx
In the special case of n = 2 or n = 3 we write f 00 (x) or f 000 (x), respectively. If the n-th derivative
of f at a point exists, then we say that f is n-times differentiable at this point. If the n-th
derivative of f (at x0 ) exists and is a continuous function (at this point), then we say that f is
n-times continuously differentiable (at x0 ).
Example 5.27. We now consider f (x) := xa for arbitrary a ∈ R, see Example 5.20.
If a ∈ N0 , we are back in the situation of the last example. That is, the formula for the higher-
order derivatives leads to f (k) ≡ 0 (i.e., f (k) (x) = 0 for all x ∈ I) for all k > a.
This does not hold if a is negative or not a natural number, i.e.,
dk a
x = a(a − 1) · · · (a − k + 1) xa−k for arbitrary a ∈ R \ N0 , x > 0 and k ∈ N.
dxk
170
Example 5.28. Note that all differentiable functions that we discussed so far, were also contin-
uously differentiable, i.e., they possess a continuous derivative (on the whole domain). In fact,
it is not easy to find a differentiable function that is not continuously differentiable.
The classical example of such a function is
(
x2 sin(1/x), if x 6= 0,
f (x) :=
0, if x = 0.
First of all, by using the bound |f (x)| ≤ |x|2 , we obtain that f is continuous. By the definition
of the derivative at x0 = 0, we obtain that f 0 (0) = limh→0 h sin(1/h) = 0, where we again use
the boundedness of sin. For x0 6= 0 we use the calculation rules for derivatives, and we obtain
(
2x sin(1/x) − cos(1/x), if x 6= 0,
f 0 (x) =
0, if x = 0.
We now turn to a sufficient condition that allows to deduce if a stationary point (i.e. f 0 (x0 ) = 0)
is a local extreme point, and, moreover, if it is a local maximum or a local minimum.
Theorem 5.29 (Second derivative test). Let I = (a, b), x0 ∈ I and f : I → R be twice
continuously differentiable at x0 , i.e., f 00 exists and is continuous at x0 .
Moreover, assume that x0 is a stationary point of f , i.e., f 0 (x0 ) = 0. Then,
and
f 00 (x0 ) < 0 =⇒ f has a strict local maximum at x0 .
Although we could prove this theorem here, using a rather longish reasoning, we will present a
very short argument at the end of the following section, see the Proof of Theorem 5.29.
171
Note that a precise characterization of an extremum would also involve higher-order derivatives
and is rather complicated. As the above is usually enough, we do not state the characterization
here. However, note that we will learn later that the knowledge of all derivatives f (k) (x0 ),
k ∈ N0 , at a given point x0 , if they exist, allows us to reconstruct the function exactly in a
neighborhood around x0 . That is, we can obtain all ’local information’ of a function from its
higher-order derivative values, if the function can be differentiated infinitely often.
Example 5.30. The function f (x) = 3x2 − 6x + 5 (on R) satisfies f 0 (x) = 6x − 6 and f 00 (x) = 6,
see Figure 37. It therefore has a unique critical point at x0 = 1 which is a local minimum. As
it is the only extreme point, and limx→±∞ f (x) = ∞, we obtain that f is not bounded from
above, i.e., f does not have a maximum, and f has a global minimum at x0 = 1 with minimum
value f (x0 ) = 2.
Example 5.31. If f 00 (x0 ) = 0, then we can not say if x0 is an extreme point and, if it is, which
one. To see this, consider, e.g., f (x) = x3 , g(x) = x4 and h(x) = −x4 on R, and x0 = 0.
Then, f 0 (0) = f 00 (0) = g 0 (0) = g 00 (0) = h0 (0) = h00 (0) = 0, but f has no extrema at 0, g has a
maximum at 0 and h has a minimum at 0.
In this section we discuss the mean value theorem, which is a theorem that assures that the
derivative of a function attains a certain value, if we only know the function values at the end-
points of the interval. In the same way as the other ’existence theorems’, which were ultimately
all due to the Bolzano-Weierstrass theorem 3.42, this will be an important tool in the proofs of
the upcoming theorems.
In particular, it will imply l’Hospital’s rule for calculating certain limits, which may be seen as
∞
a (fast) way of calculating expressions of the form ’ 00 ’ or ’ ∞ ’.
Now we are ready to prove the first mean value theorem, namely the Theorem of Rolle. This
may be seen as a special case of the upcoming theorem, but it already contains the main idea.
Theorem 5.32 (Rolle). Let f : [a, b] → R be continuous and differentiable on (a, b). Fur-
thermore, assume that f (a) = f (b). Then there exists some ξ ∈ (a, b) such that f 0 (ξ) = 0.
Proof. For constant f the statement is obvious. So we may assume that f is not constant. Since
f is continuous on [a, b] we know from the extreme value theorem (Theorem 4.56) that it has a
172
maximum and a minimum, which are not equal. Moreover, since f (a) = f (b), one of the (global)
extreme points has to be in (a, b), i.e., not at the boundary points. By Theorem 5.25, this point,
say ξ ∈ (a, b), satisfies f 0 (ξ) = 0.
Geometrically the above theorem states that the graph of f has at least one point where the
tangent is horizontal. Again, if you have a drawing of a function, this result seems to be a trivial
statement. However, it will also be necessary in more complex situations, and we will see that
it is important to respect all the assumptions.
Example 5.33. It is important the the function is differentiable in the whole interval.
To see this, consider the absolute value function f (x) = |x| on [−1, 1], which is continuous and
satisfies f (−1) = f (1) = 1. However, it is not differentiable at x0 = 0, and at all other points it
satisfies f 0 (x) = 1 or f 0 (x) = −1. So, there is no ξ ∈ (−1, 1) with f 0 (ξ) = 0.
We now use the Theorem of Rolle to prove the mean value theorem.
Theorem 5.34 (Mean value theorem). Let f : [a, b] → R be continuous in [a, b] and differ-
entiable in (a, b). Then, there exists some ξ ∈ (a, b) such that
f (b) − f (a)
f 0 (ξ) = .
b−a
f (b) − f (a)
h(x) := f (x) − x−a .
b−a
We see that h is continuous and differentiable as f , and satisfies h(a) = h(b) = f (a). Therefore,
Rolle’s theorem implies that there is some ξ ∈ (a, b) with h0 (ξ) = 0. Since
f (b) − f (a)
h0 (x) = f 0 (x) − ,
b−a
we obtain the result for x = ξ.
173
Note that the mean value theorem gives information about the derivative of a function, even if
we only know the function values at the boundary points. For example, given some f : [0, 1] → R
with f (0) = 0 and f (1) = 5. The mean value theorem states that, if f is differentiable on (0, 1),
then its derivative must attain the value 5, i.e., there exists ξ ∈ (0, 1) with f 0 (ξ) = 5. Note that
the function f (x) = 5x is the only linear function with these function values and satisfies f 0 ≡ 5.
The theorem then states that every function with the same boundary values has also a point
with this slope. (Recall that the intermediate value theorem, Theorem 4.49, implies e.g. that
there is a ξ ∈ (0, 1) with f (ξ) = 2.)
The following theorem is a slight generalization of the mean value theorem which is the first
step in proving l’Hospital’s rule, and will be also of interest later.
Theorem 5.35 (General mean value theorem). Let f, g : [a, b] → R be continuous in [a, b]
and differentiable in (a, b). Then, there exists some ξ ∈ (a, b) such that
In particular, if g 0 6= 0 on (a, b), then there exists some ξ ∈ (a, b) such that
Proof. We first consider the case that g(a) = g(b). By Rolle’s theorem we know that there exists
some ξ ∈ (a, b) with the property that g 0 (ξ) = 0. So we only have to show the first point, as
g 0 6= 0 is clearly not true. To do so we have a look at
which was to show. For the other case, i.e. g(a) 6= g(b), we define the function
f (b) − f (a)
h(x) = f (x) − (g(x) − g(a)).
g(b) − g(a)
Since f and g are continuous on [a, b] and differentiable on (a, b), we have that h is continuous
on [a, b] and differentiable on (a, b). Moreover, it is easy to compute that
h(a) = h(b).
An application of Rolle’s theorem yields that there exists some ξ ∈ (a, b) such that
f (b) − f (a) 0
0 = h0 (ξ) = f 0 (ξ) − g (ξ).
g(b) − g(a)
Regrouping this equation leads to
which proves the first point of the theorem. For the second point we assume additionally that
g 0 6= 0 on (a, b), i.e., g 0 (x) 6= 0 for all x ∈ (a, b). Thus, we can divide the last but one equation
by g 0 (ξ) and obtain
f 0 (ξ) f (b) − f (a)
0
= .
g (ξ) g(b) − g(a)
174
x2 − 16 4x2 − 5x
lim or lim .
x→4 x − 4 x→∞ 1 − 3x2
If we plug in x = 4 in the first limit we get 00 , another similar case appears if we would plug in
∞
"∞" in the second limit as we would get −∞ (recall that, if x tends to ∞, then a polynomial
behaves like its largest power).
We now introduce l’Hospital’s rule, which is a method to compute such limits, if they exist.
Theorem 5.36 (l’Hospital). Let I = (a, b) and x0 ∈ [a, b]. Let f, g : I\{x0 } → R be differen-
tiable on I \{x0 } and g 0 6= 0. Furthermore assume that either limx→x0 f (x) = limx→x0 g(x) =
0 or limx→x0 f (x) = ±∞, limx→x0 g(x) = ±∞ holds. Then we have
f (x) f 0 (x)
lim = lim 0 ,
x→x0 g(x) x→x0 g (x)
Proof. We only prove the case limx→x0 f (x) = limx→x0 g(x) = 0. Otherwise replace f by 1/f ,
and g by 1/g. First observe that we can continuously extend f, g to x0 with f (x0 ) = g(x0 ) = 0.
By the general mean value theorem for any x ∈ I such that x 6= x0 there exists ξ ∈ (x0 , x)
satisfying
f 0 (ξ) f (x) − f (x0 ) f (x)
0
= = .
g (ξ) g(x) − g(x0 ) g(x)
0
If x → x0 , it follows that ξ → x0 and since limξ→x0 fg0 (ξ)
(ξ)
exists (this was our assumption) we
obtain
f (x) f 0 (x)
lim = lim 0 .
x→x0 g(x) x→x0 g (x)
Remark 5.37. This rule of calculation was published in 1696 by Guillaume Francois Antoine,
Marquis de L’Hospital (1661–1704) in the very first textbook on differential calculus. However,
it was actually proven in 1694 by the famous mathematician Johann Bernoulli (1667–1748).
As a first application of l’Hospital’s rule we prove that the second derivative test is valid.
Proof of Theorem 5.29. For this, let x0 ∈ I be such that f 0 (x0 ) = 0 and f 00 (x0 ) > 0. Since f 0 (x0 )
exists, we have from Theorem 5.10 that f is continuous at x0 . This implies that limx→x0 (f (x) −
f (x0 )) = 0. We obtain, by l’Hospital’s rule, f 0 (x0 ) = 0 and the definition of the second derivative,
that
f (x) − f (x0 ) f 0 (x) 1 f 0 (x) − f 0 (x0 ) f 00 (x0 )
lim = lim = lim = > 0.
x→x0 (x − x0 )2 x→x0 2(x − x0 ) 2 x→x0 x − x0 2
This shows that the limit on the left hand side exists and is positive. Therefore, the function
inside the limit must be positive in a neighborhood of x0 . In detail: There are some ε, δ > 0
175
Let us see some more examples where we can use this rule.
Example 5.38.
sin x cos x
lim = lim = 1.
x→0 x x→0 1
Example 5.39.
x3 − 1 3x2
lim = lim =3
x→1 x − 1 x→1 1
This rule can also be used several times to calculate a limit as the following examples will show.
Example 5.40.
1 − cos x sin x cos x 1
lim = lim = lim = .
x→0 x2 x→0 2x x→0 2 2
Example 5.41.
!
ln x x−1
x
lim x = exp lim x ln x = exp lim −1 = exp lim = e0 = 1.
x&0 x&0 x&0 x x&0 −x−2
Example 5.42.
1 − cos x2 1
sin x2 1
cos x2 1
lim = lim 2 = lim 4 = .
x→0 1 − cos x x→0 sin x x→0 cos x 4
A common application of the mean value theorem is the characterization of the monotonicity of
a function, which is based on its derivative.
or (strictly) decreasing if
f (x2 ) − f (x1 )
= f 0 (ξ) ≥ 0.
x2 − x1
This implies f (x2 ) ≥ f (x1 ) for arbitrary points x1 , x2 ∈ I with x1 < x2 , so f is non-decreasing.
Now assume that f is non-decreasing. This is, f (x2 ) ≥ f (x1 ) for all x2 > x1 . This also implies
that f (x2 ) ≤ f (x1 ) for all x2 < x1 . In any case, we get
f (x2 ) − f (x1 )
≥ 0,
x2 − x1
Remark 5.45. Note that if ’f is increasing’ does not imply that f 0 > 0 in general. As an
example consider f (x) = x3 , which is an increasing function, but f 0 (0) = 0.
Example 5.46. The function f (x) = ex is increasing on R, since ∀x ∈ R : f 0 (x) = ex > 0. This
implies that f has no stationary points.
Example 5.47. The function f (x) = − ln(x) is decreasing on R+ , since ∀x > 0 : f 0 (x) = − x1 <
0. Again, this implies that there are no stationary points.
The second derivative helps us to determine the shape of a function, once plotted in a coordinate
system. If a line between two points of a curve is above the graph, we call the function convex,
if otherwise the line is below the graph, we say the function is concave.
Remark 5.49. From the definition we obtain immediately f is concave if and only if −f is
convex.
177
Example 5.50. Have a look at the graphs of f (x) = x2 and g(x) = ln(x), see Figure 39. Clearly,
f (x) is strictly convex and g(x) is strictly concave on R+ . (Calculate the second derivatives of
both functions. What do you see?) The remark above states that −f (x) is strictly concave.
4 y f (x) = x2
2
g(y) = ln(x)
x
1 2 3
−2
Remark 5.51 (*). Convexity is a very useful property for certain optimization problems.
For a twice differentiable function it suffices to check the sign of the second derivative.
Proof. For now we only prove the case of convexity, concavity can be treated analogously.
Assume f 00 (x) ≥ 0 for all x ∈ I = (a, b). Moreover, let x0 , x1 ∈ I such that a < x0 < x1 < b.
Then, for any λ ∈ (0, 1), we let x := (1 − λ)x0 + λx1 ∈ (x0 , x1 ). By the mean value theorem we
can find ξ0 ∈ (x0 , x) and ξ1 ∈ (x, x1 ) such that
f (x) − f (x0 ) f (x1 ) − f (x)
= f 0 (ξ0 ) and = f 0 (ξ1 ).
x − x0 x1 − x
Since f 00 ≥ 0, we get that f 0 has to be non-decreasing. Hence f 0 (ξ0 ) ≤ f 0 (ξ1 ) and we obtain
f (x) − f (x0 ) f (x) − f (x0 ) f (x1 ) − f (x) f (x1 ) − f (x)
= ≤ = .
λ(x1 − x0 ) x − x0 x1 − x (1 − λ)(x1 − x0 )
The identities λ(x1 − x0 ) = x − x0 and (1 − λ)(x1 − x0 ) = x1 − x follow from the definition of x.
Using the above inequality and regrouping the terms we obtain
f (x) = f (1 − λ)x0 + λx1 ≤ (1 − λ)f (x0 ) + λf (x1 ),
which is the definition of convexity.
On the other hand assume that f is convex. Using the inequality in the definition of convex
functions and the above calculations we see that it holds
f (x) − f (x0 ) f (x1 ) − f (x0 ) f (x1 ) − f (x)
≤ ≤ ,
x − x0 x1 − x0 x1 − x
for arbitrary x0 < x < x1 . Letting x → x0 and x → x1 we obtain
f (x1 ) − f (x0 )
f 0 (x0 ) ≤ ≤ f 0 (x1 ).
x1 − x0
Thus f 0 is non-decreasing and therefore f 00 ≥ 0.
178
In the last subsections, we have seen that some (local or global) properties of a function may be
characterized by its derivatives. Now, we will show that, under certain assumptions, a function
can be characterized exactly (in a neighborhood) just by knowing all higher-order derivatives of
a function at one point. This shows that the very local behavior of a function can be used to
determine it exactly everywhere.
Let us start with a result that shows that we can approximate a function of the function in
a neighborhood of a point by a polynomial. This is Taylor’s theorem.
Recall that the n-th derivative of f : (a, b) → R (at x) is defined inductively by
dn d (n−1)
f (n) (x) := n
f (x) = f (x),
dx dx
if it exists.
We call
n
X f (k) (x0 )
Tn (x) := (x − x0 )k ,
k=0
k!
the Taylor polynomial of f of order n (at x0 ).
The term
f (n+1) (ξ)
Rn (x) := f (x) − Tn (x) = (x − x0 )n+1
(n + 1)!
is called the remainder of the Taylor polynomial (in Lagrange form).
Note that ξ above depends on x. (This may already be clear since it lies between x and x0 ,
i.e., ξ ∈ (x, x0 ) or ξ ∈ (x0 , x).) That’s why some authors write ξ = ξx to make this explicit.
In particular, to obtain the equality above, one needs to find a specific ξ for every x, which is
very impractical. However, this formula is helpful, when we want to prove that the error of the
Taylor polynomial, which is the difference of f (x) and Tn (x), is ’small’ for all x ∈ (a, b). We
just have to bound f n+1 (ξ) for all possible ξ ∈ (a, b).
Proof. Let x ∈ (a, b) be arbitrary and, w.l.o.g., assume x > x0 . Define, for t ∈ [x0 , x], the
function
n
X f (k) (t) m
g(t) := f (x) − (x − t)k − (x − t)n+1 ,
k=0
k! (n + 1)!
where we choose m such that g(x0 ) = 0.
Clearly, g(x) = 0. So, together with g(x0 ) = 0, Rolle’s theorem yields the existence of some
ξ ∈ (x0 , x) such that g 0 (ξ) = 0. We compute the first derivative of g (in t) and obtain, using the
179
(k)
Remark 5.54. It is easy to check that Tn (x0 ) = f (k) (x0 ) for all k = 1, . . . , n.
The equation in Example 5.55 is sometimes useful to rewrite polynomials. In particular, if one
is interested in properties around a specific point.
Example 5.56. Let p(x) = x3 − 2x2 + 3 and consider x0 = −1. Since p has the derivatives
p0 (x) = 3x2 −4x, p00 (x) = 6x−4 and p000 (x) = 6, we obtain p(−1) = 0, p0 (−1) = 7, p00 (−1) = −10
and p000 (−1) = 6. Example 5.55 implies
In the same way we may also expand more general functions around a point if we know
its derivatives, but note that there is an additional error term, which is in general not easy to
determine exactly. Under additional assumptions, however, one may give practical bounds.
Corollary 5.57. In the setting of Theorem 5.53, assume additionally that f : (a, b) → R
satisfies |f (n+1) (x)| ≤ M for some M < ∞ and all x ∈ (a, b), then
M (b − a)n+1
|f (x) − Tn (x)| ≤ for all x ∈ (a, b).
(n + 1)!
Proof. The bound follows directly from Theorem 5.53, by noting that |x − x0 | ≤ (b − a) for all
x, x0 ∈ (a, b).
Although this gives a useful uniform bound on the error, it is clearly useless, if we consider
functions that are defined on R. Let us discuss an example.
Example 5.58. Consider the function f (x) = ex on I = (−1, 1). Assume you want to approxi-
1
mate f by a polynomial with preferably small degree, and you allow an error of at most ε = 50 .
(k) x (k)
We know, see Example 5.7, that f (x) = e for all k ∈ N, and therefore also f (0) = 1.
Setting x0 = 0, we obtain that
n
X xk
Tn (x) = .
k=0
k!
Moreover, note that |f (k) (x)| ≤ e for all x ∈ (−1, 1) and k ∈ N. It follows from Corollary 5.57
that
e · 2n+1
|ex − Tn (x)| ≤ .
(n + 1)!
One may check the first values of n, to see that n = 7 is enough. That is, we obtain
!
x x2 x3 x4 x5 x6 x7 1
e − 1+x+ + + + + + ≤ 0.0173 ≤
2 6 24 120 720 7! 50
for all x ∈ (−1, 1). (Try to visualize this with some computer algebra software!)
One should notice that the upper bound on the error in the last example goes to zero (very fast)
with n. This is not only the case for the function ex , but for all infinitely often differentiable
functions that satisfy the assumption of Corollary 5.57 for all n ∈ N with the same M , i.e., if
n o
sup |f (k) (x)| : k ∈ N, x ∈ (a, b) ≤ M.
181
(b−a)n
In this case, we obtain that limn→∞ |f (x) − Tn (x)| = 0, since we always have n! → 0 for
a, b ∈ R, and therefore
∞
X f (k) (x0 )
f (x) = lim Tn (x) = (x − x0 )k
n→∞
k=0
k!
for all x0 ∈ (a, b), and we call the right hand side the Taylor series of f (around x0 ).
Note that Taylor series are a special case of the more general power series, see Section 3.8. In
particular, Taylor’s theorem leads to the earlier announced and constructive way to write rather
general functions as power series
∞
X
f (x) := ak (x − x0 )k
k=0
for some (real- or complex-valued) sequence (ak )k∈N0 and x0 ∈ R. See Section 3.8 for some
general results on the convergence of the right hand side. In particular, we obtain convergence
if |x − x0 | < R, where R is the radius of convergence
1
R = lim inf √
k→∞ k ak
f (k) (x0 )
see Theorem 3.96. For Taylor series, we consider ak = k! .
Remark 5.60. Note that convergence of the series ak (x − x0 )k does not mean that this sum
P
equals f (x). For this, we need a bound on the values of the derivatives at all ξ between x0 and
x, see Theorem 5.53. One may get in trouble otherwise. For example, if we write the function
f (x) = e|x| as its Taylor series at x0 = 1. Note that at this point f (x) equals ex , and so would
be the Taylor series. This series would converge for all x ∈ R, but wouldn’t be equal to f for
x < 0. (Note that f is not differentiable at x0 = 0.)
We now bring our conditions in a form that is more useful to decide if a function can be written
as Taylor series everywhere. This also allows for an analysis of functions defined on R. Note
that, however, in many cases, the above equality does not hold in the whole domain of definition,
but only in a neighborhood of the point x0 .
182
Remark 5.62. Note that the latter condition in the theorem is clearly fulfilled if |f (n) (x)| ≤ c·C n
for every fixed x ∈ I, where c, C ≥ 1 may depend on x, but not on n.
Proof. For the first statement, note that x ∈ Ur (x0 ) if and only if |x − x0 | < r. We can therefore
(n)
bound the remainder from Theorem 5.53 by |Rn−1 (x)| ≤ |f n!(ξ)| rn . Taking into account that ξ
is between x0 and x, and therefore also ξ ∈ Ur (x0 ), we see that the assumption of the theorem
implies |Rn−1 (x)| → 0.
For the second part, let x ∈ I be fixed. For x = x0 the statement is obvious. So, w.l.o.g.,
we assume x > x0 . Then, since f is infinitely often differentiable on I, and [x0 , x] is strictly
contained in I, we get that f (n+1) (ξ) (i.e., the derivative of f (n) at ξ) exists for all ξ ∈ [x0 , x].
This implies that f (n) , and therefore |f (n) |, is continuous on [x0 , x], see Theorem 5.10. From
Theorem 4.56 we know that a continuous function on a closed interval attains its maximum, say
at ξ ∗ ∈ [x0 , x], i.e., supξ∈[x0 ,x] |f (n) (ξ)| = |f (n) (ξ ∗ )|.
√
n
|f (n) (ξ)|
Using our assumption limn→∞ n = 0 for all ξ ∈ I, we obtain, in particular, that
q
n
|f (n) (ξ ∗ )| 1 nn
< ⇐⇒ |f (n) (ξ ∗ )| <
n 5|x − x0 | 5n |x − x0 |n
for all large enough n. Moreover, we have that nn ≤ 4n n! for all n ∈ N. This can be proven
inductively, by using (1 + n1 )n ≤ 4, see Example 3.36. With Rn from Theorem 5.53, we obtain
|f (n) (ξ ∗ )|
n
nn 1 4
|Rn−1 (x)| ≤ |x − x0 |n ≤ ≤ −→ 0,
n! n! 5n 5
where the second inequality only holds for large enough n. Since x ∈ I was arbitrary, we get
∞
X f (k) (x0 )
f (x) = (x − x0 )k for all x∈I
k=0
k!
Example 5.63. Let f (x) = ex on R. As already discussed above, we have |f (n) (x)| = ex , which
is independent of n. We can therefore apply the second part of Theorem 5.61 with I = R and
x0 = 0, and get
∞
X xk
ex =
k=0
k!
for all x ∈ R.
We now consider the trigonometric functions. It may come as a surprise, that one can
approximate these periodic (wave) functions up to an arbitrary precision by polynomials on the
whole real line. However, a visualization of the first Taylor polynomials is very instructive for
understanding what’s going on.
In the same way, one can calculate the following Taylor series, and prove their convergence:
∞
X x2k+1 x3 x5
sin x = (−1)k = x− + − +...
k=0
(2k + 1)! 3! 5!
∞
X x2k x2 x4
cosh x = = 1+ + + ...
k=0
(2k)! 2! 4!
∞
X x2k+1 x3 x5
sinh x = = x+ + + ...
k=0
(2k + 1)! 3! 5!
Remark 5.65. In the proof of Theorem 5.61 we have used the bound nn ≤ 4n n! for all n ∈ N.
Later we will prove Stirling’s formula, which states that, in fact,
√ n
n
1
√ n
n
2πn ≤ n! ≤ 1+ 2πn
e 11n e
√
n
n!
for all n ∈ N. In particular, it implies nn ≤ en · n! for all n, and n → 1e .
184
Example 5.66 (Taylor series of ln(x)). Let us now consider the natural logarithm f (x) = ln(x)
on R+ = (0, ∞), which is an example that shows that a Taylor series may converge only for x
in a neighborhood of x0 , i.e. we cannot apply the second part of Theorem 5.61.
For all x ∈ R+ we obtain f 0 (x) = x1 = x−1 , and therefore
(−1)k−1 (k − 1)!
f (k) (x) =
xk
for all k ∈ N and x ∈ R+ , see Example 5.3. With this, the Taylor polynomial of f around
x0 ∈ R+ is given by
n n
X f (k) (x0 ) X (−1)k−1 (x − x0 )k
Tn (x) = ln(x0 ) + (x − x0 )k = ln(x0 ) + ·
k=1
k! k=1
k xk0
n k
(−1)k−1 x
X
= ln(x0 ) + · −1 .
k=1
k x0
Clearly, these sums converge absolutely as n → ∞ if | xx0 − 1| < 1 (use e.g. the root test), which
k−1
holds if and only if x ∈ (0, 2x0 ). Moreover, for x = 2x0 , we have Tn (x) = nk=1 (−1)k , which
P
is the alternating harmonic series and, therefore, convergent, see Example 3.93. For all other x,
i.e. x > 2x0 , the Taylor series is clearly not convergent, since the terms of the sum are not a null
sequence. Hence,
∞
1 x k
X
ln(x) = ln(x0 ) − · 1−
k=1
k x0
holds if and only if 0 < x ≤ 2x0 . Choosing x0 = 1 we obtain the typical series expansion of
ln(x) at x0 = 1:
∞
X (1 − x)k
ln(x) = −
k=1
k
Pn (−1)k−1
for x ∈ (0, 2]. In particular, ln(2) = k=1 k .
Remark 5.67. Note that one may write functions f (x) as its Taylor series at different points x0 .
This can be used to evaluate, or give a short notation for, certain infinite sums. For example, if
we consider the Taylor series of ln(x) and cos(x) we see, e.g., that
∞ k
1 1
X
1− = ln(e) = 1
k=1
k e
(use x0 = e and x = 1), or
∞
X (−π)k √
= cos π .
k=1
(2k)!
Remark 5.68 (Complex arguments). Note that the series in the above expansions of ex , sin,
cos etc., are absolutely convergent series for all x ∈ R. This means, that the series make also
sense if we allow x to be a complex number, i.e., x ∈ C. This is the natural way of extending
real valued functions to the complex case.
Example 5.69. Use Taylor series to prove that
eix = cos x + i sin x
√
for all x ∈ R, where i := −1.
185
In the last section about differentiability we want to study Newton’s method, which is a com-
monly used method if we want to calculate zeros of a function. Before we prove the convergence
of this method we want to discuss the main idea in detail. We are interested in solving
f (x) = 0.
Now we consider f ∈ C 2 (I), i.e., twice continuously differentiable functions, such that there
exists some α ∈ I such that f (α) = 0. Moreover, we assume that f 0 (x) 6= 0 for all x ∈ I. This
ensures that xk+1 is always well-defined.
Now we use Taylor’s theorem and get for arbitrary x ∈ I that
f 00 (ξ)
0 = f (α) = f (x) + f 0 (x)(α − x) + (α − x)2 ,
2
for some ξ ∈ (x, α) (or (α, x)). We can divide this equation by f 0 (x) 6= 0 and obtain
f (x) 1 f 00 (ξ)
0= + (α − x) + (α − x)2 .
f 0 (x) 2 f 0 (x)
186
f (xn ) 1 f 00 (ξ)
xn+1 − α = xn − − α = (α − xn )2 .
f 0 (xn ) 2 f 0 (xn )
Thus
1 f 00 (ξ)
|xn+1 − α| = 0 · |xn − α|2 .
2 f (xn )
00
So we see that if ff0 (x(ξ) does not behave too badly, and we choose x0 not too far from α, the
n)
error should decrease quadratically. We will now explain how to choose x0 , in order to guarantee
convergence.
For this, define
m1 := sup |f 00 (x)| and m2 := inf |f 0 (x)|
x∈I x∈I
Theorem 5.70 (Local convergence of Newton’s method). Let I = (a, b) and f : C 2 (I) with
f (α) = 0 for some α ∈ I. Moreover, assume there is some δ > 0 such that
Then, for all x0 ∈ Uδ (α), the Newton’s method converges, and we have |xn −α| ≤ 2−n |x0 −α|.
187
The other part of the theorem is concerned with the integral of a function over an interval:
Clearly, we have to define precisely what this means. In contrast to derivatives, which were
closely connected to the slope at a point, local extrema, and other local properties of a function,
the integral is a global quantity. However, we will see later that both concepts are very much
related. In particular, if f : [a, b] → R is continuous, then there exists an antiderivative F of f
and Z b
f (x) dx = F (b) − F (a).
a
This is (the second part of) the fundamental theorem of calculus.
In this chapter we will first discuss antiderivatives of functions. Then we introduce a basic
definition of an integral, and show the fundamental theorem of calculus, which gives a rather
easy way of computing integrals (or areas). Later, we will see the limitations of this (too naive)
approach, and turn to the more powerful Lebesgue integral. For this we also need to discuss
some basic measure theory, i.e., we discuss what we mean by the area of a set.
Remark 6.1 (History). The first systematic ideas of an area (or integral) go back to 500 BC,
when people tried to compute the area of simple areas, like land plots. Already in the 3rd
century BC, Archimedes (c. 287 – c. 212 BC) used these ideas to find an approximation of the
area of a circle with radius one, and thereby determined the inequality 3 + 10 10
71 < π < 3 + 70 .
Only in the 19th century, mathematics was brought to a more formal level, which allowed for
very precise statements. The first “correct” definition on an integral was given by Augustin-Louis
Cauchy (1789–1857) in 1823. This was extended (or improved) by several mathematicians. The
most famous approaches are the Riemann integral, introduced in 1854 by Bernhard Riemann
(1826–1866) and by now the classical approach for introducing an integral to students, and the
Lebesgue ingegral, introduced in 1902 by Henri Léon Lebesgue (1875–1941), which is the one
actually used in research. Here, we only shortly comment on the Riemann integral, and focus
on the definition of the more powerful Lebesgue integral.
6.1 Antiderivatives
The theory of the previous chapters was mostly done for functions defined on (open) intervals,
which was enough to present the most important results. However, all the definitions (and many
of the results) would also make sense for functions defined on the union of open intervals, by
considering the function separately on each interval, or even on more general domains. But note
that for some theorems, like the mean value theorem (Theorem 5.34), it was essential that the
domain was an interval.
In order to come closer to more formal (or general) statements of theorems, we will present the
following for a larger class of domains. This will also be necessary since we want to define the
integral also over general domains.
∀x ∈ Ω ∃ε > 0 : Uε (x) ⊂ Ω,
Clearly, open intervals (a, b) and Ω = R are open. Also sets of the form (a, ∞), and their unions,
are open. Therefore, also R \ {0} = (−∞, 0) ∪ (0, ∞) is open. Similarly, we obtain that closed
intervals [a, b] are closed since ([a, b])c = R \ [a, b] = (−∞, a) ∪ (b, ∞) is open. Therefore, also
sets {a}, that contain only one element, are closed.
Remark 6.3. By this notation we can present the results in more generality, and without
specifying a specific form of the domain. Note that open sets Ω are exactly those sets, where
we can define the derivative of a function at every x ∈ Ω. (We had problems with the boundary
points!) However, if we consider in the following an open (or closed) set Ω, you may just think
about an open (or closed) interval, or unions of them.
189
This definition seems easy to handle since all of us already computed many derivatives. However,
if we compute a derivative of a differentiable function, then we always ended up with a (unique)
function, and we had a nice point-wise criterion for deciding if a function is differentiable.
This is now different since we want to find a function F , but we only have information about
its derivative F 0 = f . This is not enough to end up with a unique antiderivative F . To see this,
note that knowing the slope in each point does not give any information about the function
values at all. This is because a function with the same derivative might be at any “height”. Let
us write this down mathematically. For any function f and F , and c ∈ R, we have that
F is an antiderivative of f ⇐⇒ F + c is an antiderivative of f.
For this, we only used that the derivative of a constant function equals zero. In particular, if a
function has an antiderivative, then it has infinitely many.
R
Remark 6.5. Note that the notation f (x) dx, which shall denote a function, might be confus-
ing,
R
because it does not allow for the direct use of function values. E.g.,
R
we would never write
f (x) dx(2) for F (2) or so. Moreover, the correct meaning of F = f (x) dx is just “F is an
antiderivative of f ”, which is not an actual equality, but the derivatives of both sides have to
coincide everywhere. However, this notation is useful when we want to talk about (properties
of) the antiderivative as a function, since we do not have to reserve/waste a new letter.
Let us start with the easy example of the exponential function ex , which does not change under
differentiation.
Example 6.6. For the exponential function, we know that (ex )0 = ex , and therefore that
F (x) = ex is one possible antiderivative, i.e.,
Z
ex dx = ex .
However, if one asks for all antiderivatives of ex , then we have to take F (x) = ex +c for arbitrary
c ∈ R, i.e., Z
ex dx = ex + c.
In most applications, it is enough to know just one of the antiderivatives, and therefore we
mostly omit the constant c. However, keep in mind that a antiderivative is not unique.
Moreover, note that the two equations above combined do clearly not imply that ex = ex + c for
every x. The equal signs should be interpreted as “the derivative on the right hand side is ex ”.
190
1
Example 6.7. Now consider f (x) = x on Ω = R \ {0}, and we show that
dx 1
Z Z
= dx = ln |x|.
x x
First of all, we know from the last chapter that (ln(x))0 = x1 for all x > 0. Hence, F (x) = ln(x)
for x > 0. But ln(x) is not defined for x < 0 and therefore, it is not obvious how to choose F (x)
such that F 0 (x) = x1 . But it is easy to verify that, for x < 0, the function ln |x| = ln(−x) is well
defined and (ln(−x))0 = −x 1
· (−1) = x1 . This proves the claim.
This is already an example that shows, that it might be hard to find the antiderivative of a
given function, but it is easy to verify that a function is an antiderivative.
(Hint: Always double check your antiderivative by calculating its derivative!)
Next we provide a list of antiderivatives which we will use from now on. All of them follow by
differentiating the right hand side. (Do this again as an exercise!)
ax
Z
ax dx = , a > 0, a 6= 1
ln a
xa+1
Z
xa dx = , a 6= −1
a+1
dx
Z
= ln |x|
Z x
cos x dx = sin x
Z
sin x dx = − cos x
dx
Z
= tan x
cos2 x
dx
Z
= − cot x
sin2 x
dx
Z
= arctan x
1 + x2
dx
Z
√ = arcsin x
1 − x2
dx
Z
√ = arcosh x
x2 − 1
dx
Z
√ = arsinh x
x2 + 1
R
All the antiderivatives f dx above exist on the whole domain where f is defined.
However, not all functions have a antiderivative, as the following example shows.
Example 6.8. If we consider the function f : R → R with f (x) = 1 for x ≥ 0, and f (x) = 0
for x < 0, i.e., the Heaviside function, then it is not hard to see that an antiderivative must be
constant, say F (x) = c, for x < 0, and linear, say F (x) = x + b for x > 0. Otherwise we would
not get F 0 (x) = f (x) for x 6= 0. It remains to consider x = 0. Since F has to be differentiable,
it has to be continuous, and we obtain that b = c. However, it is easy to check that such a
function F cannot be differentiable at 0. (F has a kink.)
191
As for derivatives we will now present some rules that are useful when we want to find the
antiderivative of a complicated function, which is composed of some elementary functions, like
the ones given above.
However, and unfortunately, although we were able to determine the derivative for nearly any
combination of ’easy’ functions, it is much harder to find an antiderivative. In fact, a very
common strategy is to guess an antiderivative, and then to verify it by calculating its deriva-
tive. For this, one clearly needs to be well-practiced in calculating derivatives. Moreover, it is
sometimes just impossible to determine a closed formula for the antiderivative, even for ’easy
2 2
looking’ functions like e−x . We will discuss e−x dx and similar functions later in more detail.
R
The first calculation rule for antiderivatives, which directly follows from the corresponding rules
for derivatives, is the linearity.
R R
Lemma 6.9 (Linearity). Let F = f (x) dx and G = g(x) dx. Then, for all α, β ∈ R,
Z Z Z
αF + βG = α f (x) dx + β g(x) dx = αf (x) + βg(x) dx.
Proof. We only have to verify that the derivative of the function on the left equals the one in
the intgral on the right. By Theorem 5.11, we obtain
since F 0 = f and G0 = g.
x4 x3
Z Z Z
3 2 3
(x + x ) dx = x dx + x2 dx = + .
4 3
x4 x3
In particular, all antiderivatives of x3 + x2 are of the form F (x) = 4 + 3 + c for some c ∈ R.
x2 4x5/2 x3
= + + .
2 5 3
192
In some cases, one may need some modifications of the integrand to bring it to the right form.
Example 6.12.
x2 − x4 1 − x4 − (1 − x2 ) 1 − x2
Z Z Z
dx = dx = 1 − dx
1 − x4 1 − x4 (1 − x2 )(1 + x2 )
1
Z Z
= 1 dx − dx = x − arctan x.
1 + x2
In the same way as we used linearity of differentiation above, we can also utilize the other
calculation rules from Section 5.1 to deduce rules for antiderivatives.
Recall that the product rule states that
(f g)0 = f 0 g + f g 0
whenever the derivatives exist, see Theorem 5.13. Since this equality shows that f g is an
antiderivative of f 0 g+f g 0 , we obtain the rule of integration by parts (or partial integration)
by rearranging the terms.
Let us see an example that shows how this rule is usually applied.
R
Example 6.14. Assume we want to compute F = ln(x) dx, i.e., an antiderivative of ln(x).
Since we know the derivative of ln, i.e. (ln(x))0 = x1 , we may choose g(x) = ln(x) above.
Additionally, we choose f (x) = x, which satisfies f 0 (x) = 1. We obtain
Z Z Z Z
ln(x) dx = 1 · ln(x) dx = f 0 g dx = f g − g 0 f dx
1
Z
= x ln(x) − x dx = x ln(x) − x
x
= x(ln(x) − 1).
The same ’trick’ is very useful for integrating products of polynomials with sine, cosine or
exponential functions. However, sometimes one has to integrate several times to obtain a explicit
formula for the antiderivative.
Example 6.15. If we want to calculate x sin x dx, we set f 0 (x) = sin x and g(x) = x, which
R
Example 6.16. If we want to calculate x2 ex dx, we set g(x) = x2 and f 0 (x) = ex , which
R
This still involves the integral x ex dx, which we calculate by setting g(x) = x and f 0 (x) = ex ,
R
and therefore Z
x2 ex dx = x2 ex − 2xex + 2ex .
All the examples above involve polynomials (or more precisely, monomials), which vanish after
enough differentiation. This is not the case if we consider the product of a trigonometric and a
exponential function. In such cases, it may happen that we reach the same integral again after
some steps of partial integration. This can be used to calculate the antiderivative.
We set f 0 (x) = cos(x) and g(x) = 2x , which yields f (x) = sin(x) and g 0 (x) = ln(2) · 2x .
Therefore, Z Z
cos(x) 2x dx = sin(x) 2x − ln(2) · sin(x) 2x dx.
Now, the same indefinite integral appears on both sides. Rearranging, and dividing by (1+ln(2)2 )
leads to
sin(x) + ln(2) cos(x)
Z
cos(x) 2x dx = 2x .
1 + ln(2)2
sin(x)−cos(x)
sin(x) ex dx = ex
R
Example 6.18. Show that 2 .
The next rule we want to employ is the chain rule, see Theorem 5.14. For this recall that for
two differentiable functions F and g, we have
d
(F ◦ g)0 (x) = F (g(x)) = F 0 (g(x)) · g 0 (x).
dx
If F is a antiderivative of f , then this shows that F ◦ g is an antiderivative of (f ◦ g) · g 0 . This
is called the substitution rule.
194
R
Lemma 6.19 (Substitution rule). Let F = f x and g be a differentiable function. Then,
Z
f g(x) g 0 (x) dx.
F g(x) =
(This may also be done by partial integration, but would take ages.)
We see that the difficult part is the “x7 + 1” in the cosine. Let’s write g(x) = x7 + 1. Then, we
have g 0 (x) = 7x6 . If we now write f (x) = cos(x), we obtain
1
Z Z
6 7
x cos(x + 1) dx = g 0 (x) f (g(x)) dx.
7
The integral on the right hand side equals F (g(x)) by the substitution rule, where F (x) = sin(x)
is the antiderivative of f . We obtain
sin x7 + 1
1
Z
6 7
x cos(x + 1) dx = F (g(x)) = .
7 7
The substitution rule is sometimes easier to handle, if we introduce the substitution t = g(x).
If we then use the very informal reasoning that
dt d d
= t= g(x) = g 0 (x),
dx dx dx
and that this “implies” dt = g 0 (x) dx ( ⇐⇒ dx = dt
g 0 (x) ), then we can write
Z Z
0
f (g(x)) g (x) dx = f (t) dt = F (t).
R
Note that f (t) dt is now a function in t, and we have to replace t = g(x) at the end. Al-
though the above arguments were rather informal, we know that this formula is correct from
the substitution rule.
(Again, one could expand the brackets and integrate all the easy terms, which is rather lengthy.)
dt
With the substitution t = x3 − 5, we have dx = 3x2 , and therefore x2 dx = dt
3 . We obtain
dt t7
Z 6 Z
x3 − 5 · x2 dx = t6 = .
3 21
If we finally substitute t = x3 − 5, we obtain
7
x3 − 5
Z 6
3 2
x −5 · x dx = .
21
195
Example 6.22. There are also quite tricky √ examples, which can be solved by substitution.
Consider for example
R√
the function f (x) = 1 − x2 for x ∈ [0, 1], and we want to compute its
antiderivative 1 − x2 dx. Here, we use the substitution the other way around and define
x = sin(t) (is equivalent to t = arcsin(x)) for t ∈ [0, π2 ]. We obtain dx
dt = cos(t), and therefore
Z p Z q
1 − x2 dx = 1 − sin2 (t) · cos(t) dt.
Since cos2 (t) + sin2 (t) = 1, and cos2 (t) = | cos(t)| = cos(t) for t ∈ [0, π2 ], we get
p
Z p Z
1 − x2 dx = cos2 (t) dt for x = sin(t).
From this equation we get cos2 (t) dt = 21 (sin(t) cos(t) + t). Putting this in the equation above,
R
√
re-substituting x = sin(t) (or t = arcsin(x)), and noting that cos(arcsin(x)) = 1 − x2 (Try to
prove this!), we obtain
1 p
Z p
1 − x2 dx =x 1 − x2 + arcsin(x)
2
Note that, although the left hand side looks elementary, its antiderivative could involve functions
that one would not expect there, e.g., arcsin. We will see later how this integral is related to
the area of a circle.
There is one particularly useful rule that follows directly from the substitution rule. One may
even deduce it directly from the product rule by noting that, for a differentiable function f , we
have
0 f 0 (x)
ln |f (x)| = ,
f (x)
whenever f (x) 6= 0. We obtain the following corollary.
f 0 (x)
Z
dx = ln |f (x)|.
f (x)
To see that this result follows from the substitution rule, consider the substitution t = f (x).
Example 6.24. The last formula is always useful, when one wants to find the antiderivative of
a rational function, such that the the numerator is the derivative of the denominator.
An easy application is
x 1 2x 1
Z Z
2
dx = 2
dx = ln(1 + x2 ),
1+x 2 1+x 2
where we set f (x) = 1 + x2 , which gives f 0 (x) = 2x, and omit the absolute value, because f > 0.
196
Another application is
sin(x)
Z Z
tan(x) dx = dx = − lncos(x).
cos(x)
For this, let f (x) = cos(x), which implies that f 0 (x) = − sin(x).
Remark 6.25. Let us finally recall again, that an antiderivative is not unique. It is just a
function, whose derivative satisfies something. However, it can be unique, if we know in advance
that it satisfies additional conditions. In particular, one function value is enough. For example,
if we are looking for an antiderivative F of ex , such that F (0) = 0, then we obtain F (x) = ex −1.
(Verify yourself!)
This is a special case of so-called initial value problems (or boundary value problems):
For given f : I → R, x0 ∈ I and y0 ∈ R, we want to find F such that F 0 = f and F (x0 ) = y0 .
One may think about the following example: Let F (t) be the position of a train (or car, or
particle) at time t moving on the real line. Then, f (t) = F 0 (t) is the velocity. If we assume
that only f is given, i.e., we know only the velocity at every point in time, then we can clearly
say something about overall distance traveled, say, between time t = 0 and t = 1. However, we
have no chance to say, where the train actually is at time t = 1 (or any other time), if we do
not know where the train departed.
As noted in the beginning of this chapter, antiderivatives are very much connected to the inte-
gration of a Rfunction. That is, for a given function f : R → [0, ∞), and a given interval [a, b],
the integral ab f (x) dx is the area between the x-axis and the graph of f , i.e., the area of the set
2
(x, y) ∈ R : a < x < b, 0 < y < f (x) , see Figure 41.
Rb
Figure 41: The area of the shaded region is denoted by a f (x) dx
Clearly (and as we all hopefully already got used to), we have to define precisely what this
means. In particular, we have to clarify which functions are integrable, and which functions (or
sets) do not allow for the definition of a meaningful area.
197
Remark 6.26. It might come as surprise that there are sets, such that it is not possible to assign
them an area (or a volume in higher dimension). This corresponds to the existence of functions
such that we just cannot say how big the area below the graph is. However, we already were
in a similar situation when we discussed in Chapter 5 that not all functions are differentiable
(Consider, e.g., f (x) = |x| at x = 0.). And, similarly to there, the answer to “Is f integrable?”
very much depends on the precise definitions. As we discussed in Remark 6.1, there were several
attempts to obtain “the most general” or “the most useful” definition of an integral. (Presently,
we consider the Lebesgue integral, which we will introduce later, the “best”. But who knows...)
However, it was a great advance in mathematics to understand that there cannot be a universal
definition that assigns an integral to any function. And this insight has a rather fascinating
reformulation, if we talk about volumes in three dimensions:
The Banach-Tarski paradox states that we can split a ball into five disjoint pieces and, without
deforming one of them, we can put them back together in a different way to obtain two identical
copies of the same ball. In particular, if we would be able to assign a volume to each of the
(impossible to draw) pieces, then this construction would show that the volume of the ball equals
twice the volume of the same ball. An obvious contradiction. A proof of this statement (which
appeared first in 1924), goes far beyond the scope of this lecture.
It is obvious, why this is called a paradox: The same is clearly not possible in our real (physical)
world, as we cannot just double a ball. This is one of the most prominent examples of a very
counterintuitive mathematical result and shows that we have to be careful with our definitions.
We now introduce one possibility of defining an integral. Actually, this is probably the most
simple and straightforward definition, which is therefore restricted to “easy” functions. Later,
when we need more powerful mathematical tools, we have to give also a more involved definition.
However, note that for “easy” functions (that one can draw) it is imperative that all these
different definitions should give the same result.
Let us consider a continuous function f : Ω → R, and a closed interval [a, b] ⊂ Ω. (Note that
f is therefore bounded on [a, b].) If we now want to calculate the area between the graph of f
and the x-axis, then we could divide the interval [a, b] into equal subintervals, and approximate
the area in this subinterval just by the area of a suitable rectangle. Clearly, there are many
reasonable choices. Figure 42, e.g., shows the (bad) approximation of the integral by using the
smallest and the largest rectangle in each interval (when we divide it only into four subintervals).
Although the resulting approximation of the integral might be quite different, this difference
gets smaller and smaller if we increase the number of subintervals, see Figure 43.
This suggests that it is actually unimportant which of the rectangles we take, and we will prove
that this is in fact the case when we consider continuous functions on a closed interval.
Therefore we may just use the rectangles whose height is given by the left endpoint of the
subinterval. Note that, in the case of monotonically increasing functions, these are the same as
the lower rectangles, see Figure 42(left).
Remark 6.27. Note that all the reasoning also makes sense for possibly negative functions.
In this case, the area between f and the x-axis is counted negatively. In particular, if the integral
of a function is zero, then this only means that the area above the x-axis equals the area below.
Sn−1 h k k+1 i
Note that in the special case [a, b] = [0, 1], this partition has the simple form k=0 n , n .
For illustration purposes, let us stick to the case [a, b] = [0, 1]. To approximate the integral
of a function f : [0, 1] → R, we first consider the first subinterval [0, n1 ]. In this interval, we
approximate the area below the graph by the area of the rectangle [0, n1 ] × [0, f (0)], which is
clearly n1 · (f (0) − 0). See again Figure 42, where this area is just zero. We then consider
the second subinterval [ n1 , n2 ], approximate the area by the rectangle [ n1 , n2 ] × [0, f ( n1 )], which is
1 1
n f ( n ), and so on. If we add up the areas of all these rectangles, and call the sum Qn (f ), we
obtain
1 n−1
X k
Qn (f ) := f . (6.1)
n k=0 n
199
b − a n−1
X k(b − a)
f a+ .
n k=0 n
To assign the function f its integral over [a, b], it remains to show that these sums converge if
we make n larger and larger. This is given in the following lemma. To keep things simple, we
only show the case [a, b] = [0, 1]. The general case, can be proven along the same lines.
Moreover, as we already discussed above, our choice of the left endpoint to determine the height
of the rectangles was somewhat arbitrary, and we have to justify that it is indeed irrelevant. For
this, we show that one might also take the smallest or the largest of these rectangles in each
subinterval, and the result would still be the same. For this, define the lower sums
1 n−1 k k+1
X
Ln (f ) := min f (x) : x ∈ ,
n k=0 n n
and the upper sums
1 n−1 k k+1
X
Un (f ) := max f (x) : x ∈ , .
n k=0 n n
(Note that the minima and maxima exist, since f is continuous on the closed (sub-)intervals.)
Consider again the above pictures where the lower and upper sums are depicted.
In particular, we have
Ln (f ) ≤ Qn (f ) ≤ Un (f )
for every continuous function. (If this is not obvious to you, verify it!) So, if Ln (f ) and Un (f )
converge to the same value, then (by the sandwich rule) Qn (f ) also converges to that value.
lim Ln (f ) = lim Un (f ).
n→∞ n→∞
In particular, both limits exist. Therefore, also the sequence Qn (f ) n∈N
converges.
Proof. To prove that the limits of Ln (f ) and Un (f ) are equal, we will show that the difference
Un (f ) − Ln (f ) converges to zero, i.e., that for all ε > 0 there is some n0 ∈ N such that
|Un (f ) − Ln (f )| < ε for all n ≥ n0 . First of all, note that f is continuous on a closed interval,
and therefore uniformly continuous, see Theorem 4.65.
Let us now fix some ε > 0. By the uniform continuity we obtain that there is some δ > 0 such
that |x − y| < δ implies |f (x) − f (y)| < ε. If we now take some n0 > 1δ , we see that n1 ≤ n10 < δ
for every n ≥ n0 . Since the interval [ nk , k+1 1
n ] has length n , we obtain that |f (x) − f (y)| < ε for
all x, y ∈ [ nk , k+1
n ], if n ≥ n0 . (We use that |x − y| < δ for all such x, y and n.) In particular,
By the above lemma, it does not matter which specific points we choose in the respective
intervals. We always obtain the same limit. Therefore, we can define the integral of a function
as the limit of the given average of the function values.
b − a n−1
Z b X k(b − a)
f (x) dx := lim f a+
a n→∞ n n
k=0
the (definite) integral of f over [a, b]. We call a and b the limits of the integral.
Note that ab f (x) dx (if it exists) is a number, i.e., the area between the graph and the x-axis.
R
Therefore, the “x” is just a placeholder, and one may use any other letter. That is, e.g.,
Z b Z b Z b
f (x) dx = f (t) dt = f (ξ) dξ = . . .
a a a
Rb Rb
We sometimes even omit the integration variable, and write a f dx = a f (x) dx.
Although the definition of the integral looks rather simple, it is not a very practical one. The
involved limit is usually hard to determine. However, we will see in the following section that
the integral can be given in terms of the antiderivative of a function. This is also the typical way
of calculating integrals, and justifies the similarity of the notations. But always bear in mind
that antiderivatives are functions (more precisely: classes of functions) and not just a number.
(n−1)n
From the formula n−1
P
k=0 k = 2 (which is called “Gaußsche Summenformel” in German, but
doesn’t seem to have a name in English) we then obtain
Z 1 Z 1 1
1 (n − 1)n 1− n 1
f (x) dx = x dx = lim 2
= lim = .
0 0 n→∞ n 2 n→∞ 2 2
From the definition of the integral as a limit, we obtain that it satisfies a list of rules, like
linearity. Most of them may even already be clear from the “graphical definition”. We state
them without a proof.
201
There is one case of the above lemma that is particularly important. In analogy to the very
similar inequality for (finite) sums, this is also called triangle inequality.
Proof. Clearly, f ≤ |f | and −f ≤ |f |. The inequality therefore follows from |x| = max{x, −x},
Lemma 6.31(4) with λ = −1, and Lemma 6.31(6).
Let us finally state some remarks on difficulties with and variants of the above definition.
which equals the true area. (Note that Lemma 6.28 holds also for such functions.)
We will comment on such “piecewise functions” in detail in Section 6.6.
If we consider the Dirichlet function χQ instead, i.e.,R the indicator function of the rational
numbers, then, with our definition, we would obtain 01 χQ dx = 1, because all the function
values we compute are at rational points. And therefore, 01 χR\Q dx = 1 − 01 χQ dx = 0.
R R
However, this is unsatisfactory (and in contrast to our intuition), since there are more irrational
than rational numbers. One might check that Lemma 6.28 fails for this function. Hence, the
outcome of our “integration procedure” depends on the chosen function values, and we have to
be careful how we define an integral.
To solve this issue (at least partially) one must be more elaborate and think about a definition of
integrable functions. One could then define the integral for a much larger class of functions.
202
(Still not for all functions, see Remark 6.26.) We will talk about a more powerful definition,
i.e., the Lebesque integral, later. But, as stated many times, every generalization of the above
concept should lead to the same result when applied to a continuous function.
Remark 6.34 (Improper integrals). In the definitions above it was essential that we talk about
continuous (and therefore bounded) functions on a closed (and therefore bounded) interval. This
implies that all the rectangles that were used for the approximation of the integral have finite
size. Otherwise, the above definition is clearly useless. However, this can be relaxed a bit by
considering improper integrals. With this, it is also be possible to define the integral of some
unbounded functions on unbounded intervals, i.e., we may determine the area of unbounded
sets. We will shortly come back to this when we know how to compute integrals fast.
Remark 6.35. The “Q” in Qn (f ) in (6.1) stands for “quadrature rule”. This is how such
averages over function values are usually called. Quadrature rules are also very much used in
practice, but then one sometimes has to think about “more clever” averages. In particular, this
is important as we cannot compute the above limits in general, and therefore have to work with
finite n, which leads to errors that we might want to minimize. However, as we only work with
the limits here, Lemma 6.28 shows that such optimizations are not needed.
Remark 6.36. Even if we would work with more clever choices of points in the definition of the
quadrature rule, and therefore ultimately in the definition of the integral, this would still be not
enough for a “satisfactory definition”. (We do not want to comment here on what this means
exactly.) One of the first definitions that met these standards is the Riemann integral, see
Remark 6.1. For this definition we not only have to consider arbitrary points in each subinterval,
we also have to consider arbitrary subintervals (i.e., partitions) of the given interval. As this is
more technical than needed here, we skip the precise definitions and basic results, which can be
found in most beginners books on calculus.
We now turn to the fundamental theorem of calculus, which provides us with an easier way of
determining integrals, and which shows the existence of antiderivatives for continuous functions.
Recall that a differentiable function F is an antiderivative of f if and only if F 0 = f . We start
with the following additional result, which is of independent interest.
Theorem 6.37 (Mean value theorem for definiteR integrals). Let f, g : [a, b] → R be contin-
uous functions, and assume that g ≥ 0. Then, ab f (x)g(x) dx exists and there exists some
ξ ∈ [a, b] such that
Z b Z b
f (x)g(x) dx = f (ξ) g(x) dx.
a a
In particular, we have (for g = 1) that
Z b
1
f (x) dx = f (ξ)
b−a a
Proof. Since f is continuous, it attains its extrema on [a, b]. Let us denote them by m :=
minx∈[a,b] f (x) and M := maxx∈[a,b] f (x). It follows that m g(x) ≤ f (x)g(x) ≤ M g(x) for all
x ∈ [a, b], and therefore
Z b Z b Z b
m g(x) dx ≤ f (x)g(x) dx ≤ M g(x) dx,
a a a
Rb
see Lemma 6.31. We define I = a g(x)dx and obtain
Z b
m·I ≤ f (x)g(x)dx ≤ M · I.
a
If I = 0, then g = 0 and any ξ ∈ [a, b] can be used to obtain the result. Otherwise we divide by
I and get
1 b
Z
m≤ f (x)g(x) dx ≤ M.
I a
Due to the intermediate value theorem f attainsR every value in the interval [m, M ] (i.e., between
its exteme values), so in particular the value I1 ab f (x)g(x)dx.
This is all we need to formulate the main result of this section. For this, we set
Z b Z a
f dx := − f dx,
a b
whenever b < a. (Note that we have defined the left hand side only for a < b.)
is an antiderivative of f , i.e., F 0 = f .
Moreover, for any a, b ∈ I and any antiderivative F of f , we have
Z b
f (x) dx = F (b) − F (a),
a
The mean value theorem implies the existence of some ξh ∈ [x, x + h] such that
Z x+h
1
f (y)dy = f (ξh ).
h x
for all x ∈ I. For the second part, we first plug in a and b into the antiderivative that we just
defined and obtain
Z b Z a Z b
F (b) − F (a) = f (y) dy − f (y) dy = f (y)dy.
a a a
Moreover, note that different antiderivatives differ only by a constant. Therefore, the right hand
side is the same for any antiderivative of f .
Remark 6.39. The last theorem shows that integration gives us an antiderivative of a continu-
ous function f , and one may ask for an interpretation of this. One somehow sketchy possibility,
which originates from a graphical point of view, is the following:
If we start at F (a) and then add all the changes that F makes between a and b (if this would
be possible), then we arrive at F (b). Now, these changes (or slopes) are precisely the values of
the derivative of F , and summing all of them up is like integration. So, one might guess that
F (b) = F (a) + ab F 0 (x) dx. But, since F 0 = f , this is exactly what the fundamental theorem of
R
calculus is about.
With this very important and powerful theorem we can calculate many integrals easily. (At
least if you know many antiderivatives.) Let us consider again the example from above.
Example 6.40. Let us consider the function R1
f : [0, 1] → R given by f (x) = x.
We already showed in Example 6.30 that 0 x dx = 12 . This may also be shown by considering
2
F (x) := x2 , which is an antiderivative of f . We therefore have from the fundamental theorem
that " #1
x2 12 02
Z 1
1
x dx = = − = .
0 2 0 2 2 2
We now consider some more complicated examples, which may be difficult to compute only with
the definition of the integral as a limit.
x4 x3
Example 6.41. We know from Example 6.10 the antiderivative (x3 + x2 ) dx =
R
4 + 3 on R.
We therefore obtain, for example, the values
" #1
x4 x3
Z 1
3 2 1 1 7
(x + x ) dx = + = + = .
0 4 3 0
4 3 12
or " #1
x4 x3
Z 1
2
(x3 + x2 ) dx = + = .
−1 4 3 −1
3
(Try to find these values using the definition of the integral, or by any other means.)
205
Example 6.42. We now want to Rcompute definite integrals of the natural logarithm ln(x).
From Example 6.14 we know that ln(x) dx = x(ln(x) − 1) for all x > 0, i.e., all x such that
ln(x) is defined. From this, we obtain, e.g., the value
Z e
ln(x) dx = [x(ln(x) − 1)]e1 = e(ln(e) − 1) − (ln(1) − 1) = 1.
1
These examples already show that, with the techniques we’ve just learned, we can calculate
integral (or areas) precisely without much effort. (However, it is again essential that you know
many (anti)derivatives and how to obtain them.)
We will finally present (again) the rules for integration, that we already discussed in the section
about antiderivatives. Although they can be directly deduced from there, we state them again
for definite integrals to clarify their meaning.
Let us start with integration by parts, which is also called partial integration.
Proof. This follows from Lemma 6.13 together with Theorem 6.38.
Remark 6.44. Note again that we did not define the derivative at a boundary point. Hence,
whenever we assume a function to be “continuously differentiable on [a, b]”, we actually mean
that the function is “continuous on [a, b] and continuously differentiable on (a, b)”. We think
this should not lead to any confusion.
Example 6.45. If we want to calculate 0π x sin x dx, we set f 0 (x) = sin x and g(x) = x, which
R
implies that g 0 (x) = 1 and f (x) = − cos x, see Example 6.15. Partial integration yields
Z π Z π
π
x sin x dx = (− cos x)x 0
− (− cos x) dx
0 0
π π
= (− cos x)x 0
+ sin x 0
= − cos(π) · π + sin π = π.
Example 6.46. Again, we can use this formula also more than once, see Example 6.16. However,
the involved formulas for definite integrals can sometimes be much simplified, in contrast to
antiderivatives. For example, if we want to calculate 01 x2 ex dx, we set g(x) = x2 and f 0 (x) = ex ,
R
Corollary 6.47 (Substitution rule). Let I = [a, b], f be continuous and g be continuously
differentiable on I. Then,
Z b Z g(b)
0
f (g(x)) g (x) dx = f (y) dy
a g(a)
Note that, usually, we have a (complicated looking) integral like the one on the left and want to
transform it to a more easy one, like the one on the right. We will come to some examples soon.
Proof. Although this follows directly from Lemma 6.19 together with Theorem 6.38, we present
a proof here, because this one is slightly different and one can see where the fundamental theorem
comes in. First of all, we use the chain rule, see Theorem 5.14, to calculate
When we use this rule we can use the following (formally not completely correct) strategy:
Rb
1. We want to calculate a f (g(x))g 0 (x) dx, i.e., we assume that the integral is of this form
for some f, g.
2. Set y = g(x).
dy
3. Calculate dx = d
dx g(x) = g 0 (x). (Now check again if the integral is of the above form!)
1
4. Regroup to dx = g 0 (x) dy and plug in.
5. Since the new integration variable is y = g(x), we have to replace the limits of the integral
by g(a) and g(b).
R g(b)
6. Consider the new integral g(a) f (y) dy.
This is of the form above with f (x) = sin(x) and g(x) = 2x. Following the above procedure, we
dy
set y = 2x and calculate dx = 2, which implies dx = dy2 . Hence,
Z π Z 2π
1 1 2π 1
sin(2x) dx = sin(y) dy = − cos(y) 0 = (cos(0) − cos(2π)) = 0.
0 2 0 2 2
207
Once one gets used to this kind of substitutions, it should be no problem to calculate integrals
like Z 2
− cos(3) + cos(−11)
sin(7x − 11) dx =
0 7
in short time.
However, this is clearly wrong since the integral of a positive function must be positive. But
where is the mistake? The problem is that f is not continuous on [−1, 1], which is necessary for
the fundamental theorem. In fact, the function is not even defined at x = 0.
Although this looks like a drastical example, it also shows that it is important to verify all
assumption before we apply such a theorem. Otherwise, the result might be completely senseless.
So far we discussed how to calculate integrals (or areas below graphs), whenever the function and
the corresponding interval is bounded. However, one might imagine that we can also calculate
integrals of functions over unbounded intervals if the function is “small enough” for large x. By
a similar reasoning, we can integrate functions with a pole at the boundary, i.e., functions that
diverge to infinity at the boundary of the interval, if the divergence is “fast enough”. We discuss
both cases.
208
Let us start with unbounded intervals. In this case, we define the integrals
Z ∞ Z b
f dx := lim f dx,
a b→∞ a
Z b Z b
f dx := lim f dx,
−∞ a→−∞ a
Z ∞ Z 0 Z ∞
f dx := f dx + f dx
−∞ −∞ 0
whenever the integrals and limits on the right hand side exist. We then say that the integrals
on the left converge.
From the fundamental theorem of calculus, we know how to compute the finite integrals on the
right
Rb
hand side. That is, if F is an antiderivative of f (on the corresponding interval), then
a f dx = F (b) − F (a). To compute the above limits, it is therefore enough to compute the
limits for the antiderivative. That is, if we denote
F (−∞) := lim F (a) and F (∞) := lim F (b),
a→−∞ b→∞
then we obtain
Z ∞
f dx = [F ]∞
a = F (∞) − F (a),
a
Z b
f dx = [F ]b−∞ = F (b) − F (−∞),
−∞
Z ∞
f dx = [F ]∞
−∞ = F (∞) − F (−∞),
−∞
whenever the corresponding limits exist.
Let us see some examples.
Example 6.52. Consider f (x) = x1α = x−α on [1, ∞). An antiderivative of f is given by
1
F (x) = 1−α x1−α for α 6= 1, and F (x) = ln(x) for α = 1. Noting that F (x) converges to 0 when
x → ∞ if α > 1, and diverges otherwise, we obtain
(
Z ∞
dx 1
α−1 for α > 1,
= F (∞) − F (1) =
1 xα ∞ for α ≤ 1.
From this example we can deduce the following general rule for improper integrals. Note that
we had a very similar statement for series, see Lemma 3.91.
Proof. Exercise.
By the above arguments, we see that it is not necessary for the antiderivative to be defined at
the limits of the integral. (Note that a function on R is never defined at ±∞. We can only
compute limits.) This can also be used if a function is defined, and has an antiderivative, on an
open interval (a, b), but not on the boundary points.
Example 6.55. Consider for example the functions f (x) = x1 or f (x) = ln(x) on (0, 1). Both
functions are continuousR on (0, 1), and therefore have an antiderivative on (0, 1). However, for
computing the integral 01 f dx by using the fundamental theorem directly, we would need the
values F (0) and F (1). But, since both functions are not even defined at 0, it makes no sense to
ask for a function whose derivative equals f at 0, i.e., there cannot be an antiderivative at 0.
Consider a continuous function f : (a, b) → R, and let F : (a, b) → R one of its antiderivatives.
We then define
F (a) := lim F (x) and F (b) := lim F (x),
x&a x%b
and therefore Z 1
ln x dx = F (1) − F (0) = −1.
0
Example 6.58. Consider f (x) = x1α = x−α on (0, 1]. An antiderivative of f is given by
1 1
F (x) = 1−α x1−α for α 6= 1, and F (x) = ln(x) for α = 1. Therefore, F (1) = 1−α for α 6= 1.
Noting that F (x) converges to 0 when x → 0 if α < 1, and diverges otherwise, we obtain
(
Z 1
dx 1
1−α for α < 1,
= F (1) − F (0) =
0 xα ∞ for α ≥ 1.
R1 1
√ dx = 2 (for α = 1 ) and 1 x dx = 1 (for α = −1), but 1 1 dx
R R
In particular, we obtain 0 x 2 0 2 0 x
does not converge.
R ∞ dx
Remark 6.59. Note that, by the Examples 6.52 and 6.58, we have that 0 xα = ∞ for all
α ∈ R.
Example 6.60. Show that the results of Examples 6.52 and 6.58 are equivalent by using the
substitution rule with f (x) = g(x) = x1 .
210
Let us finally comment on functions, which have some desired properties only piecewise.
Definition 6.61. Let I = [a, b]. We say that a function f : I → R is piecewise continuous
if and only if there are exist a finite number of points x1 , . . . , xm ∈ I such that
2. the limits limx%xk f (x) and limx&xk f (x) exist and are finite.
A simple example of a function for which such piecewise considerations might be necessary is
the indicator function of an interval [c, d] ⊂ R, i.e.
(
1, if x ∈ [c, d],
χ[c,d] (x) :=
0, if x ∈
/ [c, d].
However, one might also think about other piecewise defined functions, like
2
−x , if x < 0,
f (x) := 2x2 + 1, if x ∈ [0, 1],
x, if x > 1.
These functions are clearly not continuous on R. However, when restricting to the individual
“pieces” of the functions then they are continuous. (Actually, both function are infinitely often
differentiable on each subinterval.) Since also the needed (one-sided) limits are finite, both
functions are piecewise continuous.
Now, to compute the integral of such piecewise countinuous functions, we can just split the
integral into the corresponding parts and then use the respective rules for calculating integrals.
That is, if f : [a, b] → R is a piecewise continuous function with exceptions x1 , . . . , xm , then we
use
Z b Z x1 m−1
X Z xk+1 Z b
f dx = f dx + f dx + f dx.
a a k=1 xk xm
Since, f is now a continuous function in each subinterval. This expression is well-defined. For
example, we easily obtain
Z ∞ Z c Z d Z ∞
χ[c,d] dx = 0 dx + 1 dx + 0 dx = d − c.
−∞ −∞ c d
However, note that the subintervals (xk , xk+1 ) are (by Definition 6.61) open intervals. Therefore,
formally, we need to treat the integrals as improper integrals. The assumption about the one-
sided limits ensures that these integrals always exist. (One might argue with an continuous
extension of f to the closed interval [xk , xk+1 ].)
Remark 6.62. The assumption about the one-sided limits could be relaxed a bit. However, since
we want to say that “every piecewise continuous function is integrable”, we need an assumption
that excludes, e.g., x1 of being piecewise continuous.
211
7 Fourier series
As discussed several times, it is the major task of natural and applied science to give an approx-
imation of reality (which is actually a complicated function). We have already used derivatives
to obtain approximations of functions by using Taylor’s theorem, see Theorem 5.53 and Corol-
lary 5.57. Although this is often very useful in theory, it has several drawbacks when it comes
to actual computations. First of all, the quality of the Taylor polynomial of a function is limited
by its smoothness, i.e., how often the function is differentiable, and the size of the domain. This
is clearly unsatisfactory, as an actual computational problem may not be “nice” in this respect.
For example, classification problems are naturally not concerned with smooth functions.
In this section we introduce and discuss Fourier series, which is a particular famous way of
writing functions as series based on certain integrals. The main idea is that functions can be
written as superpositions (or sums) of wave functions, represented by certain multiples of cos
and sin. That is, for many functions f : [0, 1) → R, we can find numbers ak , bk such that the
trigonometric polynomials
n
X
ak cos(2πkx) + bk sin(2πkx) ≈ f (x)
k=0
are “good” approximations of f for large enough n, where ak , bk must depend on f , much like
the coefficients in the Taylor polynomial. In particular, one may recover f if we send n → ∞.
Before we come to precise definitions, let us give some comments on the (partly recent) history
of the theory of Fourier series.
Remark 7.1 (History). Fourier series and the corresponding Fourier analysis are nowadays
of remarkable significance in a lot of applied sciences, especially in physics (acoustics, optics,
astrophysics) but also in signal processing, cryptography, oceanography and economics.
The first attempts of using trigonometric series for the approximations of functions dates back
to 1740 when the mathematicians Daniel Bernoulli (1700–1782) and Jean-Baptiste le Rond
d’Alembert (1717–1783) discussed this possibility. But only when Jean Baptiste Joseph Fourier
(1768–1830) presented his famous work “Théorie analytique de la chaleur” (“The analytical the-
ory of heat”) in 1822, it became apparent how powerful these techniques are. Fourier managed
to solve the heat equation (in one dimension) by using the series, which are by now referred to
as Fourier series.
At this time it was the general thought that every continuous function can be written as such
a series. However, one of the first actual convergence results is due to Peter Gustav Lejeune
Dirichlet (1805–1859). In 1829 he proved that the Fourier series of a Lipschitz continuous
function converges. In order to treat Fourier series, Bernhard Riemann (1826–1866) actually
invented his definition of an integral (the Riemann integral) and discovered the so called local-
ization theorem in 1853. It took until 1876 for Paul du Bois-Reymond (1831–1889) to find a
continuous function, whose Fourier series did not converge at any point. This was a big surprise
at that time. However, in 1904 the Hungarian mathematician Leopold Fejér (1880–1959) could
show that for any continuous function, its Fourier series converges on arithmetic average. This
means, that we can recover any continuous function from its Fourier coefficients, but we need
to be careful how to use them. This was a major breakthrough, and lead to big advances in
theoretical and applied sciences.
The final word on this problem was given only in 1966 when the Swedish mathematician Lennart
Carleson (1928–now) showed that the Fourier series of any square-integrable function converges
almost everywhere. (We will see later what this means.) This was a question posed in 1915 by
Nikolai Nikolajewitsch Lusin (1883–1950), and Carleson became world-famous for this.
212
As all the theory so far, and all that will come, we need assumptions about the functions under
consideration. Here, we will mostly assume that the functions f : [0, 1) → R are (piecewise)
continuous. This is to ensure that the integrals that we use are well-defined. This assumption
is not necessary for many claims, but we do not have the theoretical background to treat more
general cases, so far.
Moreover, as we want to write functions as sums of cosine and sine functions, which are obviously
periodic when considered as functions on the real line, it is natural to assume the same for the
functions under consideration.
Note that a periodic function is completely known if we know its function values in [0, 1). All
other function values on R follow from periodicity. E.g., f (0) = f (1) and f ( 37 ) = f ( 13 ) for peri-
odic functions. That’s why we also call functions defined on [0, 1) periodic functions, and mean
by this its periodic extension to R, i.e., we define f (x) := f ({x}) for x ∈ R \ [0, 1), where
{x} := x − bxc ∈ [0, 1) is the fractional part of x. See Figure 44 for the periodic extensions of
the functions f, g : [0, 1) → R with f (x) = x and g(x) = 12 − |{x} − 12 |, which are called sawtooth
wave and triangle wave.
(Note that, by the periodic extension, we have f (2) = f (0) = 0 and not f (2) = 2.)
This is useful when describing properties of a function, which include the boundary points. For
example, the above functions show that, while both f and g are continuous on [0, 1), only the
triangle wave g is continuous when considered as a periodic function. We fix this notation in
the following definition.
Definition 7.3. We say that a periodic function f : [0, 1) → C has a property if and only
if its periodic extension to R, i.e., the function f : R → C with f (x) = f ({x}), has this
property.
For example, when we say that a periodic function f is continuous, then this also implies that
the function values at the boundary coincide, or, more precisely, that limx&0 f (x) = limx%1 f (x).
The same statements hold for differentiability and so on. In particular, note that the left function
in Figure 44, i.e., the sawtooth wave, is also (infinitely often) differentiable if considered as a
function on (0, 1). However, as a periodic function it is not even continuous.
213
Let us come to the most important examples, which are called trigonometric monomials, in
correspondence to the (algebraic) monomials xk for k ∈ N.
Example 7.4 (Trigonometric monomials). The functions cos(2πkx) and sin(2πkx) are infinitely
often differentiable and 1-periodic functions for any k ∈ Z. Verify this yourself and make plots
of the functions for different k.
Remark 7.5 (Other periods). It is somewhat arbitrary to choose 1 as the length of a period.
Another standard choice would be to consider 2π-periodic functions, i.e., functions that satisfy
f (x) = f (x + 2π), like e.g. cos(x). We chose this normalization to ease the notation.
More general, one can study functions with an arbitrary period ω > 0, i.e., we have f (x + ω) =
f (x) for all x ∈ R, but note that, in this case, we always have that the function x 7→ f (ω · x) is
a 1-periodic function.
Although we considered so far mostly real-valued functions, it is very natural to study Fourier
series directly for complex-valued functions. For this, note that every complex-valued
function f : R → C √ can be written as f (x) = u(x) + i · v(x), where u, v : R → R are both
real-valued, and i = −1. We call such a function f continuous/differentiable/etc., if the same
is true for the real part u and the imaginary part v.
An especially important class of periodic functions are the trigonometric polynomials. These
are sums of the cosine and sine functions discussed in Example 7.4. Recall that (algebraic)
polynomials were functions p : R → R of the form p(x) = nk=0 ak xn . We showed that, under
P
certain assumptions, we have that a function can be approximated quite well by using algebraic
polynomials. This was Taylor’s theorem 5.53.
We now want to have similarly simple “building blocks” also for periodic functions, and it
turns out that the functions cos(2πkx) and sin(2πkx) are suitable. However, as it simplifies the
notation a lot and is a very useful tool, we use Euler’s formula
and use e2πikx = cos(2πkx) + i sin(2πkx), which are 1-periodic, as building blocks for trigono-
metric polynomials. But note that we also have to consider “negative frequencies” then.
where n ∈ N and c−n , . . . , cn ∈ C are called the coefficients of the trigonometric polynomial.
Note that this very short notation for a trigonometric polynomial can clearly be written out by
using cosine and sine terms. In particular, by using Euler’s formula, we obtain that
n
X n
X
2πikx
p(x) = ck e = ck cos(2πkx) + i sin(2πkx)
k=−n k=−n
Xn n
X
= ck cos(2πkx) + i ck sin(2πkx)
k=−n k=−n
214
However, this representation can be simplified further by using that cosine is even, i.e., cos(−x) =
cos(x), and that sine is an odd function, i.e., sin(−x) = − sin(x). From this we get
n
X n
X
p(x) = c0 + (ck + c−k ) cos(2πkx) + i (ck − c−k ) sin(2πkx).
k=1 k=1
(Check this in detail!) We may therefore write trigonometric polynomials, as in Definition 7.6,
in the form n n
X X
p(x) = a0 + ak cos(2πkx) + bk sin(2πkx).
k=1 k=1
if we set a0 = c0 and, for k ≥ 1,
ak := ck + c−k ,
(7.1)
bk := i(ck − c−k ).
Example 7.7. The trig. polynomial p(x) = e6πix + e−6πix can be written as p(x) = 2 cos(6πx).
More generally, every trigonometric polynomial with all ck ∈ R such that ck = c−k , gives a sum
of cosines with real coefficients.
Example 7.8. Write p(x) = sin(2πx) in the form given in Definition 7.6.
From the above relations, we can deduce quite some information of a trigonometric polynomial,
just by looking at its coefficients. In particular, we can say if a trigonometric polynomial is
indeed real-valued, i.e., if p(x) ∈ R for every x ∈ R.
Lemma 7.10. Let p be a trigonometric polynomial in the form given in Definition 7.6.
Then, p is real-valued, i.e., p : R → R, if and only if
Remark 7.11 (Function defined on a circle). Another way of looking at periodic functions
is to assume they are defined on a circle. For this note that, instead of assuming that the
function has “the same behavior” at the endpoints of the interval [0, 1), we may also assume
that the endpoints are just the same, i.e., we assume 0 = 1. We can then talk about properties
like continuity when “x goes around the circle”, and there is no distinguished point like a
boundary. Mathematically, there are many ways of modeling this. The most prominent is to
consider functions defined on the complex unit circle S1 := {z ∈ C : |z| = 1}. This also gives a
justification of the name trigonometric polynomial since, with the parametrization z = e2πit , it
can be written as z 7→ nk=−n ck z k , z ∈ S1 , which looks like an algebraic polynomial.
P
215
We now turn to the approximation of periodic functions by trigonometric polynomials. For this,
we need the so-called Fourier coefficients of a function. These values are then used to build
up an approximation of the function, the Fourier series. Note that in a similar way, we used
derivative values to obtain approximations by algebraic polynomials, by using Taylor’s theorem.
the Fourier series of f . (We use this notation also if the limit does not exist.)
We say that the Fourier series equals f (pointwise) if f (x) = Sf (x) for all x ∈ [0, 1).
Note that the Fourier coefficients, and therefore the partial sums of the Fourier series, are well-
defined if the involved integrals are. So, in particular, for any piecewise continuous function
f on [0, 1). (We do not even need that f is continuous as a periodic function.) However, the
Fourier series does not necessarily make sense (aka. converge), or even then, it must not be
equal to f everywhere. Before we see this with an easy example, let us state the derivative and
the (indefinite) integral of our basic building blocks.
d 2πikx e2πikx
Z
e = (2πik) e2πikx and e2πikx dx =
dx 2πik
for all x ∈ R. In particular, we obtain
Z 1
e2πikx dx = 0
0
6 0, and 01 e2πi0x dx = 1.
R
for k =
If we accept that Theorem 5.11 and Lemma 6.9, i.e., linearity of differentiation and integration,
also hold for complex-valued functions, then a proof is straightforward, and we omit it.
216
We first discuss the Fourier series of a trigonometric polynomial. Similarly as algebraic poly-
nomials can be represented exactly by some finite Taylor polynomial, it is probably no surprise
that some (large enough) partial sum of the Fourier series is exact for trigonometric polynomials.
We give a detailed proof here for demonstration.
N
X Z 1
= cj e2πi(j−k)x dx.
j=−N 0
(Note that we were able to switch integral and sum, only because the sum is finite.) We have
Z 1 1 if j = k ,
e2πi(j−k)x dx = h
1 2πi(j−k)x
i1
0
2πi(j−k) e =0 if j 6= k ,
0
such that
N
X
p̂(k) = cj δjk ,
j=−N
In other words (
ck if − N ≤ k ≤ N,
p̂(k) =
0 otherwise.
This shows that Sn p = p for all n ≥ N , and therefore Sp = lim Sn p = p.
217
We continue by computing the Fourier series of a function that is not a trigonometric polynomial.
This example will show, when we finally finish it, that Fourier series are sometimes very helpful
in computing complicated sums.
Example 7.15. Consider the periodic function f (x) = x on [0, 1) (i.e., we consider the function
f (x) = {x} on R). By using the above, and integration by parts, we obtain fˆ(0) = 12 and, for
k 6= 0,
Z 1 Z 1
fˆ(k) = f (x)e−2πikx dx = x e−2πikx dx
0 0
" #1
x e−2πikx
Z 1
1
= − e−2πikx dx
−2πik 0
−2πik 0
1
= .
−2πik
We therefore obtain the Fourier series
1 X e2πikx
Sf (x) = +
2 k6=0 −2πik
Although the last series looks like the (divergent) harmonic series, we will see later that it is
actually convergent, and equals f (x) = x, for any x ∈ (0, 1). However, we see already now, that
the series is not equal to f at all x ∈ [0, 1), as f (x) = Sf (x) is false for x = 0. To see this, note
that all terms in the above series equal 0 if x = 0. Therefore, the series converges to Sf (0) = 12 .
(Indeed, Sn f (0) = 21 for all n, see also Figure 45.) However, we clearly have f (0) = 0. This
shows that we need to be careful with the points of convergence of a Fourier series.
Remark 7.16 (Non-convergence). The above example shows that continuity in [0, 1) is not
enough such that all functions are representable as its Fourier series. The reason is, that f (x) =
{x} is not continuous when considered as a periodic function. (It has jumps at x ∈ Z.) And if we
plot the first partial sums of the Fourier series, see Figure 45, we see that also the approximation
close to these jumps is not very good. One may even prove that Sn f is unbounded, i.e., that
limn→∞ maxx∈[0,1) |Sn f (x)| = ∞, which shows that an approximation can be arbitrarily bad for
finite n. Moreover, it is possible to show that there is also a periodic and continuous function
whose Fourier series diverges in a point. Both statements go far beyond this lecture.
Let us consider another Fourier series before we turn to statements about convergence.
Example 7.17. Consider the periodic function f : [0, 1) → C that is given by f (x) = (x − 12 )2 .
Note that f is continuous (as a periodic function). We obtain the Fourier coefficients
Z 1 2 Z 1/2
1 1
fˆ(k) = x− e−2πikx dx = t2 e−2πik(t+ 2 ) dt
0 2 −1/2
Z 1/2 Z 1/2
= e−πik t2 e−2πikt dt = (−1)k t2 e−2πikt dt,
−1/2 −1/2
218
Note that these sums are clearly absolutely convergent (Why?), but, so far, we do not know if
Sf equals f at any point.
However, when we have a look at the first partial sums of this Fourier series, then it seems to
converge very fast, i.e., already for n = 20 we see almost no difference to the original function,
see Figure 46. It even looks like the partial sums would converge uniformly to f .
If we could prove that Sf equals f , i.e., that the partial sums converge pointwise to the function,
then, in particular, we would have that Sf (0) = f (0). Noting that f (0) = 14 and that all the
cosine terms in the above series equal 1 at x = 0, this would imply that
∞
1 X 1 ? 1
+ 2
= Sf (0) = f (0) = ,
12 k=1 (πk) 4
and therefore ∞
X 1 ? π2
= .
k=1
k2 6
This formula is correct (and by the way quite beautiful), but we did not prove it so far.
It still remains to show that the Fourier series converges to f at any point. We will do this
by presenting a general rule, based on an assumption on the Fourier coefficient, that a Fourier
series converges at all points. This assumption will be fulfilled, e.g., by twice-differentiable and
periodic functions.
Remark 7.18. The method above is probably the most powerful technique for computing
certain infinite sums. In some cases, there is even no other (manageable) way. One may try,
e.g., to compute ∞ 1
P
k=1 k2 by any other means.
However, we can clearly not evaluate every series by this. One example of this kind is ∞ 1
P
k=1 k3 .
(For practice, try to find the value of this series. But do not try too long! There’s no explicit
form of this number.)
220
Remark 7.19. Using the computations from page 214 (with ck := fˆ(k)), we can write the
partial sums of the Fourier series solely by sums of sine and cosine functions, as indicated in the
introduction. That is, we can write
n
X n
X
Sn f (x) = a0 + ak cos(2πkx) + bk sin(2πkx)
k=1 k=1
with Z 1 Z 1
ak := 2 f (x) cos(2πkx) dx and bk := 2 f (x) sin(2πkx) dx
0 0
for k ∈ N. (Verify this using (7.1).) However, this form has almost no advantages and it is often
more work to compute ak and bk separately, instead of just the ck . That’s why we usually work
with the Fourier coefficients as given above.
Remark 7.20. As stated in the beginning, it was rather arbitrary to choose the period 1. If
one chooses the period 2π (which is another prominent choice), then the Fourier coefficients are
usually defined by
1 2π
Z
ˆ
f (k) := f (x) e−ikx dx.
2π 0
This implies that also the Fourier series of functions may look different. (And they are because
of the different domain/period.) Therefore, you should be careful when using other literature.
We now turn to a result about point-wise convergence of Fourier series, i.e., we want to know
if the partial sums of the Fourier series of a function converge to this function at all points.
This is clearly a desirable property, and we will show that this holds for functions whose Fourier
coefficients are absolutely summable, i.e., k∈Z |fˆ(k)| < ∞.
P
(i) If
lim fn (x) = f (x) for all x ∈ D,
n→∞
pw
then we say that (fn ) converges point-wise to f . We use the notation fn −→ f or
“fn → f point-wise”.
(ii) If
lim sup |fn (x) − f (x)| = 0,
n→∞ x∈D
Example 7.22. Consider fn (x) = xn on D = [0, 1). Then, since we know that xn → 0 for every
pw
x ∈ [0, 1), we have that fn −→ 0. However, we have for every fixed n ∈ N that
Since this does not converge to 0, we obtain that fn is not uniformly convergent (to 0).
Roughly speaking, the sequence xn converges arbitrarily slow to 0 (depending on x), and there-
fore not uniformly.
Example 7.23. Another example of a sequence of functions, and its limit, that we discussed
already is the difference quotient: For this, let f : (a, b) → R be a continuous function, and
define the sequence of functions
f (x + n1 ) − f (x)
fn (x) = 1 .
n
pw
We know that fn −→ f 0 , i.e., fn converges point-wise to the derivative of f , if f is differentiable.
One can also show that fn converges uniformly to f 0 if f is continuously differentiable. (We do
not need that here.)
We do not want to go too much into detail here. In most cases, we will just talk about point-
wise convergence, which should be easy to comprehend, since we actually already work with this
type of convergence for some time. However, many results hold directly for the much stronger
uniform convergence, and that’s why we also state it. Moreover, it is sometimes a helpful tool
in proving our results. Let us briefly comment on the power of this type of convergence: In
general, we do not know in advance properties of the point-wise limit of a sequence of functions.
However, if fn → f uniformly, then some properties are preserved. For example, if every
fn is continuous, then we obtain that also f is continuous. This is a very powerful insight!
222
(To see that this is false without uniform convergence, consider e.g., fn (x) = xn on [0, 1] which
converges pointwise to a discontinuous function. Which one?)
In the sequel we will need only two of the properties that are preserved under uniform conver-
gence. We state them in the following lemma.
Note that the second part of this lemma actually states that we can “interchange” the limit
(in f (x) = limn→∞ fn (x)) and the integral. Since the terms of the sequence are usually much
easier functions than the limit, this gives a nice way of computing integrals. Moreover, the first
enables us to show that the limit of continuous functions is continuous, even if we do not know
the limit precisely. This is particularly useful when working with Fourier series, because in this
case (fn = Sn f ) the functions fn are obviously continuous. Hence, uniform convergence of the
partial sums of a Fourier series directly implies continuity of the limit.
Proof. To prove part (i), we need to show that limm→∞ |f (xm ) − f (x0 )| = 0 for every x0 ∈ D
and every sequence (xm )m∈N ⊂ D with xm → x0 . Fix some ε > 0. By the uniform convergence
of fn → f , we obtain that there is some n0 ∈ N such that |fn (y) − f (y)| < 2ε for all n ≥ n0 and
all y ∈ D. Therefore,
|f (xm ) − f (x0 )| = f (xm ) − fn (xm ) + fn (xm ) − fn (x0 ) + fn (x0 ) − f (x0 )
≤ |f (xm ) − fn (xm )| + |fn (xm ) − fn (x0 )| + |fn (x0 ) − f (x0 )|
ε ε
< + |fn (xm ) − fn (x0 )| + = |fn (xm ) − fn (x0 )| + ε
2 2
for all n ≥ n0 . Now, since all fn are continuous, we have that the limit m → ∞ of the right
hand side converges to ε. That is, we have
Since this holds for all ε > 0, this proves part (i).
For the second part, we use the triangle inequality to obtain
Z
b Z b Z b
f dx − fn dx ≤ |f (x) − fn (x)| dx.
a a a
ε
By uniform convergence, supx |f (x) − fn (x)| < b−a for all n ≥ n0 and n0 large enough. This
implies Z
b Z b ε
Z b
f dx − fn dx < 1 dx = ε
b−a a
a a
for n ≥ n0 . This implies the result.
223
P∞ ˆ
Theorem 7.25. Let f : [0, 1) → C be continuous. If k=−∞ |f (k)| < ∞ holds, then
The proof of this theorem makes use of the following lemma, which says that two distinct
functions can be distinguished by their Fourier coefficients. This is a natural statement, and
one of the fundamental bases of Fourier analysis. However, a formal proof would be rather
complicated, and we omit it here.
Lemma 7.26. Let f, g : [0, 1) → C be continuous functions such that fˆ(k) = ĝ(k) for all
k ∈ Z. Then,
f (x) = g(x) for all x ∈ [0, 1).
Remark 7.27. The statement of Lemma 7.26 is equivalent to the statement that for every
function f 6= 0, there exists some k ∈ Z such that fˆ(k) 6= 0.
Proof of Theorem 7.25. To simplify notation, we set ak := fˆ(k) for k ∈ Z. According to the
Pn
assumptions, the sequence of partial sums k=−n |ak | n≥0 converges, because it is a monotone
and bounded sequence. In particular, it is a Cauchy sequence. This means that for every ε > 0
there exists n0 ∈ N, such that for all n ≥ n0 we have
X
|ak | ≤ ε.
k : |k|>n
where we use |e2πikx | = 1. Since this holds for all ε > 0, we see that Sn f converges uniformly to
Sf . It remains to show that Sf (x) = f (x) for all x. Since we want to use Lemma 7.26 for this,
we need to show that both f and Sf are continuous, and that their Fourier coefficients coincide.
First, f is continuous by assumption, and by Lemma 7.24, Sf is continuous as the uniform limit
of continuous functions. (Trigonometric polynomials are always continuous.)
Let us consider the Fourier coefficients. It is easy to see that, if a sequence of functions (gn )
converges uniformly to g, then, for fixed ` ∈ Z, gn e2πi`· converges also uniformly to g e2πi`· .
(Verify this!) By Lemma 7.24(ii), this implies that
Z 1 Z 1
lim gn (x) e−2πi`x dx = g(x) e−2πi`x dx.
n→∞ 0 0
224
In other words, ĝn (`) → ĝ(`) for all ` ∈ Z. We now apply this to gn = Sn f and g = Sf .
Moreover, recall from Example 7.14 that
Z 1 (
−2πi`x a` , if |`| ≤ n ,
Sd
n f (`) = Sn f (x) e dx =
0 0, otherwise.
(This just means that the “first n” Fourier coefficients of the partial sum Sn f coincide with the
corresponding Fourier coefficients of f .) Since Sd n f (`) = a` for all n ≥ |`|, we clearly obtain
limn→∞ Sn f (`) = a` for all ` ∈ Z. Together with the above, and the uniform convergence of
d
Sn f to Sf , we have
c (k) = lim Sd
Sf n f (k) = ak = f (k)
ˆ
n→∞
for all k ∈ Z. This finally shows that all Fourier coefficients of Sf and f coincide. Together
with their continuity, we obtain from Lemma 7.26 that Sf (x) = f (x) for all x ∈ [0, 1).
With Theorem 7.25 we can finally prove that the example from the end of Section 7.2 was correct.
Proof. As discussed above, the Fourier coefficients of the function f (x) = ({x} − 12 )2 satisfy
Z 1/2
1
fˆ(k) = (−1)k t2 e−2πikt dt = .
−1/2 2(πk)2
that Sf (x) = f (x) for every x ∈ [0, 1). In particular, Sf (0) = f (0). This implies that
∞
1 1 X 1 1 X 1
= f (0) = Sf (0) = + = + .
4 12 k6=0 2(πk)2 12 k=1 (πk)2
Note that we can deduce a bit more from the proof of Theorem 7.25. In particular, we showed
that, given an absolutely summable sequence of complex numbers, then the trigonometric poly-
nomials with these coefficients converge to a continuous function. This is a helpful statement
when a function is given by its Fourier coefficients. (Note that the theorem shows that, under
the given assumptions, a function might be uniquely defined by its Fourier coefficients.) We
state this in the following lemma.
225
Example 7.30. Given the sequence (ak )k∈Z with ak = 0 for k < 0, and ak = e−k for k ≥ 0.
From Lemma 7.29 we obtain that
∞
e−k e2πikx
X
g(x) =
k=0
describes a continuous periodic function. To see this, we do not even need to make any compu-
tations regarding the infinite series. It is enough that ak is non-negative and summable.
1
(One could use geometric series to obtain the explicit form g(x) = 1−e2πix−1 .)
Although Theorem 7.25 is useful and gives a simple criterion for a Fourier series to converge, it is
still not satisfactory as it gives a property of the Fourier coefficients as a sufficient condition. In
some cases, one would prefer to check only a condition of the function itself, like differentiability.
The next result, which follows almost immediately from Theorem 7.25, shows that the Fourier
series of twice differentiable functions converges uniformly. However, note that this would not
be helpful for proving the result in Corollary 7.28, because this function is not differentiable
at 0, see Figure 46. In this respect, Theorem 7.25 is more general.
Theorem 7.31. Let f : [0, 1) → C be periodic and twice continuously differentiable. Then,
For the proof of this result it is enough to show that the Fourier coefficients of a twice differen-
tiable function are absolutely summable. As this is of independent interest, we state it in the
following lemma in a more general form.
Lemma 7.32. Let s ∈ N and f : [0, 1) → C be periodic and s-times continuously differen-
tiable. Then, we have
M
|fˆ(k)| ≤ for all k 6= 0,
|2πk|s
where
M := max |f (s) (x)|.
x∈[0,1)
226
where we’ve used the periodicity of f (and e−2πikt ) in the last equation. If we repeat the
integration by parts additional (s − 1)-times, we get
Z 1
1
fˆ(k) = f (s) (x) e−2πikx dx.
(2πik)s 0
Proof of Theorem 7.31. According to the assumptions, f 00 is continuous, such that there exists
M < ∞ with
|f 00 (x)| ≤ M, ∀x ∈ [0, 1).
From Lemma 7.32 we have fˆ(k) ≤ |2πk|−2 M ≤ |k|−2 M for all k 6= 0, and consequently
∞ ∞
1
|fˆ(k)| ≤ fˆ(0) + 2M
X X
< ∞.
k=−∞ k=1
k2
Remark 7.33. Let us state again that the assumption of the last theorem is that the function
f under consideration has to be continuously differentiable as a periodic function. This means,
that also its derivative f 0 (or derivatives of higher order) have to be periodic functions and,
in particular, have to satisfy f 0 (0) = f 0 (1). (It is a typical mistake to forget this property.)
For example, the periodic function f : [0, 1) → R with f (x) = (x − 12 )2 is continuous, since
limx&0 f (x) = limx%1 f (x) = 41 . However, its derivative satisfies f 0 (x) = 2x − 1 for x ∈
(0, 1), which implies limx&0 f 0 (x) = −1 6= 1 = limx%1 f 0 (x). Therefore, f is not continuously
differentiable as a periodic function.
Theorem 7.25 and Theorem 7.31 only state the qualitative statements that the Fourier series
converge. However, for applications of the theory for actual computations, it is necessary to
have also quantitative results. That is, we want to know how large we have to choose n such
that the error is small when we approximate f by Sn f . Fortunately, such error bounds can be
obtained if we are a bit more careful in bounding the corresponding infinite sums.
227
For this we state the following lemma, which is a useful tool for bounding (infinite) sums by
certain integrals. Note that this can also be used as a convergence test for series, similarly to
the ones given in Section 3.7, to verify if a sequence of numbers is summable. But this time,
this may also lead to reasonable bounds on the sum of the series.
Lemma 7.34. Let h : [0, ∞) → [0, ∞) be a continuous and non-increasing function. Then,
for all m, n ∈ N0 with m < n, we have
n
X Z n n−1
X
h(k) ≤ h(x) dx ≤ h(k).
k=m+1 m k=m
Proof. First, we split the integral into several integrals over intervals of length 1 to obtain
Z n n−1
X Z k+1
h(x) dx = h(x) dx.
m k=m k
For each k, the mean value theorem (Theorem 6.37) yields that there is some ξk ∈ [k, k + 1]
R k+1
such that k h(x) dx = h(ξk ). Since h is non-increasing, we have h(k + 1) ≤ h(ξk ) ≤ h(k), and
therefore
n−1
X Z n n−1
X
h(k + 1) ≤ h(x) dx ≤ h(k).
k=m m k=m
An index shift implies the first statement. The second follows by taking the limit n → ∞.
Example 7.35. The most prominent application of the last lemma is to bound the power series
P∞ −s for s > 1. We obtain for every n ∈ N that
k=n+1 k
∞
1
k −s ≤ n−s+1 .
X
k=n+1
s−1
(Verify this!)
With these bounds at hand, we are able to give explicit quantitative bounds on the error of
Fourier series. Recall that we have presented a similar bound already for Taylor polynomials in
Corollary 5.57. However, with this bound we were not able to give good bounds for functions
that are not very smooth. In contrast, the next bound shows that we can give arbitrarily good
approximations of a function, as long as it is twice differentiable. (Of course, we need to compute
enough Fourier coefficients for this. We do not discuss here, how this could be done.)
228
Corollary 7.36. Let f : [0, 1) → C be continuous such that for some s > 1 and B < ∞ we
have
B
|fˆ(k)| ≤
|k|s
for all k 6= 0. Then,
2B 1
|f (x) − Sn f (x)| ≤
s − 1 ns−1
for all x ∈ [0, 1) and n ∈ N.
Let us finally again discuss the example from the beginning, see Example 7.17, and see how
good an approximation by a partial sum of the Fourier series would be.
Example 7.37. We consider the periodic function f : [0, 1) → C that is given by f (x) = (x− 21 )2 .
We already showed in Example 7.17 that its Fourier coefficients equal
1
fˆ(k) = .
2(πk)2
Therefore, they satisfy the bound of Corollary 7.36 with s = 2 and B = 2π1 2 . We therefore
obtain the bound
1
|f (x) − Sn f (x)| ≤ 2
π n
1
for all n ∈ N. One might check, that n = 11 suffices to obtain |f (x) − S11 f (x)| < 100 (or n = 102
1
for |f (x) − S102 f (x)| < 1000 ). This finally justifies that an approximation of this function can
already be good for rather small n, see Figure 46.
Example 7.38. Bounds like those in the last example can also be useful when we want to
approximate certain series by finite sums. Consider again the same example, but only at x = 0.
1
+ ∞ 1 1
P
At this point the Fourier series reads Sf (0) = 12 k=1 (πk)2 , and we have f (0) = 4 , see
Example 7.17. Therefore,
n
1 X 1 1
|f (0) − Sn f (0)| = − ≤ 2 ,
6 (πk) 2 π n
k=1
which implies
n
π2 X 1 1
− ≤ .
2
6
k=1
k n
2
We can therefore give a rather good approximation of π6 with an error of at most n1 by just
computing the sum of the first n terms of the series. Note that this can actually be done by
hand, and it was done like this before calculators were invented. At these times, such error
bounds were essential.
229
We now shortly comment on some sort of final result regarding the convergence of Fourier
series. This is Dirichlet’s theorem, which states that it is actually enough to consider only
local properties of a function to obtain pointwise convergence of the Fourier series. Note that,
in contrast to this, all the previous results required some global knowledge about the function:
either through its Fourier coefficients, or because we assumed the function to be differentiable
everywhere. The following theorem, which is only a special case of Dirichlet’s theorem, states
that the latter assumption would be enough at a point.
We do not prove this statement here. Actually, a proof of this theorem requires quite a lot of
prerequisites, and it would fill several lectures to tackle every single detail.
Let us see an example that shows how useful this theorem is.
Example 7.40. Consider again the periodic function f (x) = x on [0, 1), see Example 7.15 and
Figure 45. We have shown already that fˆ(0) = 21 and fˆ(k) = −2πik
1
for k 6= 0, which yields the
Fourier series ∞
1 X e2πikx 1 X 1
Sf (x) = + = − sin(2πkx).
2 k6=0 −2πik 2 k=1 πk
1 n−1
X
σn f (x) = Sm f (x),
n m=0
which are called Cesaro means of the partial sums, then we might prove (with a lot of effort)
that σn f → f uniformly for every continuous function. This is called Fejer’s theorem. This
result, and its more advanced variants, are heavily used in everyday applications (like JPEG
or MP3), in particular, due to their nice convergence guarantee. This shows that one might be
careful how to build up an approximation, based on given information.
230
8 Multivariate Calculus
In this chapter we initiate the study of functions that depend on more than one variable, which
are called multivariate functions. In analogy to the case of real-valued functions of one variable,
we will introduce some concepts, like continuity or differentiability, which will then lead to
results on, e.g., extreme values of multivariate functions. Note that the study and computation
of minima and maxima of functions that depend on many parameters is one of the main subjects
of optimization, and therefore particularly important in AI applications.
The general type of multivariate functions are vector-valued functions, which have the form
V : Rd → Rm
for some d, m ∈ N. That is, V maps a vector of length d to a vector of length m. We already
studied the case m = d = 1 in detail, and such functions will be called univariate. A special case
of vector-valued function are vector fields, which have d = m, and have the nice interpretation
of attaching a vector to each point in space.
The easiest vector-valued functions are given by matrix-vector multiplication with a matrix
A ∈ Rm×d . That is, we define V : Rd → Rm by V (x) = Ax ∈ Rm for x ∈ Rd . Such functions are
called linear functions. (Recall that univariate linear functions are just multiplication with a
scalar.) However, we also need to discuss non-linear functions, for which the components of the
output may result from arbitrary operations with the input, and not only linear combinations.
For this, we start by considering the special case m = 1, i.e., we consider multivariate real-
valued functions
f : Rd → R
with d ∈ N. (We use “d” for dimension.) Although this might seem to be a very special case, we
will see later that general vector-valued functions can be handled rather easily (component-wise)
once we have the right tools for real-valued functions.
Let us start with an example, and some comments on the visualization of functions.
Figure 47: A plot of the graph and the contour plot of f (x1 , x2 ) = x21 + x22 − x1 x2 in [0, 1]2 .
Remark 8.2. Note that, as usual, vectors x ∈ Rd are considered as column vectors when it
comes to matrix-vector multiplication. However, when considered as input of some function we
use the notation x = (x1 , . . . , xd ) and f (x) = f (x1 , . . . , xd ). This should not lead to confusion.
8.1 Sequences in Rd
Since we want to investigate continuity and other properties of multivariate functions, we have
to mimic the concepts that we introduced in the univariate setting. In particular, we need the
concept of a limit of a (convergent) sequence. Recall that in the univariate case, we used often
that, for a sequence (an )n∈N ⊂ R, the convergence lim an = a is equivalent to lim |an − a| = 0,
and we want to do the same here. Therefore, we actually only need to replace the absolute value
by another quantity that allows for measuring ’how large’ a vector is. This can then be used to
measure ’how close’ two vectors are.
Recall that we already introduced in Section 1.9 the Euclidean norm and the corresponding
inner product
v
q u d d
uX X
kxk2 = hx, xi = t |xi |2 , where hx, yi = x i yi
i=1 i=1
for all x, y ∈ Rd , see Theorem 1.75. Moreover, the Cauchy-Schwarz inequality (Lemma 1.76)
states that
|hx, yi| ≤ kxk2 kyk2
for all x, y ∈ Rd , and we have the equality |hx, yi| = kxk2 kyk2 if and only if y = c · x for some
c ∈ R.
232
We are now ready to treat limits of sequences of vectors. But note that we have to be
careful with the indices: In the following, we assume that (xk )k∈N ⊂ Rd is a sequence of vectors,
i.e., xk ∈ Rd for every k ∈ N. Now, these vectors have d entries, which will be denoted by
xk,1
xk,2
xk .. .
=
.
xk,d
Definition 8.3 (Convergence). Let (xk )k∈N ⊂ Rd be a sequence of vectors. If there exists
some y = (y1 , . . . , yd ) ∈ Rd such that
That is, a sequence (xk ) converges if and only if every component (xk,i ), i = 1, . . . , d, of the
sequence converges.
Using our knowledge about real sequences, and the above lemma, we see that xk → x with
x = (0, −1, 1) ,
k−2 1 ek +k2
which are the limits of k2 +1
, cos k + π and ek
.
With this in mind, we can compute limits of vectors just by computing d usual limits. However,
it is sometimes handy to have a characterization of convergence that is based on the Euclidean
norm. For illustration, let us see the following easy example.
This sequence clearly converges to 0 ∈ R2 . If we now look at the norms of the differences xk − 0
we see that s
1√
2 2 r
1 1 1 1
kxk − 0k2 = −0 + −0 = 2
+ 2 = 2.
k k k k k
Thus kxk − 0k → 0 as k → ∞.
233
Having a closer look to the above example, or the definition of the norm in general, we see that a
sequence (xk ) converges to a vector y if and only if the norms kxk − yk converge to 0 (a number).
We state this in the following lemma. The proof is an easy exercise.
xk → y ⇐⇒ xk − y → 0 ⇐⇒ kxk − yk2 → 0.
By this equivalence, we can see the similarity to the univariate situation: We had that a sequence
of numbers (an ) converges to a if and only if the absolute values |an − a| converge to zero. This
similar appearance will be helpful, as many proofs of results on continuity and differentiation
that follow will just look very similar to the proofs from the univariate setting.
Let us also comment on other norms that one can consider for vectors. We will mostly use
the Euclidean norm for the considerations below, but it is sometimes useful (or necessary) to
consider another ’distance’ between points. However, note that the Cauchy-Schwarz inequality
is special to the Euclidean norm.
Let us shortly say what we consider a norm. In particular, to collect all properties that are
shared by these quantities. (We discuss that later more detailed.)
1) kxk = 0 ⇐⇒ x = 0
As shown above, the Euclidean norm k · k2 , which is also called the 2-norm, is a norm in this
sense.
There are two other important norms. The first is the 1-norm, which is sometimes called
Manhattan norm, given by
d
X
kxk1 = |xi |
i=1
for x = (x1 , . . . , xd ) ∈ Rd . The other one is the maximum norm, or ∞-norm, given by
n o
kxk∞ = max |x1 |, |x2 |, . . . , |xd | .
It is easy to see that both fulfill the first requirement for being a norm, the definiteness. For the
homogeneity note that
d
X d
X
kλxk1 = |λxi | = |λ| |xi | = |λ| · kxk1 .
i=1 i=1
and
kλxk∞ = max{|λx1 |, |λx2 |, . . . , |λxd |} = |λ| max{|x1 |, |x2 |, . . . , |xd |} = |λ| kxk∞ .
234
It is important to note that differences between these norms cannot be too large, as the following
lemma shows.
Note that, in contrast to what may be suggested by the name, the maximum norm is the smallest
of these norms. However, as all these norms only differ by a multiplicative constant (that only
depends on the dimension), we obtain that for a sequence (xk ) we have
xk → x ⇐⇒ kxk − xkp → 0,
where p stands for 1,2 or ∞. Therefore, when talking about convergence, it does not matter
which norm we use.
Let us finally prove the last lemma.
Proof. Let k be such that |xk | = max{|x1 | , |x2 | , . . . , |xd |}. Then
v v v
u d u d u d
uX uX 2
uX √
|xk | ≤ t |xi |2 ≤ t |xk | = |xk | t 1 = d |xk | .
i=1 i=1 i=1
(Note the indices.) Thus the first point follows. For the third point we calculate
d
X d
X
|xk | ≤ |xi | ≤ |xk | = d |xk | .
i=1 i=1
The upper bound in the second point follows by combining point 1) and 3) as
√ √
kxk2 ≤ d kxk∞ ≤ d kxk1 .
For the lower bound we use the Cauchy-Schwarz inequality, see Lemma 1.76, to obtain
v
d
X d
X
u d
uX √
kxk1 = |xi | = |xi | · 1 ≤ kxk2 · t 12 = dkxk2 .
i=1 i=1 i=1
√
Dividing by d gives the lower bound on kxk2 .
235
The definition of continuity in the multivariate case is the same as in the one dimensional case.
That is, we require that we can interchange the limit with the function.
Again, to have a shorter notation, we use the concept of a limit of a function. Let us recall what
we mean by this, see Definition 4.27.
Note that, for technical reasons, we have to exclude the limit from the sequences, i.e., we only
consider sequences (xk ) with xk → x0 and xk 6= x0 to define the above limit. If no such sequence
exists for x0 , then we call x0 an isolated point of Ω, and the limit is not defined. However, as
discussed in Remark 4.29, a function is always continuous at isolated points, and we therefore
see that a function is continuous on U ⊂ Ω if and only if
• f (x1 , x2 ) = x1 + x2 ,
236
• g(x1 , x2 ) = x1 · x2 , and
x1
• h(x1 , x2 ) = x2 .
(Note that h : R × (R \ {0}) → R means that the second input is not allowed to be zero.)
These functions are particularly simple as they are just a sum/product/quotient of functions
depending on one variable. From our knowledge about real sequences, we can easily check that
these functions are continuous. For this, let us assume that (xk )k∈N is an arbitrary sequence
converging to some x0 = (x0,1 , x0,2 ) ∈ R2 . We use the notation xk = (xk,1 , xk,2 ). Using that
(xk,1 , xk,2 ) → (x0,1 , x0,2 ) if and only if xk,1 → x0,1 and xk,2 → x0,2 , we obtain
lim f (xk ) = lim (xk,1 + xk,2 ) = lim (xk,1 ) + lim (xk,2 ) = x0,1 + x0,2 = f (x0 ).
k→∞ k→∞ k→∞ k→∞
Since this holds for arbitrary sequences converging to arbitrary x0 ∈ R2 , we see that f is
continuous. The same calculations can be done for g and h.
Proof. The proof is exactly the same as the proofs of Theorem 4.12 and Theorem 4.17.
One can consider the above lemma also for vector-valued functions f, g : Rd → Rm any m ≥ 2.
Clearly, by considering all components separately, we see that the lemma holds also for f + g.
However, product and quotient do not make sense for vector-valued functions.
Additionally, we can again consider the composition of functions. Recall that g ◦ f (x) := g(f (x)). For
vector-valued functions this makes sense only if the dimensions of the corresponding functions agree.
That is, if f ’outputs’ a vector from Rp , then g must ’accept’ such vectors as ’input’. However, in this
case one can show analogously to the univariate case that the composition of continuous functions
is again a continuous function.
The most important example for now is with p = m = 1 and h : R → R with h(y) = |y|. This implies
that the absolute value |f | is continuous for every real-valued continuous function f : Rd → R.
With these results, one can easily prove that certain functions are continuous.
Example 8.13. First, we consider the norms discussed above, which are clearly also real-valued functions
on Rd . For notational convenience, let us write f1 (x) := kxk1 , f2 (x) := kxk2 and f∞ (x) := kxk∞ for
x ∈ Rd , or short fp := k · kp for p ∈ {1, 2, ∞}.
All these functions are compositions of the functions gi (x1 , . . . , xd ) := |xi |, which are clearly continuous
(and actually univariate). f1 and f∞ are their sum and maximum, respectively, which are therefore
continuous. (One might proof by induction that the maximum of d numbers is continuous.) For f2 , note
Pd
that also gi2 and hence i=1 gi2 are continuous
√ on Rd . Continuity of f2 then follows by the last part of
Lemma 8.11 with D = R+ and h(t) := t.
This function consists only of continuous functions, and the denominator is bounded away from zero,
which implies that f is continuous.
where the sum is over all k = (k1 , . . . , kd ) ∈ Nd0 with kkk1 = k1 + · · · + kd ≤ r ∈ N0 , and the ck ∈ R
are the coefficients of the polynomial. We call r the degree of this polynomial (at least if ck 6= 0 for at
least one k with kkk1 = r). One may even write such polynomials in a shorter way by introducing the
so-called multi-index notation. That is, for x ∈ Rd and k ∈ Nd0 , we define
d
Y
xk := xki i = xk11 xk22 · · · xkdd .
i=1
xk .
P
With this, we can write multivariate polynomials in the familiar form p(x) = kkk1 ≤r ck
A simple example for d = 2 is given by
This polynomial is in the above form with c(3,1) = 2, c(2,2) = 1, c(0,1) = 3, c(0,0) = −42 and all other ck ’s
equal zero. The degree of this polynomial is therefore 4.
By Lemma 8.11, multivariate polynomials are clearly continuous.
However, in general it is not so easy to prove continuity of multivariate functions. The reason is that
there are ’too many’ sequences converging to a point. (While there were only left- and right-handed
limits in the univariate case, there are infinitely many ’directions’ for multivariate functions.) Let us see
the following example.
This function is clearly continuous at any x0 6= 0. If we want to prove continuity at 0, i.e., that
limx→0 f (x) = 0, we need to consider all null sequences. In a first try, we may consider the sequences
238
given by yk = ( k1 , 0) and zk = (0, k1 ), which correspond to the limits in coordinate direction. Since
f (yk ) = f (zk ) = 0 for all k (because one of the components is always zero) we see that f (yk ) → 0 and
f (zk ) → 0, which might indicate that the function is continuous. However, if we consider the sequence
tk = (1/k, 1/k) (i.e., the limit ’from the diagonal’), then we see that
1
k2 1
f (tk ) = 1 1 = .
k2 + k2
2
This implies that the limit limx→0 f (x) does not exist, i.e., f is not continuous at 0.
This shows that proving continuity of a function is sometimes an issue, and one needs kind of intuition
to give a detailed proof. In most cases, one even has to try several approaches, as there is no general rule
for this. However, in some cases a function depends on its input only by the norm of the input, and for
such functions one might easily obtain continuity.
Example 8.17. We consider the function defined by
( 1−cos(|x1 |+|x2 |)
√ 2 2 if x 6= 0
f (x1 , x2 ) = x1 +x2
0 if x = 0.
p
Again, the function is continuous at x0 6= 0. For x0 = 0, note that Lemma 8.8 implies that x21 + x22 =
k(x1 , x2 )k2 ≥ √12 k(x1 , x2 )k1 , and therefore, with x = (x1 , x2 ) 6= 0,
√ |1 − cos kxk1 |
|1 − cos kxk1 |
|f (x)| = ≤ 2 .
kxk2 kxk1
If we now take into account that x → 0 if and only if kxk1 → 0 (and use the substitution t := kxk1 ), we
obtain that
√ |1 − cos(t)|
lim |f (x)| ≤ lim 2 = 0.
x→0 t→0 t
(Here, we used l’Hospital’s rule.) This implies limx→0 f (x) = 0, and thus that f is continuous.
We want to finish this subsection with the ε-δ-criterion for multivariate functions, as this will again be
needed in some of the upcoming proofs. Note that this is again very similar to the univariate criterion,
see Theorem 4.63.
Theorem 8.18. Let f : Ω → R and p ∈ {1, 2, ∞}. Then, f is continuous at x0 ∈ Ω if and only if
for any ε > 0 there exists a δ > 0 such that for all x ∈ Ω, we have
In a formula,
∀ε > 0 ∃δ > 0 ∀x ∈ Ω : kx − x0 kp < δ =⇒ |f (x) − f (x0 )| < ε.
The proof is almost the same as for the one-dimensional case, but we still want to give it here. Moreover,
note that by Lemma 8.8 continuity is just equivalent for all p ∈ {1, 2, ∞}. Therefore, it is enough to
prove it for p = 2 (or any other of them).
Proof. First we show that the ε-δ-criterion holds if f is continuous. Therefore we assume the opposite,
i.e., that there exists ε0 > 0 such that for any δ > 0 there is some y ∈ Ω with ky − x0 k < δ but
|f (y) − f (x0 )| > ε0 , and show that this contradicts the continuity of f . In particular, by assumption, we
may choose δ = n1 and find some yn ∈ Ω with the property
1
kyn − x0 k ≤ and |f (yn ) − f (x0 )| > ε0 .
n
239
This implies that yn → x0 , but f (yn ) 6→ f (x). So, f is not continuous at x, which is a contradiction.
Next we assume that the ε-δ-criterion holds and show that in this case f is continuous. For this, fix
some ε > 0. By definition, if a sequence (xk ) converges to x0 , then for all k large enough it holds that
kxk − x0 k < δ, where δ > 0 is as given by the ε-δ-criterion. Therefore |f (xk ) − f (x0 )| < ε. As this holds
for all ε > 0 and all sequences (xk ), we obtain
so f is continuous at x.
Note that, in the same way as in the univariate case, there are some technical problems related to boundary
points of the domain of a function. Therefore, we consider at first only open sets in Rd , which are the
replacement of open intervals on the real line.
Uε (x) = {y ∈ Rd : kx − yk < ε}
Since this topic includes a lot of indices and so on we use the following notation which will make everything
a bit shorter. If not indicated otherwise k·k always denotes the Euclidean norm, i.e. kxk = kxk2 , and
G ⊂ Rd denotes an open subset of Rd . Moreover, we want to remind you that the i-th unit vector, write
ei , was defined to have a 1 in the i-th coordinate and zeros else.
Finally, let us note that, due to the different indices needed, there might be some confusion between the
coordinates of a vector and the elements of a sequence, as already explained above. In what follows we
always write (or at least try to)
x = (x1 , . . . xd ) ∈ Rd ,
such that xi ∈ R, i = 1, . . . , d, is a number, i.e., the ith coordinate of x, and we use (again) x0 ∈ Rd for
a specific point.
240
∂f
Sometimes, also other notations for the partial derivatives ∂x i
are used, like Di f or Dei f or ∂xi f , or just
δi f or fi . Therefore, one needs to be careful when using other literature.
Example 8.21. Let us have a look at the function
f (x1 , x2 ) = x1 · x2 − e−x1 ,
∂f
considered on G = R2 . To calculate ∂x1 (x), using the definition of partial derivatives, we have to compute
the limit h → 0 of
(x1 + h) · x2 − e−(x1 +h) − (x1 · x2 − e−x1 )
f (x + he1 ) − f (x) f (x1 + h, x2 ) − f (x1 , x2 )
= =
h h h
(x1 + h − x1 ) · x2 − (e−x1 −h − e−x1 )
=
h
h · x2 e−x1 (e−h − 1)
= − ,
h h
where x = (x1 , x2 ). This implies, by recalling known limits, that
∂f
(x) = x2 + e−x1 .
∂x1
Note that we only send h → 0 above, and that x1 and x2 are fixed.
In the same way, we obtain
∂f f (x1 , x2 + h) − f (x1 , x2 ) x1 · (x2 + h − x2 ) − (e−x1 − e−x1 )
(x) = lim = lim = x1 .
∂x2 h→0 h h→0 h
A detailed inspection of the above example shows that there is a simpler way to compute partial deriva-
tives, than just to plug in the definition. In fact, note that partial differentiation w.r.t. x1 actually does
not ’touch’ x2 . Therefore, we may just consider x2 as a (fixed) constant and differentiate the univariate
function depending only on x1 . To be precise, consider the expression
f (x + hei ) − f (x) f (x1 , . . . , xi + h, . . . , xd ) − f (x1 , . . . , xi , . . . , xd )
lim = lim .
h→0 h h→0 h
where x = (x1 , x2 , . . . , xd ) is a fixed point. In this expression, the ’inputs’ x1 , . . . , xi−1 , xi+1 , . . . xd are
’untouched’, allowing to treat them as fixed. So by defining the univariate function
g(xi ) := f (x),
we see that
∂f
g 0 (xi ) =
(x).
∂xi
Thus we can compute partial derivatives by calculating one dimensional derivatives, which allows to use
all our knowledge from the previous chapter.
Let us see how this procedure helps us if we want to compute partial derivatives.
241
f (x1 , x2 ) = x1 · x2 − e−x1 .
Example 8.23. Sometimes we also have to use some (one dimensional) calculation rules to compute
partial derivatives. We have a look at
f (x1 , x2 ) = sin(x31 + x2 ).
Example 8.24. We define G = Rd \{0} (note that this is an open set) and compute the partial derivatives
of f (x) = kxk2 . Since
f (x) = (x21 + x22 + · · · + x2d )1/2 ,
the chainrule implies that
∂f 1 1 1
(x) = 2 2 2 1/2
· 2xi = xi .
∂xi 2 (x1 + x2 + · · · + xd ) kxk2
Using the product rule for one dimensional functions we easily see that f is partially differentiable in
R2 \ {0}. Furthermore, we observe that
f (hei ) − f (0) 0
= =0
h h
∂f ∂f
for any h ∈ R \ {0}. Thus ∂x1 (0) and ∂x2 (0) also exist, making f partial differentiable on R2 .
However, f is not continuous at 0 as we will show by using the sequence xk = (1/k, 1/k), which converges
to 0. But
1
2 1
f (xk ) = k 1 = .
2 k2 2
This implies that f (xk ) cannot converge to 0 = f (0).
Writing all partial derivatives of a function in a (row) vector, we obtain a compact notation.
242
Remark 8.27. Some authors prefer to write the gradient as a column vector.
f (x1 , x2 ) = sin(x31 + x2 ).
We already computed the partial derivatives and saw that they exist for any (x1 , x2 ) ∈ R2 . Thus
There is a result for gradients which can be considered as generalization of the product rule.
Theorem 8.29 (Product rule). Let f, g : G → R be partially differentiable functions. Then we have
Proof. By the definition of the gradient it is sufficient to proof the statement for each coordinate. We
use the product rule for for one dimensional functions to compute
∂f g ∂f ∂g
(x) = (x) · g(x) + (x) · f (x).
∂xi ∂xi ∂xi
f (x1 , x2 ) = x1 · x2 ,
for which we clearly have ∇f (x) = (x2 , x1 ). However, if we write f = g · h with g(x1 , x2 ) = x1 and
h(x1 , x2 ) = x2 , we see that ∇g(x) = (1, 0) and ∇h(x) = (0, 1) for all x = (x1 , x2 ) ∈ R2 . With the
product rule, we obtain
Note that in these computations, it is somehow obvious that everything is done at the (fixed) point x.
Therefore, the ’(x)’ is unnecessary and we write in short
∇f = ∇(gh) = ∇g · h + ∇h · g
= (h, 0) + (0, g) = (h, g) = (x2 , x1 ).
Example 8.25 shows that existence of partial derivatives does not imply continuity of functions, which
was a necessary condition for differentiating univariate functions. This, in particular, shows why we want
to find a somehow ’better’ generalization of the one dimensional differentiability. Therefore, recall that
a function f : R → R is differentiable at x0 ∈ R if and only if
f (x0 + h) − f (x0 ) f (x) − f (x0 )
lim = lim
h→0 h x→x 0 x − x0
243
One way to interpret this, is that T (x) := f (x0 ) + (x − x0 )f 0 (x0 ) is the tangent line to f at x0 .
The tangent is by definition an affine function, i.e., a linear function plus a constant. Therefore, one can
say that the tangent line to a function f at x0 is given by T (x) = f (x0 ) + D(x − x0 ), where D : R → R is
a linear function. Note that D(x − x0 ) = 0 for x = x0 (since D is linear), and therefore T (x0 ) = f (x0 ).
We will now use this to define differentiability in the multidimensional case. Recall that a linear mapping
D : Rd → R is characterized by the property that D(x + y) = D(x) + D(y) for any x, y ∈ Rd , and can
Pd
always be described by D(x) = i=1 ai xi = aT x = ha, xi for some a ∈ Rd .
This definition does not appear very handy, but we will see shortly its relation to the gradient.
Remark 8.32. There is again some other commonly used notation for the derivative dfx . For example,
f 0 (x), Dx f , or (even without the point x) Df or df . So, again be careful when using other literature.
In the same way as for univariate functions, i.e., d = 1, the derivative dfx can be used to define the
tangent plane T to a function f at the point x ∈ Rd by T (y) := f (x) + dfx (y − x), which is the best
approximation by an affine function at x. We come back to this and give a handy formula for the
tangent plane using partial derivatives.
This example was particularly simple, because the error of the linear approximation, i.e., the function r,
did not depend on x. However, this may clearly happen, as the next example shows.
Example 8.34. Let f : R2 → R be given by
f (x) = x21 · x2
with x = (x1 , x2 ). We see that
f (x + y) = (x1 + y1 )2 (x2 + y2 )
= (x21 + 2x1 y1 + y12 )(x2 + y2 )
= f (x) + 2x1 x2 y1 + x21 y2 + x2 y12 + 2x1 y1 y2 + y12 y2 .
245
(Again, we collected the terms that do not depend on y, then the linear terms, then the rest.) We see
that r(y) = x2 y12 + 2x1 y1 y2 + y12 y2 (which depends on x) satisfies
y2 y 2 y2
r(y) y1 y2
lim = lim x2 1 + 2x1 + 1 = 0.
y→0 kyk y→0 kyk kyk kyk
(Verify this yourself using y1 y2 = max{y1 , y2 } min{y1 , y2 } and kyk ≥ kyk∞ .)
Therefore, the linear function
dfx (y) = 2x1 x2 y1 + x21 y2
is the (total) derivative of f at x.
Let us now show that differentiability is indeed a stronger condition than the existence of partial deriva-
tives, and implies continuity.
Proof. Since f is differentiable at x we know that dfx and r exist such that
f (x + y) = f (x) + dfx (y) + r(y),
r(y)
dfx is linear and that limy→0 kyk = 0. Thus we have that
Since dfx is linear, and therefore continuous, we have that limy→0 dfx (y) = dfx (0) = 0. For the second
limit we use limy→0 |r(y)| ≤ limy→0 |r(y)|
kyk = 0. This shows that all in all we have that
To show that all partial derivatives exist, and how they Pd are related to the total derivative, first observe
that, due to linearity, dfx can be written by dfx (y) = i=1 ai yi for some a1 , . . . , ad ∈ R. Using again the
Pd
representation f (x + y) = f (x) + i=1 ai yi + r(y) for the specific sequences y = hei for h → 0 (h ∈ R),
we obtain
f (x + hei ) − f (x) hai + r(hei ) r(hei )
= = ai +
h h h
r(hei ) r(y) ∂f
(ej denotes the j-th unit vector) Since limh→0 h = limy→0 kyk = 0 we see that ∂xi (x) exists and
∂f f (x + hei )
(x) = lim = ai ,
∂xi h→0 h
proving the theorem.
Example 8.36. If we have a look at Example 8.33 we saw that f , which was given by
2
f (x) = kxk ,
P3
was differentiable for any x ∈ R3 . Moreover, we calculated that dfx (y) = 2hx, yi = 2 i=1 xi yi . The
components are given by 2(x1 , x2 , x3 ), which is exactly the gradient of f .
246
are contained in G. Again, ei denotes the i-th unit vector. Furthermore, we see that
Thus the mean value theorem of differential calculus, see Theorem 5.34, implies that there exists some
ξk ∈ [0, 1] such that
∂f (k−1)
f (z (k) ) − f (z (k−1) ) = z + ξk yk ek · yk .
∂xk
(Make sure you understand why we can apply the one dimensional mean value theorem here!)
We set ηk = z (k−1) + ξk yk ek and use a telescoping trick to see
d−1 d
X X ∂f
f (x + y) − f (x) = f (z (k+1) ) − f (z (k) ) = (ηk ) · yk .
∂xk
k=1 k=1
Now we define
d
∂f X ∂f
ak = (x) and r(y) = (ηk ) − ak yk ,
∂xk ∂xk
k=1
such that
d
X
f (x + y) − f (x) = ak · yk + r(y) = h∇f (x), yi + r(y).
k=1
∂f
Due to the continuity of all partial derivatives it follows that ak = limy→0 ∂xk (ηk ). Moreover, an appli-
cation of the Cauchy-Schwarz inequality yields
where we set η = (η1 , . . . , ηd ) and a = (a1 , . . . , ad ). Putting everything together we obtain that
r(y)
lim = 0,
y→0 kyk
Remark 8.38. The opposite of the above theorem does not hold, i.e., there exist differentiable functions
not continuously partially differentiable. One example that shows this is the function f (x) =
which are
1
kxk2 sin kxk , continuously extended to x = 0 by setting f (0) = 0, which is differentiable at 0, but
whose partial derivatives are not continuous. We omit the details.
247
The theorem which we just showed also allows to make notation a bit easier, i.e. shorter. In particular
we now know that, if a function is continuously partially differentiable, then it is differentiable and we
can interpret the gradient as derivative. In this case the gradient is also continuous, since all components
are continuous. So from here on we call a function continuously differentiable, if it is continuously
partial differentiable. Nevertheless it is important to be very precise if we do not have continuous partial
derivatives.
Let us see some more examples.
for x = (x1 , . . . , xd ) ∈ Rd . Functions of this form are of particular interest for quadratic optimization
problems.
Linearity and the product rule implies that the k-th component of ∇f is given by
d X d d X d
∂f X ∂(xi xj ) X ∂xj ∂xi
(x) = cij = cij xi · + xj ·
∂xk i=1 j=1
∂xk i=1 j=1
∂xk ∂xk
d X d d d
∂f X X X
(x) = cij (xi · δjk + xj · δik ) = cik xi + ckj xj
∂xk i=1 j=1 i=1 j=1
where (y)k denotes the k-th entry of the (row/column) vector y. Therefore,
T
∇f (x) = xT (C + C T ) = (C + C T )x .
Clearly, all partial derivatives (i.e., entries of ∇f ) are linear and therefore continuous functions. By
Theorem 8.37 this implies that f is (totally) differentiable.
It follows that
∂f 2
(x) = −2xk e−kxk .
∂xk
Since all partial derivatives, and thus the gradient, are continuous, we see that f is differentiable for any
x ∈ Rd . Moreover,
d
2 X 2
dfx (y) = h∇f (x), yi = −2e−kxk xk yk = −2e−kxk hx, yi
k=1
We finally discuss how to use multidimensional derivatives to describe the slope (or increase) of a function
in a fixed direction. Note that in the multidimensional setting we need to decide in which direction we
want to measure the slope. Just imagine you go for a walk on a mountain. Then there might be a
different slope in each direction, and it might be of interest in which direction is the largest increase or
decrease. It will turn out, that the gradient actually points to the direction of largest increase/decrease.
f (x + hv) − f (x)
Dv f (x) = lim
h→0 h
if the limit exists.
∂f
Remark 8.42. Again, there are many different notations for Dv f (x), like ∂v (x) and ∇v f (x).
Remark 8.43. If we choose v = ei , we see that Dei f = Di f is the i-th partial derivative.
Remark 8.44. The intuition might be, that Dv f (x) is the height change if we make ’one step’ of length
1 on the tangent plane starting at x in direction v.
As before, we can give a formula for directional derivatives using the gradient.
Proof. As for the proof of Theorem 8.35 we use a special choice of y in the definition of the (total)
derivative to obtain the result.
By Definition 8.31 with y = h · v and h ∈ R small enough, we see that
f (x + hv) − f (x) dfx (hv) + r(hv) r(hv)
= = dfx (v) + ,
h h h
where we used that dfx (hv) = h · dfx (v) since dfx is linear. Moreover, limh→0 r(hv)
h = limy→0 r(y)
kyk =0
implies that Dv f (x) = dfx (v). The rest of the statement follows from Theorem 8.35.
Example 8.47. As in Example 8.39 consider f (x) = xT Cx where C was a symmetric matrix. The
gradient of f is
∇f (x) = 2Cx,
which implies that for any v with kvk = 1 the directional derivative is given by
Interestingly, it turns out that the gradient points to the direction of the largest slope. That is, the
gradient ∇f (x) is the direction such that |Dv f (x)| is maximized.
Proof. Since Dv f (x) = h∇f (x), vi, see Theorem 8.45, we obtain from the Cauchy-Schwarz inequality
(Lemma 1.76) that
|Dv f (x)| = |h∇f (x), vi| ≤ k∇f (x)k kvk = k∇f (x)k ,
since kvk = 1. Moreover, we have equality if and only if v = c · ∇f (x), and this c must be ± k∇f1(x)k since,
again, we require kvk = 1. Now, for c = k∇f1(x)k we obtain Dv f (x) = k∇f (x)k, and for c = − k∇f1(x)k
we obtain Dv f (x) = − k∇f (x)k.
Clearly, and as in the univariate case, we sometimes want to differentiate a function more than once,
which leads to the theory of higher order partial derivatives. Again, as in the univariate case, this can be
done be iterating the differentiation procedure. However, since there are more coordinates than one, it
seems that we need to be careful in which order we calculate the derivative. Luckily, this is not the case
if the functions under consideration are ’nice enough’, in which case we have
∂ ∂f ∂ ∂f
= .
∂xi ∂xj ∂xj ∂xi
That is interchanging the order of differentiation does not change the partial derivative. In particular,
we will see that this is true if all involved (second-order) partial derivatives are continuous.
∂2f ∂ ∂f
(x) := (x)
∂xj ∂xi ∂xj ∂xi
(Make sure that you understand the difference between these definitions.)
We observe that
∂f ∂f
= 2x1 x22 − 1 and = 2x21 x2 + 1.
∂x1 ∂x2
∂f
Differentiating ∂x1 once more w.r.t. x1 , x2 , we see
∂2f ∂ ∂f ∂
2 = = (2x1 x22 − 1) = 2x22 .
∂x1 ∂x1 ∂x1 ∂x1
and
∂2f ∂
= (2x1 x22 + 1) = 4x1 x2 .
∂x2 ∂x1 ∂x2
Analogously we compute
∂2f ∂2f
= 4x1 x2 and = 2x21 .
∂x1 ∂x2 ∂x22
This example shows that we have to systematically compute and write down second-order partial deriva-
tives, especially for large d. However, since we have to derive w.r.t. xi and xj for i, j ∈ {1, 2, . . . , n} we
can use a matrix to collect all these functions.
Example 8.52. The Hessian of the function from Example 8.50 above, which was given by f (x) =
x21 x22 + x1 − x2 with x = (x1 , x2 ), is
2x22
4x1 x2
Hf (x) = .
4x1 x2 2x21
In this example, the Hessian is a symmetric matrix, which raises the question for which functions we have
∂2 ∂2
∂xi ∂xj f = ∂xj ∂xi f . We will see soon that this is guaranteed under rather weak assumptions.
However, we first show that the Hessian is related to second-order directional derivatives, which
is differentiating twice into given directions. That is, for two directions u, v ∈ Sd−1 , we compute the
directional derivative w.r.t. u of the directional derivative Dv f . This is similar to Theorem 8.45, where
we showed that the gradient is connected to the (first-order) directional derivatives.
251
Proof. Since f is twice differentiable, which implies that all first-order partial derivatives are (totally)
Pd ∂f
differentiable, we obtain that also Dv f (x) = i=1 ∂xi (x) · vi is differentiable, because it is a sum of
differentiable functions. This implies that we can use the gradient of Dv f to compute the directional
derivatives Dv f , see Theorem 8.45. That is,
d
X ∂(Dv f )
Du (Dv f )(x) = (x) · uj
j=1
∂xj
The next theorem due to Hermann Schwarz (1843–1921) from 1873, which has a rather long history with
several earlier incomplete proof attempts, shows that the Hessian is symmetric (i.e., one can interchange
partial derivatives), whenever f is twice continuously differentiable.
∂2f ∂2f
(x) = (x).
∂xi ∂xj ∂xj ∂xi
In particular, this shows that the Hessian of f is a symmetric matrix, i.e. Hf (x) = (Hf (x))T .
252
Proof. First note that we can prove the statement individually for every pair (i, j) ∈ {1, . . . , d}2 and we
can treat the other components as constants. Therefore, is it is sufficient to prove the statement for the
case that d = 2, i = 1, and j = 2. This means we consider a function f : R2 → R. Additionally we
assume that w.l.o.g. x = 0 ∈ G, which will save a lot of notation. (If x 6= 0, then we consider the function
g(·) = f (· + x).) We will use the notation f1 and f2 to denote the partial derivatives w.r.t. x1 or x2 ,
respectively.
Observe that by definition of partial derivatives, we have
Now we fix small enough k, h 6= 0, such that (k, h) ∈ G, and have a look at the univariate function
g(t) = f (t, h) − f (t, 0). An application of the mean value theorem of differential calculus, see Theorem
5.34, yields that there exists some ξ ∈ [0, k] such that
(Here we use that g is continuous and differentiable on the interval (0, k).)
Another application of the mean value theorem shows that there exists some η ∈ [0, h] such that
∂f1 ∂2f
f1 (ξ, h) − f1 (ξ, 0) = h (ξ, η) = h (ξ, η).
∂x2 ∂x2 ∂x1
Plugging in for g(k) − g(0) we see that
∂2f
f (k, h) − f (0, k) − f (h, 0) + f (0, 0) = kh (ξ, η) .
∂x2 ∂x1
∂2f
= (0, 0).
∂x2 ∂x1
Remark 8.55. The continuity of the second-order partial derivatives is a necessary condition for the
above result to hold for all such functions. If all second-order partial derivatives exist, but are not
necessarily continuous, then there are examples where the Hessian is symmetric or not, depending on the
point x.
Note that, since all second-order partial derivatives are continuous (which should be known also before
2 2
their computation), we don’t need to compute ∂x∂1 ∂x f
2
and ∂x∂2 ∂x
f
1
separately. They are just equal, so
computation of one of them is enough to write down the Hessian.
253
∂ ∂ ∂f
Dik+1 . . . Di2 Di1 f (x) := ··· (x)
∂xik+1 ∂xi2 ∂xi1
for ij ∈ 1, . . . , d with j = 1, . . . , k + 1.
If all k-th order partial derivatives are totally differentiable (at x), then we call f (k + 1)-times
differentiable (at x).
If all (k + 1)-st order partial derivatives are continuous (at x), then we call f (k + 1)-times contin-
uously differentiable (at x).
Theorem 8.58. Let G ⊂ Rd and f : G → R be k-times continuously differentiable. Then for any
i1 , i2 , . . . , ik ∈ {1, 2, . . . , d} and any permutation σ of {1, . . . , k} we have
Proof. This follows by inductively applying the Schwarz theorem, see Theorem 8.54.
8.4 Extrema
As in the one dimensional case, one can use differential calculus to study optimization problems, i.e., to
find extrema. Although some of the techniques used here are (or at least look) more complicated than
the corresponding parts of Section 5, the overall strategy is the same:
1. We use the derivative to find candidates for (local) extrema, i.e., stationary points,
2. if possible, we use the second derivatives to check if (local) maximum or minimum.
3. Finally, we consider the boundary of the considered domain separately.
As we learned in Section 5 in the univariate case, stationary points are the candidates for local extrema,
if they are in the domain of the function. But also functions without stationary points may have extreme
points. (Consider the easy example of a linear function on a closed interval.) For this we always needed
to compute the function values at the boundary points and/or consider the limit to ±∞ to verify if a
function has global/local extrema.
Unfortunately, all these objects are more difficult to handle than in the univariate case. First of all, there
are several (partial) derivatives in a given point, in contrast to the univariate case, and so we need to
generalize the previous concepts. We will see that, here, the gradient plays the role of the derivative and
the Hessian matrix will substitute the second derivative.
An additional difficulty comes from multivariate domains. While the boundary of a bounded interval,
which was the typical domain for univariate function, consists only of two points, the boundary in the
multivariate setting is more complex and needs more care. We will see an example in a second.
254
However, let us first recall the definition of global extrema from Definition 4.54.
Note that this is word by word the same definition; it works for arbitrary domains.
Also the definition of local extrema comes only with minimal modifications, see Definition 5.24.
and a strict local minimum if f (x) > f (x0 ) for all x ∈ Uε (x0 ) ∩ Ω \ {x0 }.
Analogously, we say f has a local maximum at x0 ∈ Ω if there exists ε > 0 such that
and a strict local maximum if f (x) < f (x0 ) for all x ∈ Uε (x0 ) ∩ Ω \ {x0 }.
The point x0 ∈ Ω is called local maximum/minimum point, or local extreme point.
We use Ω as a domain here, to make clear that this set does not need to be an open set. (Note that we
assumed that G is always open.)
Before we turn to the actual computation of extrema and extreme values, let us first consider the question
whether a function has an extremum or not. For this, recall from extreme value theorem (Theo-
rem 4.56) that every univariate function f : [a, b] → R, defined on a closed interval, attains its minimum
and maximum.
Theorem 8.61 (Extreme value theorem). Let C ⊂ Rd be a closed and bounded set and f : C → R
be a continuous function. Then there exist xmin , xmax ∈ C such that
In other words, continuous functions attain their extreme values on closed and bounded sets.
From what we know about exponentials, we see that f (x) ≤ 1 for all x ∈ R2 , and that f (x) = 1 if and
only if x = (0, 0). Therefore, f (0) > f (x) for every x 6= 0, which implies that x0 = (0, 0) is a strict local
maximum as well as the global maximum of f .
When looking for a minimum we first realize that f (x1 , x2 ) converges to zero, when x1 and/or x2 tend
to infinity, but f > 0. Therefore, if we would consider f on R2 , then f would not have a minimum. (One
would say its infimum is 0.) However, we consider extreme values on Ω and, since we have |x1 | ≤ 1 and
|x2 | ≤ 1 for all x = (x1 , x2 ) ∈ Ω, it is obvious that
2 2
f (x) = e−(x1 +2x2 ) ≥ e−(1+2) = e−3 ≥ 0.049 ∀x ∈ Ω.
Moreover, due to monotonicity, we see that the minimum is on the boundary {x : kxk = 1}.
It might already be seen (from the contour plot above) that f decreases fastest in x2 -direction, and
therefore that the global/local minima are at x0 = (0, ±1) such that f (x0 ) = e−2 . However, it remains
to find a way to verify this in a systematic way.
This example shows that even in simple examples, it might be non-trivial to determine all extrema when
we are working on bounded sets. Therefore, we first consider here functions that are defined on
the whole Rd , i.e., we consider extrema of functions f : Rd → R. In the context of optimization, this is
sometimes called free (or unconstrained) optimization. We will come back to extrema on subsets Ω ⊂ Rd ,
which corresponds to constrained optimization, afterwards.
Remark 8.63. It might be interesting to note here that the set Rd is open and closed at the same time,
see Definition 8.19. This follows from the fact that the empty set ∅ is, by definition, open. (As there is
no x ∈ ∅, the requirement to be open set is trivially true.) Therefore, all the results from the previous
subsections, which were stated for open sets, are valid for G = Rd .
Before we turn to the generalization of the concepts from Section 5, let us discuss another simple exam-
ple that shows that there might be infinitely many (local) extrema, but none of them is a strict local
extremum.
We already know that sin(t), t ∈ R, is maximal whenever t = π2 + 2kπ and minimal when t = 3π 2 + 2kπ,
where k ∈ Z. So all possible maxima x0 = (x1 , x2 ) have to satisfy x1 + x2 = π2 + 2kπ for some k ∈ Z,
and for such a point we have f (x0 ) = 1. This implies that f (x) ≤ f (x0 ) for all x ∈ R2 , i.e., such x0 are
(local) maxima. (Recall that sin(t) ∈ [−1, 1] for all t ∈ R.)
However, f does not have any strict local maxima. To see this, let x0 = (x1 , x2 ) be a local maximum and
y = (x1 + δ, x2 − δ) with δ > 0 be another point. We obtain
Hence, y is also a maximum and, since y can be arbitrary close to x0 for δ small enough, we obtain that
x0 cannot be a strict local maximum. The same arguments work for minima of f .
Theorem 8.65 (Necessary condition for an extreme point). Let G ⊂ Rd be open and f : G → R be
partially differentiable. If x0 ∈ G is a local extremum, then
∇f (x0 ) = 0.
This means that, if ∇f (x) 6= 0 for some point x, then this point cannot be an extremum.
Proof. We show that the statement is true if x0 is a local maximum. The same arguments can be used
if x0 is a local minimum. For this, we consider each entry of the gradient separately and use the known
results from the univariate setting. Fix i ∈ {1, . . . , d} and let U = Uε (x0 ) be a neighborhood of x0 such
that f (x0 ) ≥ f (x) for all x ∈ U .
Since x0 + tei ∈ U for t ∈ (−ε, ε), we have by assumption that the function g : (−ε, ε) → R,
g(t) := f (x + tei )
exist and is differentiable. (We use that the i-th partial derivative of f exists.) Moreover, again by
x0 + tei ∈ U , we see that
g(0) = f (x0 ) ≥ f (x0 + tei ) = g(t)
257
for any t ∈ (−ε, ε). Thus, g has a local maximum at 0 implying that
g 0 (0) = 0,
∂f
see Theorem 5.25. Since g 0 (0) = ∂xi (x0 ) and the above holds for all 1 ≤ i ≤ d the result follows.
Let us see how we can use this result to determine local extrema.
2 2
Example 8.66. We consider the function f (x) = e−x1 −2x2 , x = (x1 , x2 ), from Example 8.62. Computing
the gradient we see that
2 2 2 2
∇f (x) = −2x1 e−x1 −2x2 , −4x2 e−x1 −2x2 = −2f (x) · (x1 , 2x2 ).
2 2
Since f (x) = e−x1 −2x2 6= 0 for any x ∈ R2 , we see that
∇f (x) = 0 ⇐⇒ x = 0.
Hence, the only possible local extremum of f is at x0 = 0. Although we already know for this function
that x0 = 0 is a maximum, we still need a systematic way for general functions to verify if a stationary
point is indeed a maximum or a minimum.
We already mentioned that Theorem 8.65 is only a necessary condition for x0 to be an extremum.
Unfortunately, this is not a sufficient condition as the next example shows.
f (x1 , x2 ) = x21 · x2 .
To do so we compute
∇f (x) = (2x1 x2 , x21 ),
implying that ∇f (x) = 0 ⇐⇒ x1 = 0. Note that x2 can be chosen arbitrary. Therefore, all points in the
set {(0, x2 ) : x2 ∈ R} are stationary points and they are the only candidates for local extrema. However,
we have f (0, x2 ) = 0 for every x2 and therefore, as indicated by the plot, none of these points is a global
extremum as the the function attains smaller and larger values than 0. Regarding local extrema, note
that f (x1 , x2 ) ≥ 0 whenever x2 ≥ 0 and f (x1 , x2 ) ≤ 0 whenever x2 ≤ 0. Therefore, if x2 > 0, then there
is a neighborhood Uε (x0 ) around x0 = (0, x2 ) such that f (x) ≥ 0 = f (x0 ) in Uε (x0 ). (One can choose
258
ε = x2 .) This shows that x0 is a local minimum. In the same way, we obtain that every x0 = (0, x2 ) with
x2 < 0 is a local maximum.
It remains to check the point x0 = (0, 0). For this not that, for every ε > 0, we have f (ε, ε) > 0 and
f (ε, −ε) < 0. Therefore, every neighborhood of (0, 0) contains points with smaller and larger function
values, which shows that (0, 0) is not a (local) extremum, although ∇f (0, 0) = 0.
We now turn to a method, based on derivatives, to verify if a function has a maximum or a minimum,
or no extremum at all. This method is, similarly to the univariate case, called the second partial
derivative test, see Theorem 5.29. Recall that in the univariate case we used positivity/negativity of
the second derivative to decide if a stationary point is a minimum/maximum. However, since the second
derivative of a multivariate function is represented by a matrix, i.e., the Hessian matrix, we first need a
notion of positivity of a matrix.
• positive semi-definite if
v T Av ≥ 0 for all v ∈ Rd .
• negative definite if
v T Av < 0 for all v ∈ Rd \ {0}.
• negative semi-definite if
v T Av ≤ 0 for all v ∈ Rd .
Remark 8.70. One time-saving argument to verify that a matrix is indefinite is that is has entries on
the diagonal with different signs. To see this, one just needs to consider the unit vectors ei . Using
that eTi Aei = Aii , i.e., the i-th diagonal entry of A, we see that different signs on the diagonal of A imply
that there are i, j such that eTi Aei < 0 and eTj Aej > 0, which makes the matrix indefinite. However, the
converse is not true: A matrix with all entries of the same sign is not always definite.
1 3
Example 8.71. Consider the matrix A = . We see that
3 4
T v1 + 3v2
v Av = (v1 , v2 ) = v12 + 4v22 + 6v1 v2 .
3v1 + 4v2
1 −2
This expression is positive, e.g., for v = , but negative, e.g., for v = . Hence, A is indefinite.
0 1
259
2 1
Example 8.72. Consider the matrix A = . We see that
1 1
2v1 + v2
v T Av = (v1 , v2 ) = 2v12 + v22 + 2v1 v2 .
v1 + v2
This expression is positive for all v ∈ Rd \ {0}, and hence the matrix is positive definite. However, it is
not clear how to verify this precisely. We now introduce a direct method. (Still, try to verify it directly.)
Determining if a matrix is pos./neg. definite can be quite time-consuming, and is in most cases not
straight-forward. We therefore present an easy method based on the determinant of a matrix, see Sec-
tion 2.4. This is also the fastest method available, at least for small matrices. Note that this method
does not allow for determining semi- or indefiniteness.
Lemma 8.73 (Sylvester’s criterion). Let A = (aij )di,j=1 ∈ Rd×d be a symmetric matrix, and let
k k×k
Ak = (aij )i,j=1 ∈ R be the (upper left) submatrices of A. Then, A is ...
• positive definite if and only if det(Ak ) > 0 for all k = 1, . . . , d.
• negative definite if and only if det(Ak ) > 0 for even k and det(Ak ) < 0 for odd k.
A proof of this result is out of reach with our present knowledge and can be found in the literature.
Since, det(A1 ) = 1, det(A2 ) = det(A) = −2 and det(B1 ) = 0, we observe from Sylvester’s criterion that
A and B both cannot be positive or negative definite.
Indeed, we know from Example 8.71 that A is indefinite. For B, we see that v T Bv = (v1 , v2 )( vv21 ) = 2v1 v2
−1
1
is positive for v = ( 1 ), and negative for v = 1 , proving that B is indefinite.
First, note that A has entries on the diagonal with different signs, and is therefore indefinite, see Re-
mark 8.70. (As det(A1 ) = 1 and det(A2 ) = 0, also Sylvester’s criterion is inconclusive.)
For B, we observe that det(B1 ) = −1 < 0, det(B2 ) = 1 > 0 and det(B3 ) = det(B) = −1 < 0. By
Sylvester’s criterion, that A and B both cannot be positive- or negative-definite.
(A proof without Sylvester’s criterion would be a mess!)
260
Based on the notion of definiteness, we can finally give a sufficient condition for being an extremum.
Theorem 8.77 (Second (partial) derivative test). Let f : G → R be twice continuously differentiable
and let x0 ∈ G such that ∇f (x0 ) = 0. Then, we have
1) Hf (x0 ) is positive-definite =⇒ x0 is a strict local minimum
2) Hf (x0 ) is negative-definite =⇒ x0 is a strict local maximum
3) Hf (x0 ) is indefinite =⇒ x0 is not an extremum of f
In all other cases (i.e., semi-definite but not definite), we do not gain information from the second
derivative test.
Remark 8.78. We actually show below that points x0 ∈ G such that ∇f (x0 ) = 0 and Hf (x0 ) indefinite
are actually points such that for every ε > 0 there exist x, y ∈ Uε (x0 ) with f (x) > f (x0 ) and f (y) < f (x0 ).
That is, in every neighborhood there exist strictly smaller and larger function values. Such points, which
are clearly no extrema, are usually called saddle points.
Proof of Theorem 8.77. We consider the first case, i.e., that Hf (x0 ) is positive-definite. First, note that
there is some α > 0 such that v T Hf (x0 )v ≥ α for all v ∈ Sd−1 , i.e., all v with v T v = 1. (This can be
proven by using that v T Hf (x0 )v must attain its minimum on Sd−1 .) Now note that, by continuity of all
second-order partial derivatives, there is some ε > 0, such that
2
∂2f
∂ f α
∂xi ∂xj (x) − (x 0 <
)
∂xi ∂xj 2d
2 2
for all x ∈ Uε (x0 ) and all i, j = 1 . . . , d. That is, ∂x∂i ∂x
f
j
(x) is ’close’ to ∂x∂i ∂x f
j
(x0 ) in a small neighborhood
around x0 . From this, we obtain that
d X d 2 2
X
T
∂ f ∂ f
v Hf (x) − Hf (x0 ) v = (x) − (x0 ) · vi vj
i=1 j=1 ∂x i ∂x j ∂x i ∂x j
d X d
∂2f ∂2f
X
≤
∂xi ∂xj (x) − (x 0 · |vi ||vj |
)
i=1 j=1
∂xi ∂xj
d
! d
α X X
< |vi | |vj |
2d i=1 j=1
α α
≤ kvk2 = ,
2 2
see Lemma 8.8. This implies
v T Hf (x)v = v T Hf (x0 )v + v T Hf (x) − Hf (x0 ) v
≥ v T Hf (x0 )v − v T Hf (x) − Hf (x0 ) v
α α
> α− = >0
2 2
for all x ∈ Uε (x0 ) and v ∈ Sd−1 . That is, Hf (x) is also positive-definite in a neighborhood of x.
We now fix some v ∈ Sd−1 and consider the univariate function g(t) := f (x0 + tv) − f (x0 ). We see that
g(0) = 0 and g 0 (0) = Dv f (x0 ) = h∇f (x0 ), vi = 0, by assumption. Taylor’s theorem (Theorem 5.53) now
shows that
g 00 (ξ) 2 g 00 (ξ) 2
g(t) = g(0) + g 0 (0) · t + ·t = ·t
2 2
261
for some ξ ∈ (0, t). From Theorem 8.53, we obtain that g 00 (ξ) = Dv2 f (x0 + ξv) = v T Hf (x0 + ξv)v. In
particular, positive-definiteness implies that g 00 (ξ) > 0 for all ξ ∈ [0, ε), independent of v.
Since t ∈ (0, ε) implies ξ ∈ (0, ε), we have g(t) > 0, i.e., f (x0 + tv) > f (x0 ), for all t ∈ (0, ε), independent
of v. In other words, f (x) > f (x0 ) for all x ∈ Uε (x0 ) \ {x0 }, which proves the claim.
The case of Hf (x0 ) negative-definite follows from considering −f instead. (Note that H−f (x0 ) is positive-
definite then.)
Finally, if H(x0 ) is indefinite, then there exist some u, v ∈ Sd−1 such that
v T Hf v > 0 and uT Hf u < 0,
again in a neighborhood of x0 . Thus, the (univariate) function g(t) = f (x0 + tv) − f (x0 ) is positive, and
the function h(t) = f (x0 + tu) − f (x0 ) is negative, for every small enough t > 0. In other words, every
neighborhood of x0 contains points with smaller and larger function value, respectively, which shows that
f cannot have an extremum at x0 .
seen by Sylvester’s criterion (Lemma 8.73). Hence, Hf (0, 0) is positive definite, and x0 is a strict local
minimum.
Example 8.80. The previous example can be generalized by considering functions of the form
f (x) = xT Cx,
where C is a symmetric matrix, see also Example 8.39. Moreover, we already showed there that ∇f (x) =
2Cx, which implies that all possible extrema have to satisfy Cx = 0. Hence, we always have that x0 = 0
is a stationary point and that f (0) = 0. Moreover, we obtain that Hf (x) = 2C for any x, i.e., Hf (x) is
independent of x.
If C is positive-definite, then C is an invertible matrix (We prove that later.), which implies that 0 is
the only stationary point. Moreover, by positive-definiteness of the Hessian Hf = 2C, we obtain from
Theorem 8.77 that x0 = 0 is a strict local minimum. (This could also be seen from xT Cx > 0 for every
x 6= 0.) If C is negative-definite it follows analogously that 0 is a strict local maximum and that there
are no other extrema. If C is indefinite and invertible there are no extrema at all.
Finally, if C is ’only’ positive/negative semi-definite, or, more general, not invertible, then we cannot say
anything just by using the second derivative test.
In the special case d = 2, we can employ Sylvester’s criterion to obtain a very useful formulation of the
second derivative test.
Corollary 8.81 (Second derivative test for d = 2). Let G ⊂ R2 , f : G → R be twice continuously
differentiable and let x0 ∈ G such that ∇f (x0 ) = 0.
Moreover, let H := Hf (x0 ) be the Hessian of f at x0 with upper left entry H11 . Then, we have
1) det(H) > 0 and H11 > 0 =⇒ x0 is a strict local minimum
2) det(H) > 0 and H11 < 0 =⇒ x0 is a strict local maximum
3) det(H) < 0 =⇒ x0 is not an extremum of f
If det(H) = 0, we do not gain information from the second derivative test.
262
Again, we omit the proof. (The third point cannot be proven here.)
with x = (x1 , x2 ) from Example 8.62 has a (global) maximum at x0 = (0, 0). We already know from
Example 8.66 that ∇f (x) = −2f (x)(x1 , 2x2 ), and that this implies that x0 = (0, 0) is the only stationary
point of f . Computing the Hessian matrix of f we obtain
2
4x1 − 2 8x1 x2
Hf (x) = f (x) ·
8x1 x2 16x22 − 4
f (x1 , x2 ) = x21 · x2 ,
from Example 8.67. Since ∇f (x) = (2x1 x2 , x21 ), we saw that all points in the set {(0, x2 ) : x2 ∈ R} are
stationary points, and we also verified which of them are extrema.
Moreover, we easily obtain the Hessian
2x2 2x1
Hf (x1 , x2 ) = ,
2x1 0
and therefore Hf (0, x2 ) = 2x0 2 00 . This matrix has det(Hf (0, x2 )) = 0 for every x2 ∈ R. Therefore, the
second derivative test does not lead to an answer whether some of the stationary points are extrema or
not.
The gradient of f is
cos(x1 ) cos(x2 )
∇f (x) =
− sin(x1 ) sin(x2 )
and the Hessian, which we already computed in Example 8.56, is
sin(x1 ) cos(x2 ) cos(x1 ) sin(x2 )
Hf (x) = (−1) .
cos(x1 ) sin(x2 ) sin(x1 ) cos(x2 )
263
Let us compute all stationary points of f , i.e. all x = (x1 , x2 ) such that ∇f (x) = 0. Since sin(t) and
cos(t) cannot be zero at the same time (i.e., for the same t), we obtain that ∇f (x) = 0 if either
or
sin x1 = 0 and cos x2 = 0.
We start with the first case, i.e. cos x1 = 0 and sin x2 = 0, which implies that
π (2k1 + 1)π
x1 = k1 π + = and x2 = k2 π,
2 2
for some k1 , k2 ∈ Z. (For example, x1 = π2 and x2 = 0.)
Plugging this into the Hessian, we see that for such an x = (x1 , x2 ) we have
(−1)k1 +k2
0
Hf (x) = (−1) .
0 (−1)k1 +k2
(Make sure that you understand the basic properties of cos/sin that lead to this.)
This shows Hf (x) = −1 0
10
0 −1 if k1 + k2 is even, and that Hf (x) = ( 0 1 ) if k1 + k2 is odd.
Therefore, (
π maximum, if k1 + k2 is even,
x0 = k1 π + , k2 π is a strict local
2 minimum, if k1 + k2 is odd.
(Verify that the corresponding matrices are pos./neg. definite!)
For example, the point ( 25π
2 , 42π) is a strict local maximum (k1 = 12, k2 = 42).
If we consider the second case, i.e., sin x1 = 0 and cos x2 = 0, we need that
π (2k2 + 1)π
x1 = k1 π and x2 = k2 π + = ,
2 2
for some k1 , k2 ∈ Z. (For example, x1 = 0 and x2 = π2 .) Note that the contour plot above already shows,
that there cannot be an extremum at such points, because every point on these ’lines’ has points with
smaller and larger function value around it.
However, to prove this, plug these (x1 , x2 ) into the Hessian, to obtain
(−1)k1 +k2
0
Hf (x) = (−1) .
(−1)k1 +k2 0
0 −1
This shows that either Hf (x) = ( 01 10 ) or Hf (x) = −1 0 , but both matrices have an eigenvalue 1 and
an eigenvalue −1, making Hf (x) indefinite. Therefore, all points x0 = k1 π, k2 π + π2 with k1 , k2 ∈ Z are
no extrema. (They are actually saddle points, see Remark 8.78.)
The results above allow to find extrema of functions which are defined on open sets, like G = Rd , and
to verify if these extreme are minima or maxima. However, in many applications we are interested in
extrema which are contained in some given (closed) set Ω ⊂ G. For this, we have to consider the boundary
of the set separately. Recall from the univariate case that a function defined on a closed interval can
have a minimum/maximum at the boundary points. For example, the function f (t) = 2t on [1, 2] has a
minimum at 1 and a maximum at 2, but the derivative of f is nowhere zero. These boundary points are
easy to check, but for multivariate functions the boundary is more complex.
In what follows, we consider only functions defined on G = Rd and we want to find its extrema in sets
that are given by
Ω = {x ∈ Rd : g(x) ≤ c}.
264
for some function g : Rd → R and c ∈ R. For the sake of simplicity, we assume here that the function g
is a continuously differentiable function. From this we obtain, in particular, that the interior
Ωo := {x ∈ Rd : g(x) < c}
of the set Ω is an open set. We can therefore find all extrema of a function inside Ωo by the techniques
from above. It remains to consider the boundary of Ω, i.e.,
∂Ω := {x ∈ Rd : g(x) = c}
Extrema of a function on the boundary ∂Ω, which are defined in the same way as in Definition 8.60 with
Ω replaced by ∂Ω, are called extrema subject to the constraint g(x) = c.
Before we come to a more systematic way of finding the extrema of a function, let us note that the
equation g(x) = g(x1 , . . . , xd ) = c does sometimes lead to an ’easy’ restriction of just one coordinate,
which we may just plug into our original problem to find the extrema on ∂Ω. That is, g(x1 , . . . , xd ) = c
can sometimes be written as xd = h(x1 , . . . , xd−1 ) for some function h : Rd−1 → R. In this case, finding
an extrema of f of ∂Ω is just the same as finding an extrema of
F (x1 , . . . , xd−1 ) := f x1 , . . . , xd−1 , h(x1 , . . . , xd−1 )
on Rd−1 . That is, the restriction just reduces the dimension by one. Let us see an example.
To find the extrema of f in Ω we first compute the stationary points of f on Rd , i.e., all points such that
∇f vanishes. We already saw that x = (0, 0) is the only stationary point of f . However, (0, 0) is not
contained in Ω, because it satisfies g(0, 0) = 0 6≤ −2, and therefore cannot be an extremum of f in Ω.
Hence, there are no (local) extrema in Ωo .
To treat the boundary, i.e., ∂Ω = {x ∈ R2 : g(x) = c}, note that
g(x1 , x2 ) = −2 ⇐⇒ x2 = 2 − x1 .
Therefore, every point (x1 , x2 ) with g(x1 , x2 ) = −2 is of the form (x1 , 2 − x1 ). To find an extremum of
f is therefore the same as finding an extremum of the (univariate) function
We see that this function has a minimum at x1 = 1. Since x2 = 2 − x1 , we obtain that f has a minimum
subject to the constraint x1 + x2 = 2 at the point (1, 1). By geometric reasoning, we see that (1, 1) is a
global minimum of f in Ω, and that f has no maximum. (Make a plot!)
The last example shows that it is sometimes easy to incorporate the constraint and then find the extrema
on the boundary. However, this is clearly not always the case, and we need a way to compute extrema
subject to a constraint g(x) = c systematically. This is done by the method of Lagrange multipliers.
For this, we define the Lagrange function
L(x, λ) := f (x) + λ · g(x) − c ,
where the number λ ∈ R is called the Lagrange multiplier. Note that the function L : Rd+1 → R now
depends on d + 1 variables, namely the d ’original’ variables and λ.
265
It turns out that finding extrema subject to constraints is just the same as finding the extrema of the
Lagrange function L. For this we need to compute the gradient, i.e., all partial derivatives, of L and find
points where it is zero. Note that the gradient of L(x, λ) is given by
∂L ∂L ∂L
∇L(x, λ) = ,..., , (x, λ)
∂x1 ∂xd ∂λ
and that the partial derivative w.r.t. λ is just
∂L(x, λ)
= g(x) − c.
∂λ
Therefore, setting this partial derivative to zero is equivalent to g(x) = c, which is precisely the constraint.
This partial derivative is therefore not of much interest and we mostly need only the first entries of the
gradient, which we denote by
∂L ∂L
∇x L := ,...,
∂x1 ∂xd
and, by the definition of the Lagrange function, we see that
∇f (x0 ) = −λ∇g(x0 ).
The equation from this theorem means that the gradients of f and g at the point x0 are parallel, i.e.,
they show in the same or the opposite directions. The additional constant λ is necessary, because the
gradients might be of different length.
A formal proof of Theorem 8.86 is out of reach at the moment, as it is based on the (rather involved)
implicit function theorem that we don’t discuss here. However, we can give a geometric explanation of
this necessary condition, see Figure 54.
Figure 54: Contour plot of a function and a constraint with their gradients.
Note that the gradient of a function f at x0 is always perpendicular to the ’surface’ {x : f (x) = f (x0 )},
i.e., the level set at height f (x0 ). Therefore, if f has an extremum, say a maximum, subject to g(x) = c
266
at x0 , then the level set {x : f (x) = f (x0 )} ’touches’ the set {x : g(x) = c} in a single point, implying
that the gradients of f and g at x0 are parallel. If this would not be fulfilled, then one could ’wander’
along {x : g(x) = c} and increase the function value, which contradicts that x0 is a maximum.
Theorem 8.86 implies that for each extrema at x0 subject to a constraint g(x) = c, with ∇g(x0 ) 6= 0,
there is some λ such that ∇f (x0 ) + λ∇g(x0 ) = 0. That is, if x0 is a (local) extremum subject to g(x) = c,
then there is some λ such that
∇L(x0 , λ) = 0.
It is therefore necessary to find all stationary points of L. Moreover, the above theorem makes no state-
ment about points with ∇g(x) = 0, and they have to be considered separately.
In summary, to find all extrema of f subject to a constraint g(x) = c we need to consider all critical
points of L, which are
1. all points x0 with ∇g(x0 ) = 0 and g(x0 ) = c, and
2. all points x0 with ∇g(x0 ) 6= 0 and ∇L(x0 , λ) = 0 for some λ.
Note that, as in the unconstrained optimization, not each of these points is an extremum, and one needs
a different reasoning to verify if they are (local) minima or maxima.
Remark 8.87. There is also a variant of the second derivative test for constraint optimization. This
method is based on the so-called bordered Hessian matrix, which is the Hessian of L. Unfortunately, it is
not as simple as before (by verifying positive/negative-definiteness) to determine the type of an extremum
subject to constraints. We do not discuss the details here.
Example 8.88. We consider again the function from Example 8.85. That is, we want to find the extrema
of f (x1 , x2 ) = x21 + x22 subject to the constraint g(x1 , x2 ) := x1 + x2 = 2, see Figure 55. (For sake of
notation, we use a different g here than in Example 8.85.)
First, we check if the constraint has a vanishing gradient by computing
Clearly, this is never zero, and therefore all critical points are points x = (x1 , x2 ) where the gradient of
the Lagrange function
and
∂L(x, λ)
= x1 + x2 − 2.
∂λ
Setting ∇x L(x, λ) = 0 we see that the equation is solved by x1 = x2 = − λ2 .
We finally need to find λ such that this point satisfies the constraint g(x) = x1 + x2 = 2. Clearly, this
leads to λ = −2, and therefore to the unique critical point (1, 1), as already shown in Example 8.85. This
point is clearly a minimum, see Figure 55.
267
Recall from Example 8.67 that all points from {(0, x2 ) : x2 ∈ R} are stationary points of f and that all of
them except (0, 0) are local extrema. However, all of these points satisfy f (x) = 0 and are therefore clearly
no global extrema. As f is unbounded it actually does not have global extrema (without constraints).
We now consider the bounded and closed domain Ω := {x : x21 + x22 ≤ 3}, see Figure 56. The function
clearly has global minima and maxima in Ω, and they are not in the interior Ωo , since all stationary
points have function value 0. (However, they are still local extrema in Ω.) Therefore, the global extrema
are on the boundary ∂Ω = {x : x21 + x22 = 3}, and to find them, we need to find the extrema of f subject
to g(x) = x21 + x22 = 3.
To do so we compute
∇f (x) = (2x1 x2 , x21 )
and, for g(x) = x21 + x22 ,
∇g(x) = (2x1 , 2x2 ).
First of all, ∇g(x) = 0 if and only if x = (0, 0), but x = (0, 0) does not satisfy g(x) = 3, i.e., x ∈
/ ∂Ω, and
can therefore be ignored. Now we consider the Lagrange function
First note that λ = 0 corresponds to the (local) extrema of f without constraints, because ∇x L(x, 0) =
∇f (x). Therefore, solving the equation ∇x L(x,
√ 0) = 0 gives the set√ of solutions {(0, x2 ) : x2 ∈ R}, and
the only points of this set on ∂Ω are P1 = (0, 3) and P2 = (0, − 3). As in Example 8.67, we see that
P1 is a local maximum and P2 is a local minimum with f (P1 ) = f (P2 ) = 0, see also Figure 56.
Let us consider the general equation ∇x L(x, λ) = 0, i.e., the system of equations
2x1 x2 + 2λx1 = 0,
x21 + 2λx2 = 0.
We see that the first equation reads 2x1 (x2 + λ) = 0, which is satisfied if either x1 = 0 or x2 + λ = 0.
Since x1 = 0 only leads to local extrema (see above), we consider the second case, which implies x√2 = −λ.
Putting this into the second equation gives x21 − 2λ2 = 0, i.e., x21 = 2λ2 and therefore x1 = ± 2 · |λ|.
It remains to find suitable λ by putting into the constraint x21 + x22 =√2λ2 + (−λ)2 = 3. This√ gives the
solutions λ1 = 1 and λ2 = −1.√ For λ = 1 we obtain
√ the points P 3 = ( 2, −1) and P 4 = (− 2, −1), and
for λ = −1 we obtain P5 = ( 2, 1) and P6 = (− 2, 1).
Finally, by computing the function values f (P3 ) = f (P4 ) = −2 and f (P5 ) = f (P6 ) = 2, we see that f
has global minima in Ω = {x : x21 + x22 ≤ 3} at P3 and P4 , and global maxima in Ω at P5 and P6 .
Example 8.90. Let us finally consider again our initial example from Example 8.62, i.e.,
2 2
f (x) := e−(x1 +2x2 )
with x = (x1 , x2 ) on the set Ω := {x : kxk ≤ 1} = {(x1 , x2 ) : x21 + x22 ≤ 1}.
We already know that f has a global maximum at (0, 0) with f (0) = 1, see Example 8.82, and no other
stationary point. It therefore has a global minimum at the boundary ∂Ω = {x : kxk = 1}. To find the
minima, first note that ∇g only vanishes at (0, 0), which is not in ∂Ω, and can be ignored. Now consider
the Lagrange function
2 2
L(x, λ) = e−(x1 +2x2 ) + λ(x21 + x22 − 1)
such that
∇x L(x, λ) = −2x1 f (x) + 2λx1 , −4x2 f (x) + 2λx2
= −2x1 (f (x) + λ), −2x2 (2f (x) + λ) .
So, ∇L(x, λ) = 0 holds if either x1 = 0 and x2 = ±1, or x2 = 0 and x1 = ±1. (Verify why there are no
other possibilities!) Since f (0, ±1) = e−2 and f (±1, 0) = e−1 , we see that f has a global minimum in Ω
at x0 = (0, ±1).
where for all 1 ≤ i ≤ m we have fi : Rd → R. This implies that for each component of f , i.e. for
each fi , we are able to use all results we proved so far. We will see that, it is often enough to study all
components, i.e. we can reduce our questions to problems which only contain real functions. Therefore
it makes sense to use the following definition.
269
Example 8.93. Another type of interesting vector-valued functions are representations of curves, which
are mappings h : R → Rm , i.e., the map a number to a point in Rm . One prominent example is the helix
h(t) = (r cos(2πt), r sin(2πt), t)
with radius r > 0, see Figure 58 This function is clearly continuous, since all components are continuous.
The analysis of individual vector-valued function goes along the same lines than for real-valued (multi-
variate) functions. However, as there are now m different components that we need to keep track of, we
need some more definitions.
First, we discuss the partial derivatives for vector-valued functions. The only difference between real-
and vector-valued functions is that we need to consider all partial derivatives of every component fi of f ,
and we use again a matrix to collect them. (Recall that the partial derivatives of a real-valued function
were collected in the gradient, i.e., a 1 × d matrix aka. vector.) Nevertheless the computation of partial
derivatives is not harder in this setting as we will immediately see from the next definition.
We see that the i-th row of the Jacobian is given by the gradient of fi . We will illustrate this by the
following example.
Example 8.95. Let us compute the Jacobian of
f (x1 , x2 , x3 ) = (x1 x2 , ex3 ),
which is a function from R3 to R2 with f1 (x) = x1 x2 and f2 (x) = ex3 , where x = (x1 , x2 , x3 ). We see
that
∇f1 (x) = (x2 , x1 , 0) and ∇f2 (x) = (0, 0, ex3 ).
Therefore we see that the Jacobian is given by
x2 x1 0
J(x) = .
0 0 ex3
In fact, we already saw some vector fields and their Jacobian, although a bit hidden.
Example 8.96. Let f : G → R be a twice-partially differentiable function. (Note that f is not vector-
valued!) The gradient is then a mapping from G to Rd , as for every x ∈ G ⊂ Rd we have ∇f (x) ∈ Rd ,
i.e. gradients are always vector fields if they exist.
Now, since every component of the gradient is partially differentiable by assumption, we can compute
the Jacobian of ∇f (x), and we see that it is actually given by the Hessian of f , i.e.
J∇f (x) = Hf (x).
Note that under the additional assumption that f is twice-continuously differentiable, we even have that
the Jacobian of ∇f is symmetric. (Why is this the case?)
271
Similar to multivariate real functions it is not sufficient to only consider partial derivatives. Let us adapt
the definitions to this vector-valued case, which is quite straightforward.
where r satisfies
r(y)
lim = 0,
y→0 kyk
then we call f differentiable at x. We call D = dfx the (total) derivative of f at x.
Remark 8.98. Recall that a linear mapping from Rd → Rm is always described by a matrix. Moreover,
kr(y)k
r(y) ∈ Rm is a vector and limy→0 r(y)
kyk = 0 if and only if limy→0 kyk = 0
We see that f is differentiable in the above sense if and only if all components fi : Rd → R are differen-
tiable, where 1 ≤ i ≤ m. This allows to use all results for real functions when considering vector-valued
ones. In particular, we are able to generalize the results which were used to connect partial derivatives and
differentiability, see Section 8.3.2. All the proof are basically the same, with the additional observation
that the Jacobian of a vector-valued function can be written as
∇f1 (x)
∇f2 (x)
Jf (x) = .
..
.
∇fm (x)
Let us summarize this in the following theorem, which we state without proof.
Moreover, if f : G → Rm is a mapping such that all partial derivatives of all components are contin-
uous at x ∈ G, then f is also differentiable at x.
By the above theorem, we see that many computations related to the derivative of a vector-valued function
can be performed by computations of the (real-valued) component-functions fi . For example, we easily
see that d(f + g)x = dfx + dgx for all f, g : Rd → Rm that are differentiable at x, by using the statements
for each component individually.
In addition, using that the derivative is given by the Jacobi matrix, we obtain a particularly useful formula
for the derivative of the composition of functions. In fact, the derivative (i.e. Jacobi matrix) of the
composition g ◦ f is given by the matrix-product of the Jacobi matrices f and g (at appropriate points).
As in the univariate case, it is called the chain rule.
Proof. To keep track of all the variables we use the following notation throughout this proof, A = Jf (x) ∈
Rp×d , B = Jg (y) ∈ Rm×p , and
f (x + ξ) = f (x) + Aξ + r(ξ)
g(y + η) = g(y) + Bη + s(η)
(g ◦ f )(x) = x42
x21 x22
and computing the Jacobian of this function directly, we obtain exactly the same matrix.
h(x) = cos(sin(x1 ))
x23 + sin(x1 )
This function can be written as g ◦ f , where
ey1
x1 − x2
f (x) = x23 and g(y) = cos y3
sin x1 y2 + y3
with x = (x1 , x2 , x3 ) and y = (y1 , y2 , y3 ). (There are many different choices for f and g. Try to find
some!)
If we compute the corresponding Jacobian matrices
y
e 1 0 0 1 −1 0
Jg (y) = 0 0 − sin y3 and Jf (x) = 0 0 2x3 ,
0 1 1 cos x1 0 0
we obtain
ex1 −x2 −ex1 −x2
0
Jg◦f (x) = Jg (f (x)) · Jf (x) = − sin(sin(x1 )) cos(x1 )
0 0 .
cos x1 0 2x3
Of course we would also be able to compute this Jacobian directly, but this example suggests a quite
useful method one can use. We want to compute the Jacobian of a complicated function, in this case h.
To do so we rewrite it as the composition of two easier functions f, g and use the chainrule to compute
Jg◦f . This may lead to more computations and a matrix multiplication, but we only have to compute
very easy derivatives if we make a clever choice of f, g. Thus the problem may become easier in some
cases as many derivatives can be written down immediately.
274
|α| = α1 + α2 + · · · + αd
α! = α1 ! · α2 ! · · · αd !.
i.e. we differentiate α1 times w.r.t. x1 , α2 times w.r.t. x2 and so on. Since f is |α|-times continuously
differentiable it follows from Theorem 8.58 that we could change the order of differentiation. Moreover,
we will use the following useful notation
αd
xα = xα1 α2
1 x2 · · · xd .
dk g X k!
g (k) (t) = (t) = Dα f (x + ty) · y α .
dtk α!
|α|=k
Remark 8.105. This formula for the derivatives of g makes sense also at the endpoints of the interval
[0, 1], although we usually avoided to consider derivatives of the endpoints of the domain of the function.
For this note that, since G is open, and x, y ∈ G, we know that also small neighborhoods around x and
y are contained in G. In particular, there is some ε > 0 such that x + ty ∈ G for all t ∈ (−ε, 1 + ε). If
we now define the function g : (−ε, 1 + ε) → R, we can consider its derivatives also at t = 0 and t = 1.
If k = 1 this is just an application of the definition of the total derivative. For this, write x̄ := x + ty and
note that we can write f (x + (t + h)y) = f (x̄ + hy) = f (x̄) + dx̄ f (hy) + r(hy), and so
g(t + h) − g(t) f (x + (t + h)y) − f (x + ty)
g 0 (t) = lim = lim = d(x+ty) f (y)
h→0 h h→0 h
d
X
= Di f (x + ty) · yi ,
i=1
275
see Definition 8.31 and Theorem 8.35 (with x replaced by x + ty). Now we assume that the statement is
true for k − 1, i.e.
d
dk−1 g X
(t) = Dik−1 . . . Di1 f (x + ty) · yi1 · · · yik−1 .
dtk−1 i ,...i =1 1 k−1
This function is still differentiable, since Dik−1 . . . Di1 f is so for every choice of i1 , . . . , ik−1 . By the same
computation than above, with f replaced by Dik−1 . . . Di1 f , we see
d
d dk−1 g d X
g (k) (t) = (t) = Dik−1 · · · Di1 f (x + ty) · yi1 · · · yik−1
dt dtk−1 dt i
1 ,···ik−1 =1
d
X d
= yi1 · · · yik−1 · Dik−1 · · · Di1 f (x + ty)
i1 ,...,ik−1 =1
dt
d
X d
X
= yi1 · · · yik−1 · Dj Dik−1 · · · Di1 f (x + ty) · yj
i1 ,...,ik−1 =1 j=1
d
X d
X
= Dj Dik−1 . . . Di1 f (x + ty) · yi1 · · · yik−1 · yj
i1 ,...ik−1 =1 j=1
Using the index ik instead of j the first part of the proof follows.
Second part: Now we have to show that
d
X X k!
Dik . . . Di1 f (x + ty)yi1 · · · yik = (Dα f )(x + ty)y α .
i1 ,...,ik =1
α!
|α|=k
For this, we need to count the number of different partial derivatives. Let α ∈ Nd0 with |α| = k. By
Schwarz’ theorem (Theorem 8.58) we can interchange all these derivatives, i.e., if ij appears αj times
(1 ≤ j ≤ k)
Di1 Di2 . . . Dik f (x + ty)yi1 · · · yik = D1α1 D2α2 . . . Dkαk f (x + ty)y1α1 y2α2 . . . ydαd ,
using the multi-index notation from above. Moreover, the number of tuples (i1 , . . . , ik ) such that ij
appears exactly αj times for 1 ≤ j ≤ d is α1 !α2k!!...αd ! . This proves the desired formula.
With the help of this result we are able to show the multivariate version of Taylor’s theorem.
Proof. Let g(t) = f (x+ty), t ∈ [0, 1], which is (n+1)-times continuously differentiable, see Lemma 8.104.
Hence we are able to apply the univariate Taylor’s theorem, see Theorem 5.53, implying
n
X g (k) (0) g (n+1) (θ)
g(1) = + ,
k! (n + 1)!
k=1
276
for some θ ∈ (0, 1). (Here, we use the Taylor polynomial of order n at x0 = 0.)
Lemma 8.104 implies that for 1 ≤ k ≤ n
g (k) (0) X 1
= Dα f (x) · y α
k! α!
|α|=k
and
g (n+1) (θ) X 1 α
= D f (x + θy) · y α .
(n + 1)! α!
|α|=n+1
A simple reformulation of the above result, using y = x − x0 for some fixed x0 , gives a statement similar
to the univariate Taylor’s theorem 5.53.
for some ξ between x0 and x, i.e., ξ = x0 + θ(x − x0 ) for some θ ∈ (0, 1). We call
X Dα f (x0 )
Tn (x) := (x − x0 )α ,
α!
|α|≤n
Proof. Note that since x ∈ U there exists some y ∈ Rd with x = x0 + y and such that for any t ∈ [0, 1] we
have that x0 + ty ∈ U . An application of Theorem 8.106, with x replaced by x0 and y = x − x0 , yields
the result.
We now turn to error bounds in this approximation. Note that, in contrast to the explicit formula for
the remainder of Tn above, where we needed f to be (n + 1)-times continuously differentiable, we only
need that f is n-times continuously differentiable to obtain error bounds.
2M · dn n
|f (x) − Tn (x)| ≤ · kx − x0 k .
n!
277
Note that this upper bound might be very large (i.e., bad) for large d and small n, in which case we only
have reasonable bounds for x very close to x0 .
Proof. First of all, writing f as its order n − 1 Taylor polynomial, and regrouping, we obtain
X Dα f (x0 ) X Dα f (ξ)
f (x) = (x − x0 )α + (x − x0 )α
α! α!
|α|≤n−1 |α|=n
X Dα f (x0 ) X Dα f (ξ) − Dα f (x0 )
= (x − x0 )α + (x − x0 )α
α! α!
|α|≤n |α|=n
X Dα f (ξ) − Dα f (x0 )
= Tn (x) + (x − x0 )α .
α!
|α|=n
That is, f can also be written by its Taylor polynomial of order n, but with a different error/remainder.
To bound all the terms separately, we need that
d
Y d
Y αi n
(x − x0 )α = (xi − x0,i )αi ≤ kx − x0 k = kx − x0 k
i=1 i=1
for all α ∈ Nd0 with |α| = n. In addition, we need a special case of the multinomial theorem, i.e.,
X n!
= dn ,
α!
α∈Nd
0 : |α|=n
which follows from combinatorial arguments. (We omit details here.) We finally obtain
X α α
D f (ξ) − D f (x 0 ) α
|f (x) − Tn (x)| =
(x − x0 )
α!
|α|=n
n
n d
n
X 1
≤ 2M kx − x0 k ≤ 2M kx − x0 k .
α! n!
|α|=n
Example 8.109 (Second order approximation). We have a look at a twice continuously differentiable
function f : G → R, G ⊂ Rd , and assume that x0 = 0 and that all second-order partial derivatives of f
are bounded in G. We want to compute the Taylor representation of f for k = 2, which is formally given
by
X Dα f (0)
f (x) = · xα + R2 (x).
α!
|α|≤2
First of all, the only term with |α| = 0 if f (0). Considering the terms with |α| = 1, where any α contains
exactly one non-zero entry with a 1. Thus, all α we have to consider are given by the unit vectors and
α! = 1. Therefore we see that
d
X X Dα f (0)
T2 (x) = f (0) + Di f (0) · xi + xα .
i=1
α!
|α|=2
For |α| = 2 we see that any such vector can be obtained as ei + ej , where 1 ≤ i ≤ j ≤ d. If i = j then
α! = (2ei )! = 2, and if i < j then α! = (ei + ej )! = 1!. This shows that
X Dα f (0) d
1X 2 X
xα = Di f (0) · x2i + Di Dj f (0) · xi xj
α! 2 i=1 i<j
|α|=2
d
1X 2 1X
= Di f (0) · x2i + Di Dj f (0) · xi xj ,
2 i=1 2
i6=j
278
since all second-order partial derivatives are continuous, and can therefore be interchanged. It follows
that
1
T2 (x) = f (0) + ∇f (0) · x + xT Hf (0)x,
2
where we write ∇f (0) · x for h∇f (0), xi. (This makes sense since the gradient is a row vector.) In general,
i.e., with x0 6= 0, we have
1
T2 (x) = f (x0 ) + ∇f (x0 ) · (x − x0 ) + (x − x0 )T Hf (x0 )(x − x0 ).
2
Note that the error (for x0 = 0) can be estimated by
2
|f (x) − T2 (x)| ≤ M · d2 · kxk ,
Example 8.110. We now want to use this general formula to calculate the second-order Taylor approx-
imation for
f (x1 , x2 , x3 ) = x1 x2 + ex3 ,
at x0 = (2, 1, 0). It is easy to compute that
0 1 0
∇f (x) = x2 , x1 , ex3
and Hf (x) = 1 0 0 .
0 0 ex3
Thus, f (x0 ) = 3,
0 1 0
∇f (x0 ) = 1, 2, 1 and Hf (x0 ) = 1 0 0 .
0 0 1
Moreover, note that all second-order partial derivatives are bounded by max{1, ex3 }. In particular, for
the special choice
G = {x ∈ R3 : kxk < r},
where r > 0, we have M := supx∈G |Dα f (x)| ≤ er for all α ∈ N30 with |α| = 2.
Therefore, we have the error bound
2
|f (x) − T2 (x)| ≤ 9er · kx − x0 k ,
according topCorollary 8.108. We see that the error of the approximation is smaller than ε > 0, if
kx − x0 k < 9ε e−r .
We already saw that being able to use higher order derivatives leads to a better approximation. This
suggests, like in the one dimensional case, to try to write certain functions as the limit of Taylor poly-
nomials. A necessary condition is then that one has to be able to compute arbitrary derivatives of the
function.
Note that, a priori, we do not know if T f (x) converges, and even then, we do not know if T f (x) = f (x)
for some x ∈ G. In the same way as in the univariate case, we introduce some criteria that imply the
convergence, at least in a neighborhood of x0 .
2Mn · nd n
|f (x) − Tn (x)| ≤ · kx − x0 k ,
n!
where Mn := maxα∈Nd0 : |α|=n supξ∈Ur (x0 ) |Dα f (ξ)|. Regrouping leads to
n
r n Mn
kx − x0 k
|f (x) − Tn (x)| ≤ ·2 nd ,
n! r
The first term tends to zero by assumption. For the remaining terms note that x ∈ U (x0 ) implies
kx−x0 k
r < 1, and that limn→∞ q n nd = 0 for any d ∈ N and 0 < q < 1. Therefore, |f (x) − Tn (x)| → 0 as
n → ∞, and we obtain
T f (x) = lim Tn (x) = f (x).
n→∞
Example 8.113. Let f (x1 , x2 ) = ex1 +x2 . We want to compute the Taylor series of this function for
x0 = 0. The partial derivatives of f are
∂f (x) ∂f (x)
= ex1 +x2 = .
∂x1 ∂x2
Thus, all partial derivatives are given by f itself, which implies Dα f (x0 ) = f (x0 ) = 1 for all α ∈ Nd0 . We
obtain the Taylor series
X Dα f (0) X xα
T f (x) = xα = .
2
α! 2
α!
|α|∈N0 |α|∈N0
To√
see that this series converges, let Ur = {x ∈ R2 : kxk < r}, for some r > 0. We see that |Dα f (x)| ≤
2r
e for all x ∈ Ur . (Why?)
If we observe that √
rn α rn e 2r
lim · max sup |D f (ξ)| ≤ lim = 0,
n→∞ n! α∈Nd : |α|=n ξ∈Ur n→∞ n!
0
since n! grows faster than any exponential, we obtain from Theorem 8.112 that T f (x) = f (x) for any
x ∈ Ur . Since r was arbitrary the Taylor series of f converges for every point x ∈ R2 .
280
For such continuous functions defined on rectangles we can follow exactly the same lines as in Section 6.3
and define the integral as the limit of an average of function values.
However, since the possible domains are much more complicated in the multivariate case, we have to be a
bit more precise here, and actually need the precise definition of a Riemann integral. Let us only illustrate
the two-dimensional case. As for univariate integrals, we split the rectangle R = [a, b] × [c, d] into smaller
parts. These are obtained by partitioning each of the intervals [a, b] and [c, d] into smaller intervals, and
consider their Cartesian products. For this, assume we have a partition a = s0 < s1 < · · · < sn = b of
[a, b] and a partition c = t0 < t1 < · · · < tn = d of [c, d]. The n2 Cartesian products of these univariate
intervals, say R1 , . . . , Rn2 , are all of the form Rk = [si , si+1 ] × [tj , tj+1 ] for some i, j = 0, . . . , n, with area
|Rk | = (si+1 − si )(tj+1 − tj ). If we now bound the values of a function in each of these rectangles by the
smallest or largest one, we obtain the lower and upper sums as in Section 6.3, which are lower and upper
bounds for the value of the integral, if it exists, independent from the chosen partition. Here, we take in
a addition those partitions that lead to the largest and smallest value, respectively. That is, we consider
the lower and upper sums
2 2
n
X n
X
Ln (f ) := sup |Rk | min f (x) and Un (f ) := inf |Rk | max f (x) ,
x∈Rk x∈Rk
k=1 k=1
where the sup / inf are over all partitions as described above. It is not hard to see that Ln (f ) ≤ Un (f )
for arbitrary functions f . Moreover, Ln (f ) is monotonically increasing with n, and Un (f ) is decreasing,
which implies that both sequences converge. So, if their limits are the same, i.e., limn→∞ Ln (f ) =
limn→∞ Un (f ), then we define the integral of f by this common limit.
(The generalization to higher dimensions is straightforward.)
281
Qd
Definition 8.114 (Riemann-integrable functions). Let R = i=1 [ai , bi ] be a box and f : R → R
be a bounded function. Then, if
lim Ln (f ) = lim Un (f ),
n→∞ n→∞
we call f a (Riemann-)integrable function and define the integral of f over R by this common
limit, i.e., Z Z
f (x)dx := f (x1 , . . . , xd ) d(x1 , . . . , xd ) := lim Ln (f ).
R R n→∞
This definition is quite impractical as the involved limits, suprema and infima are hard to determine. We
will discuss shortly how to evaluate integrals easier. However, for continuous functions defined on a box
(or rectangle), the definition can be much simplified:
We can, again, define the value of the integral by the limit of cubature rules applied to f , see Remark 6.35.
Let us assume R = [a, b] × [c, d] and define the sums
n
(b − a)(d − c) X i j
Qn (f ) := f a + (b − a), c + (d − c) , (8.1)
n2 i,j=1
n n
Pn
see also (6.1). In the special case R = [0, 1]2 , we have Qn (f ) = n12 i,j=1 f ni , nj .
The following lemma shows that these sums (aka. averages) converge to the integral for continuous f . We
state it directly for higher dimensions. The modifications of the above definitions (and the proof sketch
below) are straightforward.
Qd
Lemma 8.115. Let R = i=1 [ai , bi ] and f : R → R be continuous. Then, f is integrable and
Z
f (x)dx = lim Qn (f ).
R n→∞
Sketch of proof. We use the same ideas as in the univariate case. Moreover, we only prove the case d = 2
with R = [0, 1]2 here. If we use an equidistant partition R1 , . . . , Rn2 of R, i.e., all Rk are of the form
j j+1 ∗
[ ni , i+1
n ] × [ n , n ] for some i, j = 1, . . . , n, and denote the corresponding lower and upper sum by Ln (f )
∗
and Un (f ), we obtain
L∗n (f ) ≤ Ln (f ) ≤ Un (f ) ≤ Un∗ (f ).
(Recall that Ln involves the supremum over all partitions and is therefore larger, similar for Un .)
Hence, f is integrable if limn→∞ L∗n (f ) = limn→∞ Un∗ (f ). For this, let `i = min{f (x) : x ∈ Ri } and
ui = max{f (x) : x ∈ Ri }. We obtain, for fixed ε > 0, that
|ui − `i | < ε.
for all i = 1, . . . , n2 , if n is large enough. (As in Lemma 6.28, we use here that f on a bounded set, here
the rectangle, is not only continuous but even uniformly continuous.) This yields that
n2
(b − a)(d − c) X
|L∗n (f ) − Un∗ (f )| < ε = (b − a)(d − c) ε
n2 i=1
for large enough n. Since this holds for all ε > 0, we obtain that the limits are equal.
To obtain this limit equals limn→∞ Qn (f ) note that, by the choice of the partition, we have L∗n (f ) ≤
Qn (f ) ≤ Un∗ (f ). The sandwich rule implies the result.
Although the above gives us a valid definition, it is not very handy when we want to compute integrals.
Unfortunately, there is no equivalent for the antiderivative for multivariate functions, which was,
282
together with the fundamental theorem of calculus (Theorem 6.38), the most powerful technique to
evaluate integrals. Luckily, we are again in a situation that allows to reduce everything to the evaluation
of one-dimensional integrals. This is (a special case of) Fubini’s theorem which is probably the most
important result related to multiple integrals.
Qd
Theorem 8.116 (Fubini). Let R = i=1 [ai , bi ] be a box and f : R → R be continuous. Then,
! ! !
Z Z b1 Z b2 Z bd
f (x)dx = ··· f (x1 , . . . , xd ) dxd ··· dx2 dx1 .
R a1 a2 ad
We usually omit the brackets as the order of integral signs and the dxi ’s should leave no room for
confusion. However, it is always beneficial to use brackets when things are not clear.
We do not discuss a proof here, because we come back to this important result in the next chapter, where
we prove that this is holds even more generally.
Fubini’s Theorem now allows us to use all the calculation rules for univariate integrals to obtain corre-
sponding results also for multiple integrals. In particular, in view of Lemma 6.31, we obtain the linearity
of the integral, i.e., for any continuous functions f, g : Rd → R and λ, µ ∈ R we have that
Z Z Z
λf (x) + µg(x) dx = λ f (x) dx + µ g(x) dx,
R R R
where R is some box. (To exercise formalism, verify this calculations on your own by using Fubini’s
theorem and linearity for one dimensional integrals.) Moreover, we obtain that a non-negative continuous
function is the zero function if and only if its integral is zero, and, as in Corollary 6.32, that we have the
triangle inequality Z Z
f (x) dx ≤ |f (x)| dx
R R
also for multiple integrals. Let us see some examples.
Example 8.117. We start with a continuous function of the form f (x1 , x2 ) = g(x1 )h(x2 ) on a rectangle
R = [a, b] × [c, d], which are called product functions. Such functions are nice to handle with Fubini’s
theorem since we have
Z Z bZ d Z bZ d
f (x) dx = f (x1 , x2 ) dx2 dx1 = g(x1 )h(x2 ) dx2 dx1
R a c a c
! ! Z !
Z b Z d Z b d
= g(x1 ) h(x2 ) dx2 dx1 = g(x1 ) dx1 h(x2 ) dx2 .
a c a c
That is, the integral of a product function is the product of the integrals.
To see a specific example, let f (x1 , x2 ) = cos(x1 ) sin(x2 ) and R = [0, π/2] × [0, π]. We want to calculate
Z Z π/2 Z π
f (x) dx = cos(x1 ) sin(x2 ) dx2 dx1 .
R 0 0
283
Since cos(x1 ) does not depend on x2 we are allowed to move it outside the inner integral. So,
Z π/2 Z π Z π/2 Z π
cos(x1 ) sin(x2 ) dx1 dx2 = cos(x1 ) sin(x2 ) dx2 dx1
0 0 0 0
! Z
Z π/2 π
= cos(x1 ) dx1 sin(x2 ) dx2 .
0 0
R π Rπ
However, these univariate integrals are easily evaluated to be 0
2
cos tdt = 1 and 0
sin tdt = 2, implying
Z
f (x) dx = 2.
R
Example 8.118. Let us consider the multivariate polynomial p(x1 , x2 ) = x21 x32 + x1 + x2 + 3 on R =
[0, 1]2 = [0, 1] × [0, 1]. We compute
Z Z 1Z 1
x21 x32 + x1 + x2 + 3 dx1 dx2
p(x) dx =
[0,1]2 0 0
Z 1 Z 1 Z 1 Z 1 Z 1
= x21 x32 dx1 + x1 dx1 + x2 dx1 + 3dx1 dx2
0 0 0 0 0
Z 1 Z 1 Z 1 Z 1 Z 1
= x32 x21 dx1 + x1 dx1 + x2 1dx1 + 3dx1 dx2
0 0 0 0 0
Z 1
31 1
= x2 + + x2 + 3 dx2
0 3 2
Z 1
1 3 7
= x2 + x2 + dx2
0 3 2
49
= .
12
Again, every calculation in this example can be reduced to one-dimensional integration theory.
Example 8.119. Clearly, Fubini’s theorem is also very helpful for more complicated multivariate func-
tions. Again, we can deduce everything to univariate integrals, but one should be careful with the different
variables appearing, and sometimes one should find the correct order of integration to get a simple solu-
tion.
Consider e.g. the function f (x1 , x2 ) = x1 cos(x1 x2 ) on R = [0, π]2 . We obtain
Z Z πZ π
f (x) dx = x1 cos(x1 x2 ) dx1 dx2
[0,π]2 0 0
Z π Z π
= x1
cos(x1 x2 ) dx2 dx1
0 0
Z π π
sin(x1 x2 )
= x1 dx1
0 x1 x2 =0
Z π Z π
sin(πx1 )
= x1 − 0 dx1 = sin(πx1 )dx1
0 x1 0
1 − cos(π 2 )
= .
π
Here, we computed first the integral w.r.t. x2 because the corresponding integrand ’looked simpler’. In the
same way we could have calculated the integral by starting with x1 . Then, we would need to work with
the antiderivative (w.r.t. t) of t cos(tx2 ), which would result in a more complicated calculation. However,
by Fubini’s theorem, the result would clearly be the same.
284
Clearly, it is not always the case that one has to integrate w.r.t. a box. However, general domains in
higher dimensions are even harder to tackle than already in the univariate case, and it is just not possible
to define integrals of arbitrary functions over arbitrary domains. We therefore introduce the following
rather general class of domains, which are those sets that can be assigned a volume (or area for d = 2)
by our definition of an integral (which is by now only defined over boxes). Recall that the indicator
function χA : Rd → R for a set A ⊂ Rd is defined by
(
1, x ∈ A,
χA (x) :=
0, x ∈/ A.
Definition 8.120 (Jordan-measurable set). Let A ⊂ Rd be a bounded domain, i.e., there is a box
R ⊂ Rd with A ⊂ R. We call A Jordan-measurable, if χA is Riemann-integrable.
In this case, we define the volume of A by
Z
vold (A) := χA (x) dx.
R
That is, we now also defined integrals over more general domains. However, as this kind of generalization
is the topic of the next chapter, we do not go into detail here and only discuss a specific class of domains,
which is anyhow quite usual in applications.
Before we discuss easier examples, let us see how this definition has to be understood in higher dimensions.
It just means that the variables of the domain are restricted in a sequential way, meaning that the k-
th variable is contained in an interval that is bounded by some continuous functions of the first k − 1
variables. For example, for d = 3, a normal domain has the form
n o
(x, y, z) ∈ R3 : x ∈ [a, b], y ∈ ϕ1 (x), ψ1 (x) , z ∈ ϕ2 (x, y), ψ2 (x, y) .
We now want to compute integrals over normal domains. Let us start with the case d = 2. If we combine
Fubini’s theorem (on a box that contains A) with the fact that we can restrict the boundaries of the
integral, if the function is zero ’outside’, we obtain that normal domains are Jordan-measurable
and that the integral can be written in an easier form.
285
where ϕ ≤ ψ and both are continuous functions. Then, the integral of an integrable function f : A → R
equals
Z Z b Z ψ(x1 )
f (x) dx = f (x1 , x2 ) dx2 dx1 .
A a ϕ(x1 )
For a normal domain A ⊂ R3 in three dimensions as above, we obtain that the integral of an integrable
function f : A → R can be computed by
Z Z b Z ψ1 (x) Z ψ2 (x,y)
f (x) dx = f (x, y, z) dz dy dx
A a ϕ1 (x) ϕ2 (x,y)
and corresponding formulas hold for the volume and in higher dimension.
which describes a circular segment of the circle with radius 1 (centred in the origin), see Figure 60.
The second condition can be rewritten to a condition on x2 depending only on x1 , which leads to
q q
A = x ∈ R2 : x1 ∈ [a, 1], x2 ∈ − 1 − x21 , 1 − x21 .
fore, Z
π p
1 dx = − a 1 − a2 + arcsin(a).
A 2
In particular, for a = 0, we obtain that the area of a half-circle is
Z 1p
π
vol2 x ∈ R2 : x1 ∈ [0, 1], x21 + x22 ≤ 1 = 2
1 − t2 dt = ,
0 2
and the area of the full circle, which we obtain for a = −1, is
Z 1p
2 2 2
vol2 x ∈ R : x1 + x2 ≤ 1 = 2 1 − t2 dt = π.
−1
Example 8.124. In a similar way, we can compute the volume of the 3-dimensional unit ball
A := (x, y, z) ∈ R3 : x2 + y 2 + z 2 ≤ 1 .
Calculating not only the volume of a set but the integral of a given function, works exactly along the
same lines.
Example 8.125. We want to integrate the function f (x1 x2 ) = x21 x22 over the triangle with corners
(0, 0), (1, 0) and (1, 1), see Figure 61. This set can be modeled as the normal domain A = {x ∈ R2 : x1 ∈
[0, 1], x2 ∈ [0, x1 ]}.
Using Fubini, we compute
Z Z 1 Z x1
f (x)dx = x21 x22 dx2 dx1
A 0 0
Z 1 Z x1
= x21 x22 dx2 dx1
0 0
Z 1
1 3
= x21 · x dx1
0 3 1
Z 1
1 1
= x51 dx1 = .
3 0 18
287
This approach allows us to compute many integrals. However, for this we need to bring the domain in the
correct form as a normal domain, and then we need to compute successively all the univariate integrals,
which can be very time-consuming.
Such heavy computations can sometimes be avoided by bringing the integral under consideration in a
more familiar form. That is, we use some change of variables to transform an integral to another
integral we already know. As in the univariate case, this is done by integration by substitution,
which is one of the most powerful ways to compute (difficult) integrals.
Theorem 8.126 (Substitution rule). Let G ⊂ Rd be open and A ⊂ G be a bounded and Jordan-
measurable set. Moreover, let Φ : G → Rd be a continuously differentiable and injective function such
that either det JΦ (u) > 0 or det JΦ (u) < 0 for any u ∈ G, where JΦ (u) is the Jacobi matrix of Φ at
u ∈ G.
Then, Φ(A) is also Jordan-measurable and, for any bounded and continuous function f : Φ(A) → R,
we have Z Z
f (x) dx = f (Φ(u)) · |det JΦ (u)| du.
Φ(A) A
The proof is quite involved and we omit it here. Note that we need the additional assumption that f is
bounded, because we do not know in general that continuous functions on Jordan-measurable domains
are integrable. (One might think on the univariate example f (t) = 1/t on (0, 1].)
q
Figure 62: The annulus {x ∈ R2 : r ≤ x21 + x22 ≤ R}
In the same way, we can use this formula to calculate the integral of functions f : R2 → R p
over such annuli.
1
For example, if we consider f (x) := kxk = (x21 + x22 )−1/2 on the set A = {x ∈ R2 : R1 ≤ x21 + x22 ≤ R2 }
with 0 < R1 < R2 < ∞, where it is clearly continuous, we obtain
Z Z 2π Z R2 Z 2π Z R2
1
f (x) dx = p r drdθ = 1 drdθ = 2π(R2 − R1 ).
B 0 R1 r2 cos2 θ + r2 sin2 θ 0 R1
Clearly, one can also consider circular sections of such annuli, i.e., sets of the form Φ(A) with
A = {(r, θ) : R1 ≤ r ≤ R2 , θ1 ≤ θ ≤ θ2 }
for some 0 < R1 < R2 and 0 ≤ θ1 < θ2 < 2π. Then, Theorem 8.126 shows that
Z Z θ2 Z R2
f (x) dx = f (r cos θ, r sin θ) · r drdθ.
Φ(A) θ1 R1
Example 8.128 (Linear mappings). Another important class of transformations Φ : Rd → Rd are linear
mappings. Recall that they are given by d × d-matrices, i.e., Φ(x) = T x for some T ∈ Rd×d . For such a
mapping, it is easy to compute that
Therefore, if det(T ) 6= 0, then we have that Φ is injective and that det(JΦ (x)) is smaller or larger than
zero for any x ∈ Rd . We can therefore use Theorem 8.126 to obtain,
Z Z
f (x) dx = |det(T )| f (T u)) du
T (A) A
We see that n o
T (A) = (u1 , u2 ) ∈ R2 : 0 ≤ u1 ≤ 1, 2 ≤ u2 ≤ 4 = [0, 1] × [2, 4]
Example 8.129. Let us compute the integral of f (x, y) = xy over the set
n y o
B := (x, y) ∈ R2+ : 1 ≤ ≤ 2, 1 ≤ xy ≤ 2 .
x
Note that this function is well-defined and continuous due to the second condition.
To compute this integral we consider the substitution u := xy and v := xy, because the domain is just
A := {(u, v) : 1 ≤ u, v ≤ 2} = [1, 2]2 under this substitution. To apply Theorem 8.126 we need to find the
mapping Φ with Φ(u, v) = (x, y) such that with Φ(A) = B.
For this note that u = xy is equivalent to y = xu. Plugging this into v = xy we obtain v = x2 u, i.e.,
√
x = uv , and therefore y = uv. Therefore, the desired mapping is given by
p
p v
Φ(u, v) = √u on A = [1, 2]2 .
uv
(This mapping is continuously differentiable in an open set around A.) By Theorem 8.126 we obtain
Z 2Z 2
v √
Z Z r
f (x, y) d(x, y) = f , uv · |det JΦ (u, v)| d(u, v) = u · |det JΦ (u, v)| du dv.
B A u 1 1
It remains to compute the Jacobi matrix and its determinant, which are