Lecture Notes
Lecture Notes
Oliver Roche-Newton
Abstract
Lecture notes for the Mathematics for AI 1 course at JKU (Winter Semester 2024/25).
The notes are based on Mario Ullrich’s combined notes for Mathematics for AI 1-3. I am
grateful to Jan-Michael Holzinger, Severin Bergsmann, Sara Plavsic, Tereza Votýpková
and Monica Vlad for their help in constructing the notes.
Contents
1
3 Sequences and Series 112
3.1 Convergence of sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2 Calculation rules for limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.3 Monotone sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4 Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.5 Cauchy criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.6 Introduction to series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.7 Calculation rules and basic properties of series . . . . . . . . . . . . . . . . . 139
3.8 Convergence tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2
1 Sets, Numbers and Functions
In this section, we will introduce some of the most fundamental objects of mathematics. In
particular, we will focus on numbers, sets (usually sets of numbers), and relations between
them. We will see some simple proofs and learn about some important proof techniques, par-
ticularly proof by induction and proof by contradiction. Finally, we treat complex numbers,
which are necessary to give solutions to arbitrary polynomial equations.
The content of this section will form the basis for the mathematics we learn throughout this
course and also the upcoming courses Mathematics for AI 2 and 3, and so it is essential to
have a solid understanding of the concepts we introduce here.
1.1 Sets
A set M is a collection of different ‘objects’ which we call elements of M . We use the following
notation:
x belongs to M, we write x ∈ M
or
x does not belong to M, we write x ∈
/ M.
Some particularly important sets are assigned names and symbols.
All these sets will be precisely defined and discussed later in this chapter. First, let us see
that there are multiple ways to define sets. The easiest way would be to list all its elements,
for instance:
However, if we have sets containing an infinite amount of elements, we cannot name all of
them. In this case we use dots if it is clear what is contained in the set. We did this already
for the sets N and Z. Some more examples are the even and odd natural numbers:
E = {2, 4, . . . }, O = {1, 3, . . . }.
However, this may lead to difficulties of interpretation as this is not guaranteed to give a
unique description. For example, we may define the set of even numbers, as above, via
E = {2, 4, . . . },
3
and the set of all powers of 2 as
G = {2, 4, . . . }.
Since these sets do not differ until the third element, they appear identical in this notation.
For a good definition of an infinite set, it is therefore formally necessary to precisely specify
the properties of its elements. For instance, the set of even numbers can be written as
E = {2k : k ∈ N},
Some authors prefer to use “⊆” for a generic subset instead of “⊂‘”, in order to make it
more explicit that equality is not excluded. The same authors typically use the notation “⊂”
instead of “⊊” for proper subsets. So, one should be mindful of this variation in notation
when using different literature.
The elements of sets can also be sets! An important example is the power set P(M ) for a
given set M , which consists of all subsets of M . That is,
P(M ) := {A : A ⊂ M }.
For example, consider once more the set A = {0, 1, 2}. The power set of A is
P(A) = {∅, {0}, {1}, {2}, {0, 1}, {0, 2}, {1, 2}, A}.
4
The intersection M ∩ N consists of all elements which are in both M and N .
The set difference of M and N , written as M \ N , is the set of all elements of M which are
not contained in N .
If we only work with subsets M ⊂ Ω for a fixed set Ω, then we call Ω the underlying
set or the ground set. In this case, the complement of M (in Ω), denoted M c , is the set
Mc = Ω \ M.
Figure 1: Venn-diagrams
The illustrations above are called Venn-diagrams. Such pictures can be a helpful tool for
thinking about sets and how they interact with each other.
Example. Let Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and let A, B ⊂ Ω be the sets
A = {2, 3, 4}, B = {3, 5, 7, 9}.
Then
A ∪ B = {2, 3, 4, 5, 7, 9}
A ∩ B = {3}
B \ A = {5, 7, 9}
Ac = {1, 5, 6, 7, 8, 9, 10}
B c = {1, 2, 4, 6, 8, 10}.
5
All of the sets we have seen so far have elements that can be listed, even if in some cases the
list is infinite. However, this is not the case with all sets. Another important family of sets
to consider are intervals.
Definition 1.1. Let a, b ∈ R such that a ≤ b. Then we define the closed interval between
a and b to be the set
[a, b] := {x ∈ R such that a ≤ x ≤ b}.
The half open intervals between a and b are the sets
and
(a, b] := {x ∈ R such that a < x ≤ b}.
The open interval between a and b is the set
Moreover, we write
and
(−∞, a] := {x ∈ R such that x ≤ a}, (−∞, a) := {x ∈ R such that x < a}.
Elements of sets are not ordered, and so the sets {a, b} and {b, a} are the same. Nevertheless,
it is often important to order the objects under consideration, and for this purpose we have
the Cartesian product.
Definition 1.2. Let A and B be sets, and let a ∈ A and b ∈ B be arbitrary elements.
• The expression (a, b), which is sensitive to order, is called an ordered pair.
• Two tuples (a, b) and (a′ , b′ ) are equal if and only if a = a′ and b = b′ .
A × B := {(a, b) : a ∈ A, b ∈ B}
• In general (a, b) ̸= (b, a). In fact, the second part of the definition implies that (a, b) =
(b, a) if and only if a = b.
• An ordered pair (a, b) and a set {a, b} are completely different objects.
• If we consider more than two sets, say A1 , A2 , . . . , Ad for some d ∈ N, then we can also
define the (d-fold) Cartesian product
6
• For the Cartesian product
| ×A×
A {z· · · × A}
d times ,
we use the shorthand Ad .
√
Example - Let A = {π, 2} and B = {1, 2, 3}. Then
√ √ √
A × B = {(π, 1), (π, 2), (π, 3), ( 2, 1), ( 2, 2), ( 2, 3)}
and √ √ √
B × A = {(1, π), (1, 2), (2, π), (2, 2), (3, π), (3, 2)}.
7
1.2 Propositional logic
Next, we will introduce some notation from logic that can be helpful for making precise math-
ematical statements. We begin with the universal and existential quantifiers. These are
essentially just abbreviations.
We also sometimes use the colon symbol “:” to denote “such that”. With these notational
conventions, we can also give a more concise description of some of the sets and concepts we
discussed in the previous subsection. Here are some examples:
• The statement
∃x ∈ N : x is even
simply says that there is at least one even natural number. This is a true statement,
obviously!
• We can also use this notation to make false statements. For instance,
∀x ∈ N, x is even.
∀x ∈ M, x ∈ N.
∃x ∈ N : x ∈
/ M.
As you might have already noticed, we will often need the terms “if” or “if and only if”, and
therefore we define a mathematical symbol for them. In order to do this properly, we need
to learn some basic formal logic.
Definition 1.3. A proposition is a statement which is either true or false.
We can use connectives to build more complicated propositions depending on two proposi-
tions A and B.
Definition 1.4. Let A and B be propositions, then
Examples - Let A be the proposition “3 is a prime number” and let B be the proposition
“22 = 5”. So, A is true and B is false.
8
• ¬A is the statement “3 is not a prime”. This is false.
The last of these propositions highlights a difference between the meaning of “or” in mathe-
matics and in common language. The proposition A ∨ B is true if both A and B are true also.
In common language, when we say something like “please buy me some apples or oranges”,
we are expecting only one of the kinds of fruit to arrive (this is the exclusive or). The common
language equivalent of the symbol ∨ is “and/or”.
We sometimes use truth tables to see relations between truthfulness of propositions and
their component parts. Here is an example of a truth table for the negation, conjunction,
and disjunction.
A B ¬A A∧B A∨B
T T F T T
T F F F T
F T T F T
F F T F F
Using the notation for the conjunction, we see that the proposition A ⇐⇒ B is the same as
(A =⇒ B) ∧ (B =⇒ A). We also sometimes use “iff” as an abbreviation for “if and only
if”.
Examples - Here are some examples of true mathematical statements which involve impli-
cations.
• A ⊂ N =⇒ A ⊂ Z.
9
For the three statements above, it seems to be intuitively clear that they are true. We said
above that the implication A ⇒ B means that “if A is true then B is true”. However, what if
A is not true? This situation can be rather confusing. Consider the following two statements:
Although it is not so intuitively obvious, both of these statements are in fact also true. In
fact, if A is false, then the statement A ⇒ B is true, regardless of what B says! This can be
confusing, since it can be used to build mathematical statements which are logically true but
appear to be nonsensical, like these two above.
Truth tables may be helpful for clearing up potential confusion surrounding this issue. The
truth table showing both directions of implication is given below.
Here are some remarks about what this truth table shows us.
• We can see from the truth table above that the only way that the statement A =⇒ B
is false is if A is true and B is false.
• The final column shows that the statement A ⇐⇒ B is true when A and B have the
same truth value. If two propositions A and B have the same truth value then we say
that they are logically equivalent, and write A ≡ B.
We can build longer truth tables and use them to compare other logical statements. Let us
compare the statement A =⇒ B with ¬B =⇒ ¬A.
A B ¬A ¬B A⇒B ¬B ⇒ ¬A
T T F F T T
T F F T F F
F T T F T T
F F T T T T
We see that the truth values of the statements A =⇒ B and ¬B =⇒ ¬A are in fact
identical. This means that the statements are logically equivalent; if one is true then the
other is true, and if one is false then the other is false. Another way of writing this is
(A =⇒ B) ⇐⇒ (¬B =⇒ ¬A).
10
one another, particularly as it can help us to transfer a statement that is difficult to work
with into an equivalent statement that may be easier to understand.
We use implications in mathematics to build a chain of logic, using existing statements to
prove new ones. When we do this, we are using the transitivity of implication. This is the
statement that, for propositions A, B and C,
Finally, with all of this notation, we may write certain definitions or statements without using
any words, but rather by exclusively using mathematical symbols. For example,
M ⊂ N ⇐⇒ (∀x ∈ M, x ∈ N ) ⇐⇒ (x ∈ M ⇒ x ∈ N ).
The “sentence” above shows three different symbolic descriptions for the statement that M
is a subset of N .
However, just because we can use these symbolic shorthand descriptions, it does not mean
that we always should! Sometimes people forget that words themselves are a very valuable
tool for describing mathematical ideas.
11
1.3 Relations and functions
Roughly speaking, relations shall describe connections between two objects. Here, we give a
formal description and important properties. We then introduce functions; a special kind of
relation which describe connections from one set to another. We also discuss special relations
that are used to compare, group or order elements of a given set.
Definition 1.6. A relation R between two sets M and N is a subset of the Cartesian product
of M and N , i.e. R ⊂ M × N .
To make things clearer, see the upcoming illustration, which depicts every element of R as a
“connection” between an element of M and an element of N . As you can see it is possible
that x ∈ M is connected to some y ∈ N , which we denote by (x, y) ∈ R. However, this does
not have to be the case for every x ∈ M , and different elements of M may be connected to
the same y ∈ N . Moreover, x ∈ M can be connected to more than one element in N , or can
even be connected to none of the elements of N .
R = {(2, 4), (2, 5), (2, 6), (2, 7), (3, 4), (3, 5), (3, 6), (3, 7), (5, 4), (5, 5), (5, 6), (5, 7), (4, 5), (4, 7)}.
Now we head to a very important type of relation, namely functions, which assign to each
element of M exactly one element of N .
Definition 1.7. Let M and N be non-empty sets. We call f : M → N a function from M
to N , if each x ∈ M is assigned exactly one element f (x) ∈ N .
M is called the domain of f and N is the codomain of f .
f (x) = x2
12
is a function. It satisfies the key property that every element x in the domain is assigned a
value f (x).
Similarly, the function g : R → R defined by
g(x) = x + 1
is a function.
f (S) := {f (x) : x ∈ S} ⊂ N,
f −1 (T ) := {x : f (x) ∈ T } ⊂ M.
Examples - Consider again the functions f (x) = x2 and g(x) = x + 1 introduced above. Let
S ⊂ R be the closed interval S = [1, 3]. Then
The next definition shows the connection between relations and functions.
Gf := {(x, f (x)) : x ∈ M } ⊂ M × N.
Note that the graph of a function is a relation. In this sense, all functions induce a relation,
but not vice versa.
We can visualize a function with domain and codomain in R by plotting its graph in R2 . For
f (x) = x2 and g(x) = x + 1 this is demonstrated in the next illustration (Figure 3).
In what follows, we will define several important properties of relations. We will always
demonstrate afterwards what this means for functions.
13
Figure 3: The graph of x2 and x + 1
∀ (x1 , y1 ), (x2 , y2 ) ∈ R, x1 ̸= x2 ⇒ y1 ̸= y2 .
This is equivalent to
∀(x1 , y1 ), (x2 , y2 ) ∈ R, y1 = y2 ⇒ x1 = x2 .
∀y ∈ N, ∃x ∈ M : (x, y) ∈ R.
∀x ∈ M, ∃ y ∈ N : (x, y) ∈ R
and
∀x ∈ M, y1 , y2 ∈ N, ((x, y1 ), (x, y2 ) ∈ R) ⇒ y1 = y2 .
The two requirements for a relation to be functional can be written more succinctly as
follows:
∀x ∈ M, ∃! y ∈ N : (x, y) ∈ R.
Note that the graph of a function is a functional relation, and vice versa. We can therefore
rephrase the above definitions for functions.
14
• We say f is injective if and only if
∀y ∈ N, ∃x ∈ M : f (x) = y.
∀y ∈ N, ∃!x ∈ M : f (x) = y.
15
Figure 7: bijective function
We will now define two simple functions that can be defined on an arbitrary set M . First,
we define the identity function IdM : M → M which maps each element to itself, i.e.,
IdM (x) = x.
A constant function is a function which takes the same value for every element. Let M and
N be arbitrary non-empty sets and let c ∈ N be fixed. The function f : M → N , which is
defined by
f (x) = c,
for all x ∈ M , is called a constant function.
We now ask ourselves the following question: given a function f : M → N , can we find
another function which reverses the effect of f . This leads us to the concept of an inverse
function.
Definition 1.12. Let f : M → N and g : N → M be functions with the properties
∀ x ∈ M, g(f (x)) = x
and
∀ y ∈ N, f (g(y)) = y.
Then f is the inverse of g and g is the inverse of f . In this case we write f −1 := g and
g −1 := f and call f and g invertible.
16
Note that we already used the notation f −1 in the context of the preimage of a set under a
function f . This similarity in notation is intentional, as the following exercise indicates.
Exercise - Let f : M → N be an invertible function with inverse f −1 : N → M and S ⊂ N .
Prove that
{f −1 (y) : y ∈ S} = {x ∈ M : f (x) ∈ S}. (3)
Let us revisit Definition 1.8. The left hand side of (3) is the image of S under f −1 , which
is denoted f −1 (S). The right hand side is the preimage of S under f , which is also denoted
f −1 (S). So, the exercise shows that these two notational conventions do not clash with one
another.
Example Let R+ denote the set of all non-negative real numbers (i.e. R+ = [0, ∞)) and let
f : R+ → R+ and g : R+ → R+ be defined by
√
f (x) = x2 , and g(x) = x.
√
For fixed x > 0, we have g(f (x)) = x2 = x. On the other hand, for all y > 0 we have
√
f (g(y)) = ( y)2 = y. Therefore, f and g are inverses of each other.
The following theorem provides us with a tool to check if a function has an inverse or not.
f is invertible ⇐⇒ f is bijective.
This is the first theorem we have seen in this course, and it will be the first of many. We are
trying to build a rigorous mathematical foundation in this course, and so we will (with a few
exceptions) give a proof of every result we use during the course. The skill of understanding
and writing mathematical proofs is rather specialised, and is likely to be unfamiliar to many
students of this course. We will devote special attention to proof techniques in a section that
will be shortly forthcoming, and will include the proof of Theorem 1.13.
Invertible (or bijective) functions may be used to formally define the cardinality of a set.
17
As well as finite sets, bijections also allow us (to some extent) to characterise the cardinality
of an infinite set.
• If there exists n ∈ N such that |A| = n, then we call A finite, or a finite set.
Note that countability is the precise definition of the “simple” property that the elements of
A can be enumerated by the natural numbers {1, 2, 3, . . . }, which is a fancy way to say, the
elements of A can be counted.
Examples - Define a function f : N → Z such that, for all n ∈ N,
This is a bijection. Indeed, f (1) = 0, f (2) = 1, f (3) = −1, f (4) = 2, f (5) = −2, and so on.
We can see from this pattern that every element of Z is mapped to by exactly one element
of N. By Definition 1.15, it follows that Z is a countable set.
Note the following strange feature of this example. Intuitively, it seems that the set Z is
“bigger” than N. Indeed, N is a proper subset of Z, and appears to contain half as many
elements. Still, with the notion of size (i.e. cardinality) given by Definition 1.15, the two sets
have the same size; they are both countably infinite.
Exercise - Show that the set E = {2, 4, 6, . . . } is countably infinite.
Exercise - Let M and N be finite sets. Show that |M × N | = |M ||N |.
We can also consider the composition of functions. Let X, Y, Z be non-empty sets and let
f : X → Y and g : Y → Z be functions. We then define a function (g ◦ f ) : X → Z by first
applying f and then applying g. That is, we define (g ◦ f )(x) = g(f (x)).
Exercise - Check that g ◦ f is indeed a function.
Note that it is important that the codomain of f matches the domain of g in order for the
definition of g ◦ f to make sense.
Example - If f : M → N is an invertible function and f −1 : N → M is its inverse, then, for
all x ∈ M and y ∈ N , we have
18
We get
(g ◦ f )(x) = g(f (x)) = sin(x2 ) = h(x).
However, if we reverse the order of composition and consider the function f ◦ g, we see that
In particular the functions f ◦ g and g ◦ f are not the same. Indeed, in a typical case we have
f ◦ g ̸= g ◦ f , and it can even be the case that only one of the compositions is defined.
The next theorem shows how to calculate the inverse of a composition of two functions.
Some relations which have certain combinations of these properties are especially important.
19
Example - Consider the relation L ⊂ R2 , where we define
(x, y) ∈ L ⇔ x ≤ y.
This is the well-known smaller or equal relation in R and we just write x ≤ y for (x, y) ∈ L
in the following.
L = {(x, x) : x ∈ R} ⊂ R2 .
Then L is reflexive, symmetric, antisymmetric and transitive, but not total. It is therefore
an equivalence relation and a partial order.
Exercise - Show that the relation L from the previous example is the only relation that is
symmetric and antisymmetric.
Example - Define the “strictly less” relation “<” by
• L is not reflexive, since, for example, 1 < 1 does not hold. (Of course, we know that
x < x is never true, but in order to show that reflexivity does not hold, we only need
to find a single instance of a counterexample.)
• L is not total. Once again, this is because, for example, 1 < 1 does not hold.
20
This is an instance of a more general situation that occurs in several mathematical proofs.
Suppose that P (x) is a statement that depends on some real number x, and we want to verify
that
x ∈ A ⇒ P (x) is true. (4)
In the particular case when A = ∅, the implication (4) is certainly true, because the left hand
side is certainly false.
Exercise - Let m and n be integers. We say that m divides n, and write m|n if there exists
k ∈ Z such that
mk = n.
Show that the divisibility relation
R := {(a, b) ∈ N × N : a|b}
is a partial order.
Example - One can define a partial order on a set of sets by using the subset relation ⊂.
For example, for
M := {∅, {1}, {2}, {1, 2}},
we define the relation
L = {(A, B) ∈ M × M : A ⊂ B}.
It follows from the basic properties of inclusion that the relation L is reflexive, antisymmetric
and transitive. Hence it is a partial order on M . However, since {1} ̸⊂ {2} and {2} ̸⊂ {1},
it is not total, and so not a linear order on M .
For the remainder of this subsection, we will pay special attention to equivalence relations.
Equivalence relations have the particularly useful and interesting property that they can be
used to partition the elements of a set into related chunks. These partitioning sets are called
equivalence classes.
Definition 1.19. Let R ⊂ X × X be an equivalence relation. For x ∈ X, we define the
equivalence class of x by
When it is clear which relation we are referring to (which it usually is in practice) we abbre-
viate [x]R to [x].
Each element y ∈ [x] is called a representative of the equivalence class [x]. Note that, by
reflexivity (x, x) ∈ R, and hence x ∈ [x]. So an equivalence class is never the empty set and
x is a representative of [x].
Example - Consider again the equivalence relation
L = {(x, x) : x ∈ R} ⊂ R2 .
Then, for each x ∈ R, we have [x] = {x}, and so the equivalence classes for this relation are
all singleton sets (i.e. sets with exactly one element).
Example - Fix an arbitrary m ∈ N. We say that two integers a, b ∈ Z are congruent
modulo m, if
m|(a − b)
21
and we write
a ≡ b mod m.
Define
R = {(a, b) ∈ Z : a and b are congruent modulo m}.
Then R is an equivalence relation. Indeed, let us check that the necessary three properties
hold.
• Symmetry: Let a, b ∈ N such that (a, b) ∈ R. That is, m|a − b. In other words, there
is some k ∈ Z such that
a − b = mk.
But then
b − a = m(−k),
which means that m divides b − a and thus (b, a) ∈ R.
• Transitivity: Let a, b, c ∈ Z such that (a, b), (b, c) ∈ R. By the definition of divisibility,
there exist integers k1 , k2 such that
Therefore,
a − c = (a − b) + (b − c) = mk1 + mk2 = m(k1 + k2 ).
This means that m divides a − c, and thus (a, c) ∈ R.
Note that, in both of the examples above, the equivalence classes do not overlap with one
another, and every element of the underlying set belongs to exactly one equivalence class. This
is not a coincidence; the following theorem shows that equivalence classes always partition
the underlying set in such a way.
can be rewritten as
[y] ̸= [x] ⇐⇒ [x] ∩ [y] = ∅.
Therefore, we indeed have that two distinct equivalence classes must be disjoint.
22
1.4 Real numbers
The natural starting point when thinking about numbers is the set N = {1, 2, 3, . . . }. These
are the first numbers we learn about as small children, and it feels right that they are called
the natural numbers. However, this alone is not enough to solve all of the equations that we
encounter using simple mathematical operations like addition, subtraction, multiplication,
and division. And so, we gradually extend our universe of numbers in order to be able to
deal with more of these simple mathematical questions. Continuing in this way, we soon
reach the set of real numbers.
Let’s give a little more detail about this vague idea described in the paragraph above. The set
N is closed under addition and multiplication. That is, for all n, m ∈ N we have m + n ∈ N
and m · n ∈ N. However, we already get in trouble when we try to work with subtraction,
since, for example 21 − 42 = −21 ∈ / N.
The set Z of integers is closed under addition and multiplication, but is additionally closed
under subtraction. That is, for any a, b ∈ Z we have a + b, ab, a − b ∈ Z. However, division
is still a problem if we use integer numbers only. For instance, 1 and −3 are both in Z, but
1
the ratio −3 is not.
We therefore define the set of all rational numbers
na o
Q := : a, b ∈ Z, b ̸= 0 ,
b
where we call a the numerator and b the denominator. It is possible that the same rational
number can have two different representations. For instance, 21 and 24 are the same number.
We call two rational numbers nk and ml equal if and only if km = ln with m, n ̸= 0.
It is sometimes convenient to exclude 0 from Q in order to avoid the problem of potentially
dividing by zero, and we use the notation Q∗ := Q \ {0}. We now have a set which is closed
under division. Indeed, if we take two elements ab and dc in Q∗ (with a, b, c, d ∈ Z \ {0}), then
the ratio
a/b ad
=
c/d bc
is an element of Q∗ .
k
Note also that any integer k ∈ Z can be represented as an element of Q by writing 1 = k.
Consequently, we can expand the earlier chain of inclusions (1) to
N ⊊ N0 ⊊ Z ⊊ Q. (5)
The real reason that we have introduced “new” sets of numbers above is that we wanted to
solve certain equations and, in particular, we want to know if a solution exists in a given set.
For example, if we want to solve the equation a · x + b = 0, where a, b ∈ Q are fixed constants
and a ̸= 0, we see that
−b
x= ∈Q
a
and so we can find a solution to the equation in Q.
However, what about the simple equation x2 − 2 = 0? Is there some x ∈ Q such that x2 = 2?
The following theorem will show that this is not possible.
23
Theorem 1.21. There is no x ∈ Q such that x2 = 2.
We will come back and prove this theorem in the next subsection, using the method of proof
by contradiction.
Theorem 1.21 shows that the equation x2 − 2 = 0 is not solvable in Q. But we would like
there to √be a solution √
somewhere! Furthermore, we all know from school that there is a
number 2 such that ( 2)2 = 2.
Making the number line complete, we finally get to the set of real numbers. Using the
familiar decimal expansion, we can define the set of real numbers to be
n a1 a2 o
R := b + + + · · · : b ∈ Z , a1 , a2 · · · ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} . (6)
10 100
One may think of R as the set of all points on the number line, i.e., the number line without
any holes. Note that there are infinitely many terms in the sum in (6) defining an element of
R. It is possible that, after a certain point, all of the ai are equal to zero. For instance, an
integer n can be expressed in this form, with all of the ai equal to zero. More generally, the
rational numbers, written using the decimal expansion as in (6), either have a finite number
of digits or the sequence of digits is periodic. This means that, if we consider only the rational
numbers, some points on the line are missing. These correspond to numbers which have √a
non-periodic infinite number of decimals, and cannot be written as fractions. As well as 2,
some other prominent examples of such numbers are π and e. These elements of R \ Q are
called irrational numbers.
We can extend the hierarchy described in (5) to include R as follows:
N ⊊ Z ⊊ Q ⊊ R.
It turns out that the set of rational numbers is countable, whereas the set of real numbers is
uncountable. So, there are “more” irrational numbers than there are rationals. The task of
proving this statement will be considered in a future exercise sheet.
The set of real numbers satisfies several convenient arithmetic and algebraic properties. These
properties turn out to be useful and interesting in a much more general setting, which is why
we give the following definition.
Definition 1.22. Let F be a set with operations of addition (formally, addition is a function
+ : F × F → F) and multiplication (formally, multiplication is a function · : F × F → F)
defined over F which satisfy the following properties.
• Identity elements: there exists two distinct elements 0, 1 ∈ F such that, for all x ∈ F
1. x + 0 = 0 + x = x and
2. x · 1 = 1 · x = x.
24
• Inverse elements:
All of these properties hold for the set of reals, as we can see by using elementary rules for
manipulating algebraic equations, and so R is a field.
Exercise - With the usual operations of addition and multiplication, is Q a field? How about
Z?
Exercise - Suppose that F is a field. Show that, for all x, y, z ∈ F,
(x + y) · z = (x · z) + (y · z).
25
1.5 An introduction to proof
The concept of rigorous proof is absolutely vital in mathematics. Unlike other branches of
the sciences, we cannot be satisfied with evidence that something is very likely to be true,
or true beyond reasonable doubt, as one might accept in a court of law. Rather, we need to
know that something is true (or false) with absolutely certainty.
For instance, the Goldbach Conjecture states that every even integer greater than 2 can be
written as a sum of two prime numbers. We can get some intuition for this conjecture by
writing down some examples:
4 = 2 + 2,
6 = 3 + 3,
8=3+5
10 = 5 + 5
..
.
100 = 3 + 97.
This list provides some initial evidence that the conjecture is correct. In fact, a whole load
more evidence exists; the conjecture has been verified, with the help of computers, for all
even integers n up to 4 · 1018 . That is a lot of cases checked!
Still, mathematicians continue to seek a proof of the Goldbach conjecture that guarantees
that the statement is valid for all even n, with absolute certainty. A great deal of effort has
been invested in this pursuit, including efforts from many great mathematicians spanning
several generations, and a million dollar prize was even offered at one point.
In this subsection, we will discuss some basic techniques that allow us to prove some of the
statements that were given earlier in Section 1, and in doing so develop some tools that will
allow us to prove many interesting statements during the remainder of the course. We will
focus on three particular techniques: proof by definition, proof by contradiction, and proof
by induction.
We begin with proof by definition. These are often the easiest kinds of proofs, and the
technique essentially consists of stringing together some basic properties that follow from
definitions. This is another advertisement for the importance of knowing all of the definitions
in this course!
We begin with a fairly simple proof that uses only the definitions involved. The following
lemma (a lemma is like a small theorem, √ and usually is used to prove a more important
result) will be helpful in proving that 2 is irrational. We make use of the obvious definition
of an even integer; n ∈ Z is even iff there exists k ∈ Z such that n = 2k.
Lemma 1.23. Let n ∈ N. Then
n is even ⇐⇒ n2 is even.
Proof. This is a two-way implication, and so we need to prove both directions. We begin by
proving the “⇒” direction. Suppose that n is even. Then, by the definition of evenness, we
can write n = 2k for some k ∈ N. Therefore,
n2 = (2k)2 = 4k 2 = 2(2k 2 ).
26
Now we prove the “⇐” direction. In fact we will prove the equivalent contrapositive statement
n is odd =⇒ n2 is odd.
For the next instance of these methods, let us restate and then prove Theorem 1.13.
Theorem 1.24. Let f : M → N be a function. Then,
f is invertible ⇐⇒ f is bijective.
Proof. We are trying to prove that an “ ⇐⇒ ” holds, and so we need to prove two implications.
We begin with the “⇒” direction. We assume that f is invertible, and we need to show that
it is a bijection.
We begin by using the definition of invertibility. Since f is invertible, there exists f −1 : N →
M such that
f −1 (f (x)) = x, ∀ x ∈ M, (7)
and
f (f −1 (y)) = y, ∀ y ∈ N. (8)
To show that f is bijective, we need to show that it is surjective and injective. We can take
care to write down the definitions that tell us exactly what we need to show. We begin with
surjectivity. We need to show that,
∀ y ∈ N, ∃ x ∈ M : f (x) = y.
∀ x1 , x2 ∈ M, x1 ̸= x2 =⇒ f (x1 ) ̸= f (x2 ).
It is easier to work with the logically equivalent contrapositive of this statement. That is, we
will show that
∀ x1 , x2 ∈ M, f (x1 ) = f (x2 ) =⇒ x1 = x2 .
Let f (x1 ) = f (x2 ) for some x1 , x2 ∈ M . Since f (x1 ) ∈ N and f (x2 ) ∈ N we may apply the
function f −1 to f (x1 ) = f (x2 ) and get
x1 = f −1 (f (x1 )) = f −1 (f (x2 )) = x2 ,
as required.
We now move to the “⇐” direction. We assume that f is bijective and need to prove that f
is invertible. Since f is bijective we have
∀ y ∈ N, ∃! x ∈ M : f (x) = y. (9)
27
In words, for each y ∈ N we can find a unique x ∈ M such that f (x) = y. Now we define
g : N → M such that it maps each y ∈ N to this unique x ∈ M , i.e. g(y) = x, where x is
given by (9). This shows that, for each y ∈ N , we have f (g(y)) = y. Indeed,
f (g(y)) = f (x) = y.
g(f (x)) = x.
But this is an immediate consequence of the way that the function g has been defined.
Therefore, f is invertible with f −1 = g.
Exercises
• Prove that composition of functions is associative. That is, for any functions
f : M → N, g : N → O, and h : O → P,
prove that, ∀ x ∈ M ,
(h ◦ (g ◦ f ))(x) = ((h ◦ g) ◦ f )(x).
• Show that the inverse function is unique. That is, let f : M → N be an invertible
function and suppose that g1 , g2 : N → M are inverses of f . Prove that g1 = g2 .
Next, we prove Theorem 1.16, which relates inverses and compositions. We restate and prove
the result now.
Theorem 1.25. Let f : X → Y and g : Y → Z be invertible functions. Then g ◦ f is
invertible and
(g ◦ f )−1 = f −1 ◦ g −1 .
Proof. We will directly check that f −1 ◦ g −1 is the inverse of g ◦ f . Recalling the definition
of the inverse of a function, there are two things we need to check:
and
(g ◦ f )((f −1 ◦ g −1 )(z)) = z ∀ z ∈ Z. (12)
Let us first focus on the goal of checking that (11) holds. This can be rewritten as
By the associativity of function composition (see the previous exercise) applied twice,
28
By (10), it follows that
We continue to collect some proofs of statements that were given earlier in the course. The
proof of the next statement (stated earlier as Theorem 1.20) also uses only the relevant
definitions.
Theorem 1.26. Let R ⊂ X × X be an equivalence relation and x, y ∈ X. Then,
Once we have completed this cycle of implication, we can see that all three of the statements
imply one another by moving around the cycle.
Proof that (x, y) ∈ R =⇒ [x] = [y]. Suppose that (x, y) ∈ R. We will first show that
[x] ⊂ [y]. Let z ∈ [x]. By definition (x, z) ∈ R. By the symmetry of R, we also know that
(y, x) ∈ R. Then, by the transitivity of R, we have
[y] = [x].
29
Proof that [x] = [y] =⇒ [x] ∩ [y] ̸= ∅. This is clear, we know that [x] is not the empty set
(since x ∈ [x]) and so [x] ∩ [y] is not the empty set (for instance x ∈ [x] ∩ [y].)
Proof that [x] ∩ [y] ̸= ∅ =⇒ (x, y) ∈ R. If [x] ∩ [y] ̸= ∅ then there is some z ∈ [x] ∩ [y]. By
definition (x, z) ∈ R and (y, z) ∈ R. By symmetry, (z, y) ∈ R. Then, by transitivity,
We now move on to proof by contradiction. The basic idea is as follows. Suppose that we
want to prove that a given statement A is true. We assume the opposite, which is that A is
not true, and look to see what consequences of this assumption can be logically derived. If
we can derive a statement B which we know to be false, then there must be something wrong
with our initial assumption that A is false. We therefore conclude that A is true. Actually,
what we are doing with this approach is proving the equivalent contrapositive statement
(recall the definition from Section 1.2), and some students may find this to be an easier way
to think about this approach.
We
√ first give one of the most famous proofs by contradiction in mathematics; the proof that
2 is irrational.
Proof. We assume for a contradiction that there is some x ∈ Q such that x2 = 2. We write
m
x= (14)
n
for some integers m and n. Moreover, we may assume that at least one of m and n is odd.
Indeed, if we have m = 2k and n = 2l then we can rewrite x as x = kl with k < m. We
repeat this process until one of the parts of the fraction is odd. The process must terminate
eventually, as the value of the integer k gets strictly smaller each time, and this cannot happen
infinitely often.
Since x2 = 2, (14) gives
m2
2 = x2 = .
n2
This rearranges to give
m2 = 2n2 . (15)
It therefore follows from Lemma 1.23 that m is even, and so we can write m = 2k for some
k ∈ Z. We can plug this back into (15) to get 2n2 = m2 = (2k)2 = 4k 2 , and thus
n2 = 2k 2 .
Therefore, n2 is even, and it again follows from Lemma 1.23 that n is even.
We have reached a contradiction, since we stated earlier that at least one of m and n must be
odd, but we have also shown that they must both be even. Therefore, our original assumption
that x2 = 2 for some x ∈ Q must be false. The proof is complete.
30
We will see another famous example of a proof by contradiction a little later in this section,
but now we move on to proof by induction. Proof by induction is a very structured
proof technique that is particularly useful for proving that some proposition P (n) is true for
all integers n. The basic principle of mathematical induction is explained in the following
statement.
Theorem 1.28. A predicate P (n) is true for all n ∈ N if the following two steps hold:
Proof. We know that P (1) is true, by the first assumption. We can then iteratively apply
(16) to obtain the chain of implications
P (1) is true =⇒ P (2) is true =⇒ P (3) is true =⇒ . . . ,
which completes the proof.
We call the statement that P (1) is true the induction basis or base case. The step
P (n) =⇒ P (n + 1) is called the induction step, and the assumption that P (n) is true is
called the induction hypothesis.
Let us discuss two examples to demonstrate this type of proof. The first one is (at least
without proof) known to many from school. The young Carl Friedrich Gauss (1777-1855)
knew it already by the age of nine.
Theorem 1.29. For all n ∈ N,
n
X n(n + 1)
k= .
2
k=1
Proof. We first need to check that the statement is valid in the base case when n = 1. This
is true, since
1
X 1(1 + 1)
k=1= .
2
k=1
Now, we need to show that, if the statement is true for n, it is also true for n + 1. That is,
we assume the induction hypothesis
n
X n(n + 1)
k= ,
2
k=1
and we need to use this to show that
n+1
X (n + 1)(n + 2)
k= .
2
k=1
Indeed,
n+1 n
!
X X n(n + 1) n(n + 1) 2n + 2 (n + 1)(n + 2)
k= k +n+1= +n+1= + = .
2 2 2 2
k=1 k=1
We used the induction hypothesis in the second equality.
31
Another important instance of a proof by induction is Bernoulli’s inequality, which will
be essential later.
Theorem 1.30. Fix a real number x ≥ −1 and let n ∈ N. Then
Proof. We regard x ≥ −1 as fixed, and seek to prove that (17) holds for all n ∈ N by
induction. We first check the base case with n = 1. Then (17) becomes
(1 + x)1 ≥ 1 + 1x.
The two sides of this inequality are the same, and so this is true. Thus the base case is valid.
We now assume the induction hypothesis
(1 + x)n ≥ 1 + nx (18)
(1 + x)n+1 ≥ 1 + (n + 1)x.
Indeed,
(18)
(1 + x)n+1 = (1 + x)n (1 + x) ≥ (1 + nx)(1 + x) = 1 + (n + 1)x + nx2 ≥ 1 + (n + 1)x. (19)
Where have we used the assumption that x ≥ −1 in this proof? It is not so easy to see that
we have used it if we do not pay proper attention. We have in fact used this assumption in
the first inequality of (19), since we need 1 + x ≥ 0 for the logic of this step to make sense.
It is sometimes helpful to use the following interpretation of proof by induction, where we
strengthen our induction hypothesis to assume that the statement we want to prove is valid
for all smaller values. This is called the method of proof by strong induction.
Theorem 1.31. Let k ≥ 1 be an integer. A predicate P (n) is true for all n ∈ N such that
n ≥ k if the following two steps hold
Proof. We know that P (k) is true, by the first assumption. Similarly to the proof of Theorem
1.29, we can then repeatedly apply the second assumption to prove that P (n) holds for all
n ≥ k. Indeed,
32
Theorem 1.31 looks slightly more complicated than Theorem 1.29, and one of the reasons
for this is that we have made a generalisation that allows us to consider a base case which is
not k = 1. However, the two methods are essentially the same, and we can also take a larger
base case for the usual proof by induction.
We will use strong induction to prove that every integer can be written as a product of primes.
This is one part of the Fundamental Theorem of Arithmetic. As the name suggests, this is
an extremely important fact in number theory, which tells us that prime numbers form the
building blocks for our number system.
Theorem 1.32. Let n ≥ 2 be an integer. Then there is some i ∈ N and prime numbers
p1 , . . . , pi such that
n = p1 · p2 · · · pi . (20)
Note that this statement does not require that the primes p1 , . . . , pi are distinct. For instance,
to verify that the theorem holds for n = 4, we can observe that 4 = 2·2, and this decomposition
satisfies the required form (20).
Proof. We will use proof by strong induction. For the base case n = 2, we know that n is a
prime, and so it is already in the required form (20) (with i = 1).
Now we assume the induction hypothesis that every 2 ≤ i ≤ n can be written as a product
of primes, and we seek to show that n + 1 can be written as a product of primes. If n + 1 is
a prime number then it automatically has the required form of (20). If n + 1 is not prime
then there exist two integers 2 ≤ a, b < n + 1 such that
n + 1 = a · b. (21)
a = p1 · · · pi , and b = q1 · · · qj ,
n + 1 = a · b = p1 · · · pi · qi · · · qj .
After relabelling the primes qi suitably, this gives the required form (20), and the proof is
complete.
We conclude this section by using Theorem 1.32 as part of another ancient and famous proof
by contradiction.
Theorem 1.33. The set P of all prime numbers is infinite.
P = {p1 , p2 , . . . , pn }
for some integer n. In this notation, pi denotes the ith prime number. So, p1 = 2, p2 = 3,
p5 = 11, and pn is the largest prime. Then consider the number
m := p1 · p2 · · · pn + 1.
33
Since m is larger than all of the elements of P, it is not a prime. But we know from Theorem
1.32 that m can be written as a product of primes. In particular, it follows that, for some
1 ≤ i ≤ n, pi divides m. But pi ≥ 2, and so
m 1
= p1 · · · pi−1 · pi+1 · · · pn + ,
pi pi
which is not an integer. This is a contradiction (since pi both divides and does not divide m)
and so the proof is complete.
34
1.6 Bounded sets, infimum and supremum
In this subsection, we want to understand the “smallest” and the “largest” element of a set,
where possible. To define this formally, we first need the definition of a bounded set (in R).
Definition 1.34. Let A ⊂ R.
∃ c ∈ R : ∀ a ∈ A, a ≤ c.
∃ c ∈ R : ∀ a ∈ A, c ≤ a.
• We say A is bounded if and only if A is bounded from above and bounded from below.
Example - Let a, b ∈ R such that a < b. Then for the closed interval I = [a, b] we have that
a, a − 1 and a − 42 are lower bounds for I and b and b + 42 are upper bounds, and the same
is true for the corresponding open interval (a, b). So, upper and lower bounds are not unique
in these cases (and in general).
To fix a specific upper/lower bound, let us first define the minimum and maximum of a set.
Definition 1.35. Let A ⊂ R be a non-empty set and t ∈ R. Then, t is called a minimal
element or minimum of A, denoted by min A := t , if and only if
• t ∈ A.
• t ∈ A.
35
Definition 1.36. Let A ⊂ R be non-empty and t ∈ R. Then, t is called the greatest lower
bound or infimum of A, denoted by inf A := t , if and only if
t is called the least upper bound or supremum of A, denoted by sup A := t , if and only if
If A is not bounded from above we set sup A := ∞. If A is not bounded from below then we
put inf A = −∞.
For the empty set, we define
If inf A ∈ A, then inf A = min A, and if sup A ∈ A, then sup A = max A. In words, if the
infimum (or supremum) of a set A is contained in A, then A has a minimum (or maximum)
which has the same value.
Moreover, the infimum and supremum are uniquely determined. To see this for the supremum,
assume that there are two suprema t1 and t2 for A. Since sup A = t1 , we have A ≤ t1 . In
addition, since sup A = t2 , we obtain by the second defining property above that x ≥ A =⇒
x ≥ t2 . Setting x = t1 , we have t1 ≥ t2 . But we can also make this argument in reverse to
show that t1 ≤ t2 . Hence t1 = t2 .
Example - Let a, b ∈ R with a < b. Then,
and
max[a, b] = max(a, b] = sup[a, b] = sup[a, b) = sup(a, b] = sup(a, b) = b.
However, min(a, b), min(a, b], max(a, b) and max[a, b) do not exist.
Example - Let A = {x2 : x ∈ (−1, 1)}. Then A = [0, 1). Hence
Let us state an equivalent definition of the supremum and infimum for bounded sets in R.
Although it looks more complicated at first sight, this formulation is sometimes very helpful.
Moreover, thinking in this way will be a helpful preparation for some of the real analysis
we consider later, and particularly for understanding important definitions of convergence,
continuity, and much more.
Definition 1.37. Let A ⊂ R be bounded from below. Then, t = inf A if and only if
36
• ∀ ϵ > 0, ∃ a ∈ A : a < t + ϵ (i.e. t comes arbitrarily close to A).
Exercise - Prove that every non-empty set A ⊂ R that is bounded from below has a greatest
lower bound. (Hint: Use the completeness axiom and a previous exercise.)
Note that the completeness axiom is false if we consider Q instead of R. Consider for instance
the set √
A = {x ∈ Q : x ≤ 2}.
This set is bounded, and we can find a number c ∈ Q such that c ≥ A (for instance, c = 32
works). However, we cannot find a least upper bound among all such rational c. Suppose
√
that we try to do√this. We are looking for a rational number c which is larger than 2 but
also as close to 2 as possible. But,√no matter which C we choose, we can always find a
smaller one which is still larger than 2.
This idea is considered more formally in the next theorem, where we state the Archimedean
property. This is based on the fact that the set of natural numbers is unbounded. Even
though this property seems rather obvious and unimpressive, it was of significant importance
for real analysis, and will be implicitly used repeatedly throughout the Mathematics for AI
study.
Theorem 1.39. The following assertions hold:
∀x ∈ R, ∃ n ∈ N : n > x.
37
The last of these three points is interesting; it tells us that, no matter how close the two real
numbers x and y we consider are, we can always find a rational number in between them.
Another way to think of this is as follows: no matter where you are on the real number line,
you are always arbitrarily close to a rational number.
Proof. We first prove the first point using proof by contradiction. For this, we assume that
N is bounded. Thus, the supremum x = sup N exists by the completeness axiom. As x is the
least upper bound of N, x − 1 is not an upper bound for N, so there exists some n ∈ N with
x − 1 < n.
x < n + 1.
But n + 1 ∈ N (since n ∈ N) and therefore x is not an upper bound for N. This contradicts
the fact that x = sup N. So, N must be unbounded.
Now we prove the second point of the theorem. Let ϵ > 0 be arbitrary. We can apply the
Archimedean property with x = 1ϵ , which implies that there exists n ∈ N with
1
n> .
ϵ
Rearranging this inequality, we obtain
1
< ϵ,
n
as required.
We now prove the third bullet-point. We will only check the case x, y > 0, since the other
possible cases can be checked similarly. Then we are looking for m, n ∈ N such that nx <
1
m < ny. By the Archimedean property, there exists n ∈ N with n > y−x , which rearranges
to give
ny > 1 + nx. (22)
Now take the smallest natural number m such that m > nx (such an m certainly exists).
Then m > nx and m − 1 ≤ nx. The latter inequality combined with (22) implies that
m ≤ nx + 1 < ny.
Based on this, we can easily calculate certain infima and suprema of (discrete) sets.
Example Let A = { n1 : n ∈ N}. Then
To prove (23), first note that 0 < n1 ≤ 1 for all n ∈ N. So, 0 is a lower bound for A, and 1 is
an upper bound. Since 1 ∈ A (for n = 1), we therefore obtain that max A = 1.
38
For the infimum we need to show the there is no larger lower bound. For this, let ϵ > 0 be
arbitrary, and suppose for a contradiction that ϵ is a lower bound for A. By the Archimedean
property there exists an n ∈ N with n1 < ϵ. Therefore, we have found some a = n1 ∈ A such
that a < ϵ. This contradicts the assumption that ϵ is a lower bound, and completes the proof.
Exercise - Let
1
A= : n ∈ N .
n2 − n − 3
Calculate inf A, sup A, min A and max A, if these quantities exist.
39
1.7 Some basic combinatorial objects, identities and inequalities
We begin this section by defining some important combinatorial quantities. These quantities
are heavily used when it comes to discrete mathematics or elementary probability theory, as
they represent the number of permutations or subsets of certain size.
In addition, we define 0! = 1.
For any n, k ∈ N0 with n ≥ k, the binomial coefficient nk (we say “n choose k”) is defined
by the formula
n n! n · (n − 1) · · · (n − k + 1)
:= = .
k k!(n − k)! k · (k − 1) · · · 1
Observe that
n n n n
= 1, = n, and = .
0 1 k n−k
The factorial n! is the number of permutations (orderings) of a set of n objects. If you pick
your first arbitrary element, you have n possibilities for your choice. When it comes to your
second decision you have just n − 1 options left, and so on.
The binomial coefficient nk represents the number
of ways to choose k (unordered) outcomes
from a set of n possibilities. For example, n3 is the number of three-element subsets of
{1, . . . , n} and 42 is the number of football matches that are required in a world cup group
The “simple” explanation of this equality is, that subsets of the subsets {1, . . . , n + 1} with
k + 1 elements can be split into those that contain the number n + 1 and those that don’t
contain it. To find the number of sets that contain n + 1, we need to count all k-element
subsets of {1, . . . , n}, and there are nk of them. The number of all sets that don’t contain
n
n + 1, is the same as the number of all (k + 1)-element subsets of {1, . . . , n}, which is k+1 .
So the total number is their sum.
Here is a more formal and formulaic proof of Lemma 1.41.
40
Proof.
n n n! n!
+ = +
k k+1 k!(n − k)! (k + 1)!(n − (k + 1))!
n! n!
= +
k!(n − k)! (k + 1)!(n − k − 1)!
n!(k + 1) n!(n − k)
= +
(k + 1)!(n − k)! (k + 1)!(n − k)!
n!((k + 1) + (n − k))
=
(k + 1)!(n − k)!
n!(1 + n)
=
(k + 1)!(n − k)!
(n + 1)!
=
(k + 1)!(n − k)!
(n + 1)! n+1
= = .
(k + 1)!((n + 1) − (k + 1))! k+1
Based on this, we can prove another famous and highly applicable theorem; the Binomial
Theorem.
Proof. Consider expanding the product (x + y)n (i.e. multiplying out all of the brackets).
Every term we obtain must take the form xk y n−k for some 0 ≤ k ≤ n. This is simply because
the powers must add up to n (each product consists of exactly n parts).
It remains to consider how many times the term xk y n−k appears. This will indeed be nk ,
since a term xk y n−k appears if we choose exactly k x from the total of n choices.
Alternatively, we can prove the Binomial Theorem by induction, making use of Lemma 1.41.
The proof by induction is a bit longer, but some may prefer this more formal approach.
Example - If we set x = y = 1 in Theorem 1.42 we get
n
n
X n
2 = .
k
k=0
We can also justify this identity by thinking in terms of sets. Both sides of this identity count
the number of subsets of {1, 2, . . . , n} (or, moreover, any set of n elements). To see that the
number of such sets is 2n , simply observe that for each of the n elements, we have 2 choices
in our construction of a set A ⊂ {1, . . . , n}; we either include the element in A, or we do not.
41
On the other hand, we can also count the number ofsubsets of {1, . . . , n} by counting the
number of subsets with a fixed size k, which gives nk , and then summing over all possible
values of k.
Example - Setting x = −1 and y = 1 in Theorem 1.42 gives the identity
n
X n
0= (−1)k .
k
k=0
The binomial theorem can also be used to (easily) improve upon Bernoulli’s inequality (see
Theorem 1.30.
For the inequality above, we simply take only the k = 0 and k = m terms from the sum. The
other terms are all non-negative from the assumption that x ≥ 0.
If we apply this theorem with m = 1, we recover the lower bound from Bernoulli’s inequality.
Note however that this Corollary 1.43 is only valid for x ≥ 0, and it is only guaranteed to
give a proper improvement when x ≥ 1. On the other hand, Bernoulli’s inequality is valid
for all x ≥ −1.
42
1.8 Some important functions
In this subsection we want to briefly discuss some particularly important functions. We start
with the absolute value of a number, since this function and its generalisations are heavily
used throughout the Mathematics for AI courses.
The absolute value of x ∈ R is defined by:
(
x if x ≥ 0
|x| = . (24)
−x if x < 0
The absolute value is a function from R to R+ (recall that R+ denotes the set of all non-
negative real numbers, i.e. R+ = [0, ∞)). We write | · | : R → R+ . The graph of the function
is shown below.
Figure 10: The graph of the absolute value function f (x) = |x|
We list some important properties of the absolute value function in the following lemma.
Lemma 1.44. The absolute value function satisfies the following properties.
43
Proof. The first 3 points follow immediately from the definition of the absolute value.
To prove point 4, we split into 4 cases.
Case 1 - Suppose that x, y ≥ 0. Then xy ≥ 0, and so |xy| = xy. Also, |x| = x and |y| = y.
It follows that |x||y| = xy = |xy|.
Case 2 - Suppose that x, y < 0. Then xy > 0, and so |xy| = xy. Also, |x| = −x and
|y| = −y. It follows that |x||y| = (−x)(−y) = xy = |xy|.
Case 3 - Suppose that x ≥ 0 and y < 0. Then xy ≤ 0, and so |xy| = −xy. Also, |x| = x and
|y| = −y. It follows that |x||y| = x(−y) = −xy = |xy|.
Case 4 - Suppose that x < 0 and y ≥ 0. Then xy ≤ 0, and so |xy| = −xy. Also, |x| = −x
and |y| = y. It follows that |x||y| = (−x)y = −xy = |xy|.
The proof of points 5 and 6 is left as an exercise.
The case distinction that we have made in the proof of point 4 in Lemma 1.44 gives a good
illustration of how to deal with absolute values in general. We will use a similar approach
again now, as we try to understand inequalities involving absolute values.
Example - Consider the inequality |x − 1| < 2. We would like to determine all values of x
which satisfy this inequality. Define the set
L := {x ∈ R : |x − 1| < 2}.
There are two situations that need to be considered separately, according to whether or not
x − 1 is positive. We split L into two parts. Write L = L1 ∪ L2 where
The sets L1 and L2 are disjoint (i.e. the intersection L1 ∩ L2 is equal to the empty set). It
remains to find a simplified expression for the set L1 and L2 , and thus to determine precisely
what L looks like.
Case 1 - Consider x < 1. In this range |x − 1| = −(x − 1) = 1 − x, and so the inequality
becomes
1 − x < 2,
which rearranges to x > −1. So,
Case 2 - Consider the range x ≥ 1. Then |x−1| = x−1 and the inequality under consideration
simplifies to the form
x − 1 < 2,
which is equivalent to x < 3. Therefore,
44
|x − 1| 3
-2 -1 1 2 3 4
Figure 11: A graphical representation of the inequality and its solution set
We remark here that we also could have reached the conclusion that L = (−1, 3) by applying
point 6 from Lemma 1.44. However, for more complicated inequalities involving absolute
values Lemma 1.44 is not sufficient, and we really do need to use such a case distinction. The
following is such an example.
Example - Consider the inequality
2(−x − 3) − 4(−x + 1) ≥ 8x − 2.
2(x + 3) − 4(−x + 1) ≥ 8x − 2.
45
Case 3 - Suppose that x > 1. Then the inequality (25) becomes
2(x + 3) − 4(x − 1) ≥ 8x − 2.
16
2|x + 3| − 4|x − 1| 8
8x − 2
-4 -3 -2 -1 1 2
-8
-16
Figure 12: A graphical representation of the inequality (25) and its solution set
The following is probably the most well known inequality involving absolute values. We will
see this inequality reappearing in more general scenarios throughout the Mathematics for AI
material. This is called the triangle inequality for the absolute value function.
|x + y| ≤ |x| + |y|.
Proof. We already know from Lemma 1.44 that |z| ≥ z and |z| ≥ −z for all z ∈ R. We
therefore have
|x| + |y| ≥ x + y
and
|x| + |y| ≥ −x − y = −(x + y).
Since |x + y| is equal to either x + y or −(x + y), we always have |x + y| ≤ |x| + |y|, as
required.
46
The triangle inequality (Theorem 1.45) also implies that
|x − z| = |x − y + y − z| ≤ |x − y| + |y − z|, (26)
for all x, y, z ∈ R. Hence, we can consider the absolute value as a “distance” between numbers.
The triangle inequality in the form (26) can be interpreted as saying that “the shortest route
from x to z is to go directly, rather than stopping at another number y”.
Moreover, we obtain the following (less obvious) inequality, which will also be useful later.
Corollary 1.46. Let x, y ∈ R. Then
| − y| − | − x| ≤ |x − y|,
which is equivalent to
−(|x| − |y|) ≤ |x − y|.
Since |x| − |y| is equal to either |x| − |y| or −(|x| − |y|), it follows that we always have the
intended inequality
|x| − |y| ≤ |x − y|.
We now turn to some other elementary functions. An affine linear function is a function
f : R → R of the form
f (x) = ax + b
with a, b ∈ R. If a = 0, then f is called a constant function. If a ̸= 0 and b = 0 we call f
a linear function. Note that linear functions satisfy f (x + y) = f (x) + f (y). (Is the same
true for affine linear functions?) Moreover, they are special cases of polynomial functions,
which are defined to be functions p : R → R of the form
n
X
p(x) = ai xi ,
i=0
with a0 , . . . an ∈ R. For a non-zero polynomial, i.e. a polynomial where at least one of the
coeffecients ai is not equal to zero, we may assume that an ̸= 0. The value n is the degree
of the polynomial p. The degree of the zero polynomial (i.e. the function f : R → R such
that f (x) = 0 for all x ∈ R) is not defined.
Examples - Consider the functions
√
f (x) = 2x − 3, g(x) = 4x3 + 5x, h(x) = π, j(x) = 2x.
47
All 4 functions are polynomials. f , h and j are also affine linear functions. j is also a linear
function. h is also a constant function.
Closely related to polynomials are power functions. These are functions f : R+ → R of
the form f (x) = xa , for some (fixed) a ∈ R. In the case when a is a natural number, such a
power function is a special kind of polynomial called a monomial.
The restriction of the domain to the positive reals is necessary in order for these functions
to be well defined in general. For instance, if we consider the power function f (x) = x1/2 ,
this is the function which takes the positive square root of x. In order for this to be a real
number, we may only consider x ≥ 0.
The next image illustrates the behaviour of the power function for some different choices of a.
Notice that there are some crucial points where the behaviour changes. The function grows
faster as a increases, and we see a different shape of curve for a > 1, a = 1 and 0 < a < 1.
Moreover, the function looks very different when a is negative; instead of growing with x, the
function is now decreasing.
The most natural power functions to consider are those for which a = m/n is a rational
number, in which case we can build a power function f (x) = xa as a composition of familiar
√
functions g(x) = xm and h(x) = n x. One can also consider power functions of the form
f (x) = xa where a is an irrational number. We avoid giving a full formal definition, since it
is rather technical. Interested students may wish to consult Mario Ullrich’s notes from the
previous Mathematics for AI 1 course (see page 37 of the notes). Roughly speaking, we can
approximate f (x) = xa by finding a rational number b which is very close to a and calculating
xb .
Another well-known class of functions are the exponential functions, which characterise
rapid growth or decay. Exponential functions are defined to be functions f : R → R+ of the
form
f (x) = ax ,
48
where the base a > 0 is a fixed real number. That is, in contrast to power functions, the
variable x ∈ R is in the exponent.
Among all exponential functions, one is particularly important. This is when we choose the
basis a = e ≈ 2.7182818284, where e is Euler’s number. This special (irrational) number e
is one of the most important constants in mathematics. It connects many different concepts
and appears in some beautiful formulae. It may be defined by different means. Indeed,
∞
1 n 1 n X 1
e = sup 1+ : n ∈ N = lim 1 + = .
n n→∞ n k!
k=0
Don’t worry too much about these complicated looking expressions for now. We will come
back to them much later.
The exponential functions often appear together with the logarithmic functions, which
are defined as their inverses. Let b > 0 be a real number (we usually restrict our attention
to the case b > 1. The function logb : R+ → R is defined to be the inverse of the exponential
function g(x) = bx . Recalling the properties of inverse functions, this means that
logb (bx ) = x, ∀x ∈ R
and
blogb (x) = x, ∀ x ∈ R+ . (28)
We can use (28) to give a wordy definition of the function logb . “The value of logb (x) is
defined to be the number y which satisfies by = x.”
49
Figure 15: The graph of the function f (x) = logb (x) for different b.
Once again, an important special case occurs when b = e (i.e. Euler’s number). We use the
shorthand ln(x) = loge (x).
Using familiar calculation rules for exponentiation we obtain the following calculation rules
for logarithms. For all a, b, x, y > 0:
logb xy = y logb x
logb (xy) = logb x + logb y
x
logb = logb x − logb y
y
loga x
logb x = .
loga b
Trigonometric functions play a crucial role in analysis and geometry. The most impor-
tant are sin x, cos x and tan x. We can give a geometric definition of these function using the
unit circle, as the following image illustrates.
50
Figure 16: Illustration of trigonometric functions.
The variable x ∈ R corresponds to the (anticlockwise) angle that is enclosed between the
horizontal axis and the line to the point (cos x, sin x). We will usually use the unit of radians
to refer to an angle.
Example - Consider the point p = √12 , √12 on the unit circle. We can calculate (using high
school trigonometry) that the angle between the horizontal axis and the line to the point p
is π4 . Therefore, we have
1 1 π π
√ ,√ = cos , sin .
2 2 4 4
For the point √12 , − √12 , the (anticlockwise) angle is 2π − π4 = 7π 4 , by symmetry, and so
1 1 7π 7π
√ , −√ = cos , sin .
2 2 4 4
51
Figure 17: Illustration of the example above.
The functions sin x and cos x are defined for all x ∈ R, see the illustration below.
The sine and cosine functions are periodic. In particular, we always have
sin(x + 2π) = sin x, and cos(x + 2π) = cos x.
52
This means that we have infinitely many choices for representing a point on the unit circle
using sine and cosine functions. For instance, recalling an earlier example, we have
π
1 1 π 9π 9π
√ ,√ = cos , sin = cos , sin
2 2 4 4 4 4
17π 17π
= cos , sin
4 4
= ...
However, we will typically use the simplest possible representation of a point on the unit
circle, with the angle in the interval [0, 2π). In particular, every point p on the unit
circle can be expressed uniquely in the form p = (cos x, sin x) for some x ∈ [0, 2π).
By the periodicity of sine and cosine, as well as observing that the graph of one function can
be obtained by shifting the other, we obtain the following simple calculation rules.
sin(−x) = − sin x
π
sin x + = cos x
2
sin(x + π) = − sin x
sin(x + 2π) = sin x
cos(−x) = cos x
π
cos x + = − sin x
2
cos(x + π) = − cos x
cos(x + 2π) = cos x.
Some important values of the sine and cosine functions are listed in the following table.
π π π π 2π 3π 5π
ϕ 0 6 4 π
√3 2 √3 4 6
1 √1 3 3 √1 1
sin ϕ 0 1 0
√2 2 2 2 2 2√
3 √1 1 3
cos ϕ 1 2 2 2 0 − 12 − √12 − 2 −1
We can use the results from the table above along with the fact that sin(x + π) = − sin(x)
and cos(x + π) = − cos(x) to deduce some of the corresponding values in the interval (π, 2π).
In fact, we can deduce most of these trigonometric values by using just a few assumptions,
as the following exercise shows.
Exercise - You are given the information that
π 1 π 1 π √3
sin = , sin = √ , and sin =
6 2 4 2 3 2
and the following facts about the sine and cosine functions:
sin(−x) = − sin x
π
sin x + = cos x
2
π
cos x + = − sin x.
2
53
Using only these facts, prove that
π √
3
cos = ,
6 2
π 1
cos = √ ,
4 2
π 1
cos = ,
3 2
√
2π 3
sin = ,
3 2
3π 1
sin = √ ,
4 2
5π 1
sin = .
6 2
We also recall some important trigonometric identities. Most importantly of all, for all
x ∈ R,
sin2 x + cos2 x = 1.
We will also need the trigonometric addition formulae.
defined at these values. So, tan is a function with domain R \ { π2 + kπ : k ∈ Z} and codomain
R.
54
Figure 19: Graphical representation of tan.
Additionally, we can define the inverse trigonometric functions. Let’s first consider the inverse
of the cosine function. We know from our earlier work (see Theorem 1.24) that a function
is invertible if and only if it is a bijection. We need to take a little care, since the cosine
function is not a bijection (it is not injective, for instance cos 0 = cos(2π) = cos(4π) = . . . ).
However, if we restrict the domain of the cosine function suitably, we can make it into a
bijection. In particular, one may see from the graph of the cosine function that the function
restricted to the domain [0, π] and codomain [−1, 1] is a bijection. We can therefore define
the arccosine function
arccos : [−1, 1] → [0, π].
The function is defined by
y = arccos x ⇐⇒ x = cos y and y ∈ [0, π].
The arccosine function is just a special name for the inverse of the cosine function. We also
use the intuitive notation cos−1 for the same function.
By making a similar restriction of the relevant domains, we define the arcsine function
h π πi
arcsin : [−1, 1] → − , ,
2 2
and the arctangent function π π
arctan : R → − , ,
2 2
via
h π πi
y = arcsin x ⇐⇒ x = sin y and y ∈ − , ,
2π 2π
y = arctan x ⇐⇒ x = tan y and y ∈ − , .
2 2
55
1.9 Complex numbers
In section 1.4, we discussed how we extend our number system gradually from the set of
natural numbers to the set of real numbers in order to have the tools to solve certain equations.
However, at a certain point in mathematics, we cannot answer all of the questions that arise
using only real numbers. This point is reached when we want to find a solution of the equation
x2 = −1, (29)
C := {x + iy : x, y ∈ R}.
z̄ := x − iy.
Note that the imaginary part of a complex number is in fact a real number!
The representation of the term z = x + iy is called the canonical representation. There
are alternative ways to express a generic complex numbers, as we shall see later.
Let’s first discuss how to do some simple calculations with complex numbers. We will always
follow the same rules that we are familiar with for making calculations with real numbers.
Let z = x + iy and w = u + iv be complex numbers (where x, y, u and v are real numbers).
Addition is straightforward, as we collect the real and imaginary parts together:
For multiplication, we follow familiar rules for expanding multiplication with brackets, sim-
plifying whenever possible by making use of the fact that i2 = −1. For example,
zw = (x + iy)(u + iv) = xu + i(yu + xv) + i2 yv = xu + i(yu + xv) − yv = (xu − yv) + i(yu + xv).
(30)
A useful consequence of (30) is that z z̄ is always a non-negative real number. Indeed, let
z = x + iy be a complex number. Then
z z̄ = (x + iy)(x − iy) = x2 + y 2 .
56
We stated above that C is a field. Indeed the operations of addition and multiplication are
associative, commutative, distributive. The additive and multiplicative identity elements are
the same as they are for the real numbers. The additive inverse of z = x + iy is naturally
−z = −x − iy. It remains to verify the existence of a multiplicative inverse for z ̸= 0. We
simply define
1
z −1 = .
x + iy
We should check that this really is a complex number (i.e. we should check that it can
be expressed in the canonical form). For this, we can make use of complex conjugate to
transform the denominator into a real number. Indeed,
1 1 x − iy x − iy x y
z −1 = = = 2 2
= 2 2
−i 2 .
x + iy x + iy x − iy x +y x +y x + y2
Hence, C with these operations is also a field.
Example - Consider z = 4 + 2i. Then z is a complex number with Re(z) = 4, Im(z) = 2,
z̄ = 4 − 2i and
4 2 1 1
z −1 = 2 2
−i 2 2
= −i .
4 +2 4 +2 5 10
z1 + z2 = z1 + z2 and z1 z2 = z 1 z 2 .
We can even calculate the square root of a complex number, and this will always be another
complex number!
√
Example - We would like to calculate i. Given the sentence before this example, let us
assume that this is another complex number z = x + iy (with x, y ∈ R). We have z 2 = i,
which means that
i = (x + iy)2 = x2 − y 2 + i(2xy). (31)
If we compare the real and imaginary parts of both sides of (31), we arrive at a system of
two equations:
x2 − y 2 = 0,
2xy = 1.
57
We seek to solve this for x and y. The first equation tells us that either x = y or x = −y.
However, if x = −y then we have a contradiction with the second equation (since the left
hand side is not positive), and so it must be the case that x = y.
The second equation then becomes 2x2 = 1, and so either x = √1 or x = − √12 . It follows
2
that
1 2
1
√ + i√ =i
2 2
and
1 2
1
−√ − i√ =i
2 2
That is,
√
1 1
i = ± √ + i√ . (32)
2 2
Let’s emphasise again, that even though we have introduced a new number i to our universe,
we are still using all of the same rules for calculations that we know from our mathematical
education, including all of the basic rules that we have been repeating since early school days.
The complex numbers are just an extension of what we already know, we should not think of
them as being something completely new. For instance, if we set all of the imaginary parts to
be equal to zero, all we have done so far in this section is to make some trivial observations
about real numbers.
One can identify an element of the set C with a tuple (x, y) of real numbers. Each complex
number z = x + iy can therefore be illustrated as a point in the plane, which is called the
complex plane. The coordinate axes are the called real and imaginary axis.
Exercise - Think about where to draw the complex conjugate z̄. What about −z?
We define the absolute value of a complex number z = x + iy to be the length of the straight
line from (0, 0) to (x, y). Here is a formulaic version of this definition.
58
Definition 1.48. Let z = x + iy ∈ C. We define the absolute value of z to be
√ p
|z| := z z̄ = x2 + y 2 .
In particular, note that |z| is always a real number, since z z̄ is always a positive real number.
We have used the same notation here as we used for the absolute value of a real number
in Section 1.8. This is very deliberate! The definition of the absolute value for a complex
number is an extension of the previous definition over R. Indeed, if z is a real number (i.e.
the imaginary part is equal to 0), then according to Definition 1.48,
p √
|z| = x2 + y 2 = z 2 .
This is equal z if z ≥ 0 and is equal to −z otherwise, which matches the definition given in
(24).
Several calculation rules for the absolute value are just the same as for the absolute value
over R. The next lemma is an analogue (and extension) of Lemma 1.44.
Lemma 1.49. The absolute value function satisfies the following properties.
2. |z| = 0 ⇐⇒ z = 0.
Proof. 1. This follows immediately from the definition of the absolute value, since we take
the positive square root.
59
as required. For the “⇒” direction, suppose that Re(z) = |z|. That is,
p
x2 + y 2 = x.
p
Then pit must be the case that y 2 = 0 (otherwise we would have x2 + y 2 > x. Also,
since x2 + y 2 ≥ 0, it must be the case that x ≥ 0, as required.
The triangle inequality is also valid for the absolute value function, and it is also a very
important and useful result.
We also have
We also have
|z| − |w| ≤ |z − w|. (36)
Proof. We will prove (34) and (35) simultaneously. First observe that
Here we have used the fact that z + w = z̄ + w̄, which was an exercise on page 50. It then
follows that
|z + w|2 = z z̄ + (z w̄ + z̄w) + ww̄ = |z|2 + (z w̄ + z̄w) + |w|2 . (37)
We can calculate directly that
z w̄ + z̄w = 2Re(z w̄). (38)
60
Indeed, for any a ∈ C, we have
a + a = 2Re(a). (39)
Now apply (39) with a = zw. Since zw = zw, this proves (38).
Combining (38) with (37) and then applying points 3, 5 and 6 from Lemma 1.49, we have
This proves (34). Note also that there is only one inequality in this sequence, which was
an application of Lemma 1.49, point 3. If we apply the additional information from Lemma
1.49, point 3, we see that this inequality is an equality if and only if z w̄ ∈ R and z w̄ ≥ 0.
This proves (35).
We now turn to the proof of (36). After relabelling the variables and rearranging the expres-
sion, it follows from (34) that
|a + b| − |b| ≤ |a| (40)
for all a, b ∈ C. Applying (40) with a = z − w and b = w, we get
Once again, the triangle inequality (inequality 34) also implies that
|a − b| = |a − c + c − b| ≤ |a − c| + |c − b|, (43)
for all a, b, c ∈ C.
The triangle inequality in the form |a − b| ≤ |a − c| + |c − b| can be interpreted geometrically
as saying that the fastest route from a to b is to go directly, rather than travelling via the
point c. This interpretation also gives some justification for why we call this the triangle
inequality.
61
Figure 21: The Triangle Inequality. A visualization of equation 43.
The following table summarises the representation and visualisation of complex numbers in
the complex plane.
C The plane R2
z = x + iy point with coordinates (x, y)
Re(z) x-coordinate of the corresponding point
Im(z) y-coordinate of the corresponding point
The set of real numbers points on the x-axis
The set of imaginary numbers points on the y-axis
−z z after rotation by 180 degrees centred at the origin
z̄ z reflected in the x-axis
|z| the distance between z and the origin
|z1 − z2 | the distance between z1 and z2
z1 + z2 vector addition of the corresponding points
|z1 + z2 | ≤ |z1 | + |z2 | the triangle inequality
Instead of using the Cartesian coordinates (x, y) for a complex number z = x+iy, one can also
switch to polar coordinates (r, ϕ). We can identify a complex number by its “direction” (or
angle) from the origin, and its distance from the origin. Once we know these two properties
of a complex number, we know exactly what the number is.
For example, suppose that we are told that a complex number z has absolute value |z| = 2
and makes an angle of π4 (i.e. 45 degrees) anticlockwise from the real axis (i.e. the x-axis).
We can use some simple trigonometry to calculate the canonical form z = x + iy. Indeed, we
have
y x
sin(π/4) = , cos(π/4) = ,
2 2
62
√
2
and since sin(π/4) = cos(π/4) = 2 , it follows that
√ √ !
√ √ 2 2
z= 2+ 2i = 2 +i .
2 2
The choice of the factorisation may seem a little strange here, but there is a reason for this,
as should become clear soon.
Recall from Section 1.8 that every point on the unit circle {(x, y) : x2 + y 2 = 1} can be
written uniquely as (x, y) = (cos ϕ, sin ϕ) for some ϕ ∈ [0, 2π). Putting this fact into the
context of the complex plane C, it follows that every complex number z such that |z| = 1
can be expressed uniquely as
z = cos ϕ + i sin ϕ, (44)
for some ϕ ∈ [0, 2π). This ϕ tells us the direction of a “unit” complex number. For a generic
complex number z ̸= 0 (i.e. not necessarily with |z| = 1), we introduce a dilate r which tells
us the distance between z and the origin (i.e. the value of |z|).
To summarise, we can uniquely express an arbitrary nonzero complex number z in the form
for some r ∈ R+ and ϕ ∈ [0, 2π). The values r and ϕ have a geometric meaning; r is
the distance of z from the origin (the absolute value or magnitude of z) and ϕ is the
(anticlockwise) angle determined by the real axis and z. We call ϕ the argument of z.
An easy, useful and compact way of writing complex numbers with the form of (44) can be
obtained by using Euler’s formula, which states that, for all ϕ ∈ R,
63
Note that eiϕ can at the moment only be understood as a symbol for the right hand side above,
and it is already useful as such. However, we will see later that this is really an equation,
i.e., we will also define ez for any z ∈ C in a way that is consistent with the case when z is
real, and show that Euler’s formula is true for this definition of a complex exponential.
Examples - By using Euler’s formula and some familiar trignometric values, we can imme-
diately see that
π
π π
ei 2 = cos + i sin = i,
2 2
eiπ = cos(π) + i sin(π) = −1,
i 3π 3π 3π
e 2 = cos + i sin = −i,
2 2
ei2π = cos(2π) + i sin(2π) = 1.
Additionally, we obtain from these values, together with the periodicity of the trigonometric
functions (or standard calculation rules for exponentials), that for k ∈ Z,
(
1 if k is even
eikπ = .
−1 if k is odd
Using Euler’s formula we can write every complex number (in its polar form) as
z = reiϕ . (46)
The polar representation in this form is particularly useful when it comes to multiplication
and powers of complex numbers. Given two complex numbers z1 = r1 eiϕ1 and z2 = r2 eiϕ2 ,
we obtain
z1 · z2 = (r1 eiϕ1 )(r2 eiϕ2 ) = r1 r2 ei(ϕ1 +ϕ2 ) .
This formula is often easier to deal with and more intuitive than the formula for multiplication
of complex numbers in canonical form given in (30).
From this formula (with z = w) and induction we obtain de Moivre’s formula for powers
z n of z = r(cos ϕ + i sin ϕ) = reiϕ with n ∈ N:
You should verify this in detail! By de Moivre’s formula and the periodicity of the sine and
cosine functions, we get
√ π π π π
z 42 = ( 2)42 cos 42 + i sin 42 = 221 cos + i sin
4 4 2 2
= 221 i.
64
Exercise - Suppose that z, w ∈ C \ {0}. Show that |z + w| = |z| + |w| if and only if z and w
have the same argument. (Hint: use Theorem 1.50.)
It is an important result in mathematics, that complex numbers are really all we need to solve
polynomial equations. In fact, the Fundamental theorem of algebra (in one of its variants)
even gives the precise answer for the number of solutions of polynomial equations, the number
of solutions to a polynomial degree d polynomial equation with complex coefficients is exactly
d. We do not discuss this in detail here, but illustrate with an example.
Example - We seek to find all solutions x ∈ C to the equation.
x2 + (1 − i)x − i = 0.
This is a quadratic equation, or in other words, a polynomial of degree 2. So, there should be
in general be two solutions x ∈ C. We can find them by using the quadratic formula, which
implies that
p √ √ √
i − 1 ± (1 − i)2 + 4i i − 1 ± 2i i−1± 2 i
x= = = .
2 2 2
Recalling from (32) that
√
1 1
i= √ + i√ ,
2 2
we conclude that the solutions to our equation are
√
i − 1 + 2 √12 + i √12
x= =i
2
and √ 1
i−1+ 2 − √2 − i √12
x= = −1.
2
65
1.10 Vectors and norms
For the final section of this introductory chapter, we will discuss some important concepts
related to vectors. Let d ∈ N. A vector is a tuple
v1
v2
v = (vi )di=1 = . ∈ Cd .
..
vd
The elements vi are complex numbers. The dimension of v is d. If all of the vi are equal to
zero then we call v the zero vector, which is denoted as 0.
We define the addition and scalar multiplication of vectors component-wise. That is,
for two vectors
u1 v1
u2 v2
u = . and v = . ,
.
. ..
ud vd
and a number λ ∈ C, we define
u1 v1 u1 + v1 λu1
u2 v2 u2 + v2 λu2
u+v= . + . = . and λu = . .
.. .. .. ..
ud vd ud + vd λud
Note that it is important for vector addition that the vectors have the same dimension. If the
dimensions of the two vectors do not agree, then their sum is not defined. The term “scalar”
is used to distinguish this multiplication from the other notions of multiplication with vectors
to be discussed later.
Here we use the field of complex numbers C to build complex vectors v ∈ Cd . Note that
we can easily also consider only real vectors v ∈ Rd , as real vectors are a special case of
complex vectors. However, since all definitions here work directly in the complex case, and
this will be needed later, we define it in the more general context and comment on necessary
changes when needed. Moreover, note that we could define vectors, and the corresponding
operations, in a much more general context, as long as the operations for the components are
well-defined. This will be discussed much later when we come to vector spaces.
For real and complex numbers, we defined the absolute value to give a notion of how “large”
a number is. We will now do something similar for vectors.
Definition 1.51. Let v = (vi )di=1 ∈ Cd . Then, we define the Euclidean norm or length of
v by the formula v
u d
uX
∥v∥ := t
2 |v |2 .
i
i=1
66
Moreover, for two vectors u = (ui )di=1 and v = (vi )di=1 ∈ Cd we define the inner product or
dot product of u and v by the formula
d
X
⟨u, v⟩ := ui v¯i .
i=1
and
d
X
⟨u, v⟩ := ui vi .
i=1
First of all, let us note that the notion of the Euclidean norm really does generalise the
concept of the absolute value of a complex number discussed earlier. To see this, consider the
case of a one-dimensional vector v ∈ C1 . This “vector” is in reality just a complex number
v1 = x + iy, with some x, y ∈ R. Then applying Definition 1.51, we see that
v
u d
uX p
∥v∥2 := t |vi |2 = |v1 |2 = |v1 |.
i=1
We use the subscript 2 here, because we will study also other norms later in the Mathematics
for AI syllabus. Note that some authors use the notation ∥ · ∥ for the Euclidean norm (i.e.
they remove the subscript), which emphasises its role as generalization of the absolute value
of a real or complex number.
The inner product is formally a mapping ⟨·, ·⟩ : Cd × Cd → C.
Exercise - Prove that the inner product satisfies the following properties.
and
⟨λu, v⟩ = λ⟨u, v⟩. (48)
⟨u, v⟩ = ⟨v, u⟩
67
We can use the previous three properties of the inner product to prove the following.
2. By conjugate symmetry
We then apply part 1 of this lemma, and some basic properties of the complex conjugate
(see the exercise on page 51) to obtain
Using conjugate symmetry again and the fact that (z̄) = z, (50) implies that
This is very similar to the identity 47 from the previous exercise. An analogue of (48) is the
identity
⟨u, λv⟩ = λ⟨u, v⟩.
This is obtained from part 2 of Lemma 1.52 by fixing µ = 0.
Exercise - Prove that, for any λ ∈ C and v ∈ Cd , we have
∥λv∥2 = |λ|∥v∥2 .
We have already seen that the Euclidean norm generalises many features of the absolute
value function to the higher dimensional setting. The next result, the triangle inequality,
is another important step in this direction.
68
Note that in the case that d = 2 or d = 3 the Euclidean norm coincides with the usual
intuition of the distance between the origin and the point v, i.e. ∥v∥ is the length of the
‘direct path’ between the origin and v. This makes the triangle inequality appear to be an
obvious statement. However, in higher dimensions (particularly when d > 3) this is not so
clear. We present a proof below.
First we need another inequality, the Cauchy-Schwarz inequality, which is one of the
most important inequalities in mathematics. It continues to be applied extremely frequently
in active research across almost all areas of mathematics, and many great and celebrated
results essentially come down to skillful manipulations of the bound.
Proof. Write
u1 v1
u2 v2
u = . and v = . .
.
. ..
ud vd
Case 1: Suppose that either u = 0 or v = 0. Then both sides of the inequality are equal to
zero, and thus the result is valid.
Case 2: Suppose that u, v ̸= 0. We will make use of the fact that the inequality
a2 + b2
≥ ab (51)
2
holds for all a, b ∈ R. The inequality (51) can be proved by rearranging the inequality
(a − b)2 ≥ 0.
We define real numbers
|ui | |vi |
ai = , and bi = .
∥u∥2 ∥v∥2
Observe that
d d d
X X |ui |2 1 X
a2i = 2 = ∥u∥2 |ui |2 = 1.
i=1 i=1
∥u∥ 2 2 i=1
Pd 2
Similarly, i=1 bi = 1. It therefore follows that
d
X a2 + b2
i i
= 1. (52)
2
i=1
We then use the triangle inequality for C (inequality 34 of Theorem 1.50), along with (51)
69
and (52), to conclude that
d
X d
X d
X
|⟨u, v⟩| = ui v¯i ≤ |ui v̄i | = |ui ||vi |
i=1 i=1 i=1
Xd
= ai ∥u∥2 bi ∥v∥2
i=1
d
X
= ∥u∥2 ∥v∥2 ai bi
i=1
d
X a2 + b2
i i
≤ ∥u∥2 ∥v∥2
2
i=1
= ∥u∥2 ∥v∥2 .
By analysing the argument a little more closely, and paying attention to when the inequalities
in the proof are really identites, we obtain the following stronger statement, which gives us
additional information about when the Cauchy-Schwarz inequality is tight.
Theorem 1.55 (Cauchy-Schwarz inequality). For all u, v ∈ Cd ,
Moreover, we have |⟨u, v⟩| = ∥u∥2 ∥v∥2 if and only if u = λv for some λ ∈ C.
Proof of Theorem 1.53. By the properties of the inner product established in Lemma 1.52
and the exercise before it, it follows that for all u, v ∈ Cd ,
∥u + v∥22 = ⟨u + v, u + v⟩
= ⟨u, u⟩ + ⟨u, v⟩ + ⟨v, u⟩ + ⟨v, v⟩
= ⟨u, u⟩ + ⟨u, v⟩ + ⟨u, v⟩ + ⟨v, v⟩
= ∥u∥22 + ⟨u, v⟩ + ⟨u, v⟩ + ∥v∥22
= ∥u∥22 + 2Re (⟨u, v⟩) + ∥v∥22 .
We now use Lemma 1.49 (part 3) and the Cauchy-Schwarz Inequality to conclude that
70
We finally introduce some particularly important vectors which will appear very often in the
upcoming considerations. These are the standard basis vectors e1 , . . . , ed ∈ Cd , where ek
is the vector which is zero everywhere except for the kth entry, which is 1. That is,
1 0 0
0 1 0
e1 = . , e2 = . , . . . , ed = . .
.. .. ..
0 0 1
Moreover, the representation of the vector v in this way is unique. Such a method for
representing elements of Cd uniquely turns out to be very useful and important. A set with
such a property is called a basis for Cd , and the set {e1 , . . . , ed } is called the standard
basis. These concepts will be discussed in greater detail later, when we come to study vector
spaces.
71
2 Matrices and systems of linear equations
In this chapter we want to solve systems of equations. That is, we want to find possible
values for some variables that fulfill a certain collection of equations. This is one of the most
important disciplines in applied mathematics and numerical applications. In particular, we
will focus on systems of linear equations.
Systems of linear equations are the most frequently occurring type of multivariate problems
to solve, and they are also the easiest. Many (even “non-linear”) numerical problems can be
rewritten as, or approximated by, a system of linear equations. Such systems can be rather
large and are usually solved by a computer, but it is up to the user to transfer the problem
under consideration to a well-defined linear system. It is therefore indispensable to have a
solid understanding of these basic problems.
For one free variable, linear equations are easy to solve. If we want to solve the equation
ax = b for some (fixed) a, b ∈ R, then the unique solution is given by x = ab if a ̸= 0. However,
if a = 0 and b ̸= 0 then this equation cannot be solved. That is, there is no x satisfying the
equation 0 · x = b. Meanwhile, for a = b = 0 the equation ax = b is fulfilled for every x ∈ R.
Although this situation was straightforward to solve, we still see that there are some subtleties
and different cases to consider. In particular, a single linear equation can have a unique
solution, can have no solutions, or can have infinitely many solutions, depending on the
properties of the equation. It turns out that systems of linear equations also satisfy the same
trichotomy.
As we increase the number of variables and equations, things become more complicated. Let
us illustrate this with some examples involving two equations and two variables.
Example - Find all solutions (x1 , x2 ) ∈ R2 to the system of linear equations
2x1 + x2 = 1
6x1 + 2x2 = 2.
We can use the first equation to eliminate one of the variables and reduce this to one equation
with one variable. The first equation gives x2 = 1 − 2x1 . Plugging this into the second
equation, we obtain 6x1 + 2(1 − 2x1 ) = 2, which simplifies to 2x1 = 0. So, it must be the
case that x1 = 0. Plugging this back into the first equation, it follows that x2 = 1, and so
the unique solution is (x1 , x2 ) = (0, 1).
If we change the system of equations slightly, the set of solutions can change significantly.
Example - Find all solutions (x1 , x2 ) ∈ R2 to the system of linear equations
2x1 + x2 = 1
6x1 + 3x2 = 2.
Let us try to use the same approach as we used for the previous example. The first equation
again gives x2 = 1−2x1 . Plugging this into the second equation, we obtain 6x1 +3(1−2x1 ) =
2, which simplifies to 3 = 2. It seems that we have reached a nonsensical contradiction!
Indeed, there are no solutions to this system, as the two equations are incompatible. Another
way to see this is by multiplying both sides of the first equation by 3. We arrive at the
72
equivalent system
6x1 + 3x2 = 3
6x1 + 3x2 = 2.
2x1 + x2 = 1
6x1 + 3x2 = 3.
Plugging x2 = 1 − 2x1 into the second equation, we obtain 6x1 + 3(1 − 2x1 ) = 3, which
simplifies to 3 = 3. This is satisfied for all x1 ∈ R.
What is happening in this example is that the two equations are equivalent. This can again
be seen by multiplying both sides of the first equation by 3. Therefore, the solutions (x1 , x2 )
to this system are the same as the solutions to the equation 2x1 + x2 = 1. We can choose
any x1 ∈ R and then choose the corresponding x2 to satisfy the equation, and so there are
infinitely many equations. For instance, (0, 1) and (1, −1) are solutions.
These three examples show that such systems of equations might be quite sensitive to small
changes of the parameters (and this was just a two dimensional example). It is therefore
desirable to have criteria for a given system of equations to be (uniquely) solvable that can
be checked more easily before we start trying to calculate a solution. Moreover, this procedure
of eliminating variables becomes less practical for larger systems, and so we would like to
develop a more systematic approach that can deal with larger systems efficiently.
The key objects that will be used for developing such an approach are matrices.
2.1 Matrices
In this case we use the notation A ∈ Rm×n , and call m and n the dimensions of A. An
m × 1 matrix is called a column vector. A 1 × n matrix is called a row vector. If m = n,
then A is a square matrix.
73
in the ith row and jth column is aij . Another notation that is sometimes convenient is that
(A)ij is used to denote the entry of A in the ith row and jth column. So, for the matrix A
defined above, we have (A)ij = aij .
The case of complex matrices, i.e., aij ∈ C, can be treated analogously, and we write A ∈
Cm×n . We can even consider matrices whose entries come from an arbitrary field F. However,
in order to give a solid grounding for understanding this new material, we will consider only
matrices of real numbers in this course.
We now turn to basic operations of matrices. The first two, namely scalar multiplication
and matrix addition, are simple (and familiar from the corresponding operations for vectors).
These operations are carried out component-wise, meaning that they are performed in each
entry of the matrices individually.
Let A, B ∈ Rm×n . Write
a11 a12 ... a1n
a21 a22 ... a2n
A= .
.. .. ..
.. . . .
am1 am2 . . . amn
and
b11 b12 ... b1n
b21 b22 ... b2n
B= . .. .
.. ..
.. . . .
bm1 bm2 . . . bmn
Then A + B is the m × n matrix
a11 + b11 a12 + b12 ... a1n + b1n
a21 + b21 a22 + b22 ... a2n + b2n
A+B = .
.. .. .. ..
. . . .
am1 + bm1 am2 + bm2 . . . amn + bmn
Note that it is important here that the matrices have the same dimension. If the dimensions
of the two matrices do not agree, then their sum is not defined.
The second operation is scalar multiplication, which is the operation which multiplies
every entry of the matrix by a fixed scalar. Let λ ∈ R and
a11 a12 . . . a1n
a21 a22 . . . a2n
A= . .. .
.. ..
.. . . .
am1 am2 . . . amn
74
The term “scalar” is used to distinguish this from the matrix product. This is the third
essential operation on matrices that we consider. The definition of the product of matrices
is a little more complicated, as we do not define this product component-wise.
Let A ∈ Rm×p be an m × p matrix and let B ∈ Rp×n be a p × n matrix. Write
A = (aij )m,p
i,j=1 , B = (bij )p,n
i,j=1 .
The product of A and B is the matrix C = AB ∈ Rm×n such that C = (cij )m,n
i,j=1 and
p
X
cij = aik bkj .
k=1
Pp
In other words, the entry of AB in the ith row and jth column is k=1 aik bkj .
For this definition to make sense, it is crucial that the dimensions of A and B are correct. In
particular, the number of columns of A must be equal to the number of rows in B
(we may sometimes say that the inner dimensions of A and B match).
A helpful way to think about computing the matrix product may be the following: to calculate
the ij entry of the product AB, move along the ith row of A and down the jth column of B.
Example - Let A ∈ R3×2 and B ∈ R2×2 be the matrices
1 6
7 9
A = 2 5 , B = .
8 0
3 4
There are 2 columns in A and 2 rows in B. These numbers are the same, and so the matrix
AB is defined. In particular AB is a 3 × 2 matrix. Write
c11 c12
AB = c21 c22 .
c31 c32
We will fill in the blanks one entry at a time to write C out explicitly. To calculate c11 , we
consider the first row of A and first column of B.
1 6
7 9
A= 2 5 , B=
.
8 0
3 4
Then
c11 = 1 · 7 + 6 · 8 = 55.
To calculate c12 , we consider the first row of A and second column of B.
1 6
7 9
A= 2 5 , B=
.
8 0
3 4
Therefore,
c12 = 1 · 9 + 6 · 0 = 9.
75
We can repeat this process for each entry. We eventually obtain
55 9
AB = 54 18 .
53 27
On the other hand, the reverse product BA is not defined, because the inner dimensions do
not match.
One may observe from the example above that the process of computing an entry in the
product of two matrices is similar to that of computing an inner product of two vectors. We
formalise this observation below.
Let A ∈ Rm×p and B ∈ Rp×n . The matrix A has m rows, each of which can be viewed as
row vectors. We write
a1
a2
A = . ,
..
am
where ai is the row vector ai = (ai1 , ai2 , . . . , aip ). Similarly, we can write
B = b1 b2 . . . bn .
We can also define the matrix-vector product of a matrix A ∈ Rm×n and a vector x ∈ Rn .
These kind of products will be particularly useful when it comes down to solving systems of
linear equations later! Write
a11 a12 . . . a1n a1
a21 a22 . . . a2n a2
A= . .. = ..
. .. . .
. . . . .
am1 am2 . . . amn am
and
x1
x2
x = . .
..
xn
76
Then Ax ∈ Rm is a vector, defined as
a11 x1 + a12 x2 + · · · + a1n xn ⟨a1 , x⟩
a21 x1 + a22 x2 + · · · + a2n xn ⟨a2 , x⟩
Ax := = .. . (53)
..
. .
am1 x1 + am2 x2 + · · · + amn xn ⟨am , x⟩
In the previous chapter, we considered vectors x ∈ Rn for some n ∈ N. Such a vector may be
considered as a matrix in Rn×1 . If we consider a vector to be a matrix in this way, we can
consider the product of a matrix A ∈ Rm×n with the matrix x ∈ Rn×1 using the definition
of matrix product given above. Then the matrix product Ax is exactly the same as the
definition of Ax given above in (53). In other words, the matrix vector product is just a
special case of the matrix product we have already defined.
Example - Let
1 6
3
A= 2 5
and x = .
4
3 4
Then
1·3+6·4 27
Ax = 2 · 3 + 5 · 4 = 26 .
3·3+4·4 25
As with matrix multiplication, we need to take care to ensure that the matrix-vector product
we are considering has the correct dimensions to be defined. For instance, if we instead set
1 6 3
A = 2 5 and x = 4 ,
3 4 0
λ(A + B) = λA + λB.
A(λB) = λAB.
A(B + C) = AB + AC.
Since matrix-vector multiplication is just a special case of matrix multiplication, the following
two facts follow immediately from the previous exercise.
77
• For all A ∈ Rm×n and x, y ∈ Rn ,
A(x + y) = Ax + Ay
A(λx) = λAx.
However, matrix multiplication is not commutative. Even in the case when both
matrices AB and BA are defined and have the same dimensions, it is usually the case that
AB ̸= BA. You can check this yourself by choosing two “random” matrices A and B, with
the correct dimensions to ensure that both AB and BA are defined (what condition does this
impose on the dimensions?), and computing both AB and BA.
There exist identity elements for the operations of matrix addition and multiplication.
Let 0mn denote the matrix in Rm×n with every entry being equal to zero. Then, for any
A ∈ Rm×n we have
A + 0mn = 0mn + A = A.
The multiplicative identity is of more use, and also practical interest. For n ∈ N, we define
the identity matrix In ∈ Rn×n to be the matrix
1 0 0 ... 0
0 1 0 . . . 0
In := 0 0 1 . . . 0 .
.. .. .. . . .
. . . . ..
0 0 0 ... 1
In words, In is the matrix with 1 at every diagonal entry and 0 everywhere else.
Note that the identity is a square matrix, and we may write I := In if the dimension is clear.
Exercise - Let A ∈ Rm×n . Prove that
AIn = A
and
Im A = A.
Next, we discuss the transpose of a matrix. Since the dimensions of a matrix are important,
it makes a huge difference if the dimensions of a matrix are m × n or n × m, and it is quite
useful to have a compact notation to switch the rows and columns of a matrix. That is, for a
given m × n matrix A = (aij )m,n T
i,j=1 , we define its transpose, denoted A as the n × m matrix
whose rows are the columns of A. To be more precise, if
a11 a12 . . . a1n
a21 a22 . . . a2n
A= .
.. .. ..
.. . . .
am1 am2 . . . amn
78
then
a11 a21 ... am1
a12 a22 ... am2
AT = . .. .
.. ..
.. . . .
a1n a2n . . . amn
Then
T 1 2 3
A = .
6 5 4
The transpose notation is also convenient for distinguishing column vectors and row vectors.
Recall that the standard basis unit vectors ek ∈ Rn = Rn×1 are the (column) vectors that
contain exactly one 1 (in the kth position) and all other entries are zero. The row unit vectors
are defined via transposition as eTk ∈ R1×n . That is,
With the above considerations and the fact that In A = AIn = A, we see that the unit vectors
can be used to “extract” the rows and columns from a matrix. For instance, given a matrix
A ∈ Rm×n of the form
a11 a12 . . . a1n a1
a21 a22 . . . a2n a2
A= . .. = .. ,
.. ..
.. . . . .
am1 am2 . . . amn am
eTk A = ak .
Similarly, the product Aek (with ek ∈ Rn×1 ) can be used to extract the kth column from A.
There is one calculation rule related to the transpose, that is sometimes also very useful for
computing the product of matrices. We state this in the following lemma.
(AB)T = B T AT .
79
Proof. Firstly, we note that B T AT is well-defined, and is and n×m matrix, which means that
the dimensions of (AB)T and B T AT are the same. Indeed, B T ∈ Rn×p and AT ∈ Rp×m , and
so the inner dimensions of B T and AT agree. The outer dimensions confirm that B T AT ∈
Rn×m .
To prove that (AB)T = B T AT , we need to show that each corresponding pair of entries of
the two matrices are the same. Recall that, for an arbitrary matrix M , the notation (M )ij
is used for the entry of M in the ith row and jth column. We need to show that
Comparing the previous two equations, we see that we have proved (54).
Some matrices do not change under transposition. A matrix A ∈ Rn×n such that AT = A is
called symmetric.
Note that symmetric matrices must be square, and we will see later that symmetric matrices
have several important properties.
An obvious but important example of a symmetric matrix is the identity matrix. More
generally, diagonal matrices are always symmetric.
The last concept related to matrix multiplication that we will need is the inverse of a matrix.
Definition 2.4. Let A ∈ Rn×n . The inverse of A, if it exists, is a matrix A−1 ∈ Rn×n such
that
AA−1 = A−1 A = In .
If an inverse exists, then we call a matrix invertible.
80
Note that we only considered square matrices in the above definition. Why?
Example - The matrix In is invertible and In−1 = In .
Example - Let
1 2
A=
3 4
We can verify by direct calculation that A is invertible and
−1 −2 1
A = 3
2 − 12
is not invertible. We will discuss a way to verify if a matrix is invertible, and how to compute
an inverse later in this chapter. However, let us already add here that even if we know that
a matrix is invertible, it is usually difficult to compute its inverse. We will come back to
this issue later, and present some ways for computing the inverse, at least for small matrices.
This inverse will be the ultimate tool to solve certain systems of linear equations. But we
will first discuss some more direct, but less powerful ways to calculate solutions.
Recall the field axioms that we stated in Section 1.4. Many of these axioms are also satisfied
for matrices. Let us restrict to the case of matrices in Rn×n for some n ∈ N. We have a notion
of addition and multiplication which satisfy the properties of associativity and distributivity.
There exist additive and multiplicative inverses. Addition is commutative, and every matrix
has an additive inverse.
On the other hand, multiplication of matrices is not commutative, and so Rn×n is not a field.
Furthermore, not all matrices in Rn×n have a multiplicative inverse.
81
2.2 Systems of linear equations
Throughout this section, there will be the parameters m, n ∈ N, where n is the number
of unknown variables and m is the number of equations that must be fulfilled. The
system of equations we want to solve will be of the following form.
Definition 2.5. Let m, n ∈ N and for all 1 ≤ i ≤ m and 1 ≤ j ≤ n, let aij ∈ R and bi ∈ R.
A system of linear equations is given by
The vector
b1
b2
b= .
..
bm
is called the right hand side (RHS) of the system.
If there exist numbers x1 , . . . , xn ∈ R that fulfill all the equations, then we call the tuple
(x1 , ..., xn ) a solution to the linear system.
If there is no solution, then we call the linear system inconsistent.
Let
a11 a12 ... a1n
a21 a22 ... a2n
A= .
.. .. ..
.. . . .
am1 am2 . . . amn
and
x1
x2
x = . .
..
xm
Recall from the previous section that the matrix-vector product Ax is defined as
a11 x1 + a12 x2 + · · · + a1n xn
a21 x1 + a22 x2 + · · · + a2n xn
Ax = .
..
.
am1 x1 + am2 x2 + · · · + amn xn
82
Therefore, the linear system from the previous definition can also be written in shorter form
as
Ax = b.
L(A, b) = {x ∈ Rn : Ax = b}.
Examples - Let us revisit some examples we considered at the beginning of this chapter.
The system
2x1 + x2 = 1
6x1 + 2x2 = 2
has the unique solution (x1 , x2 ) = (0, 1). This is the same thing as saying that the system
2 1 x1 1
=
6 2 x2 2
Note that L(A, b) is a set and therefore, we need to write L(A, b) = {x} (rather than
L(A, b) = x if x is the only solution.
Example - We also considered the system of equations
2x1 + x2 = 1
6x1 + 3x2 = 3.
Since the two equations are identical, the solutions to this system are simply the solutions to
the equation 2x1 + x2 = 1. This can be rewritten as x2 = 1 − 2x1 .
We can treat x1 ∈ R as a free variable, and conclude that any point of the form (λ, 1 − 2λ)
such that λ ∈ R is solution to our system of equations. In summary, we have shown that
2 1 1 λ
L , = :λ∈R .
6 3 3 1 − 2λ
83
2x1 + x2 + 3x3 = 1
6x1 + 3x2 = 3 (55)
4x1 = 8.
The set of solutions to this system is the same as the set of all (x1 , x2 , x3 ) ∈ R3 such that
2 1 3 x1 1
6 3 0 x2 = 3
4 0 0 x3 8
The form that this system takes makes it quite simple to solve. The last equation immediately
gives x1 = 2. Plugging this into the second equation, we have 12 + 3x2 = 3, and so x2 = −3.
Plugging these values of x1 and x2 into the first equation gives 4 − 3 + 3x3 = 1, and so x3 = 0.
We conclude that
2 1 3 1 2
L 6 3 0 , 3
= −3 .
4 0 0 8 0
It would be nice if all systems of linear equations could be solved as easily as (55). While this
is not quite the case, it is true that every system of linear equations can be reduced to make
it a little easier to consider. This is essentially what we will be doing in the next section,
when we learn about Gaussian elimination and row reduction.
We now discuss one special case of equations, for which the right-hand side of the system is
made up only of zeroes.
Definition 2.7. Let A ∈ Rm×n . A linear system of the form
Ax = 0
For a given matrix A ∈ Rm×n and b ∈ Rm , it turns out that we can learn a lot about
the set of solutions L(A, b) by considering the set of solutions L(A, 0) to the corresponding
homogenous system. This is the content of the next two lemmas.
Lemma 2.8. Let A ∈ Rm×n and b ∈ Rm . Suppose that there exist y, z ∈ Rn with z ̸= 0 such
that Ay = b, and Az = 0. Then there exist infinitely many solutions x ∈ Rn to the system
Ax = b.
Since z ̸= 0, it follows that all of the vectors y + λz are distinct. As there are infinitely many
choices for λ ∈ R, it follows that Ax = b has infinitely many solutions.
84
Lemma 2.9. Let A ∈ Rm×n and b ∈ Rm . Suppose that the homogeneous system Ax = 0 has
only the trivial solution x = 0. Then there exists at most one solution to the system Ax = b.
85
2.3 Gaussian elimination
Now we make an important observation which allows us to derive an algorithm for solving
linear systems by manipulating matrices. We can perform the following operations to a linear
system without changing the set of solutions:
• Interchanging any two equations, i.e., changing the order of the equations.
Take a moment to consider why these three changes do not alter the solution set. The third
of these points is a little more difficult than the others, and will be considered in more detail
in a forthcoming exercise sheet.
Since every system of linear equations can be written with the help of a matrix, it is natural
to consider how the above operations change the corresponding matrix of coefficients of a
linear system. We will see that they indeed allow for successive modifications that lead to
much simpler matrices, i.e., matrices in echelon form and reduced echelon form. From such a
matrix, we will be able to basically see if a corresponding linear system is (uniquely) solvable
or not.
Let us start by discussing how the above operations to a linear system Ax = b affect the
corresponding matrix A. However, note already now that these operations also change the
RHS b of a linear system, and this is essential. We will come back to this shortly, but for
now we only consider the corresponding matrix of coefficients.
In view of the operations from above that can be used to change a linear system Ax = b
without changing the set of solutions, we see that the matrix A is changed in the following
way:
For obvious reasons, these operations are called row operations, or sometimes elementary row
operations. Two matrices A and B are said to be equivalent if A can be obtained from B
by performing row operations. Note that this definition is symmetric, since one can perform
“inverse” row operations to then get from A back to B.
The goal now is to use these operations to simplify the given matrix into echelon form. In
particular, we look to create as many zeroes in the matrix as possible, and for these zeroes
to appear in a certain structured manner. We are ready to give some proper definitions.
Definition 2.10. Let C ∈ Rm×n and let 1 ≤ i ≤ m. The leading coefficient of the ith row
of C is the first non-zero entry in the row.
86
Definition 2.11. Let C ∈ Rm×n be a matrix which has at least one non-zero entry and let
cij denote the entry of C in the ith row and jth column. Let ciji denote the leading coefficient
of the ith row of C, if such a leading coefficient exists.
C is in row echelon form if both of the following conditions hold.
• There is some 1 ≤ k ≤ n such that the leading coefficient ciji exists if and only if
1 ≤ i ≤ k. In other words, all zero rows appear at the bottom of the matrix.
• For all 1 ≤ i < i′ ≤ k, ji < ji′ . In other words, the leading coefficients move strictly to
the right as we move down the rows of C.
The matrix with every entry being 0 is also in row echelon form.
Note that this definition also implies that the entries below a leading coefficient in the same
column are all equal to 0.
Definition 2.12. Let C ∈ Rm×n and let cij denote the entry of C in the ith row and jth
column. Let ciji denote the leading coefficient of the ith row of C, if such a leading coefficient
exists.
C is in reduced row echelon form if it is in row echelon form and it also satisfies the
following two conditions.
• For all 1 ≤ i ≤ k, ciji = 1. In other words, all leading coefficients are equal to 1.
• For all 1 ≤ i ≤ k, and 1 ≤ i′ < i, we have ci′ ji = 0. In other words, the entries above a
leading coefficient in the same column are all equal to 0.
The matrix with every entry being 0 is also in reduced row echelon form.
87
This matrix is not in row echelon form. It does not satisfy the required condition that all of
the zero rows are at the bottom of the matrix. However, if we reverse the order of the rows,
swapping the second row with the fourth one, we obtain the matrix C, which is in reduced
row echelon form.
Another matrix which is not in row echelon form is
1 2 0
0 1 1
F = 0 1
.
0
0 0 0
This is because the leading coefficient of the the third row is not to the right of the leading
coefficient of the row above.
Example - Let’s see an example of how reduced row echelon form matrices correspond to
linear systems that can be easily solved. We use the reduced row echelon form matrix
1 2 0
0 0 1
C= 0 0 0
0 0 0
Recalling the notation introduced in the previous section, we want to understand the set
2
1
LC, 0
x1 + 2x2 = 2
x3 = 1
0=0
0 = 1.
Since the last equation is never valid, there are no solutions to this system, and therefore
2
1
L C, 0 = ∅.
88
Let’s see what happens when we slightly modify the RHS of this system and consider solutions
x ∈ R3 to the equation
2
1
Cx =
0 .
0
Writing this as a system of linear equations, this becomes
x1 + 2x2 = 2
x3 = 1
0=0
0 = 0.
The last two equations are always satisfied, and can thus be disregarded. So, we need to
solve the system. Writing this as a system of linear equations, this becomes
x1 + 2x2 = 2
x3 = 1.
We can treat x2 as a free variables. This means that we allow x2 to range over all possible
values of R, and write the other variables in terms of the free variables (if necessary). We
conclude that
2
1 2 − 2x2
L C, =
x2 : x2 ∈ R . (56)
0
1
0
Note that there is some choice in how to choose the free variables and therefore in how we
express the final form of the solution set. In this case, we could have instead chosen x1 as
the free variable and expressed x2 in terms of x1 . We could then conclude that
2
1 x1
L C, = 1 − 1 x1 : x1 ∈ R .
0 2
1
0
These two expressions may appear different, but they are just two different descriptions of
the same set.
When we are looking to express the solution set for a system given in reduced row echelon
form, a convenient method is the following: we can set the free variables to correspond to
the columns which do not contain a leading coefficient. This is what we did when writing
down the solution set in the form of (56). As we have seen above, there are other ways to
choose the free variables, but using the columns without leading coefficients is guaranteed to
produce a fairly tidy looking expression for the solution set.
We do not prove the following statement here formally, but note that it is the basis of the
considerations below.
Theorem 2.13. Every matrix can be transformed to reduced row echelon form by performing
row operations. Moreover, the reduced row echelon form of a matrix is unique.
89
In contrast, a given matrix A can be transformed by row operations into different matrices
in (non-reduced) row echelon form. For example, multiplying any row of a row echelon form
matrix by 2 leads to another row echelon form matrix. But even then, all row echelon forms
of a matrix have the same number of non-zero rows.
Exercise - Prove that any two equivalent matrices in row echelon form have the same number
of non-zero rows.
This allows us to state the following definition.
Definition 2.14. Let A ∈ Rm×n be arbitrary. We define the rank of A, denoted by rank(A),
as the number of non-zero rows in a row echelon form matrix C that is equivalent to A.
Examples - Let us revisit the six matrices A, B, C, D, E and F from a previous example.
Since A, B, C and D are in row echelon form, we can immediately see their ranks by counting
the number of non-zero rows. Note that
Although E is not in row echelon form, we can easily transform it into row echelon form with
row operations, namely by switching the second and fourth row. We obtain the equivalent
matrix
1 2 0
0 0 1
E′ = 0 0 0 .
0 0 0
Therefore, rank(E) = 2. We can transform F into row echelon form by subtracting the
second row from the third. We obtain the equivalent row echelon form matrix
1 2 0
0 1 1
F′ = 0 0 −1 .
0 0 0
Therefore, rank(F ) = 3.
Let us see some more involved examples of how we can use row operations to transform a
matrix into row echelon form and reduced row echelon form.
Example - Let us consider the matrix
1 2 3
A = 4 5 6 .
7 8 9
Our first task is to reduce the matrix to row echelon form. To do this, we need to create
zeroes underneath the leading entries. We can do this by subtracting four times the first row
from the second. We indicate this procedure as follows.
90
1 2 3 −−−−−−−−−→ 1 2 3
4 5 6 R2 = R2 − 4R1 0 −3 −6
7 8 9 7 8 9
Note that the “R2 = R2 − 4R1 ” appearing above is not an actual mathematical equation, but
rather a piece of notation for telling us how the row operation has been carried out. There
are many slight variants of this notation that appear throughout the literature, so please be
aware of this when reading other sources.
Similarly, we obtain a zero in the third entry of the first column by subtracting 7 times the
first row.
1 2 3 −−−−−−−−−→ 1 2 3
0 −3 −6 R3 = R3 − 7R1 0 −3 −6
7 8 9 0 −6 −12
We are finished with the first leading entry, but we also need to have zeroes beneath the
second leading entry. The row operation which achieves this is R3 = R3 − 2R2 . With this,
we obtain the matrix
1 2 3
B = 0 −3 −6 .
0 0 0
This matrix is in row echelon form. To transform this into reduced row echelon form, we
dilate the second row to ensure that all leading coefficients are equal to 1 (we perform the
operation R2 = − 13 R2 ).
After that, we still need to create zeroes above all of the leading coefficients. This means that
the entry in the first row and second column needs to be zero. The operation R1 = R1 − 2R2
achieves this. We summarise these two steps via the following notation.
1 2 3 −−−−−−−−−→ 1 2 3
0 −3 −6 R2 = − 1 R2 0 1 2
3
0 0 0 0 0 0
−−−−−−−−−→ 1 0 −1
R1 = R1 − 2R2 0 1 2
0 0 0
We have finally obtained a matrix in reduced row echelon form. We can therefore say that
the reduced row echelon form of A is the matrix
1 0 −1
C = 0 1 2 .
0 0 0
The process that we have used above to transform a given matrix into (reduced) row echelon
form is called Gaussian elimination.
91
There are many ways to reduce a matrix to row echelon form and reduced row echelon form
via row operations, and the choices mainly come down to choosing an order to perform
the operations. We can often use intuition to spot certain shortcuts that will simplify the
process. We can also state a formal algorithm for doing this, which is essentially implicit in
the examples outlined above.
Step 1 - Begin with the leftmost nonzero column. If necessary, use row interchange so that
this column’s first entry is nonzero.
Step 2 (optional) - Dilate so that the first entry in this column is equal to 1 (this is not
essential for reducing to echelon form, but it usually makes the calculations easier)
Step 3 - Use row replacement operations to create zeros in all positions except for the first
entry of the column.
Step 4 - Ignore the first row of the matrix, and apply steps 1,2 and 3 to the submatrix that
remains. Repeat the process until the matrix is in echelon form (this process will certainly
terminate, since any matrix with exactly one row is in echelon form).
At this point we have a matrix in echelon form, but we can extend the algorithm to produce
a matrix in reduced echelon form.
Step 5 - Beginning with the rightmost leading entry, use row replacement operations to
create zeros above the leading entry. Do this for all of the leading terms, progressing to the
left and up.
Step 6 - Use row dilation so that all of the leading entries are changed to 1 (this will not be
necessary if we performed step 2 every time).
The order of steps 5 and 6 can be reversed. It is often more practical to perform step 6 before
step 5.
Now we discuss how we can solve linear systems by calculating (reduced) row echelon forms of
matrices. We consider the linear system Ax = b with corresponding matrix A = (aij ) ∈ Rm×n
and RHS
b1
b2
b= .
..
bm
Define the augmented matrix (A|b) to be the matrix A with an additional row b added.
That is,
a11 a12 . . . a1n b1
a21 a22 . . . a2n b2
(A|b) = . .. .
.. .. ..
.. . . . .
am1 am2 . . . amn bm
As we outlined at the beginning of this section, applying row operations to a system of linear
equations does not change the solution set. Therefore, if the augmented matrix (C|b′ ) is
obtained from (A|b) using only row operations, then
L(A, b) = L(C, b′ ).
92
Let’s see how to perform Gaussian elimination with augmented matrices to solve linear sys-
tems in practice.
Example - Consider the linear system Ax = b where x ∈ R2 is a (vector) variable,
3 5 42
A= and b = .
1 −1 6
Next, we reduce the augmented matrix to reduced row echelon form using row operations.
This means that we perform row operations to transform the left side of the augmented
matrix to reduced row echelon, and in the process record the changes to b on the right side
of the augmented matrix. We obtain
3 5 42 −−−−−−−−−→ 1 −1 6
R1 ↔ R2
1 −1 6 3 5 42
−−−−−−−−−→ 1 −1 6
R2 = R2 − 3R1
0 8 24
−−−−−−−−−→
1 1 −1 6
R2 = R2
8 0 1 3
−−−−−−−−−→ 1 0 9
R1 = R1 + R2 .
0 1 3
It therefore follows that the set of solutions to Ax = b is equal to the set of solutions to the
system
x1 + 0x2 = 9
0x1 + x2 = 3.
93
Next, we reduce the augmented matrix to reduced row echelon form using row operations.
2 1 −2 5 −−−−−−−−−→ 2 1 −2 5
0 3 6 3 R3 = R3 − R1 0 3 6 3
2 0 −4 4 0 −1 −2 −1
−−−−−−−−−→ 2 1 −2 5
R3 = −R3 0 3 6 3
0 1 2 1
−−−−−−−−−→ 2 1 −2 5
R2 ↔ R3 0 1 2 1
0 3 6 3
−−−−−−−−−→ 2 1 −2 5
R3 = R3 − 3R2 0 1 2 1
0 0 0 0
−−−−−−−−−→ 2 0 −4 4
R1 = R1 − R2 0 1 2 1
0 0 0 0
−−−−−−−−−→ 1 0 −2 2
1 0
R1 = R1 1 2 1 .
2
0 0 0 0
The left side is of the augmented matrix is now in reduced row echelon form, and so we are
ready to finalise the solution. The system is equivalent to
x1 − 2x3 = 2
x2 + 2x3 = 1.
This system has a free variable, which we set as x3 . We rewrite this system as
x1 = 2 + 2x3
x2 = 1 − 2x3 .
The next result highlights the convenient simplicity of the reduced row echelon form of a
square matrix with full rank.
Lemma 2.15. Let A ∈ Rn×n be a square matrix and let C ∈ Rn×n be its reduced row echelon
form. Then
rank(A) = n ⇐⇒ C = In .
94
Proof. (⇐) We prove the contrapositive form. Suppose that rank(A) ̸= n. Then C contains
at least one zero row, and thus C ̸= In , as required.
(⇒) Suppose that rank(A) = n. Then C is a square matrix in reduced echelon form with
at least one entry in each row. It must then be the case that the leading coefficients are the
diagonal entries of C. Since C is in reduced echelon form, these leading coefficients are all 1.
Also, since C is in reduced echelon form, all of the other entries in the same column as the
leading coefficients must be zero. It follows that C = In .
This leads to the following nice characterisation of square matrices with full rank.
Theorem 2.16. If A ∈ Rn×n is a square matrix with rank(A) = n, then the linear system
Ax = b has a unique solution for any b ∈ Rn .
Proof. Let C be the reduced row echelon form of the matrix A. By Lemma 2.15, C = In .
The set of solutions to Ax = b is the same as the set of solutions to Cx = b′ , where b′ is
some fixed vector in Rn . However, since C = In , the system Cx = b′ has the unique solution
x = b′ .
Let A ∈ Rm×n and let C be the reduced echelon form matrix equivalent to A. Note that the
linear system Ax = b is inconsistent, i.e., has no solutions, if and only if there is a zero row
in C (from the augmented matrix (C|b′ )) and the corresponding entry of b′ is not equal to
zero.
Since there are exactly m rows, and k = rank(A) of them are non-zero, we see that a linear
system Ax = b is solvable (independent from the RHS b) if rank(A) = m.
We can see from the discussion above and the previous few examples, as well as Theorem
2.16, that the rank of a matrix A is very influential in determining whether a system Ax = b
has no solutions, a unique solution, or infinitely many solutions. We summarise some more
features of the relationship between rank and the number of solutions in the next theorem.
1. If rank(A) < m, then there is some b ∈ Rn such that the linear system Ax = b has no
solutions.
2. If rank(A) < n, then the homogeneous system Ax = 0 has infinitely many solutions.
3. If rank(A) < n, then the system Ax = b either has no solutions or has infinitely many
solutions.
Proof. 1. Let C be the reduced row echelon form matrix equivalent to A. Since rank(A) <
m, the last row of C is a zero row. Let b′ ∈ Rm be any vector with a non-zero entry
b′m ̸= 0 in the final position. Observe that that the system Cx = b′ has no solutions.
Since A and C are equivalent, we can perform row operations on the augmented matrix
(C|b′ ) to transform it into the equivalent matrix (A|b) (here b is a vector in Rm which
is obtained by performing the row operations which transform C into A). The solutions
to the system Ax = b are exactly the same as those for the system Cx = b′ . Therefore,
there are no solutions to Ax = b.
95
2. Let C be the reduced echelon form matrix equivalent to A. If the rank of A is strictly
less than the number of columns, then there must exist a column in C which does
not contain a leading coefficient. The corresponding variable can be treated as a free
variable.
3. Suppose that Ax = b has at least one solution. By part 2 of this theorem, Ax = 0 has
a non-trivial solution (i.e. a solution x ̸= 0). Lemma 2.8 then implies that Ax = b has
infinitely many solutions.
Proof. The “⇒” direction is Theorem 2.16. The other direction follows from Theorem 2.17.
The final key idea of this section to introduce is that row operations are equivalent to multi-
plication by certain “elementary” matrices. This interpretation of the row operations will be
useful as we seek to build a basic theory of matrices.
An elementary matrix is a matrix which can be obtained from the identity matrix by a single
row operation. There are three kinds of elementary matrices
1. A matrix corresponding to row interchange, for example performing the operation R2 ↔ R1
gives
0 1 0
E1 = 1 0 0 .
0 0 1
2. A matrix corresponding to row dilation, for example performing the operation R2 = 3R2
gives
1 0 0
E2 = 0 3 0 .
0 0 1
Furthermore, every row operation that we perform can be restated as matrix multiplication
by an elementary matrix. For example, consider the matrix
1 0 −2
A = −3 1 4 .
2 −3 4
96
We would typically start the process of reducing this to echelon form by performing the
operation R2 = R2 + 3R1 . We obtain
1 0 −2
0 1 −2 .
2 −3 4
Indeed
1 0 0 1 0 −2 1 0 −2
3 1 0 −3 1 4 = 0 1 −2 .
0 0 1 2 −3 4 2 −3 4
The same is true for all elementary row operations. This hints at the following result.
Lemma 2.19. Let A and B be m × n matrices and suppose that A and B are equivalent.
Then, there exists a sequence of elementary matrices E1 , E2 , . . . Ek such that
B = Ek Ek−1 . . . E1 A.
Note that, since elementary matrices are derived from the identity matrix via a row opera-
tions, it follows that they are always square. A useful fact about elementary matrices is that
they are always invertible, and their inverses are also elementary matrices.
97
2.4 Matrices as linear transformations
We can use our notion of matrix multiplication to view matrices as a kind of function, or
transformation.
Let A ∈ Rm×n , we say that TA : Rn → Rm is the matrix transformation of A. This function
is defined by
TA (x) = Ax.
Example - Let
1 −3
A= 3 5 .
−1 7
Then, for an arbitrary vector
x1
x= ∈ R2
x2
we have
1 −3 x1 − 3x2
x
TA (x) = 3 5 1 = 3x1 + 5x2 .
x2
−1 7 −x1 + 7x2
Note that TA (x) is only defined if x ∈ R2 .
An important class of functions is the set of linear transformations.
Definition 2.21. A function T : Rn → Rm is a linear transformation if
Proof. We proved in an earlier exercise that A(x + y) = Ax + Ay and A(λx) = λAx. This
immediately implies the theorem.
The next result shows that there is a direct correspondence between matrices and linear
transformations.
Theorem 2.23. Let T : Rn → Rm be a linear transformation. Then there exists a unique
matrix A such that T = TA . In fact, A is the m × n matrix
Theorem 2.23 can then be used to prove the following result, which will be useful later in
this chapter.
Theorem 2.24. Let A ∈ Rn×n be a square matrix. Then
TA is invertible ⇐⇒ A is invertible.
98
Proof. Throughout this proof, we use Id to denote the identity function with domain Rn .
(⇐) Suppose that A is invertible. Then AA−1 = In = A−1 A. Therefore,
TA ◦ TA−1 = TAA−1 = TIn = Id,
and similarly
TA−1 ◦ TA = TA−1 A = TIn = Id.
(⇒) Suppose that TA is invertible, and so there is some function T : Rn → Rn such that
T ◦ TA = Id = TA ◦ T .
Claim. T is a linear transformation.
First we will show that the claim implies the theorem, and then we will prove the claim
Claim ⇒ Theorem 2.24 - Since T is a linear transformation, Theorem 2.23 implies that
T = TB for some matrix B ∈ Rn×n . Therefore
Id = TA ◦ T = TA ◦ TB = TAB , and Id = T ◦ TA = TB ◦ TA = TBA .
Therefore, for all x ∈ Rn , we have
AB(x) = x and BA(x) = x.
It follows that AB = In = BA, and so A is invertible with A−1 = B.
It remains to prove the claim.
99
2.5 Determinants
We now introduce the determinant of a matrix. This is a particularly useful tool for deter-
mining if a matrix is invertible or not.
Given a matrix A, let Aij denote the matrix with row i and column j of A removed. For
example, let
3 2 1
A = 6 7 8 .
1 −1 −1
Then
6 7 3 1
A13 = , A22 = .
1 −1 1 −1
Definition 2.25. For a 1 × 1 matrix A whose only entry is a, we say that the determinant
of A is a, and write det(A) = a.
Suppose that n ≥ 2 and A ∈ Rn×n . The determinant of A, denoted det(A) is
n
X
(−1)1+j a1j det(A1j ).
j=1
We sometimes omit the brackets and simply write det A. Note that the determinant is only
defined for square matrices.
We remark here that there are many equivalent definitions of the determinant that may be
found elsewhere in the literature.
Example - For 2 × 2 matrices, the definition above gives a simple description of the deter-
minant. Let
a11 a12
A= .
a21 a22
Then, according to the definition above, we have
3 2 1
A = 6 7 8 .
1 −1 −1
100
Using the definition of the determinant above, this can be written as the sum of three smaller
determinants. We obtain
7 8 6 8 6 7
det A = 3 · det − 2 · det + 1 · det
−1 −1 1 −1 1 −1
= 3 · [7 · (−1) − 8 · (−1)] − 2 · [6 · (−1) − 8 · 1] + 1 · [6 · (−1) − 7 · 1]
= 3 · 1 − 2 · (−14) + 1 · (−13) = 18.
Another common notation for the determinant of a matrix is to use a pair of vertical lines
instead of the usual round brackets for the matrix border (resembling the notation for the
absolute value function). So, we may also write
3 2 1
6 7 8 = 18.
1 −1 −1
The definition of the determinant is given by expanding along the first row of the matrix.
In fact, there is much more flexibility, and we can also expand along any row or column to
calculate the determinant.
Theorem 2.26. Let A be an n × n matrix, n ≥ 2. Then, for any 1 ≤ i ≤ n
n
X
det A = (−1)i+j aij det Aij
j=1
We skip the proof of this result because we are running out of time. This result is very
useful, in both theory and practice. In particular, it can provide a convenient shortcut for
calculating determinants by choosing a row or column that contains many zeroes.
Example - Compute the determinant of
1 2 100
√
3 5 2 .
0 0 1
To make things easier, we use the third row. The presence of many zeros in this row makes
our calculations quicker.
det A = 0 · det A31 − 0 · det A32 + 1 · det A33 = 1 · det A33 = 1 · (1 · 5 − 2 · 3)) = −1.
101
If we used the original definition of the determinant, it would require many computations to
calculate det A. However, the many zeroes appearing in the first column can be used to make
things easier. We obtain
2 −5 7 3
0 1 5 0
det A = 3 · det
0 2 4 −1
0 0 −2 0
1 5 0
= 3 · 2 · det 2 4 −1
0 −2 0
1 5
= 3 · 2 · (−1) · (−1). det
0 −2
= 3 · 2 · (−2) = −12.
Definition 2.27. An m × n matrix is said to be upper triangular if all of the entries below
the main diagonal are zero. That is, A is upper triangular if aij = 0 for all i > j. A matrix
is lower triangular if aij = 0 for all i < j. A matrix is triangular if it is either upper
triangular or lower triangular.
Lemma 2.28. Let A = (aij ) be an n × n triangular matrix. Then det A = a11 · a22 · · · ann .
bij = λaij , ∀ 1 ≤ j ≤ n
and
bkj = akj , ∀1 ≤ k, j ≤ n such that k ̸= i.
102
In particular, it follows that Aij = Bij for all 1 ≤ j ≤ n. The determinant det B can
be calculated by expanding along the ith row. We obtain
n
X
det B = (−1)i+j bij det Bij
j=1
n
X
= (−1)i+j λaij det Bij
j=1
Xn
= (−1)i+j λaij det Aij
j=1
Xn
=λ (−1)i+j aij det Aij
j=1
= λ det A.
2. Proof by induction on n. The base case n = 2 can be verified directly. Suppose that
a b
A=
c d
and
c d
B= .
a b
Then
det A = ad − bc = −(cb − da) = − det B.
Now let n ≥ 3 and suppose that the result holds for dimension (n−1)×(n−1) matrices.
Choose a row of B that is the same as the corresponding row of A. Since there are at
least 3 rows, and only two rows change, such an unchanged row is guaranteed to exist.
The determinant of B can be calculated by expanding along this row (let us say that
the ith row is the same in B and A). We obtain
n
X n
X
det B = (−1)i+j bij det Bij = (−1)i+j aij det Bij
j=1 j=1
Now observe that the matrix Bij ∈ R(n−1)×(n−1) is obtained from the matrix Aij by
interchanging two rows. Therefore, by the induction hypothesis, det Bij = − det Aij .
Finally,
n
X n
X
det B = (−1)i+j aij det Bij = − (−1)i+j aij det Aij = − det A.
j=1 j=1
3. A similar proof by induction argument to the one used for part 2 of this lemma can be
used here. This is left as an exercise.
Observe that the previous result immediately implies the following handy consequence.
103
Corollary 2.30. Let A ∈ Rn×n and suppose that B can be obtained from A by row operations.
Then
det A = 0 ⇐⇒ det B = 0.
Lemma 2.29 can be useful for calculating determinants. The idea is simply to reduce a given
matrix to row echelon form (which is a triangular matrix), and to then use Lemma 2.28 to
quickly calculate the determinant of the echelon form matrix. We need to keep track of the
row operations we carry out, and include this factor in our calculation.
Example - Consider the matrix
1 −4 2
−2 8 −9 .
−1 7 0
By Lemma 2.29,
1 −4 2 1 −4 2 1 −4 2
det −2 8 −9 = det 0 0 −5 = − det 0 3 2 = (−1) · 1 · 3 · (−5) = 15.
−1 7 0 0 3 2 0 0 −5
In the first step we only performed the operation of adding a multiple of one row to another
row (R2 = R2 +2R1 and R3 = R3 +R1 ), so we did not need to introduce a multiplicative factor.
In the second step, we switched the sign because we applied a row interchange operation.
The determinant is of special interest, because it can be used to characterise if a matrix
is invertible and if a linear system is uniquely solvable. Recalling Theorem 2.16, the latter
property is equivalent to the corresponding matrix having full rank.
det A ̸= 0 ⇐⇒ rank(A) = n.
Proof. By Theorem 2.13, A can be transformed into reduced row echelon form C by perform-
ing row operations.
(⇒) We will prove the contrapositive form. Suppose that rank(A) ̸= n. Then, by definition,
C has a zero row, and in particular, one of the diagonal entries of C is equal to zero. On
the other hand, C is an upper triangular matrix, and so Lemma 2.28 implies that det C = 0.
Since C is obtained from A via row operations, Corollary 2.30 then implies that det A = 0.
(⇐) Suppose that rank(A) = n. Then by Lemma 2.15, C = In . Thus det C = 1, and
Corollary 2.30 then implies that det A ̸= 0.
A is invertible ⇐⇒ rank(A) = n.
104
Matrix multiplication can also be considered as a function on vectors. Indeed, recall that we
defined the linear transformation TA : Rn → Rn by
TA (x) = Ax.
The right hand side of the equivalence (57) is equivalent to the function TA being bijective
(for all b ∈ Rn , there exists exactly one x ∈ Rn such that TA (x) = b). Therefore, by Theorem
1.13 and Theorem 2.24, we conclude that
Combining the previous two lemmas, we have the following important theorem relating de-
terminants and invertibility.
Theorem 2.33. Let A ∈ Rn×n . Then
A is invertible ⇐⇒ det A ̸= 0.
Moreover, we can combine several of the statements we have recently proved into one big
statement which shows that several important properties of square matrices are equivalent.
Theorem 2.34. Let A ∈ Rn×n . Then the following statements are equivalent.
1. A is invertible.
2. det A ̸= 0.
3. rank(A) = n.
4. The linear system Ax = b has a unique solution for any b ∈ Rn .
5. A is equivalent to In .
105
is invertible.
The next statement says that the determinant of the product of two matrices is equal to the
product of the determinants.
We will first prove this theorem in the special case when one of the matrices is elementary.
Lemma 2.36. Let B ∈ Rn×n and let E ∈ Rn×n be an elementary matrix. Then
Proof. There are three cases to consider, corresponding to the three row operations that E
may represent.
Proof of Theorem 2.35. Case 1 - Suppose that either det(A) = 0 or det(B) = 0. We consider
the case when det(A) = 0 in detail only, as the case when det(B) = 0 can be handled similarly.
It follows from Lemma 2.31 that rank(A) < n. Part 1 of Theorem 2.17 then implies that
there is some vector b ∈ Rn such that Ax = b has no solutions.
Suppose for a contradiction that det(AB) ̸= 0. Then, by Theorem 2.34, there exists y ∈ Rn
such that (AB)y = b, where b is the same vector as in the previous paragraph. This is a
contradiction, since if we write By = x, we see that
106
Since A can be reduced to In by a sequence of row operations, this process can be reversed
so that In is transformed to A by a sequence of row operations. It follows that there is a
sequence of elementary matrices E1 , . . . , Ek such that
A = Ej . . . E1 In = Ej . . . E1 .
Similarly,
B = Fk . . . F1 ,
where F1 , . . . Fk are elementary matrices.
Repeated applications of Lemma 2.36 imply that
det(A) = det(Ej . . . E1 )
= det(Ej ) det(Ej−1 . . . E1 )
..
.
= det(Ej ) · · · det(E1 ). (58)
Similarly,
det(B) = det(Fk ) · · · det(F1 ). (59)
Finally,
det(AB) = det(Ej . . . E1 Fk . . . F1 )
= det(Ej ) det(Ej−1 . . . E1 Fk . . . F1 )
..
.
= det(Ej ) · · · det(E1 ) · det(Fk ) · · · det(F1 )
= det(A) · det(B).
Example
Consider the matrix
15 10 24
A = 15 22 12 .
12 4 33
It looks like it might be tricky to compute det A, at least without a calculator. We are,
fortunately, given the information that
1 2 4 1 2 4
A = 3 4 0 3 4 0 .
2 0 5 2 0 5
107
It remains to compute the easier determinant, which we do here by expanding along the final
row:
1 2 4
2 4 1 2
3 4 0 =2 +5 = 2 · (−16) + 5 · (−2) = −42.
4 0 3 4
2 0 5
Therefore,
det A = (−42)2 = 1764.
We can also use Theorem 2.35, combined with Theorem 2.34, to quickly determine whether
a product of two matrices is invertible.
Proof.
AB is invertible ⇐⇒ det(AB) ̸= 0
⇐⇒ (det A) · (det B) ̸= 0
⇐⇒ det A ̸= 0 and det B ̸= 0
⇐⇒ A and B are both invertible.
Now that we know that the determinant “behaves well” with respect to transposition and
multiplication, one might guess that a similar relation also holds for addition. However, this
is not the case, and there is no similar formula for the determinant of the sum of
matrices as the following simple example shows. Example Consider the matrices
1 0 0 0
A= and B= ,
0 0 0 1
and observe that A + B = I2 . Since A and B both contain a zero row, it follows that
det A = 0 = det B.
Moreover, this example also shows that an analogue of Theorem 2.37, with sums instead of
products, is not true.
108
2.6 Inverse matrices
We now take a closer look at inverse matrices. One of the main motivations for working with
the inverse matrix is that it can be used to solve linear systems in a straightforward way.
Indeed, since the inverse matrix A−1 satisfies AA−1 = A−1 A = In , we have that
Ax = b ⇐⇒ x = A−1 b. (60)
In the previous section, we saw how the determinant can be used to quickly determine whether
or not a matrix is invertible. However, so far, we do not have a method for calculating what
the inverse is. Developing such a method will be the focus of this section. There are different
methods for calculating the inverse of a matrix, and we choose a method which is an extension
of the Gaussian elimination techniques we learned earlier in this chapter.
Suppose that we have a matrix A ∈ Rn×n that we know is invertible. Let ci denote the ith
column of A−1 , so
A−1 = (c1 . . . cn ).
Recall that we discussed earlier (see the discussion before Lemma 2.2) that unit vectors can
be used to extract columns from matrices. In particular,
ci = A−1 ei .
Then, following the logic of (60) (in other words, left-multplying both sides of the above
equation by A), we have
Aci = ei .
This shows that calculating ci is equivalent to solving the linear system Ax = ei . In section
2.3, we learnt how to do this by considering the augmented matrix (A|ei ), and using Gaussian
elimination to reduce this to (In |x). The resulting right hand side of the resulting augmented
matrix is a solution to the system Ax = b.
To use this method to calculate the inverse A−1 , we must repeat this procedure for each
column of A. However, we can apply the Gaussian elimination procedure to several vectors
at once! Hence, we can compute all columns of A−1 at once by computing the reduced row
echelon form of the augmented matrix
a11 . . . a1n 1 0
.. .. ..
. .
. .
an1 . . . ann 0 1
If A is invertible then the reduced echelon form of A is the identity matrix In (see Theorem
2.34. Thus, by using Gaussian elimination, we are able to compute
(A|In ) → (I|A−1 ).
109
has the form
0 a′11 . . . a′1n
1
.. .. .. .
. . .
0 1 a′n1 . . . a′nn
The matrix on the right hand side of the second augmented matrix above is the inverse of A.
That is,
A−1 = (a′ij )ni,j=1 .
110
It therefore follows from Theorem 2.38 that
−7 2 −2
A−1 = −4 1 0 .
4 −1 1
111
3 Sequences and Series
This chapter is dedicated to formalising the idea of the limiting processes. It forms one of
the central ideas of mathematical analysis and defines the basis for essential concepts like
continuity, differentiability, integration etc.
For example, consider the following infinite sum of powers of 2:
∞
1 1 1 X
1 + + + + ··· = 2−k .
2 4 8
k=0
As we add more and more terms (i.e. as k gets larger and larger), we get closer and closer
to 2, and we can get as close as we want (i.e. arbitrarily close) by adding enough terms. It
is intuitive to suggest that
X∞
2−k = 2.
k=0
We will develop a framework to consider such infinite sums more formally in this chapter.
We begin with the definition of a sequence.
Definition 3.1. Let M ̸= ∅ be an arbitrary set, and let I ⊂ Z be an infinite set. A sequence
in M is a mapping a : I → M . With the notation an := a(n), we can write the sequence as
(an )n∈I .
The range of a sequence (an )n∈I is the set {an : n ∈ I}. The domain I of a sequence is
called the index set of the sequence.
In most cases, we consider I = N or I = {K, K + 1, . . . } for some K ∈ Z. In the latter case,
we write the sequence as (an )n∈I = (an )∞
n=K . If the index set is clear, we may just write (an )
for (an )n∈I .
an = 2n (62)
or
1
bn = 1 + .
n
We can also define a sequence by recursion. This means that we give one (or more) starting
value(s) and a rule for how to calculate a new term using previous terms. For example, we
can set a1 = 2 and define ai = 2ai−1 for all i ≥ 2. This is another description of the sequence
(62).
Example - Consider the sequence (an )n∈N given by an = n1 . We can also represent this
sequence by listing its elements, that is this sequence is the same thing as the list
1 1
1, , , . . . .
2 3
112
We can immediately observe that the terms of this sequence get very close to 0 as n gets
large, but never reach zero. We say that the sequence converges to 0. We will give a formal
definition of what this means in the next section.
Example - One of the most famous sequences, which appears in several areas of natural
science, is the Fibonacci sequence. Here, the recursion depends on the previous two
values. The sequence (Fn )n∈N is defined by
The first values of this sequence are 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, . . . . It is an
√ interesting
1+ 5
phenomenon that the quotients Fn+1 /Fn converge to the golden ratio 2 . (See e.g.
Wikipedia for background on the importance of this constant).
Pn
Example - Given a sequence an , we can consider the new sequence sn = k=1 ak . We
would like to know if sn approaches a certain number when n goes to infinity. These special
sequences are called series, and we come back to this later in the chapter.
The concept of convergence is central to mathematical analysis. Intuitively, it states that the
terms of the sequence (an )n∈N approach a limit with growing index n.
an
Note that, for M = R, the ϵ-neighbourhood Uϵ (a) is just the open interval (a − ϵ, a + ϵ). For
M = C, Uϵ (a) is the disc of radius ϵ around centred at a in the complex plane C.
113
One can also consider neighbourhoods in a much more general situation. All we need is a
notion of distance. For instance, one may define the ϵ-neighbourhood of a point x ∈ Rd . We
may consider such generalisations later in the course.
We now come to the formal definition of convergence to a limit a.
Definition 3.3. Let (an )n∈N be a complex-valued sequence and a ∈ C. We say that the
sequence (an )n∈N converges to a if and only if
An equivalent definition is
∀ ϵ > 0, ∃ n0 ∈ N : n ≥ n0 =⇒ an ∈ Uϵ (a).
a = lim an
n→∞
This definition may appear intimidating to some readers on first viewing. Let’s try to draw
a picture to illustrate the meaning of the definition.
an
a+ϵ
a
a−ϵ
n0 n
What this definition says is that, no matter how tiny we choose ϵ to be, the sequence will
always eventually stay within a distance ϵ of the limit a.
Note that the limit does not depend on the first terms of a sequence. In particular, we can
always disregard finitely many elements of a sequence when considering its limiting behaviour.
For instance, if (an )n∈N and (bn )n∈N are two sequences such that an = bn holds for all but
finitely many values of n, and limn→∞ an = a, then limn→∞ bn = a.
Let us consider some examples.
114
Example - Consider again the sequence (an )n∈N with an = n1 . For all ϵ > 0 we can find
some n0 ∈ N such that n10 < ϵ. This is the Archimedean property (in particular, see Theorem
1.39). Since n1 ≤ n10 for n ≥ n0 , we obtain
1 1
|an − 0| = ≤ <ϵ
n n0
for all n ≥ n0 . Therefore, an → 0.
Example - Consider the sequence (an )n∈N with an = (−1)n , i.e. the sequence which alter-
nates between 1 and −1. This sequence is divergent. For a proof we assume the opposite,
i.e., that (an ) converges to some a ∈ R. Now, by the definition of convergence, we have that
there exists some n0 such that
|an+1 − an | = 2.
Definition 3.4. Let (an )n∈N be a real sequence such that limn→∞ an = 0. Then we call
(an )n∈N a null sequence.
1
and (2−n )n∈N are null sequences.
Example - The sequences n n∈N
Exercise - Show that, for any c > 0 the sequence n1c n∈N is a null sequence.
We now consider the notion of boundedness. This is very similar to the notion of boundedness
for sets which we considered in Section 1.6.
Definition 3.5. Let (an )n∈N be a complex valued sequence. We call the sequence bounded
(by C) if and only if
∃ C > 0 : ∀ n ∈ N, |an | ≤ C.
A real-valued sequence (an )n∈N is bounded from above if and only if
∃ C ∈ R : ∀ n ∈ N, an ≤ C,
∃ C ∈ R : ∀ n ∈ N, an ≥ C.
115
and
42
n n∈N
(eniθ )n∈N .
Theorem 3.6. Let (an )n∈N be a convergent sequence. Then (an )n∈N is bounded.
Proof. Let a = limn→∞ an . By the definition of convergence (with ϵ = 1), there exists n0
such that, for all n ≥ n0 ,
|a − an | < 1.
It follows from the triangle inequality that
Note that the sequence (an )n∈N given by an = (−1)n is bounded by 1 but not convergent.
Therefore, the opposite implication for the theorem above does not hold in general.
Example - Consider the sequence (an )n∈N with an = log2 n. This sequence is unbounded,
and it therefore follows from Theorem 3.6 that the sequence is divergent.
In the previous example, we have seen a sequence that appears to “converge to infinity”. We
now introduce the terminology of definite divergence of a real-valued sequence in order to
give a formal description for this kind of situation.
116
Definition 3.7. Let (an )n∈N be a real-valued sequence. The sequence (an )n∈N tends to ∞ if
and only if
∀ C ∈ R, ∃ n0 ∈ N : ∀ n ≥ n0 , an ≥ C.
In this case, we write limn→∞ an = ∞ or an → ∞ and call ∞ the improper limit of
(an )n∈N .
The sequence (an )n∈N tends to −∞ if and only if
∀ C ∈ R, ∃ n0 ∈ N : ∀ n ≥ n0 , an ≤ C.
Note that definitely divergent sequences are necessarily unbounded. Moreover, we do not
have such a concept for complex-valued sequences, as we do not have an order on C.
117
3.2 Calculation rules for limits
We now study how to determine the limit of more complicated sequences. This always follows
the same procedure; either we already know the limit of the sequence under consideration,
or one has to split up the sequence into easier parts that can be handled, or split again.
The following result helps us to reduce the task of determining a limit to several smaller and
hopefully easier limits.
Theorem 3.8. Let (an )n∈N and (bn )n∈N be convergent complex-valued sequences and let
λ ∈ C. Let a = limn→∞ an and b = limn→∞ bn . Then, we have
(ii) limn→∞ (λ · an ) = λ · a,
Proof. (i) Let ϵ > 0 be arbitrary. By the definition of convergence, there exist m0 , n0 ∈ N
such that
ϵ
n ≥ m0 =⇒ |an − a| <
2
and
ϵ
n ≥ n0 =⇒ |bn − b| < .
2
In particular, for all n ≥ max{m0 , n0 }, we have |an − a|, |bn − b| < 2ϵ . Then, by the
triangle inequality
ϵ ϵ
|(an + bn ) − (a + b)| = |(an − a) + (bn − b)| ≤ |an − a| + |bn − b| < + =ϵ
2 2
holds for all n ≥ max{m0 , n0 }.
(ii) Let ϵ > 0 be arbitrary. By the definition of convergence, there exists n0 ∈ N such that
ϵ
n ≥ n0 =⇒ |an − a| < .
|λ|
(iii) Since (bn ) is a convergent sequence, it follows from Theorem 3.6 that there is some
C > 0 such that
|bn | ≤ C
holds for all n ∈ N.
118
Let ϵ > 0 be arbitrary. By the definition of convergence, there exist m0 , n0 ∈ N such
that
ϵ
n ≥ m0 =⇒ |an − a| <
2C
and
ϵ
n ≥ n0 =⇒ |bn − b| < .
2|a|
ϵ ϵ
In particular, for all n ≥ max{m0 , n0 }, we have both |an − a| < 2C and |bn − b| < 2|a| .
Then, by the triangle inequality,
|an bn − ab| = |an bn − abn + abn − ab| ≤ |an bn − abn | + |abn − ab|
= |bn ||an − a| + |a||bn − b|
≤ C|an − a| + |a||bn − b|
ϵ ϵ
<C + |a| = ϵ.
2C 2|a|
Example - We can use the previous theorem to calculate the limit of the sequence (an )n∈N
given by an = 1 + πn . Write an = bn + πcn with
1
bn = 1, ∀n ∈ N and cn = .
n
Note that both of the sequences (bn ) and (cn ) are convergent, with limits 1 and 0 respectively.
By applying the first and second points of Theorem 3.8, it follows that
119
Example - Consider the sequence (an )n∈N given by
3n2 + 4n + 100
an = √ .
7n2 + 13n + n
We can use Theorem 3.8 to calculate the value of limn→∞ an . Divide the numerator and
denominator by n2 to express an as
4 100
3+ n + n2
an = 13 1 .
7+ n + n3/2
Let
4 100 13 1
bn := 3 + + 2 , cn = 7 + + 3/2 .
n n n n
It follows from (parts (i) and (ii) of) Theorem 3.8 that the sequences (bn )n∈N and (cn )n∈N
are convergent with
lim bn = 3, lim cn = 7.
n→∞ n→∞
bn 3
lim an = lim = .
n→∞ n→∞ cn 7
In order to make more use of Theorem 3.8, we need to build up a bigger collection of simple
sequences for which we know that they are convergent and what they converge to. Let us
start with a lemma that shows how to verify that a sequence is a null sequence by comparison
with another null sequence.
Lemma 3.9. Let (an )n∈N be a complex-valued null sequence and let c, C > 0 be arbitrary
positive real numbers. Suppose that the sequence (bn )n∈N satisfies
|bn | ≤ C|an |c
for all but finitely many n ∈ N. Then (bn )n∈N is a null sequence.
bn = 10000|an |0.00001
1
where an = n and (an )n∈N is a null sequence.
Let us now consider other important “building blocks”, i.e., limits that may be considered
known from now on, together with the corresponding proofs. The first example is concerned
with powers of small complex bases.
120
Lemma 3.10. Let z ∈ C with |z| < 1. Then
lim z n = 0.
n→∞
Proof. Let
1
x := − 1.
|z|
Observe that x is a positive real number. Bernoulli’s Inequality (or even the Binomial The-
orem, see Corollary 1.43) implies that (1 + x)n ≥ 1 + nx holds for all n ∈ N. Therefore,
n
n n 1 1 1 1 1
|z | = |z| = = n
≤ < = · an ,
1+x (1 + x) 1 + nx nx x
where an = n1 . Note that (an )n∈N is a null sequence. An application of Lemma 3.9 (with
c = 1 and C = x1 > 0) implies that (z n )n∈N is also a null sequence.
Exercise - Let z ∈ C with |z| > 1. Show that the sequence (z n )n∈N is divergent. For what
z ∈ C with |z| = 1 is the sequence (z n )n∈N convergent?
The next limit will be a useful building block for establishing the convergence of other se-
quences later.
Lemma 3.11. √
n
lim n = 1.
n→∞
Proof. Define √
n
xn = n − 1.
We will show that (xn )n∈N is a null sequence. It then follows from Theorem 3.8(i) that
√
lim n n = lim (xn + 1) = 0 + 1 = 1.
n→∞ n→∞
To show that (xn ) is a null sequence, we use Corollary 1.43 with m = 2 to deduce that
n n 2 n(n − 1) 2
n = (1 + xn ) ≥ 1 + xn = 1 + xn .
2 2
We will now use Lemma 3.11 in combination with Theorem 3.8 to give a strengthened version
of the previous result in which we allow the argument of the root to grow even faster.
121
Proof. The proof is by induction on k. The base case k = 1 was established in Lemma 3.11.
Now, suppose that the result holds for k. We need to prove that
lim an = 1.
n→∞
where √
n
an = nk+1 .
√ √
n
Let bn = n n and cn = nk , and observe that an = bn · cn . Also, by Lemma 3.11 and the
induction hypothesis respectively, we have both
lim bn = 1
n→∞
and
lim cn = 1.
n→∞
The following result illustrates the fact that exponential growth is faster than polynomial
growth.
Lemma 3.13. Let z ∈ C with |z| > 1 and let k ∈ N be fixed. Then
nk
lim = 0.
n→∞ z n
Proof. Our plan for this proof is as follows: we will show that there is some constant C such
that
nk 1
n
≤C (65)
z n
k
holds for all but finitely many n ∈ N. It then follows from Lemma 3.9 that ( nz n ) is a null
sequence. It remains to prove (65) for almost all n ∈ N.
Set x := |z| − 1 > 0 and suppose that n ≥ 2k which is equivalent to n − k ≥ n2 . This
assumption on n is ok for us, since we are disregarding only finitely many values of n. Apply
Corollary 1.43 with this x to obtain
n n n
|z| = (1 + x) ≥ 1 + xk+1
k+1
n(n − 1) · · · (n − k) k+1
=1+ x
(k + 1)!
(n/2)k+1 k+1
≥ x
(k + 1)!
nk+1
= xk+1 .
(k + 1)!2k+1
122
It therefore follows that, for all n ≥ 2k,
nk nk k (k + 1)!2
k+1 1
= ≤ n · = · C,
zn |z|n xk+1 nk+1 n
k+1
where C = (k+1)!2
xk+1
is an absolute constant (i.e. it is independent of n). This proves (65),
and completes the proof of the lemma.
The next result gives a very helpful tool for the calculation of difficult limits. This one is
helpful when the sequence under consideration can be bounded from above and below by
sequences that converge to the same limit.
Theorem 3.14. (Sandwich Rule) Let (an )n∈N and (cn )n∈N be convergent real-valued se-
quences. Suppose that (bn )n∈N is a real-valued sequence and that there exists n0 ∈ N such
that
an ≤ bn ≤ cn ∀n ≥ n0 .
Suppose also that
lim an = L = lim cn .
n→∞ n→∞
Then (bn )n∈N is convergent and
lim bn = L.
n→∞
Our first application of the Sandwich Rule is to prove a variant of Lemma 3.11.
Lemma 3.15. Let a be a positive real number. Then
√
lim n a = 1.
n→∞
Proof. The lemma holds for trivial reasons if we set a = 1. For a > 1, observe that, for all
n≥a √ √
1 ≤ n a ≤ n n.
√
We have bounded the sequence ( n a)n∈N from above and below by two sequences which both
converge to 1 (by Lemma 3.11). It therefore follows from the Sandwich Rule that
√
lim n a = 1.
n→∞
123
Example - We give another application of the Sandwich Rule in this example. Let x, y > 0
be fixed real numbers and consider the sequence (bn )n∈N where
√
bn = n xn + y n .
We may assume (w.l.o.g.) that x ≥ y and seek to show that limn→∞ bn = x. Note that
√ √ √
n
√
n
√ √
n
x = n xn ≤ n xn + y n ≤ 2xn = 2 n xn = 2 · x.
√
Apply√the Sandwich Rule with an = x for all n and cn = n 2 · x. We know from Lemma 3.15
that n 2 → 1, and it then follows from Theorem 3.8(ii) that cn → x. The Sandwich Rule
then implies that
lim bn = x.
n→∞
(1 + sin n)n
lim = 0.
n→∞ n2n
This is a good illustration of the use of the Sandwich Rule, since this sequence is rather
complicated and it is not easy to observe a pattern by writing out the first few terms of the
sequence.
Observe that
(1 + sin n)n (1 + 1)n 1
0≤ n
≤ = .
n2 n2n n
1
Apply the Sandwich Rule with an = 0 and cn = n. Both of these sequences converge to 0,
n)n
and thus limn→∞ (1+sin
n2n = 0 also.
Exercise - Let k ∈ N be fixed. Use the Sandwich Rule to prove that
n!
lim = 1.
n→∞ (n − k)!nk
We conclude this section by giving an analogue of Theorem 3.8 for definitely divergent se-
quences.
Theorem 3.16. Let (an )n∈N and (bn )n∈N be real-valued sequences and let b, λ ∈ R. Then
124
Proof. This is left as an exercise.
In general, one needs to take a little more care with these kinds of rules for definitely divergent
sequences, and we do not always have a convenient rule to apply. For instance, there is no
easy rule for determining the limit of the sequence (an bn )n∈N for the case when an → ∞ and
bn → 0. If we set an = n and bn = 2−n then Lemma 3.13 informs us that an bn → 0. However,
if we set an = n2 and bn = n1 , then an bn = n → ∞.
125
3.3 Monotone sequences
∀n ∈ N, an+1 ≥ an ,
∀n ∈ N, an+1 ≤ an .
Note that, since the definition of monotonicity requires a notion of order, we do not have an
analogue of the definition above for complex-valued sequences.
Examples - Many of the sequences that we have discussed so far are monotone. For example
1
(and thus also strictly monotone). Any sequence of the form (nk )n∈N
n n∈N is decreasing
with k > 0 is increasing. If k < 0 then the sequence is decreasing, and if k = 0 then
the sequence is constant (and so both non-increasing and non-decreasing). The sequence
((−1)n )n∈N is not monotone.
For some sequences, a little more work is required to determine whether or not they are
monotone. In some cases, a helpful trick is to consider the quotients of consecutive terms
of a sequence and show that they are bounded (from above or below) by one. This works
because a sequence is increasing (for example) if and only if
an+1
>1
an
holds for all n ∈ N.
1 n n+1 n
an := 1 + =
n n
is non-decreasing.
126
Proof. We consider quotients of successive terms. We will show
an+1
≥1
an
holds for all n ∈ N. Observe that
n+1
n+2
(n + 2)n n+1 n + 1
an+1 n+1
= n+1 n
= ·
(n + 1)2
an n
n
n+1
(n + 1)2 − 1
n+1
= 2
·
(n + 1) n
n+1
1 n+1
= 1− 2
· .
(n + 1) n
1
An application of Bernoulli’s Inequality (Thm. 1.30) with x = − (n+1)2 ≥ −1 yields
n+1
an+1 1 n+1 1 n+1
= 1− · ≥ 1 − (n + 1) = 1.
an (n + 1)2 n (n + 1)2 n
In fact, if we were a little more careful in this proof, we could use a strict version of Bernoulli’s
Inequality that holds for x > −1 to prove that the given sequence is decreasing (and hence
strictly monotone).
The following result, whose statement and proof is similar to Lemma 3.18, will be used later.
Lemma 3.19. The sequence (bn )n∈N given by
1 n+1 n + 1 n+1
bn := 1 + =
n n
is non-increasing.
The following result shows that monotonicity is a very helpful property as we only need to
check if a monotone sequence is bounded in order to know whether it is convergent or not.
Note that boundedness of a sequence is usually much easier to show.
Theorem 3.20 (Monotonicity Principle). (i) If (an )n∈N is a non-decreasing sequence which
is bounded above, then
lim an = sup{an : n ∈ N} ∈ R.
n→∞
lim an = inf{an : n ∈ N} ∈ R.
n→∞
127
We use the notation sup(an ) as a shorthand for sup{an : n ∈ N}, and similarly inf(an ) =
inf{an : n ∈ N}.
Proof. (i) Suppose that (an )n∈N is non-decreasing sequence which is bounded above. This
means that the set {an : n ∈ N} is bounded above. It follows from the Completeness
Axiom (Axiom 1.38) that there exists t ∈ R such that
t = sup{an : n ∈ N}.
Now let ϵ > 0 be arbitrary. It follows from the definition of the supremum that t − ϵ is
not an upper bound for the set {an : n ∈ N}. In particular, there exists n0 ∈ N such
that an0 > t − ϵ.
However, since (an )n∈N is non-decreasing, it follows that for all n ≥ n0
(ii) The proof is very similar to part (i), and is left as an exercise.
(iii) We know from Theorem 3.6 that convergent sequences are bounded, which proves the
first direction of the implication (66). In order to prove (66), we need to prove the
reverse implication. This follows from the previous two parts of this theorem.
Indeed, suppose that (an )n∈N is a bounded sequence which is monotone. By definition
of monotonicity, the sequence is either non-decreasing or non-increasing. In the first of
these cases, we can use part (i) of this theorem to show that (an )n∈N is convergent and
limn→∞ an = t = sup(an ). Similarly, if (an )n∈N is non-increasing then the result follows
from part (ii).
Exercise - Let (an )n∈N be a non-decreasing real-valued sequence which is unbounded. Prove
that
lim an = ∞.
n→∞
With Theorem 3.20 we see that there are convergent sequences where we do not have to know
the limit to verify that it exists. In some cases, we may even define numbers just as limits
of specific sequences, because we do not have another (explicit) description. One typical
example is Euler’s number.
1 n n+1 n
an := 1 + =
n n
is convergent.
Proof. We have already proven in Lemma 3.18 that this sequence is monotone. To prove this
lemma, we will show that (an )n∈N is bounded. We are then done, by Theorem 3.20.
128
Since (an )n∈N is non-decreasing, it follows from Lemma 3.18 that, for all n ∈ N,
an ≥ a1 = 2.
It remains to establish an upper bound for an . For this, we recall the related sequence
(bn )n∈N , given by
1 n+1 n + 1 n+1
bn := 1 + = .
n n
Observe that, for all n ∈ N, an ≤ bn . We also know, from Lemma 3.19, that (bn )n∈N is
non-increasing, which implies that, for all n ∈ N,
an ≤ bn ≤ b1 = 4.
We have found both an upper and lower bound for an , which means that the sequence is
indeed bounded and the proof is complete.
If we take a little more care with the application of Theorem 3.20, we see that this shows
that the limit of the sequence given by
1 n n+1 n
an := 1 + =
n n
exists and equals sup{an : n ∈ N}. We define this limit to be Euler’s number, denoted e.
That is,
1 n 1 n
e := lim 1 + = sup 1 + .
n→∞ n n
129
3.4 Subsequences
The concepts of the last sections deal with sequences that converge or, in other words, con-
centrate around a single point. In some cases, however, divergent sequences may also have
some points of interest for very large n. An obvious example is ((−1)n )n∈N , which appears
to converge towards two different points. Now, we want to formalise the idea of sequences
having more than one limit.
an = (−1)n .
Two notable subsequences are given by taking the odd and even terms of the sequences. That
is, we can consider (n1 , n2 , . . . ) = (1, 3, . . . ) and (n1 , n2 , . . . ) = (2, 4, . . . ). These subsequences
are convergent (in fact, they are constant) with limit −1 and 1 respectively.
Exercise - Suppose that (an )n∈N is a sequence such that limn→ an = a. Show that any
subsequence (ank )k∈N satisfies
lim ank = a.
k→∞
lim ank = a.
k→∞
The accumulation points of the sequence ((−1)n )n∈N are −1 and 1. We can also consider
some more complicated examples.
Example - Consider the sequence defined by
(
1 if n is not prime
an = 1 .
n if n is prime
The accumulation points of this sequence are 0 and 1. The subsequence (an )n∈P , where P
denotes the set of all primes, converges to 0 and the subsequence (an )n∈P
/ is constant, always
taking value 1.
Next we want to show that each bounded sequence has at least one convergent subsequence.
This result bears the names of Bolzano and Weierstrass and is an important technical tool
for proofs in many areas of analysis.
Theorem 3.24 (Bolzano-Weierstrass Theorem). Let (an )n∈N be a bounded real-valued se-
quence. Then (an )n∈N has at least one convergent subsequence.
130
an
n
m
Case 1 - Suppose that there are finitely many peaks m1 < m2 < · · · < ml . Set n1 = ml + 1.
In particular, n1 is not a peak, and so there exists an integer n2 > n1 such that an2 ≥ an1 .
Also, n2 is not a peak, and so there exists n3 > n2 such that an3 ≥ an2 ≥ an1 . We can
continue this process to obtain a (infinite) non-decreasing sequence (an1 , an2 , . . . ) such that
an1 ≤ an2 ≤ an3 ≤ an4 . . . . This subsequence is also bounded, because of the assumption
that (an )n∈N is bounded. It therefore follows from the Monotonicity Principle (Theorem 3.20)
that it is convergent.
Case 2 - Suppose that there are no peaks. The proof is the same as that of Case 1, expect
that we set n1 = 1.
Case 3 - Suppose that there are infinitely many peaks m1 < m2 < . . . . Then the sequence
(am1 , am2 , . . . ) is decreasing. It is also bounded, because of the assumption that (an )n∈N is
bounded. It therefore follows again from the Monotonicity Principle (Theorem 3.20) that
this subsequence is convergent.
Example - We can use the Bolzano-Weierstrass Theorem to show that certain sequences
contain convergent subsequences even in cases when the convergent subsequences are rather
difficult to see. For instance, consider the sequence (an )n∈N given by
n · cos(3n2 − 5)
an = .
n+1
It is not easy to see a pattern in this sequence, with the values of an jumping around fairly
randomly, somewhere in the range (−1, 1). However, it is not difficult to check that the
sequence (an )n∈N is bounded, and so the Bolzano-Weierstrass Theorem tells us that there
exists a convergent subsequence.
Note that the converse of the Bolzano-Weierstrass Theorem does not hold, i.e. not every
sequence with a convergent subsequence is bounded. One may consider, for example, the
sequence (an )n∈N given by (
n if n is even
an = .
0 if n is odd
131
We finish this section with an extreme example, highlighting the potentially strange behaviour
of accumulation points.
Example Consider the sequence (an )n∈N , which is a list of all rational numbers in the interval
(0, 1). One may define this sequence more formally by constructing a bijection between
f : N → Q ∩ (0, 1) (we did this back in Chapter 1) and then setting an = f (n).
For every real number x ∈ (0, 1) it is possible to define an infinite sequence of rational numbers
which converge to x. Such a sequence can be constructed using part 3 of Theorem 1.39 (and
I encourage you to formally define such a sequence).
In other words, the set of accumulation points for this sequence is the whole interval (0, 1).
This is an uncountable and therefore somewhat “larger” than the actual range of the sequence!
132
3.5 Cauchy criterion
In this section we introduce Cauchy sequences and the Cauchy criterion for establishing the
convergence of a sequence. The Cauchy criterion is, similarly to the Monotonicity Principle
(Theorem 3.20), an important tool to verify that a sequence is convergent without knowing
its limit and it will be used several times throughout the remainder of the course.
You should compare this definition with the definition of convergence in order to gain better
understanding.
We now take a look at two familiar sequences and examine whether or not they are Cauchy.
Example - The sequence (an )n∈N given by
1
an =
n
is a Cauchy sequence. To see this, observe that, by the triangle inequality,
1 1 1 1 1 1
|an − am | = − ≤ + = + .
n m n m n m
It therefore follows that, for all ϵ > 0 and for all m, n > 2ϵ , we have |an − am | < ϵ.
Example - The sequence (an )n∈N given by
an = (−1)n
is a not Cauchy sequence. Suppose for a contradiction that it is Cauchy. Then, for all ϵ > 0,
there is some n0 ∈ N such that |an − am | < ϵ holds for all m, n ≥ n0 . But, if we set ϵ = 1, we
obtain a contradiction, since no matter which value we choose for n0 , we have
These two examples suggest that the property of being Cauchy may be similar to that of
being convergent (since we already know that the first sequence (1/n) is convergent, and that
((−1)n ) is not). Indeed, this intuition is correct, as the following important result shows.
Theorem 3.26 (Cauchy criterion). Let (an )n∈N be a real-valued sequence. Then
Before proving the Cauchy criterion, we prove a lemma that will be used in the proof.
Lemma 3.27. Let (an )n∈N be a complex-valued sequence which is Cauchy. Then (an )n∈N is
bounded.
133
Proof. Apply the definition of a Cauchy sequence with ϵ = 1. It follows that there is some
n0 ∈ N such that, for all m, n ≥ n0 ,
|an − am | < 1.
This gives a bound for all n ≥ n0 . It then follows that, for all n ∈ N, |an | ≤ C, where
Suppose that (an )n∈N is convergent with an → a. Let ϵ > 0 be arbitrary. By the definition
of convergence, there exists n0 ∈ N such that for all m, n ≥ n0 ,
ϵ
|am − a|, |an − a| < .
2
It follows from the triangle inequality that, for all m, n ≥ n0 ,
ϵ ϵ
|am − an | = |(am − a) + (a − an )| ≤ |am − a| + |a − an | < + = ϵ.
2 2
Next, we consider the opposite implication. Suppose that (an )n∈N is Cauchy. Lemma 3.27
implies that (an )n∈N is bounded. By the Bolzano-Weierstrass Theorem (Thm. 3.24), (an )n∈N
has at least one convergent subsequence.
Let (ank )n∈N be a subsequence of (an )n∈N such that
lim ank = a.
k→∞
Let ϵ > 0 be arbitrary. Since (ank )n∈N tends to a, it follows from the definition of convergence
that there is some k0 ∈ N such that, for all k ≥ k0 ,
ϵ
|ank − a| < .
2
Also, since (an )n∈N is Cauchy, it follows that there is some n0 such that, for all m, n ≥ n0 ,
ϵ
|an − am | < .
2
Let k ∈ N be any integer such that both k ≥ k0 and nk ≥ n0 hold. Then, by the triangle
inequality, we have that for all n ≥ n0
ϵ ϵ
|an − a| = |(an − ank ) + (ank − a)| ≤ |an − ank | + |ank − a| < + = ϵ.
2 2
134
The Cauchy criterion is particularly useful as a shortcut for proving that certain sequences
are convergent, since it is generally easier to verify that a sequence is Cauchy than it is to
show that it is convergent. In particular, we do not need to know what the limit of the
sequence is when checking that it is Cauchy.
Example - Let (an )n∈N be a recursively defined sequence with a1 = 1 and
(
an + 21n if n is prime
an+1 = .
an − 21n if n is not prime
If we write down the first few terms of this sequence, it is not immediately obvious that
the sequence is convergent, and it is even more tricky to identify a possible limit. However,
without knowing anything about the possible limit of the sequence, we can show that is it
Cauchy, as follows.
First, observe that, for all n ≥ 2,
1
|an+1 − an | = .
2n
Now let ϵ > 0 and suppose that m and n are both larger than n0 . We need make sure that
n0 is sufficiently large for the forthcoming proof, and we will specify the choice of n0 to make
the argument work.
Without loss of generality, we may assume that m > n ≥ n0 . By the triangle inequality,
|am − an | < ϵ.
2
Formally, we may define n0 := ⌈log2 ϵ ⌉.
In summary, we have shown that (an )n∈N is a Cauchy sequence, and thus by the Cauchy
criterion, it follows that the sequence is indeed convergent.
135
3.6 Introduction to series
In this section, we use sequences to build series. A series is really just a special kind of
sequence (sn )n∈N given by
X n
sn = ak ,
k=1
where (an )n∈N is another sequence. The sum of all terms of the sequence (an )n∈N , i.e. the
limit of the sequence (sn )n∈N , is one of the main motivations for considering limits at all, and
some interesting phenomena appear when it comes to the question of whether such limits
exist.
We begin with a more formal definition.
Definition 3.28. Let (an )n∈N be a complex-valued sequence and
n
X
sn = ak .
k=1
If the sequence (sn )n∈N converges with limn→∞ sn = s then we say that the series converges.
We call s the sum of the series, and write
∞
X
ak = s.
k=1
136
Lemma 3.29 (Geometric series). Let q ∈ C with |q| < 1. Then we have that
∞ ∞
X 1 X q
qk = , and qk = . (67)
1−q 1−q
k=0 k=1
Moreover, we have
n
X 1 − q n+1
qk = . (68)
1−q
k=0
Pn k
Proof. We will first prove (68), and then use it to prove (67). Let sn := k=0 q and consider
the equation
n
X
(1 − q)sn = (1 − q) qk
k=0
= (1 − q)(1 + q + q 2 + · · · + q n )
= (1 + q + q 2 + · · · + q n ) − (q + q 2 + q 3 + · · · + q n+1 )
= 1 − q n+1 .
In the last step, we have simply observed that all but the extreme terms in the two brackets
cancel out with one another. This proves (68).
To prove the first part of (67), we make use of some facts about convergence of sequences
that we established earlier in this chapter (namely Theorem 3.8 and Lemma 3.10) to see that
∞
X 1 − q n+1 1 1
q k = lim sn = lim = · lim (1 − q n+1 ) = .
n→∞ n→∞ 1 − q 1−q n→∞ 1−q
k=0
The second sum from (67) follows immediately from the first.
In the proof above, we considered two long sums and observed that almost all of the terms
cancelled out. Such arguments are called telescoping tricks and sums of this form are
called telescoping sums. This kind of trick will be used more throughout this chapter.
1
Example - If we set q = 2 in Lemma 3.29, it follows that
∞ ∞
X 1 X 1
= 2, and = 1.
2k 2k
k=0 k=1
The next example shows that not all convergent sequences give rise to convergent
series. This is a very important example of a divergent series.
Lemma 3.30 (Harmonic series). Consider the sequence (an )n∈N given by an = n1 . Then, the
corresponding series satisfies
∞
X 1
= ∞.
k
k=1
137
Proof. We need to show that the sequence of partial sums sn = nk=1 k1 tends to infinity. We
P
group the terms of the partial sum s2n and manipulate the sums as follows:
1 1 1 1 1 1 1 1 1 1
s2n = 1 + + + + + + + + ··· + + + · · · +
2 3 4 5 6 7 8 2n−1 + 1 2n−1 + 2 2n
1 1 1 1
≥ 1 + + 2 · + 4 · + · · · + 2n−1 · n
2 4 8 2
1 1 1 1
= 1 + + + ··· +
2 2 2 2
n
=1+ .
2
Since sn is an increasing sequence, it follows that, for all C ∈ R, there exists n0 = 2⌈2C⌉ such
that, for all n ≥ n0 we have
⌈2C⌉
sn ≥ sn0 = s2⌈2C⌉ ≥ 1 + > C.
2
Recalling the definition of the improper limit, we see that sn → ∞.
P∞ 1
The series
P∞ k=1 n = ∞ is called the harmonic series. By contrast with Lemma 3.30, the
series k=1 n−α converges for any α > 1, and so the harmonic series is something of a critical
example at which there is a change of behaviour.
Example - In this example, we discuss how the aforementioned telescoping trick can some-
times be a powerful tool for obtaining the precise value of apparently complicated series. We
will prove that
∞
X 1
= 1.
k(k + 1)
k=1
We do not even know that this series is convergent yet. However, we first make the helpful
observation that
1 1 1
= − . (69)
k(k + 1) k k+1
It therefore follows that
n n n n+1
X 1 X 1 1 X 1 X1 1
sn = = − = − =1− .
k(k + 1) k k+1 k k n+1
k=1 k=1 k=1 k=2
138
3.7 Calculation rules and basic properties of series
In this section, we will use some of the theory that we have built for sequences to derive some
basic properties about series. We begin with an analogue of Theorem 3.8.
∞
X ∞
X ∞
X
(ak + bk ) = ak + bk
k=1 k=1 k=1
and
∞
X ∞
X
c · ak = c · ak .
k=1 k=1
The next result gives a useful application of the Monotonicity Principle (Theorem 3.20).
Theorem 3.32. Let (an )n∈N be a real-valued sequence such that an ≥ 0 for all n ∈ N. Let
sn = nk=1 ak . Then
P
∞
X
(sn )n∈N is bounded ⇐⇒ ak converges.
k=1
Proof. Since ak ≥ 0 for all k ∈ N, it follows that the sequence (sn )n∈N is non-decreasing. In
particular, this sequence is monotone. It therefore follows from Theorem 3.20(iii) that
139
P∞ 1
Since k=1 k! converges, it immediately follows that
∞ ∞
X 1 X 1
=1+
k! k!
k=0 k=1
P∞ 1
and so k=0 k! also converges. Moreover, one can indeed show that
∞
1 n
X 1
= e = lim 1 + . (70)
k! n→∞ n
k=0
We omit the proof of (70) here, but it will be considered on a forthcoming exercise sheet.
Next, we will apply the Cauchy criterion to the sequence of partial sums to obtain the
following result.
m
X
∀ϵ > 0, ∃ n0 ∈ N : ∀m > n > n0 , ak < ϵ. (71)
k=n+1
P∞
Proof. By definition,
Pn the series k=1 ak is convergent if and only if the sequence of partial
sums sn = k=1 ak converges. By the Cauchy criterion, this sequence converges if and only
if it is Cauchy.
But what does it mean for the sequence of partial sums to be Cauchy? By definition, this
means that
∀ϵ > 0, ∃ n0 ∈ N : ∀m > n > n0 , |sm − sn | < ϵ. (72)
Pm Pn Pm
Observe that sm − sn = k=1 ak − k=1 ak = n+1 ak . Therefore, the statement (72) is
equivalent to the statement (71), which completes the proof.
This theorem immediately leads to the following simple criterion. In many cases, this is
already enough to show that a series is divergent.
Corollary 3.34. Let (an )n∈N be a sequence and suppose that the series ∞
P
k=1 ak converges.
Then
lim an = 0.
n→∞
In other words, a series can only be convergent if the corresponding sequence is a null sequence.
P∞
Proof. Suppose that the series k=1 ak converges. Then it follows from Theorem 3.33 (setting
m = n + 1 in (71)) that
140
Example - We can immediately use the previous corollary to see that, for any divergent
sequence, or any convergent sequence which is not a null sequence, the corresponding series
is not convergent. In particular, the series
∞
X
(−1)k
k=1
is divergent.
An important remark is that the converse of Corollary 3.34 does not hold. For instance, as
we have shown in Lemma 3.30, the harmonic series is definitely divergent with
∞
X 1
= ∞,
n
k=1
Definition 3.35. Let (an )n∈N be a complex-valued sequence with the property that there exists
C ∈ R such that
Xn
|ak | ≤ C, ∀n ∈ N.
k=1
P∞
Then we say that the series k=1 ak is absolutely convergent.
Proof. We have
∞ ∞
X X |q|
|q k | = |q|k = ,
1 − |q|
k=1 k=1
where we have just applied Lemma 3.29 again, with |q| in the role of q.
141
is called the alternating harmonic series. This series is not absolutely convergent, since
∞ ∞
X (−1)k X 1
=
k k
k=1 k=1
is the harmonic series, which is divergent. However, it turns out that the alternating harmonic
series is convergent.
The next result shows that absolute convergence is indeed a stronger condition than “mere”
convergence.
Proof. Let ∞ k
P
k=1 a be an absolutely
P∞ convergent series and let ϵ > 0 be arbitrary. By Theorem
3.33 (applied to the series k=1 |ak |) and the triangle inequality, there exists n0 ∈ N such
that, for all m > n ≥ n0 ,
m
X m
X m
X
ak ≤ |ak | = |ak | < ϵ.
k=n+1 k=n+1 k=n+1
P∞
It then follows from a second application of Theorem 3.33 that k=1 ak is convergent.
142
3.8 Convergence tests
We will now discuss several criteria, called convergence tests, that can be used to verify if a
series is convergent or not. However, note that these test are sometimes inconclusive, i.e.,
we do not always get a definite answer by applying them, and one may need to apply other
techniques.
The first result shows that we can obtain information about the convergence of a series by
comparing it to a series that is known to be convergent (or not).
1. If ∞
P
P∞bk is absolutely convergent and |ak | ≤ |bk | holds for all but finitely many k ∈ N,
k=1
then k=1 ak is also absolutely convergent.
Proof. 1. It follows from the hypothesis that there exists k0 ∈ N such that,
|ak | ≤ |bk |, ∀ k ≥ k0 .
P∞
Since k=1 bk converges absolutely, it follows that the sequence of partial sums sn =
P n
k=1 |ak | is bounded. Indeed, for any n ≥ k0 , we have
n
X 0 −1
kX n
X
|ak | = |ak | + |ak |
k=1 k=1 k=k0
0 −1
kX n
X
≤ |ak | + |bk |
k=1 k=k0
0 −1
kX ∞
X
≤ |ak | + |bk | ∈ R.
k=1 k=k0
P∞
It then follows from Theorem 3.32 that k=1 |ak | converges.
2. Let
n
X n
X
sn = an , tn = bn
k=1 k=1
denote the sequences of partial sums of the two series. Since bn ≥ 0, it follows from
Theorem 3.32 that the sequence (tn ) is not bounded. But also, sn ≥ tn for all n ∈ N,
and so it must also be the case that the sequence (sn ) is not bounded. It then follows
from the exercise after the proof of Theorem 3.20 that sn → ∞.
143
P∞ 1
Example - We will now use Theorem 3.38 to prove that the series k=1 k2 is (absolutely)
convergent. For any k ∈ N, we have k + 1 ≤ 2k and thus
1 k+1 1 1
= · ≤2· .
k2 k k(k + 1) k(k + 1)
1 2
≤ .
k2 k(k + 1)
∞ ∞
X 2 X 1
=2 = 2.
k(k + 1) k(k + 1)
k=1 k=1
Since
P∞ the2 corresponding sequence consists of non-negative real numbers, it follows that
k=1 k(k+1) is also absolutely convergent.
1 2
Apply the first part of Theorem 3.38 with ak = k2
and bk = k(k+1) . It follows that the series
P∞ 1
k=1 k2 is absolutely convergent.
2. Conversely, if pk
|ak | ≥ 1
P∞
holds for infinitely many k ∈ N, then k=1 ak is divergent.
1. The series ∞ k
P
Proof. k=1 c is absolutely convergent (see Lemma 3.36). Also, by the
hypothesis of the theorem |ak | ≤ ck P = |ck | holds for all but finitely many k ∈ N.
Therefore, by part 1 of Theorem 3.38, ∞ k=1 ak is absolutely convergent.
144
2. It follows from the condition that |ak | ≥ 1 for infinitely many k ∈ N. In particular, the
sequenceP∞(ak )k∈N is not a null sequence. It then follows from Corollary 3.34 that the
series k=1 ak does not converge.
The root test is usually most helpful when a the kth term of the corresponding sequence
involves a kth power.
Example - Consider the series
∞
X k 100
sin(k) .
k=1
2k/2
This looks like a complicated series, and if we try to calculate the first few terms of the
corresponding sequence, they are rather large! However, we can use the comparison test and
the root test to prove that the series is convergent.
Note that, for all k ∈ N,
k 100 k 100
sin(k) ≤
2k/2 2k/2
Therefore, by part 1 of Theorem 3.38 it will be sufficient to prove that the series
∞
X k 100
k=1
2k/2
145
√
Recall from Lemma√ 3.15 that limk→∞ k 9 = 1. Therefore, there is some k0 such that, for all
k ≥ k0 , we have k 9 ≤ 2. So, for all k ≥ k0 ,
s
k k k/4 k 1/4
≥ .
32+3k 54
We see that, for all k sufficiently large, the right hand side of the inequality above is at least
1. Indeed, for all k ≥ max{k0 , 544 }, we have
s
k k k/4 k 1/4
≥ ≥ 1.
32+3k 54
is divergent.
The following corollary gives us a helpful repackaging of the root test.
1. If p
k
lim |ak | < 1
k→∞
P∞
then the series k=1 ak is absolutely convergent.
2. If p
k
lim |ak | > 1
k→∞
P∞
then the series k=1 ak is divergent.
Proof. 1. Since p
k
lim |ak | < 1
k→∞
p
it follows that there is some c < 1 such that k |ak | < c holdsP for all k sufficiently
large. It then follows from part 1 of Theorem 3.39 that the series ∞k=1 ak is absolutely
convergent.
2. If p
k
lim |ak | > 1
k→∞
p
then it follows that k |ak | >P
1 holds for all k sufficiently large. Part 2 of Theorem 3.39
then implies that the series ∞ k=1 ak is divergent.
146
3.8.3 Ratio Test
1. If there exists some real number c < 1 such that, for all but finitely many k ∈ N,
ak+1
ak ̸= 0, and ≤ c,
ak
P∞
then k=1 ak is absolutely convergent.
|ak0 +m | ≤ cm |ak0 |.
Let
|ak0 |
bn := cn ·
.
ck0
P∞
The series k=1 bk is absolutely convergent. This follows from Theorem 3.31 and
Lemma 3.36. Also, for all k ≥ k0 ,
|ak | ≥ |ak0 |.
147
Corollary 3.42. Let (ak )k∈N be a sequence.
1. If
ak+1
lim <1
k→∞ ak
P∞
then the series k=1 ak is absolutely convergent.
2. If
ak+1
lim >1
k→∞ ak
P∞
then the series k=1 ak is divergent.
k3
ak =
k!
is convergent. Indeed, for all k ∈ N
3
(k + 1)3 k!
ak+1 1 k+1 8
= · 3 = · ≤ .
ak (k + 1)! k k+1 k k+1
k+1
The last inequality is just an application of the fact that k ≤ 2 holds for all k ∈ N. It
therefore follows that, for all k ≥ 15 we have
ak+1 1
≤ .
ak 2
P∞ k3
The ratio test implies that the series k=1 k! is convergent.
Example - As was hinted at earlier, the ratio test does not always provide a definite answer
regarding the convergence or divergence of a series. To see this, let’s try and apply the ratio
test for the series ∞ 1
P
n=1 n (which we already know is divergent, see Lemma 3.30).
Observe that
ak+1 k
= .
ak k+1
k
Since the sequence k+1 tends to 1, it follows that neither of the two conditions in Theorem
3.41 hold, and we do not get any information about the series ∞ 1
P
n=1 n by this method.
148