Math Data
Math Data
in Data Science
These lecture notes were written in the Summer Semester 2022, when the authors
gave the class “Mathematische Grundlagen der Datenanalyse” at the University of Os-
nabrück. The notes are intended for a course at an advanced Bachelor level. A main
theme of these notes is studying the geometry of problems in data science.
ii
Contents
1 The Basics 1
1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Network Analysis 20
2.1 Graphs and the Laplace Matrix . . . . . . . . . . . . . . . . . . . . . . 20
2.2 The Spectrum of a Graph . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Markov Processes in Networks . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Centrality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Machine Learning 55
3.1 Data, Models, and Learning . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Nonlinear Regression and Neural Networks . . . . . . . . . . . . . . . 59
3.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 78
Bibliography 113
1 The Basics
Many mathematical methods in data analysis rely on linear algebra and probability. In
the first two lectures we will recall basic concepts from these fields.
A = [a1 , . . . , an ] .
For x = (x1 , . . . , xn )T ∈ Rn ,
Ax = x1 a1 + · · · + xn an
is a linear combination of the columns of A. Other interpretations of A are
1. a list of n vectors in Rm
2. a list of m vectors in Rn
3. a linear map Rn → Rm given by x 7→ Ax
4. a linear map Rm → Rn given by y 7→ AT y
5. a bilinear map Rn × Rm → R given by (x, y) 7→ yT Ax.
1
1 The Basics
All of these viewpoints are best understood by considering four subspaces (two sub-
spaces of Rn and two of Rm ).
Definition 1.1 (Four Subspaces). Let A ∈ Rm×n . The image and kernel of A and AT are
1. Im(A) := {Ax | x ∈ Rn } ⊆ Rm ,
2. Im(AT ) := {AT y | y ∈ Rm } ⊆ Rn ,
3. ker(A) := {x ∈ Rn | Ax = 0} ⊆ Rn ,
4. ker(AT ) := {y ∈ Rm | AT y = 0} ⊆ Rm
We give the R-vector spaces Rn and Rm the structure of a Euclidean space by defining
the positive definite form
⟨a, b⟩ := aT b.
This form is called the Euclidean inner product. Notice that we have
AT y = [⟨ai , y⟩]ni=1 .
Lemma 1.3. Let U,V ⊆ Rn be two subspaces. Then, we have U = V ⊥ , if and only
if U ⊥ V and U ⊕V = Rn .
2
1 The Basics
· u
tu
Figure 1.1: The meaning of the inner product between u and v is illustrated in this picture: let t ∈ R such
that v = tu + u′ , where u′ is orthogonal to u. Then, ⟨u, v⟩ = uT (tu + u′ ) = tuT u = t⟨u, u⟩. In particular, if
⟨u, u⟩ = ⟨v, v⟩ = 1, then t = ⟨u, v⟩ is the arccosine of the angle between u and v.
dim(Im(AT )) + dim(ker(AT )) = m.
Therefore
dim(Im(A)) + dim(ker(AT )) = m.
Moreover for y ∈ ker(AT ) and Ax ∈ Im(A),
Thus Im(A) ⊕ ker(AT ) = Rm . The statement of (1) follows now using Lemma 1.3. The
proof of (2) follows similarly.
We now want to understand the solution of the system of linear equations Ax = b in the
context of Theorem 1.4. Namely, let b ∈ Im(A) and let r = dim(Im(A)) = dim(Im(AT )).
First, we observe that Ax = b has a solution x ∈ Rn , if and only if b ∈ Im(A). Suppose
that x is such a solution. This situation is depicted in Fig. 1.2. From Theorem 1.4 we
3
1 The Basics
Figure 1.2: The situation when b ∈ Im(A): in this case, Ax = b has a unique solution x ∈ Im(AT ) and the
solution space for Ax = b is x + ker(A).
know that Im(AT ) ⊕ ker(A) = Rn . So, there exist uniquely determined x0 ∈ Im(AT ) and
x1 ∈ ker(A) with x = x0 + x1 and we have
Proof. (2.⇒1.) This direction follows because e ∈ ker(AT ). This also shows that b0 is
uniquely determined.
4
1 The Basics
Figure 1.3: Visualization of the proof of Lemma 1.5: b0 minimizes the distance from b to Im(A).
So now it suffices to prove (2.) Let A = [a1 , . . . , an ]. Since b0 ∈ Im(A), set b0 = Ax0
for some x0 . Define the map φ (x) = Ax − b. Suppose that we write the output vector
in Rm by φ = [φ1 , . . . , φm ]T . Then, we minimize the scalar function ∥φ (x)∥ by taking the
derivative and
h setting it equal to zero. Namelyi we want to compute when the gradient
d ∂ ∂ n
dx ∥φ (x)∥ = ∂ x ∥φ (x)∥ , . . . , ∂ xn ∥φ (x)∥ ∈ R is equal to zero. We compute
1
m
∂ 1 ∂φj 1
∂ xi
∥φ (x)∥ =
2 ∥φ (x)∥ ∑ 2φ j (x) ∂ xi (x) = ∥φ (x)∥ aTi (Ax − b).
j=1
If ∥φ (x0 )∥ = 0, then b0 = b and we are done. Otherwise we must have that x0 satisfies
AT (Ax − b) = 0. This implies AT Ax0 = AT b, and so AT b0 = AT b.
Lemma 1.5 implies that the map which projects b to the point on Im(A) minimizing
the distance to b in the Euclidean norm is linear: call it ΠIm(A) . Furthermore, recall that
A restricted to Im(AT ) is a linear isomorphism, hence invertible. Consequently, we have
a well-defined linear map (A|Im(AT ) )−1 ◦ ΠA : Rm → Rn , shown in Fig. 1.4. The matrix
representation of this linear map is called the pseudoinverse of A.
Definition 1.6. Let A ∈ Rm×n . The pseudoinverse A† ∈ Rn×m is the matrix such that
A† b = x
5
1 The Basics
Figure 1.4: The pseudoinverse A† ∈ Rm×n of A ∈ Rm×n first orthogonally projects b ∈ Rn to b0 ∈ Im(A)
and then maps b0 to the unique point x ∈ Im(AT ) with Ax = b0 .
Let us first notice two properties of the pseudoinverse, which follow immediately
from the definition.
In the case when A ∈ Rm×n has full rank, which means that r(A) = min{m, n}, the
pseudoinverse has the following properties.
AT AA† b = AT b.
6
1 The Basics
A† b = (AT A)−1 AT b.
This shows A† = (AT A)−1 AT and it also shows A† A = (AT A)−1 AT A = 1n . For the
second part, see Exercise 1.3.
In closing of this lecture we want to discuss an important choice of bases for Im(A)
and Im(AT ). For this we first prove a fundamental result about eigenvectors and eigen-
values of symmetric matrices, the spectral theorem. Recall that a matrix A ∈ Rn×n is
called symmetric, if A = AT .
Theorem 1.9 (The spectral theorem). Let A ∈ Rn×n be symmetric. Then, A has only real
eigenvalues and there is a basis {v1 , . . . , vn } of eigenvectors of A such that ⟨vi , v j ⟩ = δi, j
(such a basis is called an orthonormal basis).
Proof. For the proof we extend the Euclidean inner prooduct form to Cn by setting
⟨a, b⟩ := aT b = aT b for a, b ∈ Cn , where a denotes the componentwise complex conju-
gation of a.
Let fA (t) = det(A −t1n ) be the characteristic polynomial of A. It has at least one zero
in C, which shows that A has at least one (possibly complex) eigenvalue.
If the only eigenvalue of A is zero, we can take an orthonormal basis of ker A to prove
the statement. Otherwise, let λ ∈ C be an eigenvalue of A and v ∈ Cn be a corresponding
eigenvector with ⟨v, v⟩ = 1. Then, we have λ = ⟨v, Av⟩ = ⟨Av, v⟩ = λ , hence λ ∈ R. This
shows that we can take v ∈ Rn , and moreover that A has only real eigenvalues.
Let now (v, λ ) ∈ Rn × R be an eigenpair of A with ⟨v, v⟩ = 1 and λ ̸= 0. We consider
the subspace U := (Rv)⊥ . For all w ∈ U we have
7
1 The Basics
Let us now come back to the case where A ∈ Rm×n . While A is not necessarily sym-
metric, the matrix AT A ∈ Rn×n is symmetric. By Theorem 1.9, AT A has only real eigen-
values and an orthonormal basis of eigenvectors {v1 , . . . , vn }. Let λi be the eigenvalue
corresponding to vi ; i.e., AT Avi = λi vi for 1 ≤ i ≤ n. We have
the matrix AT A is thus positive semidefinite. We can assume that λ1 ≥ · · · ≥ λr > 0 and
λr+1 = · · · = λn = 0 for r = r(A) being the rank of A. We then have
A = UΣV T .
Theorem 1.10. Let A ∈ Rm×n and r = r(A). Then, there exist matrices U ∈ Rm×r and
V ∈ Rn×r with U T U = V T V = 1r and uniquely determined numbers σ1 , . . . , σr > 0 such
that
A = UΣV T , Σ = diag(σ1 , . . . , σr ).
We have Im(A) = Im(U) and Im(AT ) = Im(V ). If the σi are pairwise distinct and
ordered σ1 > . . . > σr > 0, the matrices U and V are uniquely determined up to the
signs of their columns.
8
1 The Basics
Proof. Existence of the SVD and Im(A) = Im(U) and Im(AT ) = Im(V ) follow from the
discussion above. We have to show uniqueness of singular values, and in the case when
the singular values are pairwise distinct uniqueness of U and V (up to sign). Suppose
A = UΣV T = Ũ Σ̃Ṽ T
are two SVDs of A with Σ = diag(σ1 , . . . , σr ) and Σ̃ = diag(σ̃1 , . . . , σ̃r ). Then, we have
Using that r = r(A) = r(AAT ) we conclude that both σ1 , . . . , σr and σ̃1 , . . . , σ̃r are the
nonzero eigenvalues of AAT . Since eigenvalues are unique, σi = σ̃i , i = 1, . . . , r. There-
fore, the singular values are uniquely determined.
Let us now assume that the σi are pairwise distinct. Then, since every σi is positive,
also the σi2 are pairwise distinct for i = 1, . . . , r. This means that the nonzero eigenvalues
of AAT are all simple, which implies that the eigenvector of σi is unique up to sign, hence
ui = ±ũi . Repeating the same argument for AT A shows that the columns of V and Ṽ
also coincide up to sign.
An alternative definition of the SVD is A = USV T for U ∈ Rm×k and V ∈ Rn×k with
k = min{m, n} and U T U = V T V = 1k , and S = diag(σ1 , . . . , σr , 0, . . . , 0). This is some-
times called the non-compact SVD, while the decomposition in Theorem 1.10 is called
compact SVD. The difference between the two is that the compact SVD involves or-
thonormal bases of Im(A) and Im(AT ), while the non-compact SVD appends orthonor-
mal vectors from ker(AT ) and ker(A). The way one should think about the SVD (com-
pact or non-compact) is that it provides particular orthonormal bases that reveal essential
information about the matrix A.
The final result of this lecture is the connection between SVD and pseudoinverse.
Lemma 1.11. Let A ∈ Rm×n and A = UΣV T be the SVD of A as in Theorem 1.10. Then,
A† = V Σ−1U T .
9
1 The Basics
Exercise 1.4. Let A ∈ Rn×n be symmetric. By Theorem 1.9, there exists an orthonormal
basis {v1 , . . . , vn } of eigenvectors of A. Let λ1 ≥ · · · ≥ λn be the corresponding eigenval-
ues. For every 1 ≤ i ≤ n set Ui := span{v1 , . . . , vi } and Vi := span{vi+1 , . . . , vn } = Ui⊥ ,
and V0 = Rn . Show that
⟨u, Au⟩ ⟨u, Au⟩
λi = max = min .
u∈Vi−1 \{0} ⟨u, u⟩ u∈Ui ⟨u, u⟩
(a) Compute by hand a singular value decomposition UΣV T and the pseudoinverse A†
of A.
(b) Now try to do the same using the LinearAlgebra library in Julia [BEKS17]
(or any other numerical linear algebra implementation). Do you get what you
expected? What happens if you compare the pseudoinverse obtained via the com-
mand pinv to the one obtained by taking V Σ−1U T ?
10
1 The Basics
Definition 1.12. Let Ω be a nonempty set and A ⊂ 2Ω be a subset of the power set
of Ω. We call A a σ -algebra, if it satisfies the following properties
1. Ω ∈ A ;
2. if A ∈ A , then Ω \ A ∈ A ;
3. if An ∈ Ω, n ∈ N, then ∈ Ω.
S
n∈N An
11
1 The Basics
Every set A ∈ A is called an event, Ω is called the space of events, and P(A) is the
probability of A. The map P is called a (probability) distribution.
The restriction that A is a σ -algebra is crucial: without this assumption a probability
might not even exist. However, if Ω is discrete or even finite we can always take A = 2Ω
as σ -algebra. In the case Ω = R we have the Borel σ -algebra. This is the smallest σ -
algebra (by inclusion) that contains every interval in R.
Definition 1.14. Let A be the Borel σ -algebra in R. We call a function g : R → R
measurable, if for all A ∈ A we have g−1 (A) ∈ A .
Example 1.15. Let Ω = {0, 1} and A = {0,
/ {0}, {1}, Ω} = 2Ω . Suppose P({1}) = p.
Then, we have
P({0}) = P(Ω) − P({1}) = 1 − p.
This probability distribution is called Bernoulli distribution with parameter p. It mod-
els the probability of an experiment with two outcomes.
Often Ω is complicated, but at the same time we don’t want to know every informa-
tion about events in Ω, just some particular pieces of information. This motivates the
definition of random variables.
Definition 1.16. A random variable X is a map X : (Ω′ , A ′ , P′ ) → (Ω, A , P) between
probability spaces, such that for all events A ∈ A it holds that
12
1 The Basics
P(A | B) = P(A).
We call two continuous (resp. discrete) real random variables X and Y independent, if
for all events A ∈ A . We say that a sequence of continuous (resp. discrete) real random
variables (Xn )n∈N are independent and identically distributed (abbreviated as i.i.d.),
if the Xn are pairwise independent and all have the same probability distribution.
13
1 The Basics
Again recall all random variables have a probability distribution, but not all all ran-
dom variables have densities.
The interpretation of a probability density is that f (x) measures the “infinitesimal
probability” of x ∈ R. We will often denote the probability density by PX (x) := f (x)
or P(x) := f (x). The only time this becomes confusing is when we have the singleton
A = {x0 }, in which case the probability of any single event occurring for a continuous
random variable is always zero:
Z Z
P({x0 }) = P(x) dx = f (x) dx = 0,
{x0 } {x0 }
14
1 The Basics
To see that the right-hand side in Definition 1.23 is indeed a density observe that
Z
PX (x) = P(X,Y ) (x, y) dy, (1.2.1)
Rm
Theorem 1.24 (Bayes’ theorem for densities). Let (X,Y ) ∈ Rn × Rm be a random vari-
able with a probability density P(X,Y ) and x ∈ Rn and y ∈ Rm . Then:
PY (y)
PY |X=x (y) = PX|Y =y (x) · .
PX (x)
Proof. This follows immediately from Definition 1.23.
Definition 1.25. Let X ∈ {x1 , x2 , . . .} be a discrete real random variable. The expected
value of X is
∞
E X := ∑ xi · P(X = xi ).
i=1
If X ∈ R is a continuous real random variable with a density P its expected value is
Z
E X := x · P(x) dx.
R
Lemma 1.26 (Linearity of the expected value). Let X and Y be two real random vari-
ables with finite expected values: E X, EY < ∞. Then, for all a, b ∈ R we have
E(aX + bY ) = a E X + b EY.
Proof. See, e.g., [Ash70, Section 3.3]. See also Exercise 1.7.
15
1 The Basics
The continuous case requires some ideas from measure theory, which we skip here. We
refer to [Ash70, Section 3, Theorem 2] for a proof.
Lemma 1.27 implies the following expressions for covariance of random variables
(X,Y ) ∈ R2 with joint density P:
Z
Cov(X,Y ) = xy P(x, y) d(x, y) − E X EY.
R
Lemma 1.28. Let X ∈ Rn and Y ∈ Rm
be random variables, and suppose that Y has a
density PY (y) and that (X | Y ) has a density PX|Y =y (x). Then,
EX X = EY EX|Y =y X.
R
Proof. By Eq. (1.2.1), the density of X is given by PX (x) = Rm PX|Y =y (x)PY (y) dy. This
implies
Z
EX X = x · PX (x) dx
n
ZR Z
= x · PX|Y =y (x) · PY (y) dy dx
n m
ZR RZ
= x · PX|Y =y (x) dx · PY (y) dy
Rm Rn
= EY EX|Y =y X.
16
1 The Basics
Example 1.29. The following list of random variables describes important distributions.
1. Bernoulli distribution: X ∈ {0, 1} and P(X = 0) = p.
We write X ∼ Ber(p).
n k n−k .
2. Binomial distribution: X ∈ {0, . . . , n} and P(X = k) = k p (1 − p)
Φ(x | µ, Σ) = √ 1
exp(− 21 (x − µ)T Σ−1 (x − µ)) (1.2.2)
(2π)n det(Σ)
17
1 The Basics
EX = µ and Var(X) = σ 2 .
Cov(Yi ,Y j ) = Σi, j
for all 1 ≤ i, j ≤ n.
Lemma 1.31. Let X ∼ N(µ, Σ) and Y ∼ N(ν, S) be independent Gaussian random vari-
ables in Rn , and let A, B ∈ Rm×n and b ∈ Rm . Then,
Exercise 1.7. Prove Lemma 1.26 (for the continuous case you can assume that (X,Y )
has a joint density). Hint: Use Lemma 1.27 for g(X,Y ) = X + Y and that for every
random variable E |X| < ∞ if and only if E X < ∞ (see [Ash70, Eq. (3.1.7)]).
Exercise 1.10. Prove Lemma 1.31. Hint: Prove first the case B = 0 by computing the
density of AX + b. Then, use that
" #
h i X
AX + BY = A B .
Y
Exercise 1.11. The element Caesium-137 has a half-life of about 30.17 years. In other
words, a single atom of Caesium-137 has a 50 percent chance of surviving after 3.,17
years, a 25 percent chance of surviving after 60.34 years, and so on.
18
1 The Basics
(a) Determine the probability that a single atom of Caesium-137 decays (i.e., does
not survive) after a single day. How would you model the random variable X that
takes the value 1 when the atom decays and 0 otherwise?
(b) Using Julia, simulate 1000 times the behaviour of a collection C of 106 Caesium-
137 atoms in a single day. How would you model the following random variable?
λ k −λ
P{Z = k} = e .
k!
Plot the Poisson distribution with λ = 106 · p, where p is the probability computed
in part (a).
(d) Compare the empirical distribution in part (b) to the theoretical distribution in (c).
Some Julia packages that might be useful: Distributions, StatsPlots.
19
2 Network Analysis
After the preliminaries we will now start the first chapter on mathematical methods in
data science. Our first goal is to analyze structures of networks using spectral methods.
We will mostly follow the book by Chung [Chu97], and the lecture notes by Guruswami
and Kannan [GK12], and by Sauerwald and Sun [SS11]. For more context we also
recommend [Chu10].
E ⊆ {{i, j} : i, j ∈ V, i ̸= j}.
The adjacency matrix A(G) can be understood as a data structure for a graph.
Given a vertex v ∈ V , the degree of v is the number of vertices adjacent to v denoted
20
2 Network Analysis
In the following, we will only consider graphs G = (V, E) that have no isolated ver-
tices; i.e., we assume
deg(u) > 0, for all v ∈ V.
Isolated vertices do not contribute to the network structure we want to analyze, which is
why we want to ignore them. Detecting isolated graphs from the adjacency matrix A(G)
is straightforward, so that we can remove columns and rows corresponding to isolated
vertices from A(G) immediately.
Remark 2.2. The notation {i, j} is used to denote an unordered set, so in particular
{i, j} = { j, i} which means we will be working with simple and undirected graphs.
Example 2.3. Consider G = (V, E) for V = {1, 2, 3} and E = {{1, 2}, {1, 3}}
1 2
1 0 0
1 3
6 4
21
2 Network Analysis
Definition 2.6. A graph G = (V, E) is said to be bipartite if the vertex set V can be
subdivided into two disjoint subsets V1 and V2 so that every edge in E has an endpoint
in V1 and the other in V2 . If moreover every possible edge between V1 and V2 is present,
G is a complete bipartite graph (note that there are several possible complete bipartite
graphs on the same vertex set V ).
P = (v0 , v1 , . . . , vD ),
such that {vi−1 , vi } ∈ E for all 1 ≤ i ≤ D. In this case, we say that P is a walk from v0
to vD . The edges of P are
Lemma 2.8. Let A be the adjacency matrix of a graph G = (V, E), and let v, w ∈ V .
Then the number of walks from v to w of length k is given by (Ak )v,w .
Next, we introduce the Laplace matrix or Laplacian of a graph G. We will see that
its eigenvalues provide essential information about the network structure of G.
where
1 i= j
ℓi j = √ −1 i ̸= j and {i, j} ∈ E
deg(i) deg( j)
0 otherwise
Example 2.10. Consider the graph from Example 2.3 with G = (V, E) and V = {1, 2, 3}
and E = {{1, 2}, {1, 3}}:
22
2 Network Analysis
1 2
Then
1 − √12 − √12
1
− 2
L(G) = √ 1 0
− √12 0 1
In the following, we fix a graph G = (V, E) and denote A := A(G) and L := L(G).
Definition 2.11. We define the following diagonal matrix
deg(u) u = v
|V |×|V |
T = (tuv ) ∈ R , tuv =
0 otherwise.
Remark 2.12. Another common definition of the Laplacian of a graph is L := T − A,
where T is as in Definition 2.11 and A is the adjacency matrix of G. In fact, we have
L = T −1/2 L T −1/2 (as shown in the next lemma). Compared to L our Laplacian is also
called the normalized Laplacian. In our lecture we follow the definition in [Chu97]
using L. In [Chu97, Section 1.2] Chung discusses that preferring L over L can be
helpful in the context of stochastic processes - a topic that we will cover later in our
lectures.
Lemma 2.13. The following holds
Proof. For u ∈ V = {1, . . . , n} let eu = (0, . . . , 0, 1, 0, . . . , 0)T the u-th standard basis
vector. We compute for u, v ∈ V , and using the fact that T is symmetric so T = T T
23
2 Network Analysis
F (V ) := { f : V → R}
for v ∈ V .
Lemma 2.14. The map L induced by the Laplacian of a graph G = (V, E) is given by
1 f (u) f (v)
L f (u) = p ∑ p −p .
deg(u) v∈V :{u,v}∈E deg(u) deg(v)
Moreover,
1
∑ Auvg(v) = ∑ g(v) = p ∑ f (v).
deg(v) v∈V :{u,v}∈E
v∈V v∈V :{u,v}∈E
This shows,
f (v)
L f (u) = f (u) − ∑ p . (2.1.2)
v∈V :{u,v}∈E deg(u) deg(v)
We can write deg(v) = ∑u∈V : {u,v}∈E 1. Thus multiplying and dividing by deg(v) we can
write
f (u) f (u)
f (u) = deg(u) = ∑ . (2.1.3)
deg(u) v∈V : {u,v}∈E deg(u)
24
2 Network Analysis
f (v)
L f (u) = f (u) − ∑ p (by Eq. (2.1.2))
v∈V :{u,v}∈E deg(u) deg(v)
f (u) f (v)
= ∑ −p (by Eq. (2.1.3))
v∈V :{u,v}∈E
deg(u) deg(u) deg(v)
1 f (u) f (v)
=p ∑ p −p
deg(u) v∈V :{u,v}∈E deg(u) deg(v)
The Laplace Matrix is real and symmetric, L = LT . Thus, by the spectral theorem
(Theorem 1.9), all eigenvalues of L are real.
λ0 ≤ · · · ≤ λ|V |−1 ,
λG = λ1 .
Example 2.16. The Laplace matrix from Example 2.10 has spectrum 0, 1, 2.
⟨ f , g⟩ := ∑ f (u)g(u).
u∈V
25
2 Network Analysis
As above we set
g := T −1/2 f ,
so that
⟨ f,L f ⟩ = ∑ g(u) ∑ g(u) − g(v).
u∈V v∈V :{u,v}∈E
We order the sum on the right as follows:
!
1
⟨ f,L f ⟩ = ∑ g(u) ∑ g(u) − g(v)
2 u∈V v∈V :{u,v}∈E
!
1
− ∑ g(v) ∑ g(u) − g(v)
2 v∈V u∈V :{u,v}∈E
as claimed.
Theorem 2.18 shows that L defines a bilinear form ( f , g) 7→ ⟨ f , Lg⟩ that is positive
semi-definite. Consequently, the spectrum of G is always nonnegative. We give a formal
proof for this observation.
26
2 Network Analysis
where e ∈ F (V ) is the constant one function (in the identification from Eq. (2.1.1) this
is e = (1, . . . , 1)).
Next, we give the spectra of some example graphs.
Exercise 2.1. Consider the complete graph on 6 vertices from Example 2.5. Construct
the adjacency matrix and the Laplace matrix for this graph. What are the adjacency
matrix and the Laplace matrix for a complete graph on n vertices?
1 0 0
Compute the (1, 1)-entry of Ak for any k ≥ 1 without computing the matrix power Ak
explicitly.
Exercise 2.4. For the graph in Example 2.5, compute the number of paths of length 3
from vertex 1 to vertex 2.
27
2 Network Analysis
eigenvalues of its Laplacian L(G). We proved in Corollary 2.19 that these eigenvalues
are nonnegative.
The main goal of this lecture is to prove the following theorem.
0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1
Example 2.22. Before we prove this theorem, let us recall the graph from Example 2.3:
1 2
n = λ1 + · · · + λn−1 ≥ (n − 1)λG ,
28
2 Network Analysis
n
so that λG ≤ n−1 . In the same spirit,
n = λ1 + · · · + λn−1 ≤ (n − 1)λn−1 ,
n
so that λn−1 ≥ n−1 . This proves the second item.
n
For the third item we recall from Proposition 2.20 that, if G is complete, λG = n−1 .
We show that otherwise λG ≤ 1.
Recall from Eq. (2.1.5) that T 1/2 e ∈ ker L. As a consequence, λG can be written in
the following way using the Rayleigh quotient (see Exercise 1.4):
⟨g, Lg⟩
λG = min . (2.2.1)
g∈F (V )\{0}: ⟨g,T 1/2 e⟩=0 ⟨g, g⟩
29
2 Network Analysis
Assume now that G is connected: then, for every i, j ∈ G we can find a path from i
to j. It follows that g is a multiple of the constant one function e, and f is a multiple of
T 1/2 e. Consequently, 0 is a simple eigenvalue of L and λ1 > 0. Conversely, if λ1 = 0,
then there exists a nonzero function f in the kernel of L which is not a multiple of T 1/2 e.
But then there must exist vertices i, j ∈ G such that G contains no path from i to j, and
hence G is not connected.
The statement for multiple connected components follows from this and item 6.
To prove item 5. note first that, for every a, b ∈ R, one has that (a − b)2 ≤ 2(a2 + b2 ),
and equality holds if and only if b = −a.
Now, setting g = T −1/2 f and using again the expression of the Rayleigh quotient in
Theorem 2.18 and Exercise 1.4, we get that
⟨ f,L f ⟩ 1
λn−1 = max = max ∑ (g(u) − g(v))2 .
f ∈F (V )\{0} ⟨f, f⟩ f ∈F (V )\{0} ∑u∈V deg(u)g(u)2 {u,v}∈E
Combining this with the above inequality and deg(u) = ∑v∈V :{u,v}∈E 1 yields
2
λn−1 ≤ ∑ (g(u)2 + g(v)2) = 2.
∑u∈V deg(u)g(u)2 {u,v}∈E
The only inequality used in the argument was (g(u) − g(v))2 ≤ 2(g(u)2 + g(v)2 ).
Therefore, λn−1 = 2 if and only if there is a function g ∈ F (V ) \ {0} with g(u) = −g(v)
for all {u, v} ∈ E. If G has a bipartite component H = (V ′ , E ′ ), we can write V ′ = V1′ ⊔V2′
and choose
1,
if u ∈ V1′
g(u) = −1, if u ∈ V2′
0, if u ∈ V \V ′
Then V ′ = W1 ∪ W2 and, for every edge {u, v} ∈ E ′ , the endpoints u and v must lie in
different Wi ’s. Therefore, H is bipartite.
Finally, for the last item we denote the connected components of G by G1 , . . . , Gk .
Let us write Gi = (Vi , Ei ), so that V = ki=1 Vi . We can reenumerate the vertices to
S
30
2 Network Analysis
have Vi = {ni−1 +1, . . . , ni } with 0 = n0 < n1 < · · · < nk = n. Let also Li be the Laplacian
of Gi . Then, the Laplace matrix of G is a block diagonal matrix:
L1 · · · 0
. .
L(G) = .. . . . .. .
0 · · · Lk
This shows that the eigenvalues of L(G) are given by the eigenvalues of the Li .
datascience
MachineLearning
DataScience
BigData AI
Figure 2.1: The network from Example 2.25. One can see 7 connected components. The 5 vertices with
highest degree are labelled. The labels are the hashtags the vertices correspond to.
Theorem 2.21 shows how we can obtain information about the structure of a graph
by computing its spectrum. However, often networks are almost disconnected or almost
bipartite rather than having exactly this property. Such a scenario is also reflected in the
spectrum. We need another definition for formulating results in this direction.
31
2 Network Analysis
Figure 2.2: This graph shows the big component in Fig. 2.1, called G.
Before we prove it, let us see how Proposition 2.24 is related to Theorem 2.18. The
graph G is bipartite with components G1 and G2 , if and only if vol(G1 ) = vol(G2 ) = 0
32
2 Network Analysis
and ε = 1. In this case, the bound in Proposition 2.24 becomes λG ≤ 2 ≤ λn−1 , similar
to Theorem 2.21 5. Furthermore, the components G1 and G2 are disconnected, if and
only if ε = 0, in which case we have λG = 0.
Then,
⟨ f,L f ⟩ vol(G)
λG ≤ = ε|E| ≤ λn−1 .
⟨f, f⟩ m1 m2
33
2 Network Analysis
corresponding hashtags appear together in at least one tweet. This gives the graph on
n = 171 vertices that can be seen in Fig. 2.1. We have labelled the 5 vertices with highest
degrees in this graph with their corresponding hashtag.
The graph in Fig. 2.1 has 7 connected components. We consider the big component
and call the underlying graph G. Fig. 2.2 shows G and Fig. 2.3 shows the spectrum of G.
Figure 2.3: Spectrum of the graph from Fig. 2.2. The x-axis represents the index of the λi and the y-
axis their numerical value. Instead of plotting discrete points we have plotted a piecewise linear curve
connecting the discrete values λ0 , . . . , λn−1 .
Fig. 2.3 shows that λG is small. Following Proposition 2.24 we therefore anticipate1
that G = (V, E) has two components G1 = (V1 , E1 ) and G2 = (V1 , E1 ) such that there
are few edges between them. The proof of Proposition 2.24 motivates the following
clustering method: let f be the eigenfunction of λG . Then, we take
These two clusters are shown in Fig. 2.4. The vertices in V1 are shown in red and the
vertices in V2 are blue. The two clusters indeed seem to give two relatively separated
components of G.
We can use the same approach for analyzing the largest eigenvalue λn−1 . Here, how-
ever, two cluters are not enough, because the eigenfunction g of λn−1 defines three
clusters: one cluster, where g is negative, one, where g is negative, and a third cluster,
1 Strictly
speaking, in Proposition 2.24 we only showed an upper bound for λG in terms of the number
of connected edges ε, so that λG could be small while ε is big.
34
2 Network Analysis
where g is zero. We show these three components in Fig. 2.5, where we have plotted the
components corresponding to where g is zero in yellow. We see that the yellow vertices
form the major part of G and that the blue and red vertices give a bipartite subgraph.
The analysis of three components is the topic of Exercise 2.7.
Figure 2.4: We have partitioned the vertices of the graph in Fig. 2.2 into two classes of vertices corre-
sponding to whether the eigenfunction of λG is positive or negative.
The next two results underline that λG should be understood as a measure of connec-
tivity for G.
Proposition 2.26. Let G = (V, E) be a connected graph and let diam(G) be the diameter
of G; i.e., diam(G) is the maximal length of all shortest paths between vertices u, v ∈ V .
Then,
1
λG ≥ .
diam(G) · vol(G)
Proof. Let f ∈ F (V ) be an eigenvector of λG with ⟨ f , T 1/2 e⟩ = 0. Such an eigenvector
exists by Eq. (2.1.5) (and since L(G) is symmetric). Write g := T −1/2 f . Then,
35
2 Network Analysis
Figure 2.5: We have partitioned the vertices of the graph in Fig. 2.2 into three classes of vertices corre-
sponding to whether the eigenfunction of λn−1 is positive, negative or zero. The class corresponding to
zero is yellow.
Let v0 ∈ V with |g(v0 )| = maxv∈V |g(v)|. Eq. (2.2.2) implies that there exists u0 ∈ V
with g(u0 )g(v0 ) < 0 (i.e., they have opposite sign). If P is a shortest path from u0 to v0
of length D > 0, then
1 1
≥ .
D · vol(G) diam(G) · vol(G)
We show that λG ≥ (D · vol(G))−1 . Using Theorem 2.18 we have
⟨ f,L f ⟩ 1
λG = = ∑ (g(u) − g(v))2
⟨f, f⟩ ∑u∈V deg(u)g(u)2 {u,v}∈E
1
≥ ∑ (g(u) − g(v))2 .
vol(G) g(v0 )2 {u,v}∈E(P)
36
2 Network Analysis
It follows that
1 (g(v0 ) − g(u0 ))2
λG ≥ ,
D · vol(G) g(v0 )2
and we have (g(v0 ) − g(u0 ))2 ≥ g(v0 )2 , because g(u0 )g(v0 ) < 0.
The final result of this lecture is a classical result in combinatorics known as Kirch-
hoff’s theorem. From Theorem 2.21 we know that, if G) is connected, L(G) has rank
n − 1. Therefore, all (n − 1) × (n − 1) submatrices of L(G) are invertible, and so have
nonzero determinant. The next theorem shows that this determinant counts spanning
trees in G. Recall that a tree is a graph that has no circles, and that a spanning tree τ is
a tree-subgraph of G = (V, E), such that the vertices of τ are V .
It follows from Theorem 2.18 that ⟨ST f , ST f ⟩ = ⟨ f , L f ⟩, which shows that L = SST .
Therefore, if Su denotes the matrix that is obtained from S ∈ R|E|×(|V |−1) by removing
the u-th row, we have Lu = Su SuT . As before, we let n = |V |. The Cauchy-Binet formula
[HJ92, Sec. 0.8.7] implies
37
2 Network Analysis
Exercise 2.6. It follows from Theorem 2.21 that G has exactly k connected components,
if and only if dim ker L(G) = k. Can you determine the k components from ker L?
Exercise 2.7. State and prove a version of Proposition 2.24 involving three pairwise
distinct components of a graph. Prove bounds for λG and λn−1 for a tripartite graph.
Hint: Consider a function that is positive on the first components, negative on the sec-
ond, and zero on the third component.
Exercise 2.8. Use Twitter.jl to generate a network, and analyze it using the spectral
methods from this lecture.
38
2 Network Analysis
Exercise 2.10. Let G1 , G2 , G3 be the following graphs on vertex set {1, 2, 3, 4}:
4 3 4 3 4 3
1 2 1 2 1 2
G1 G2 G3
(a) List all spanning trees for G1 , G2 and G3 . (You can either draw them or list their
edges.) How many do you get?
(b) Compute the number of spanning trees for G1 , G2 and G3 using Theorem 2.27.
X0 , X1 , . . . ∈ V,
called the steps of the walk, such that for all i ≥ 1 we have:
1. P(Xi = u | Xi−1 = v, Xi−2 = vi−2 , . . . , X0 = v0 ) = P(Xi = u | Xi−1 = v);
2. P(Xi = u | Xi−1 = v) > 0 only if {u, v} ∈ E or u = v (i.e., remaining at the current
vertex is allowed);
3. P(Xi = u | Xi−1 = v) does not depend on i.
39
2 Network Analysis
The first item means that the probability law of the i-th step only depends on the
position of the (i − 1)-th step, but is independent of what happened before. The third
item means that the probability law of a step does not depend on the number of steps
that have passed. The second item means that we can only progress along edges in the
graph or stand still. In the following, we will denote
P = (puv ) ∈ Rn×n
with puv = P(u | v). The puv are called transition probabilities.
Example 2.31. Consider G = (V, E) for V = {1, 2, 3} and E = {{1, 2}, {1, 3}}:
1 2
Suppose the we start at the vertex 3. We have P(2 | 3) = 0, because there is no edge
between 2 and 3. We can either move to 1 or stand still. This means that, if p := P(1 | 3),
then P(3 | 3) = 1 − p. Similarly, starting at 2 we can’t move to 3. The transition matrix
of any Markov process on G therefore has the form:
r q p
P= s 1−q 0
1−r−s 0 1− p
with 0 ≤ p, q, r, s and p, q, r + s ≤ 1.
Let us make a simple but important observation, implicitly used in Example 2.31.
Lemma 2.32. Let e = (1, . . . , 1) ∈ Rn and let P = (puv ) ∈ Rn×n be the transition matrix
of a Markov process on G. Then, e is an eigenvector of PT with eigenvalue 1:
PT e = e.
40
2 Network Analysis
Definition 2.33. We call X a uniform Markov process, if its transition probabilities are
1 , if u ̸= v and {u, v} ∈ E
puv = P(u | v) = deg(v)
0, else.
Let us denote
Lemma 2.34. Let X be a Markov process on G with transition matrix P. Let also i ≥ 0
and fi := (P(Xi = 1), . . . , P(Xi = n))T ∈ F+ (V ) be the probability distribution of the
i-th step of X. Then,
fi+k = Pk fi
for all k ≥ 0.
Lemma 2.34 should be interpreted as follows: (Pk )uv is the probability that, if the
Markov process starts at v, it reaches u after k steps.
Example 2.35. In Example 2.31 we take the transition matrix of the uniform process:
0 1 1
P = 21 0 0 .
1
2 0 0
41
2 Network Analysis
This means that, if we start at 3, the next vertex is almost surely 1. Then, from 1 with
probability 12 we either go back to 3 or go to 2. We also have
1 0 0
P2 = 0 12 12 .
0 12 12
The last column of P2 shows that, starting from 3, after 2 steps we either are at 2 or 3,
both with probability 12 .
Definition 2.36. Let X be a Markov process on G with transition matrix P = (puv ).
1. We call a probability distribution π : V → R stationary (with respect to X), if
Pπ = π.
f (v)puv = f (u)pvu
for all u, v ∈ V .
We will spend a great part of the remainder of this lecture to give conditions under
which stationary distributions exist and are unique. In the literature, a stochastic process
that has a stationary distribution is also called an ergodic process. First, we give a
sufficient condition for π being a stationary distribution of a Markov process X.
Proposition 2.37. Let X be a Markov process on G and π : V → R be a probability
distribution reversible with respect to X. Then, π is a stationary distribution of X.
Proof. Let P denote the transition matrix of X. For all u ∈ V we have
42
2 Network Analysis
for all u ∈ V .
2. X is called irreducible, if for all u, v ∈ V there exists k ∈ N with
Recall from Lemma 2.34 that (Pk )uv is the probability of being at u after k steps
given that we have started from v. Therefore, if on G there exists an irreducible Markov
process, G must be connected. On the other hand, if G is connected, this does not imply
that any Markov process on G is irreducible: on any graph we can always define the
process which does not move with probability one.
We now come to the main theorem of this lecture.
We need two auxiliary results for the proof of Theorem 2.40. We prove them first,
and then we prove Theorem 2.40 towards the end of this section.
43
2 Network Analysis
then Bk has kλ k−1 on the off-diagonal. Since by Eq. (2.3.2) the entries of M must be
bounded, there can be such a Jordan block only if |λ | < 1. This implies that 1 is a simple
eigenvalue.
Let now λ ∈ C, λ ̸= 1, be another eigenvalue of AT . Let x = (x1 , . . . , xn )T ∈ Cn be
a corresponding eigenvector. Then, we have λ xi = a1i x1 + · · · + ani xn for all 1 ≤ i ≤ n.
We have 1 = a1i + · · · + ani , because AT e = e. Since the a ji are positive, this implies
that λ xi is a convex combination of the entries of x. Let us denote by C ⊂ C the convex
hull of the xi . Since x must be linearly independent of e, there must be indices i and j
with xi ̸= x j . Therefore, C is not a single point. Moreover, since the a1i are all strictly
positive, λ xi lies in the relative interior of C. From this we get
|λ xi | < max |x j |,
1≤ j≤n
The second result we need is a lemma from commutative algebra. For this, we recall
that a semigroup in N is a subset S ⊂ N, such that for all s, r ∈ S also r + s ∈ S (see the
textbook [RGS09] for more details).
Lemma 2.42. Let S ⊂ N be semigroup and suppose that gcd(S) = 1 (i.e., the greatest
common divisor of all elements in S is 1). Then, N \ S is finite.
There exists an N ∈ N, such that ℓa + ky > 0 for all ℓ > N and 0 ≤ k < ab. Therefore,
there exists and M ∈ N, such that z ∈ S′ for all z > M.
44
2 Network Analysis
Proof of Theorem 2.40. The proof is based on the fact that under the hypothesis of the
theorem there exists M ∈ N with (Pm )uv > 0 for all m ≥ M and u, v ∈ V . We show this
at the end of this proof. Let us first see how the statement of the theorem follows.
We know that PT e = e from Lemma 2.32, and so (Pm )T e = e for all m ≥ 1. Therefore,
if (Pm )uv > 0 for all u, v ∈ V , Proposition 2.41 implies that Pm has 1 as simple eigenvalue
and that all other eigenvalues λ of Pm satisfy |λ | < 1. Since µ is an eigenvalue of P, if
and only if λ = µ m is an eigenvalue of Pm , we see that P has 1 as simple eigenvalue and
that all other eigenvalues of P have absolute value strictly less than 1. It follows that P
has a unique right-eigenvector π ∈ F (V ) with eigenvalue 1. This is not yet enough to
prove that π is a stationary distribution. For this, we have to show π ∈ F+ (V ) (i.e., π
is nonnegative). We do this next.
Since limk→∞ µ k = 0 for all eigenvalues µ ̸= 1 of P, we see that Pk converges to a
matrix P∗ of rank one. Let us write P∗ = xyT . Since Pk π = π and (Pk )T e = e for every k,
we also have P∗ π = π and (P∗ )T e = e. This shows
lim Pk f = P∗ f = πeT f = π,
k→∞
because eT f = ⟨e, f ⟩ = 1.
It remains to show the claim above. For this we pick u ∈ V and define
45
2 Network Analysis
Recall from Lemma 2.34 that (Pt )uu is the probability of having a walk with t steps that
starts and ends at u. If α, β ∈ Su , then we must have α + β ∈ Su , since we can simply
join two walks starting and ending at u. This shows that Su is a semigroup. Since X is
aperiodic, N \ Su is finite by Lemma 2.42. So, there exists Mu ∈ N with (Pm )uu > 0 for
all m ≥ Mu . We set
M ′ := max Mu .
u∈V
Let now u, v ∈ V , u ̸= v, and define for t ≥ 1:
This is the probability of arriving at u for the first time after t steps, when we have
started at v. For every k ≥ 1 we have
k
k
(P )uv = ∑ ℓtuv · (Pk−t )uu.
t=1
Since X is irreducible, there exists r ∈ N such that (Pr )uv > 0. Therefore, ℓ1uv , . . . , ℓruv
r
can’t all be equal to zero, so that ∑t=1 ℓtuv > 0. We set
M := M ′ + r
We have shown in Theorem 2.40 that aperiodic and irreducible Markov chains on a
graph have a unique stationary distribution. Let us now turn our point of view upside
down and start with probability distribution π : V → R on a connected graph G = (V, E).
Can we find a Markov process whose stationary distribution is π? The answer is yes!
And we can use this process to sample from π by simply starting at a vertex v ∈ V and
following the random walk along the graph. This procedure is known as Metropolis-
Hastings algorithm and it is based on the next theorem.
46
2 Network Analysis
Then, P is the transition matrix of an aperiodic and irreducible Markov process with
unique stationary distribution π.
V̂ := {F : V → {1, . . . , k} | F is admissible},
Ê := {{F, G} ⊂ V̂ | F and G differ in exactly one vertex u}.
Then, G is connected (see, e.g., [FV07]) and the number of all admissible k-colorings
is |V̂ |. Using Theorem 2.43 we can sample the probability distribution π : V → R with
1
π(u) = for all u ∈ V̂
|V̂ |
Lastly, one thing that we did not discuss in this lecture is the number of steps it
takes such that an aperiodic and irreducible Markov process X is somehow close to its
47
2 Network Analysis
stationary distribution π. One way to measure convergence is using the total variation
distance d(t) := maxu,v∈V |P(Xt = u | X0 = v) − π(u)|. For a given δ > 0 the minimal t
such that d(t) < δ is called mixing time of the process. We refer to [Chu97, Chapter 1]
for more details. In practice, however, it often suffices to take a fixed number of steps.
Exercise 2.12. Let G be a graph and X be the uniform Markov process on G. Show:
1. X is aperiodic, if and only if G is not bipartite.
2. X is irreducible, if and only if G is connected.
Exercise 2.13. Prove that the following algorithm implements the Markov process from
Example 2.44. Suppose that at the i-th step we have the admissible coloring Fi . Then
we do the following:
1. Choose (u, c) ∼ Unif(V × {1, . . . , k}).
2. For all v ∈ V \ {u} set Fi+1 (v) = Fi (v), and Fi+1 (u) = c.
3. If Fi+1 is not an admissible coloring, go back to 1. Otherwise, return Fi+1 .
Exercise 2.14. Implement the algorithm from Exercise 2.13 for the graph
1 2
using 4 colors.
48
2 Network Analysis
a Page-Rank of G.
PcR = cR ,
where P is the transition matrix of the uniform Markov process X on G (see Defini-
tion 2.33). It follows from Exercise 2.11 that X is aperiodic and irreducible, and then
Theorem 2.40 implies that X has a unique stationary process; i.e., a unique solution
cR ∈ F+ (V ) with PcR = cR and ⟨e, cR ⟩ = 1.
Remark 2.47. We can use the transition matrix of any irreducible and aperiodic Markov
process to define a corresponding Page-Rank.
Example 2.48. This example is based on the Graphs lecture in the Data Science course
by Huda Nassar2 .
We compute Page-Rank for the airport dataset from the VegaDatasets.jl3 package.
From this dataset we compute a graph G with n = 305 vertices and 5668 edges. The
vertices represent airports in the US, and there is an edge between two airports if there
is a flight from one of the two airports to the other. The graph is shown in Fig. 2.6.
2 https://github.com/JuliaAcademy/DataScience
3 https://github.com/queryverse/VegaDatasets.jl
49
2 Network Analysis
Figure 2.6: The network from Example 2.48. The blue vertices represent 305 airports in the US. There is
an edge between two airports if there is a flight from one to the other. The vertex on the top left is Adak
Airport.
For computing Page-Rank we set up the transition matrix P of the uniform Markov
process on G and then compute an eigendecomposition of P. We also approximate
Page-Rank by using the third item in Theorem 2.40. For 1 ≤ i ≤ N we sample a random
airport v0 and then we start a random walk at v0 using the transition probabilities in P.
After m steps we record the locations vi . We approximate Page-Rank by the empirical
distribution function
1
f (u) = |{i | u = vi }|.
N
The result of an experiment with N = 104 and m = 20 is shown in Fig. 2.8 and Fig. 2.7.
Next, we introduce several other commonly used centrality measures.
cD (u) := deg(u).
50
2 Network Analysis
Figure 2.7: Page-Rank (blue) and approximated Page-Rank (brown) for the network from Fig. 2.6.
4. Let σx,y be the number of shortest paths from x to y, and σx,y (u) the number of
shortest paths from x to y passing through u. The betweenness centrality of u is
σx,y (u)
cB (u) := ∑ .
x,y∈V \{u},x̸=y
σx,y
51
2 Network Analysis
rank
0
0.0001
0.0001764290755116296
0.00017642907551164197
0.000176429075511642
0.00017642907551164392
0.00017642907551164427
0.0001764290755116443
0.00017642907551164433
0.00017642907551164438
0.00017642907551164446
0.0001764290755116445
0.0001764290755116448
0.0002
0.0003
0.0003528581510232513
0.0003528581510232517
0.00035285815102326345
0.00035285815102327136
0.0003528581510232832
0.00035285815102328324
0.00035285815102328817
0.0003528581510232882
0.0003528581510232884
0.0003528581510232885
0.00035285815102328855
0.0003528581510232886
0.00035285815102328866
0.00035285815102328877
…271 entries
Figure 2.8: Page-Rank (blue) and approximated Page-Rank (brown) for the network from Fig. 2.6. The
size of the circles corresponds to the respective centrality measures.
where τ(u, v) is the minimal t such that a uniform Markov process X starting at v
arrives at u for the first time after t steps:
∞
i.e., E τ(u, v) = ∑t=0 t · ℓtu,v and ℓtu,v = P(Xt = u and Xi ̸= u for i < t | X0 = v) is as
in Eq. (2.3.5).
1 2
We compute the centrality measures from Definition 2.49 for the three vertices. In all
cases 1 will have the largest measure. This can be interpreted as 1 taking the most
important role in the network.
52
2 Network Analysis
The degree centralities of the vertices are cD (2) = cD (3) = 1 and cD (1) = 2. The
closeness centralities are
1 1 1
CC (1) = = =
dist(1, 2) + dist(1, 3) 1 + 1 2
and
1 1 1
CC (2) = = = .
dist(1, 2) + dist(2, 3) 1 + 2 3
Similarly, CC (3) = 13 . For harmonic centrality we have
1 1
CH (1) = + = 1+1 = 2
dist(1, 2) dist(1, 3)
and
1 1 1 3
CH (2) = + = 1+ =
dist(1, 2) dist(2, 3) 2 2
3
and CH (3) = 2 due to symmetry. The betweenness centralities are
σ2,3 (1) 1
CB (1) = = =1
σ2,3 1
and CB (2) = CB (3) = 0 as there are no shortest paths from 1 to 3 passing through 2,
and no shortest paths from 1 to 2 passing through 3. Finally, we compute the Markov
centralities of the three vertices. First, we compute cM (1):
1
cM (1) = .
E τ(1, 2) + E τ(1, 3)
Starting from either 2 or 3 the next vertex must always be 1, which means that we have
E τ(1, 2) = E τ(1, 3) = 1. Consequently, cM (1) = 12 . Next, we compute cM (2). We have
1
cM (2) = .
E τ(2, 1) + E τ(2, 3)
∞
2k + 1
E τ(2, 1) = ∑ k+1
= 3.
k=0 2
53
2 Network Analysis
Furthermore, for moving from 3 to 2 we always need an even number of steps and we
can argue as above to get
∞
2k
E τ(2, 3) = ∑ k = 4.
k=0 2
1
This shows cM (2) = 3+4 = 71 . Due to symmetry, also cM (3) = 17 .
Exercise 2.15. How would you define Page-Rank for a directed graph? Analyze the
data from Example 2.48 using your ideas.
Exercise 2.16. Compute the centrality measures from this section for the following
graph:
4 3
1 2
54
3 Machine Learning
This chapter is based on Part II of the textbook by Deisenroth, Faisal, and Ong. [DFO20].
What is machine learning? Machine learning is a subarea of data analysis. The
fundamental motivation is to develop algorithms that automatically extract information
from datasets. Here we mean “automatic” in the sense that we want to find general
methods which can be used in special situations so that we don’t need to create an
algorithm for each individual scenario.
The automation of an algorithm happens through the analysis of training data. It is
also said that one “learns” from the data. The data is understood as numeric vectors.
In real-world data we always have to assume the presence of noise, so that modeling
f (x) = y is not realistic. Using instead approximations allows for more flexibility. The
exact meaning of ≈ depends on the problem, and is usually measured with a loss func-
tion. We will come back to this in Definition 3.5 below.
55
3 Machine Learning
Definition 3.3. Variables which have a continuous domain of values are called continu-
ous variables. Variables with a discrete domain of values are called discrete variables
or categorical variables.
In Eq. (3.1.1) it is usually not helpful to consider the class of all functions. The
selected functions we use are called a model. The choice of model is dependent on
context, and can be constructed either from an analyst or automatically.
In the following, we will always consider statistical models with a density that we
denote by Pθ (y | x). A key assumption for statistical models is that the n pairs in the
training data are chosen independently.
The goal of machine learning is to determine a parameter θ that accurately describes
the data in the model. We also say that we learn the parameter θ . Ideally, we have
chosen a model and learned a parameter that predicts well on unseen data. For this
reason, we split the data into training data and test data. The training data is used
56
3 Machine Learning
to learn the parameter. The role of the test data is to simulate unseen data. We assess
the quality of prediction of our model by evaluating a quality function on the test data.
This last step is called validation.
Thus, at the core approaching a machine learning problem consists of four steps.
Algorithm 3.1: Core steps of solving a machine learning problem.
1 Select a model;
2 Split the data into training and test data;
3 Learn parameters;
4 Validation.
Let us first consider the validation step. The common definition of a quality functions
that can be used for validation is empirical risk.
Definition 3.5. Given data (x1 , y1 ), . . . , (xn , yn ) and the model fθ : RD → RN (resp. Pθ ),
let ℓ : RN × RN → R be a function called the loss function. Then the empirical risk
depending on ℓ is
1 n 1 n
R(θ ) := ∑ ℓ(yi, fθ (xi)) resp. ∑ Eyˆi∼Pθ (y|xi) ℓ(yi, ŷi).
n i=1 n i=1
When there is a discrepancy in the value of the quality function between training and
test data, we speak of overfitting. Overfitting means that the model fits the training data
well, but does not accurately predict the test data.
The simplest way for model selection is to split the data by randomly assigning the
data points to either training or test data, learn a parameter θ , and validate θ on the test
data using a quality function Q(θ ). The model for which Q is minimized is chosen.
A more sophisticated way is cross-validation. Here, we randomly split the data into k
parts D1 ∪· · ·∪Dk with Di ∩D j ̸= 0/ for i ̸= j. Then, for every 1 ≤ i ≤ k we learn a param-
eter θi using j̸=i D j as training data and return Q := 1k ∑ki=1 R(θi ), where R(θi ) is the
S
risk (or any other quality function) for the test data Di . As before, we choose the model
for which Q is minimized. Therefore, cross-validation is a method for model selection,
but not for parameter learning (although the process for cross-validation involves the
computation of parameters).
For splitting the data into training and test data we can proceed as before and ran-
domly assign training or test labels. Usually between 50%–80% of the data is used for
57
3 Machine Learning
training. The random choice in this step, however, should be independent of the random
choices in the model selection step.
After a model is chosen and data is prepared, we learn parameters. In the determin-
istic model we utilize a technique called Empirical Risk Minimization (ERM). This
means computing a parameter θ ∗ which minimizes the empirical risk R(θ ) on the train-
ing data, namely θ ∗ ∈ argminθ R(θ ). For the statistical model we can use Maximum-
Likelihood Estimation (MLE) or Maximum a-Posteriori Estimation (MAP). They
correspond to maximizing the following functions.
Definition 3.6. Let (x1 , y1 ), . . . , (xn , yn ) be data points and Pθ be a statistical model.
1. The likelihood function is the probability of observing the response variables
given the input data: L(θ ) := ∏ni=1 Pθ (yi | xi ). The log-likelihood function is
n
l(θ ) := ∑ log Pθ (yi | xi ).
i=1
h iT h iT
2. Let us denote X := x1 . . . xn ∈ R n×D and Y := y1 . . . yn ∈ Rn×N ; i.e.,
X and Y have input data and response variables as their rows. Suppose that the
parameter θ is a random variable, and that (θ | X,Y ) has a probability density P.
The posteriori function is
The motivation for the definition of l is that in order to maximize L, we can also
maximize l. The latter is often simpler.
In particular, in maximum a-posteriori estimation we model the parameter θ as ran-
dom. Taking the point of view of Bayesian probability allows us to use prior informa-
tion for modelling the random variable θ . The choice of probability distribution for
θ is therefore called the prior distribution. For instance, θ could be a normal distri-
bution around a mean value that we have observed often. Going one step further, we
can also keep θ random, so that our model for θ allows fluctuations. The response
variable then has a conditional distribution (y | x, θ ), and we can use Eq. (1.2.1) to get
R
P(y | x, X,Y ) = θ ∈RP Pθ (y | x) · P(θ | X,Y ) dθ . These two approaches are summarized
under the name Bayesian machine learning.
Finally, let us briefly come back to model selection. Suppose that we have to choose
among models M1 , . . . , Mr . In Bayesian machine learning we can place a prior P(M)
58
3 Machine Learning
on the choice of model. For instance, we could define P(Mi ) = 1r for 1 ≤ i ≤ r. This
would correspond to choosing a model uniform at random. Then, we have the posteriori
function P(M | X,Y ), which we use for maximum a-posteriori estimation. It follows
from Bayes theorem for densities (Theorem 1.24) that
and P(X,Y | Mi ) is the probability of having the data giving the model, also called
evidence of the model. It is the marginal density, where the random variable θ has been
R
integrated out: P(X,Y | Mi ) = θ ∈RP P(X,Y | θ ) · P(θ | Mi ) dθ , by Eq. (1.2.1). Here,
P(θ | Mi ) is the prior distribution of θ for the model Mi and P(X,Y | θ ) the joint density
of (X,Y ) given θ .
Definition 3.7. The linear model fθ : RD → R is given by the following function de-
pending on the parameter θ = (a, b) ∈ R × RD :
fθ (x) = a + xT b.
The empirical risk R(θ ) (see Definition 3.5) for the quadratic loss is also called mean
squared error (MSE). Alternatively, one also considers the root mean squared error
p
(RMSE) defined by RMSE(θ ) := R(θ ).
We again consider training data (x1 , y1 ), . . . , (xn , yn ) ∈ RD × R. Recall that we denote
h iT h iT
X := x1 . . . xn ∈ Rn×D and Y := y1 . . . yn ∈ Rn . (3.2.1)
59
3 Machine Learning
Figure 3.1: The picture shows 75% of the data in the dataset cars from the RDatasets package, and the
result of a linear regression for this data.
Theorem 3.8. Giving the setting from Definition 3.7, let Y ∈ Rn be as in Eq. (3.2.1) and
Ω be the feature matrix. Recall that r(Ω) denotes the rank of the matrix Ω.
1. When r(Ω) < D + 1, then argminθ R(θ ) has infinitely many solutions.
2. When r(Ω) = D + 1, then argminθ R(θ ) has a unique solution
θ ∗ := Ω†Y,
1 n 2 1 n 1
R(θ ) = ∑ ( f (x
θ i ) − yi ) = ∑ (a + xiT b − yi )2 = (Ωθ −Y )T (Ωθ −Y ).
n i=1 n i=1 n
Example 3.9. The RDatasets1 provides the data set cars that features the variables
x =speed and y =dist. The dataset is shown in Fig. 3.1. We expect a linear relation
of the form y = a + bx to hold. We estimate a and b using Theorem 3.8 and a subset ot
75% of the data for training. The result is shown in Fig. 3.1.
1 https://github.com/JuliaStats/RDatasets.jl
60
3 Machine Learning
The special case of Definition 3.10 given by the case A = 1D+1 is called Tikhonov
regularization. The corresponding regression problem is called ridge regression. The
parameter λ is here not considered a parameter of the model in the sense of step 3. of Al-
gorithm 3.1, but a parameter which is chosen in advance or tuned later. Such parameters
are called hyperparameters.
Theorem 3.11. Let Ω ∈ Rn×(D+1) be the feature matrix from Eq. (3.2.2) and Y ∈ Rn
be as in Eq. (3.2.1). Giving the setting from Definition 3.10 and Eq. (3.2.3), for almost
every λ there is a unique solution θ ∗ = argminθ R(θ ) given by
θ ∗ = (ΩT Ω + nλ AT A)−1 ΩT Y.
Notice that Theorem 3.11 shows that the regularized linear models not only helps
against overfitting, but also in the case that Ω does not have rank D+1. This is especially
useful in the case when there is too little data with n ≤ D or when the data lives in a
lower dimensional subspace.
2
∇R(θ ) = ΩT (Ωθ −Y ) + 2λ AT Aθ = 0.
n
Which implies
(ΩT Ω + nλ AT A)θ = ΩT Y.
61
3 Machine Learning
θ = (ΩT Ω + nλ AT A)−1 ΩT Y,
Pθ (y | x) = Φ(y | fθ (x), σ 2 ),
where Φ is the density of the normal distribution as in Eq. (1.2.2). This means that we
have y = fθ (x) + ε for ε ∼ N(0, σ 2 ). Note that through this statistical model we model
(measurement) errors in the data. The variance σ 2 is not considered a parameter in this
model, but it is a hyperparameter. We will consider a model that has σ 2 as a parameter
in Proposition 3.14 below.
We now again consider linear regression for θ = (a, b)T ∈ R × RD . That is fθ (x) =
a + xT b.
Theorem 3.12. Let Ω ∈ Rn×(D+1) be the feature matrix as defined in Eq. (3.2.2). Let
Y ∈ Rn be as in Eq. (3.2.1). The maximum likelihood estimation in the above model for
linear regression
1. is not unique if r(Ω) < D + 1,
2. is uniquely determined by θML = Ω†Y when r(Ω) = D + 1.
Note that Theorem 3.12 shows that the risk minimization in the deterministic model
and in the statistical model give the same answer.
62
3 Machine Learning
The first term does not involve θ . As in the proof of Theorem 3.11, we proved that
argminθ ∥Ωθ −Y ∥2 has a unique solution exactly when r(Ω) = D + 1, and then the
solution is θML = Ω†Y .
Pθ (y | x) = Φ(y | θ T φ (x), σ 2 ).
For every such model we have a corresponding feature matrix, which we denote by
h iT
Ω = φ (x1 ) . . . φ (xn ) ∈ Rn×P . (3.2.4)
This notation is not in conflict with Eq. (3.2.2): in Eq. (3.2.2) we see the special case of
Eq. (3.2.4) for linear regression. We note the following.
• For D = 1 and
1
x
φ (x) = ..
.
x P−1
Theorem 3.13. Let Ω be the feature matrix in Eq. (3.2.4) and Y ∈ Rn be as in Eq. (3.2.1).
The maximum likelihood estimator in nonlinear regression is
1. not unique when r(Ω) < P.
2. uniquely determined by θML = Ω†Y when r(Ω) = P.
In Theorem 3.12 and Theorem 3.13 we have the variance σ 2 provided. Alternatively
we can also model the parameter.
63
3 Machine Learning
Similar to linear regression the MLE tends towards overfitting. Our solution in the
deterministic model was to introduce a penalty term λ ∥Aθ ∥2 . In the Bayesian ap-
proach θ itself is random, and instead of a penalty term we prescribe a choice of θ
called a prior. This then leads to a maximum a-posteriori estimator as in Definition 3.6.
The next theorem computes the MAP for the statistical model (y | x) ∼ N(θ T φ (x), σ 2 )
using a Gaussian prior N(µ, Σ). The theorem shows that in this case MAP can be
understood as the statistical analogue of regularization.
Theorem 3.15. Let µ ∈ RP and Σ ∈ RP×P be positive Given the statistical model above,
then for the prior θ ∼ N(µ, Σ) we have the MAP
where Ω is the feature matrix, Y ∈ Rn is as in Eq. (3.2.1), and under the assumption that
ΩT Ω + σ 2 Σ−1 is invertible.
Proof. Let α(θ ) = P(θ | X,Y ) be defined as in Definition 3.6. By Bayes’ Theorem for
densities (Theorem 1.24),
P(Y | X, θ )
P(θ | X,Y ) = P(θ ) .
P(Y | X)
Recall that
1 1 T −1
P(θ ) = Φ(θ | µ, Σ) = p exp − (θ − µ) Σ (θ − µ) . (3.2.5)
(2π)n det Σ 2
Furthermore, as the n pairs in the training data are assumed to be independent, we have
n
T 2 1 1 2
P(Y | X, θ ) = ∏ Φ(yi | θ φ (xi ), σ ) = p exp − 2 ∥Y − Ωθ ∥ . (3.2.6)
i=1 (2σ 2 )n 2σ
64
3 Machine Learning
Example 3.16. We compute the ML and MAP estimator using Theorem 3.12 and The-
orem 3.15 for the data in Example 3.9. A sample from these statistical models is shown
in Fig. 3.2. The sample points are connected by lines to plot a piecewise linear function.
In the discussion after Definition 3.6 we observed that instead of finding θ deter-
ministically through MLE or MAP, we can also compute the distribution of θ given
the training data. In the context of regression this approach is called Bayesian regres-
sion. We will assume as before that θ ∼ N(µ, Σ) and (y | x, θ ) ∼ N(φ (x)T θ , σ 2 ). The
goal of Bayesian regression is to sample from the posteriori distribution with density
α(θ ) = P(θ | X,Y ). In this concrete case we can explicitly compute the posteriori dis-
tribution.
65
3 Machine Learning
Figure 3.2: The picture shows ML and MAP for the data in Fig. 3.1. For the plot we have sampled points
and connected them by lines to illustrate the statistical model. The red points show the ML estimator, the
green points the MAP estimator.
In fact, Theorem 3.17 implies Theorem 3.15, because the density of a Gaussian is
maximized at the expected value; i.e., m = θMAP .
66
3 Machine Learning
For the validation step we compute the marginal distribution of the response variable
y ∈ R given an input variable x ∈ RD and the training data (x1 , y1 ), . . . , (xn , yn ) ∈ RD ×R.
Proposition 3.18. Giving training data (x1 , y1 ), . . . , (xn , yn ) ∈ RD × R and an additional
data point (x, y) ∈ RD × R, the distribution of (y | x, X,Y ) is
where S and m are as in Theorem 3.17, and X,Y are as in Eq. (3.2.1).
Proof. See Exercise 3.4.
The discussion in this section centered around models with (non-)linear functions
RD → R. Given a nonlinear function φ : RD → RP , a straightforward generalization of
this setting is to multivariate models
φ (x)T θ (1)
..
h i
fθ : RD → RN , x 7→ , θ = θ (1) · · · θ (N) ∈ R
P×N
, (3.2.8)
.
φ (x)T θ (N)
and using quadratic loss ℓ(y, ŷ) = ∥y− ŷ∥2 for the deterministic model or the multivariate
Gaussian distribution Pθ (y | x) = Φ(y | fθ (x), σ 2 1N ) for the statistical model. Since the
Euclidean norm satisfies ∥y − ŷ∥2 = ∑N 2 T
i=1 (yi − ŷi ) for vectors y = (y1 , . . . , yN ) and
ŷ = (ŷ1 , . . . , ŷN )T , maximization with respect to parameters can be done for each entry
of fθ : RD → RN separately. Therefore, the estimators obtained in this section can be
used for each entry of fθ : RD → RN separately.
Another generalization of linear regression models is given by neural networks. They
are obtained by iterating nonlinear regression models.
Definition 3.19. Let L > 0 and N0 , N1 , . . . , NL > 0. Let gθi : RNi−1 → RNi be nonlinear
regression models, 1 ≤ i ≤ L. Denote D := N0 , N := NL . We call The model
fθ = (gθL ◦ · · · ◦ gθ1 ) : RD → RN
a neural network of depth L. The function gθi is called the i-th layer of the network.
67
3 Machine Learning
In this definition, the linear and nonlinear regression models above are neural net-
works of depth 1. For depth larger than 1, however, we have no analytic expression for
MLE or MAP. Instead, we have to use methods from optimization for computing them.
Neural networks often encompass combinations of linear functions with so-called
activation functions. This means that in Definition 3.19 the i-th layer is given by a
function of the form gθi = σi ( fθi (x)), where fθi is a multivariate nonlinear regression
model as in Eq. (3.2.8) and σi is a nonlinear activation function. For instance, activa-
tion functions for z ∈ Rk are the ReLu function σ (z) = (max(0, zi ))1≤i≤k the sigmoid
1
function σ (z) = ( 1+exp(−z i)
)1≤i≤k , or the softmax function
!
exp(zi )
σ (z) = . (3.2.9)
∑kj=1 exp(z j ) 1≤i≤k
Example 3.20. We use a neural network of depth 2 for the data in Example 3.9. The
inner functions should be σ ◦ f1 and σ ◦ f2 , where σ is the ReLu activation function and
f1 : R1 → R2 and f2 : R2 → R1 are linear. Unlike in the previous examples we don’t
have a closed form for the optimal parameters. Instead, we minimize the empirical risk
using optimization methods. The result of the computation is shown in Fig. 3.3. We
can see from the figure that the neural network computed an estimator that is piecewise
linear. This indicates that there could be a hidden latent variable describing the data
with two different linear models depending on whether speed is small or large.
68
3 Machine Learning
Figure 3.3: The picture shows a function describing the data in Fig. 3.1 that was computed using a neural
network.
Exercise 3.1. Prove Theorem 3.13. Hint: Adapt the proof of Theorem 3.12.
Exercise 3.2. Reformulate and prove Theorem 3.8 for linear models RD → RN and the
quadratic loss ℓ(y, ŷ) = ∥y − ŷ∥2 .
Exercise 3.4. Prove Proposition 3.18. Hint: By Eq. (1.2.1) the marginal density is
R
given by P(y | x, X,Y ) = RP P(y | x, θ ) · P(θ | X,Y ) dθ .
Exercise 3.5. From the RDatasets package in Julia load the pressure data set. This
data set contains the variables temperature and pressure, which give the values of
pressure of mercury depending on temperature. The Antoine Equation is a simple
model for this dependency:
b
log(pressure) = a − .
temperature
Set up and solve a regression problem to estimate a and b.
69
3 Machine Learning
We discuss SVMs in the context of binary classification. In this setting the la-
bels y1 , . . . , yn are elements in {1, −1} and we want to find a model that fits the data
(x1 , y1 ), . . . , (xn , yn ) ∈ RD × {−1, 1}. We will work here with finding a deterministic
model as in Definition 3.4 given by
fθ : RD → {−1, 1}.
Example 3.21. Suppose that our input data is contained in the plane R2 . Consider the
line H = {x = (x1 , x2 ) ∈ R2 | x1 + x2 − 2 = 0} shown in blue below. Then H splits the
plane in the following two regions.
+ +
+
− − + +
+
− Positive
Negative +
− +
−
We classify data in R2 by assigning them a plus or a minus sign. The data points on the
upper right of the plane are labeled with a plus, because for them x1 + x2 − 2 > 0. The
other data points are labeled with a minus sign, because here x1 + x2 − 2 < 0.
We now aim to find a suitable of way computing parameters for the model Eq. (3.3.1).
To do so, we first make two small observations.
Lemma 3.22. Let x ∈ RD and θ = (a, b) ∈ R × RD . For y ∈ {−1, 1}, we have y = fθ (x)
if and only if y(a + ⟨b, x⟩) > 0.
70
3 Machine Learning
b x0
•
•
x
Proof. In the case that fθ (x) = −1, we have a + ⟨b, x⟩ < 0 and y = −1, which implies
y(⟨a, x⟩ + b) > 0. Similarly when fθ (x) = 1, then a + ⟨b, x⟩ > 0 and y = 1 so that in this
case we also have y(a + ⟨b, x⟩) > 0.
We can see from Example 3.21 that a hyperplane H which separates the data doesn’t
need to be unique. SVMs select the Hyperplane which maximizes the distance to the
data. The following lemma helps us to formulate the right optimization problem.
Lemma 3.23. Let x0 ∈ RD and H = {x ∈ RD | a + ⟨b, x⟩ = 0} for ⟨b, b⟩ = 1. The
Euclidean distance from x0 to H is y(a + ⟨b, x0 ⟩) where y = fθ (x0 ).
Proof. The geometrical setup of this proof is depicted in Fig. 3.4. Let x ∈ H be the point
which minimizes the distances to x0 . So we can write x0 = x + εrb where r = ∥x − x0 ∥ ≥
0 and ε ∈ {−1, 1}. Since x ∈ H we compute
Note if ε = −1, then y = −1 and therefore y(a + ⟨b, x⟩) = yεr = r. Similarly if ε = 1
then y = 1 and therefore y(a + ⟨b, x⟩) = yεr = r.
In order to compute the hyperplane that maximizes the distance to the data, we can
now write it in terms of solving the following optimization problem:
max r (3.3.2)
θ =(a,b)
71
3 Machine Learning
The optimization problem of Eq. (3.3.2) is often solved via an equivalent problem by
normalizing the value of r. Namely if a′ = ar and b′ = br , then the constrains now ask
for yk (a′ + ⟨b′ , xk ⟩) ≥ 1. Then in this case ∥b′ ∥ = 1r , so we can either maximize ∥b1′ ∥ , or
minimize ∥b′ ∥. This leads to the following parameter finding problem.
Definition 3.24. The Hard Margin SVM is given by the optimization problem
min ∥b∥2
θ =(a,b)
An issue with Hard Margin SVM when working with noisy data is that it does not
allow for outliers. To compensate for this, we will introduce a slack variable ξk .
Definition 3.25. The Soft Margin SVM is given by the optimization problem
n
min ∥b∥2 +C ∑ ξk
a,b,ξ k=1
s.t. yk (a + ⟨b, xk ⟩) ≥ 1 − ξk , for ξk ≥ 0, k = 1, . . . , n.
Here, the regularization parameter is not taken as a parameter of the model but as a
hyperparameter.
The SVMs in Definition 3.24 and Definition 3.25 both fall in the category of Primal
SVMs. Another formulation is the Dual SVM which we now work toward defining.
We first define the Lagrange function for Soft Margin SVM:
n n n
L (a, b, ξ , α, β ) = ∥b∥2 +C ∑ ξk − ∑ αk (yk (a + ⟨b, xk ⟩) − (1 − ξk )) − ∑ βk ξk ,
k=1 k=1 k=1
(3.3.3)
where αk , βk ≥ 0. The KKT conditions say that the optimum occurs when
∂L ∂L ∂L
= 0, = 0, = 0.
∂a ∂b ∂ξ
72
3 Machine Learning
which gives
L = ⟨b, b⟩ +C⟨ξ , e⟩ − 2⟨u, v⟩ − ⟨α + β , ξ ⟩ + ⟨α, e⟩.
Therefore
n n
∂L
0= = 2b − 2 ∑ uk xk ⇒ b= ∑ uk xk (3.3.4)
∂a k=0 k=1
n
∂L
0= = −2⟨u, e⟩ ⇒ 0= ∑ uk
∂b k=1
∂L
0= = Ce − (α + β ) ⇒ Ce = α + β ⇒ αi ≤ C
∂ξ
Remark 3.26. The equation ∂∂Lb = 0 is the meaning behind the name Support Vector
Machine. Namely, the xi with uk ̸= 0 and equivalently αk ̸= 0 are the support of the
vector b.
73
3 Machine Learning
Remark 3.28. Why a maximum? The solution for the optimization problem for the
Primal SVM is given by
min max L (a, b, ξ , α, β ). (3.3.5)
a,b,ξ α,β
When the Dual SVM problem has been solved, we can use the following result to
compute direction of the hyperplane b and the offset a.
Proposition 3.29. Let α be an optimal solution from the Dual SVM. Then, we have
optimal values for the Soft Margin SVM from Definition 3.25 by setting
1. b∗ := ∑nk=1 uk xk , where uk = yk αk ;
2. a∗ is the median value of yk − ⟨b∗ , xk ⟩ for all k with αk ̸= 0.
Proof. The formula for b∗ follows from Eq. (3.3.4) Using that we have α + β = C in
the optimal value, the Lagrange function from Eq. (3.3.3) becomes
n
L = ∥b∥2 − ∑ αk (yk (a + ⟨b, xk ⟩) − 1) .
k=1
yk (a + ⟨b, xk ⟩) − 1 ≤ 0 ⇒ αk = 0. (3.3.6)
This implies
n
L = ∥b∥2 − ∑ αk (yk (a + ⟨b, xk ⟩) − 1) = ∥b∥2 −C ∑ (yk (a + ⟨b, xk ⟩) − 1) ,
k: αk >0 k: αk >0
as the term in the middle is maximized for setting all αk = C. Now, by Eq. (3.3.6)
the summands on the right are all nonnegative. Using that |yk | = 1 for all k we get
that a∗ is a point on the real line that minimizes ∑k: αk >0 |yk − a − ⟨b∗ , xk ⟩|. The median
of yk − ⟨b∗ , xk ⟩ for αk ̸= 0 is a minimizer (see Exercise 3.7).
Example 3.30. There is no hyperplane separating the following 4 points into 2 classes.
Even using Soft-Margin SVM the 2 classes can’t be well separated.
74
3 Machine Learning
+1 −1
• •
• •
−1 +1
+1 −1
• •
++ +−
−+ −−
• •
−1 +1
+1 −1
• •
• •
−1 +1
75
3 Machine Learning
For solution 2 above the idea is to combine an SVM with a feature map
φ : RD → RM
(as we did in nonlinear Regression), and compute an SVM for the modified data points
(φ (x1 ), y1 ), . . . , (φ (xn ), yn ). In this situation the Dual SVM is well suited. In particular,
in the formulation of Definition 3.27 only the inner product between input variables
occurs. So it is enough to know the value of the inner products ⟨φ (xk ), φ (xℓ )⟩. These
are parametrized through positive semi-definite matrices.
Lemma 3.31. Let G ∈ Rn×n . Then G is positive semi-definite if and only if there exists
M and vectors z1 , . . . , zn ∈ RM with G = (⟨zi , z j ⟩)ni, j=1 .
Proof. Let Z ∈ RM×n be the matrix whose columns are the zi . If G = Z T Z, for every
w ∈ Rn \ {0} we then have wT Gw = (Zw)T Zw ≥ 0. On the other hand, if G is positive
semi-definite, we can find a Cholesky-decomposition G = Z T Z.
In the context of SVMs, the positive definite matrices are also called kernels. The
function
κ(x, y) := ⟨φ (xk ), φ (xℓ )⟩
is called kernel map. By Proposition 3.29 we have the optimal value b∗ = 21 ∑ni=1 ui φ (xi ).
Let us define
1 n
ψ(x) := ∑ ui κ(xi , x).
2 i=1
Then, we have ψ(x) = ⟨b∗ , φ (x)⟩ by linearity. The optimal value for a is then the median
of yk − ψ(xk ) for αk ̸= 0. Moreover, we can evaluate the model from Eq. (3.3.1) as
fθ (x) = sgn(a + ψ(x)). All this leads to the following algorithm.
Algorithm 3.2: Binary classification by Dual SVM.
1 Input: Training data (x1 , y1 ), . . . , (xn , yn ) ∈ RD × {−1, 1}, a kernel map
κ(x, y), and a regularization parameter C.
2 Output: A function f : RD → {−1, 1} of the form f (x) = sgn(a + ⟨b, φ (x)⟩).
3 Compute the kernel matrix G = (κ(xi , x j ))1≤i, j≤n ;
4 Using C and G solve the Dual SVM problem from Definition 3.27 for α ∈ Rn ;
5 Define the function ψ(x) := 12 ∑ni=1 yi αi κ(xi , x);
6 Take b as the median of {yk − ψ(xk ) | αk ̸= 0.};
7 Return f (x) = sgn(a + ψ(x)).
76
3 Machine Learning
⟨φ (u), φ (v)⟩ = φ (u)T φ (v) = ∑ ui00 ui11 · · · uiDD vi00 vi11 · · · viDD
0≤i0 +···+iD =d
Figure 3.5: Images of coats and sandals from the FashionMNIST dataset.
The goal is to obtain a function that classifies the data. The FashionMNIST dataset
contains 6000 images of coats and 6000 images sandals. We pick a subset of n = 50
images as training data and another subset of 50 images as test data.
First, we use an SVM with parameter C = 5 and kernel map κ(x1 , x2 ) = ⟨x1 , x2 ⟩.
Next, we use a neural network with three layers given by RD → R20 , R20 → R15 and
R20 → R2 . The inner layers feature as ReLu actication function. For the last layer
we use the softmax activation function from Eq. (3.2.9). This way, the neural network
2 https://github.com/zalandoresearch/fashion-mnist
77
3 Machine Learning
produces a probability distribution for the two classes. We use the crossentropy loss
function ℓ(p, q) = −p1 log(q1 ) − p2 log(q2 ) (see, e.g., [Mur13, Section 8.3.1]). The
classifier chooses the class which has the largest probability.
The results of the two classifiers evaluated on the test data are shown in Fig. 3.6.
Figure 3.6: The results of the two classifiers from Example 3.33 evaluated on the test data. The SVM
based classfier identfies images 90% correctly. The neural network based classfier identfies images 96%
correctly.
Exercise 3.6. Consider the Hinge loss ℓ(y, ŷ) = max{0, 1−y· ŷ}. Show that Soft Margin
SVM can be understood as empirical risk minimization with respect to a regularized
Hinge loss.
Exercise 3.7. Let w1 , . . . , wn ∈ R. Prove that the median of the wi minimizes the aggre-
gated distances d(v) = ∑ni=1 |wi − v|.
Exercise 3.8. The MLDatasets.jl3 package provides the MNIST dataset4 . This dataset
contains images of handwritten digits. Load the training data for images of zeros and
ones and implement an algorithm that learns to separate these two classes. After the
learning step let your algorithm predict the labels of test data points.
3 https://github.com/JuliaML/MLDatasets.jl
4 http://yann.lecun.com/exdb/mnist/
78
3 Machine Learning
Given data pairs x1 , . . . , xn ∈ RD , with the help of PCA we model the data so that
we reduce the number of parameters that the data describes. This can have several
motivations. For instance, we could be interested in data compression, and reducing
parameters would reduce the memory for storing the data. Another motivation is to
interpret the parameters as geometric information, so that the goal here would be to
learn the shape of the data.
For PCA we will not be taking into account response variables y1 , . . . , yn ∈ RN . Learn-
ing without using response variables is called unsupervised learning. By contrast, the
settings from the previous sections are summarized as supervised learning.
The basic idea of PCA is to find a linear space U ⊆ RD of dimension d ≪ D and a
vector b ∈ RD so that x1 , . . . , xn lay “nearby” U + b.
Example 3.34. Here is a small graphic when D = 2 and d = 1. The line is represented
by U + b, and the points are the points x1 , . . . , x8 .
U +b
• •
•
• •
• •
•
For the moment, we take d as a fixed input parameter. We will discuss later how this
parameter can be chosen.
As depicted in Example 3.34, the assumption in PCA is that the data centers around
a low-dimensional linear subspace, and the goal is to determine this subspace. This
assumption, however, is not always fulfilled. Data points can also lie on a nonlinear
subspace. For this reason, we again work with a nonlinear feature map φ : RD → RM ,
and we set
zi := φ (xi ), 1 ≤ i ≤ n.
For instance, if we take the polynomial feature map from Lemma 3.32, then finding a
linear subspace for the zi means finding a low-dimensional algebraic variety describing
the data; i.e., the vanishing set of a system of polynomial equations.
79
3 Machine Learning
Example 3.35. We consider a sample of n = 100 points from the unit circle in R2 plus
Gaussian noise. The point sample is shown in Fig. 3.7. We apply PCA combined with
the nonlinear feature map φ ((x1 , x2 )) = (1, x1 , x2 , x1 x2 , x12 , x22 ) ∈ R6 . By doing so, we
find that the data lies in a hyperplane in R6 . The equation of this hyperplane yields
approximately the nonlinear equation x12 + x22 − 1 = 0.
Figure 3.7: A noisy sample of n = 100 points from the unit circle.
In the following, we assume that the data x1 , . . . , xn are independent samples from an
(unknown) random variable x ∈ RD , and we set z := φ (x), so that z1 , . . . , zn are indepen-
dent samples from z.
We also denote the expected value µ := Ez ∈ RM and the covariance matrix of
z = (z(1) , . . . , z(M) )T :
Cov(z(1) , z(1) ) · · · Cov(z(1) , z(M) )
... M×M
Σ := ∈R .
80
3 Machine Learning
Remark 3.37. A common approach is to standardize the data. This means to replace
√
for every k the i-th entry (zk )i by (zk )i / sii , and then using the modified data for PCA.
As mentioned above, in practice we do not have Σ available and instead use the eigen-
vectors of S for computing U.
Before we prove this theorem, let us discuss for a brief moment the choice of d. The
statement of Theorem 3.39 implies that d should be chosen, such that the data spreads
out in the direction of the eigenvectors for λ1 , . . . , λd , but not in the directions of the
eigenvectors for λd+1 , . . . , λM . One choice for such a d is to take λd > 0 and λd+1 ≈ 0.
Alternatively, we can choose d such that λd /λd+1 or λd − λd+1 is maximized.
81
3 Machine Learning
PU = AAT ,
and so
2
∥PU (z − µ)∥2 = AAT (z − µ)
= (AAT (z − µ))T (AAT (z − µ))
= (z − µ)T AAT AAT (z − µ)
= (z − µ)T AAT (z − µ);
the last line because AT A = 1d . Now, using that AT A = ∑di=1 ui uTi this gives
d d
2 T
∥PU (z − µ)∥ = ∑ (z − µ) ui uTi (z − µ) = ∑ uTi (z − µ)(z − µ)T ui.
i=1 i=1
Using Exercise 3.9 and that the expected value is linear we get
d d
E ∥PU (z − µ)∥2 = ∑ uTi E[(z − µ)(z − µ)T ]ui = ∑ uTi Σui .
i=1 i=1
We compute
∂L
= 2Σu j − 2u j ℓ j j − ∑ u j ℓi j = 0 1 ≤ j ≤ d. (3.4.2)
∂uj i< j
∂L
= uTi u j − δi j = 0 1 ≤ i ≤ d. (3.4.3)
∂ ℓi j
82
3 Machine Learning
From Eq. (3.4.2) for j = 1 it follows that Σu1 = ℓ11 u1 thus ℓ11 = λ1 . Similarly,
By continuing this process, we see that ∑di=1 uTi Σui = ∑di=1 λi . This expression is max-
imized by taking λ1 ≥ . . . ≥ λd ≥ λd+1 ≥ . . . ≥ λM . Finally, if λd > λd+1 , then U is
uniquely determined as the sum of the eigenspaces for λ1 , . . . , λd .
In the second approach, instead of maximizing the variance, we want to minimize the
squared distance to the data points.
n
∑ ∥(zi − z̄) − PU (zi − z̄)∥2 . (3.4.4)
i=1
The next theorem shows that, although this approach is conceptually different from
maximizing the variance, we get the same minimizer as in Theorem 3.39 (when replac-
ing the covariance matrix Σ by the empirical covariance matrix S).
WW T = nS.
M
W − AAT W = ∑ ui uTi W.
i=d+1
83
3 Machine Learning
Observe that
h i
W − PU W = W − AAT W = w1 − PU (w1 ) . . . wn − PU (wn ) .
As in the proof from Theorem 3.39 we show that the ui must be eigenvectors of S so that
U = span{u1 , . . . , ud } for Sui = λi ui and λ1 ≥ · · · ≥ λM ≥ 0. The uniqueness statement
follows as in the proof of Theorem 3.39.
W = UDV T
W T W = V D2V T ∈ Rn×n .
Since the i-th columns of U and V are related by ui = W vi /∥W vi ∥, this shows that it suf-
fices to compute an eigendecomposition of an n × n-matrix, rather than a decomposition
of an M × M-matrix. Moreover, W T W can be computed using only the kernel map.
84
3 Machine Learning
Lemma 3.42. Let κ(x1 , x2 ) = ⟨φ (x1 ), φ (x2 )⟩ be the kernel map for φ , and let the kernel
matrix be G = (κ(xk , xℓ ))1≤k,ℓ≤n . Then, for W = ΩT − z̄eT we have
Proof. We get from Eq. (3.4.1) that W = ΩT (1n − n1 eeT ). This implies,
In particular, this lemma shows that W T W will always have rank at most n − 1.
Example 3.43. We load the CIFAR105 dataset. It contains images of various items. We
select n = 50 images of trucks for PCA (6 of them are shown in Fig. 3.8).
Each image is given as RGB values for 32 × 32 pixels. We therefore have data in RD
for D = 322 ·3 = 3072. We want to apply PCA using κ(x1 , x2 ) = (⟨x1 , x2 ⟩+1)4 as kernel
map. By Lemma 3.32, this corresponds to the feature map φ (x) that evaluates all mono-
mials of degree at most 4 at x ∈ RD . There are D+4 = 3, 722, 945, 108, 225 > 3 · 1012
4
such monomials. Evaluating φ directly is not an option. Instead, we use Lemma 3.42
to compute and eigendecomposition of W T W . This gives eigenvectors λ1 ≥ . . . ≥ λn .
We compute the relative differences di := (λi − λi−1 )/λi for 2 ≤ i ≤ n. Their values are
shown in Fig. 3.9. We know that W T W always has one zero eigenvalue. This is the point
on the top left of Fig. 3.9. The second point on the top left indicates that there could be
polynomial of degree 4 approximately vanishing on the data. For instance, the trucks
are photographed from different angles, so that the equation could describe rotational
symmetry (this would yield a polynomial equation of degree 2: g(x) = 0. If g(x) is small
compared to other eigenvalues, we expect that g(x)2 is even smaller in comparison; i.e.,
taking degree 4 polynomials gives a clearer separation of eigenvalues).
5 http://www.cs.toronto.edu/ kriz/cifar.html
85
3 Machine Learning
Figure 3.9: The distribution of the relative differences d: = (λi − λi−1 )/λi for the eigenvalues of W T W in
Example 3.43.
Let us now assume the following statistical setting for PCA. Recall that we have data
points x1 , . . . , xn chosen independently from x, where x ∈ RD is a random variable. Let
d ≤ D. We assume that there is a Gaussian latent variable ζ ∼ N(ν, B) with ν ∈ RD
and B ∈ RD×D positive definite, such that
x = Aζ + b + ε, (3.4.5)
x ∼ N(Aν + b, ABAT + σ 2 1D ).
We can use Proposition 3.44 for maximum likelihood, maximum a-posteriori estima-
tion (see Definition 3.6), or for taking the Bayesian perspective where we assume priors
for the parameters (A, b) ∈ RD×d × RD . Alternatively, we can also compute the distri-
bution of ζ given a data point x. We can use this last approach to use data for updating
the distribution of ζ , and then generating synthetic data points by sampling (x | ζ ).
Theorem 3.45. Suppose that we have a prior ζ ∼ N(ν, B). Then, the posterior distri-
bution of ζ given x is
(ζ | x) ∼ N(m,C)
with covariance matrix C = (σ −2 AT A + B−1 )−1 and m = C(σ −2 AT (x − b) + B−1 ν).
86
3 Machine Learning
Example 3.46. We load n = 500 images of sneakers from the FashionMNIST dataset.
A subset of these images is shown in Fig. 3.10.
As in Example 3.33 we have data points in R784 and we apply PCA to this data using
a subspace of dimension d = 31. This results in a matrix A ∈ R784×31 and a vector
b ∈ R31 . Then, we consider the statistical model from Eq. (3.4.5) with ζ ∼ N(0, 2 · 1d )
and ε ∼ N(0, 0.001 · 1d ). We use this model to generate synthetic images of sneakers.
A sample is shown in Fig. 3.11. In this setting, we call Eq. (3.4.5) a generative model.
Figure 3.11: Synthetic images of sneakers sampled from the statistical model in Eq. (3.4.5).
Finally, we load one additional data point x and use Theorem 3.45 to compute the pos-
terior distribution for ζ given x. Fig. 3.12 shows the image corresponding to x together
with an image generated from this model.
87
3 Machine Learning
Figure 3.12: The left picture shows a datapoint from the FashionMNIST dataset. The picture on the
right shows a synthetic image that was sampled using the posterior distribution in Theorem 3.45 given
the image on the left.
Exercise 3.10. Take again the MNIST dataset from the MLDatasets.jl package and
load the training data for pictures of ones and zeros. Use PCA to reduce the number of
parameters representing these pictures. Then, load a point x from the test data set and
compute the posterior distribution for (ζ | x) in Theorem 3.45 using x. Use the posterior
distribution to generate synthetic data.
Exercise 3.11. Consider the function f (Σ) = Σ−1 , where Σ ∈ Rn×n is invertible.
1. Prove that f is differentiable at Σ. Hint: Formulate f as a rational function in the
entries of Σ.
∂f
2. Show that ∂ Σi j = −Σ−1 ei eTj Σ−1 , where ek is the k-th standard basis vector in Rn .
Hint: Differentiate both sides of ΣΣ−1 = 1n .
Exercise 3.12. In the setting of Proposition 3.44 we have the likelihood equation
n
L(A, b) = ∏ Φ(xi | b, AAT + σ 2 1D ).
i=1
88
4 Topological Data Analysis
This chapter is partially based on the lecture notes by Botnan [Bot20] and the textbook
by Dey and Wang [DW22].
The goal of topological data analysis (TDA) is to learn the topology of data sets.
For this one assumes that the data points are (possibly noisy) samples from an unknown
geometric object in RD that we call model.
The topology of the model gives an idea of the shape of data up to continuous de-
formations. A simple example is the unit disc D = {(x, y) : x2 + y2 ≤ 1} ⊆ R2 . The disc
can be continuously deformed into the square S = {(x, y) : max{|x|, |y|} ≤ 1} (denoted
D ≈ S). This cannot be done if we consider instead the unit circle with a hole cut out,
for example the annulus A = {(x, y) : 1/2 ≤ x2 + y2 ≤ 1}:
≈ ̸≈
D S A
The geometric reason why we cannot continuously deform A into S is that A has a
hole, while S does not have a hole. In this chapter we will focus on learning the number
of holes of a model, which is one of the central aspects in TDA. We will give an explicit
definition of what it means to have an n-dimensional hole, which we can do with the
help of linear algebra, and homology.
Computing the number of n-dimensional holes can be used to classify and separate
data. For example, see the Blue Brain Project [Hes20]. In other applications, TDA
can be used to understand the relationship between data points. We will show that the
number of 0-dimensional holes is equal to the number of connected components. For
instance, in robotics the model could be the state space of a robot (the states into which
the robot can move), and then the number of connected components of the is relevant to
understand if the robot can change from any state to another state.
89
4 Topological Data Analysis
x0
x1
x2
x0 x3
x1
90
4 Topological Data Analysis
Simplices are the building blocks of simplicial complexes. The next definition intro-
duces simplicial complexes as unions of simplices with a particular structure.
x0 x3
x2
x6
x5
K = {{x0 }, {x1 }, {x2 }, {x3 }, {x4 }, {x1 , x2 }, {x0 , x1 }, {x0 , x2 }, {x3 , x4 }}.
91
4 Topological Data Analysis
x1 x4
x0 x2
x3
x1 x4
x0 x2
x3
92
4 Topological Data Analysis
93
4 Topological Data Analysis
Example 4.11. The Čech Complex and Vietoris–Rips Complex at level r can best be
understood by drawing discs of radius r around each data point. Consider, for instance,
a data set of four points P = {x0 , x1 , x2 , x3 } ⊂ R2 :
r
x2 x3 x2 x3 x2 x3
For any subset of data points {xi | i ∈ I}, I ⊂ {0, 1, 2, 3}, we add the corresponding
simplex to Cr (P) if the discs around xi , i ∈ I, intersect in a common point. By constrast,
we add the simplex to VRr (P) if the discs intersect pairwise.
Example 4.12. Figure 4.1 shows a noisy sample of 50 points on an ellipse and the
Vietoris–Rips complex at level r = 0.25 for this data. The hole of the ellipse has not yet
closed at this level. The value of r is too small to capture the correct topology of the
ellipse.
In the next section we will then develop strategies in the spirit of homology theory
to define and count holes in simplicial complexes. This gives the following general
algorithm for computing the topology of (discrete) datasets.
94
β₁ = 0, r = 0.25
2 4 Topological Data Analysis
-1
-2
-2 -1 0 1 2
Figure 4.1: A sample of 50 points from an ellipse and the Vietoris–Rips complex at level r = 0.25.
The advantage of the Vietoris–Rips over the Čech complex is that it is easier to com-
pute. For α ⊆ P, then for Vietoris–Rips we must compute |α|
2 distances, whereas
the Čech complex requires checking whether a system of polynomial inequalities has
95
4 Topological Data Analysis
a solution. On the other hand, the Vietoris–Rips complex automatically fills in simpli-
cies. For example {p, q}, {p, u}, {q, u} ∈ VRP (r) implies {p, q, u} ∈ VRP (r). Therefore
the Vietoris–Rips complex cannot represent all simplicial complexes. The trade-off be-
tween the Čech and the Vietoris–Rips complex is quantified in Proposition 4.13.
1
xq − z ≤ ∑ xq − x p ≤ 2r.
d + 1 p∈α,p̸ =q
Exercise 4.3. Let P = {(0, 0), (1, 1), (2, 1), (1, −2)} ⊂ R2 . Draw all the possible Vietoris-
Rips complexes as r ranges in (0, +∞).
4.3 Homology
We work towards defining holes in simplicial complexes. Let us first study the case of
a geometric simplex K ⊂ in the plane R2 . Once we have understood this basic case, we
will generalize the established concepts to simplices realized in higher dimension.
Let P = {x0 , . . . , xn } be the vertices of K. In this case, a first meaningful definition
of a hole is a triangle ∆ = {xi , x j , xk }, such that the edges of ∆ are present in K, but ∆
itself is not. The next example, however, shows that this is not enough.
Example 4.14. In the following example there are 3 holes, but only 2 of them are
triangles. The hole surrounded by x5 , x8 , x9 , x10 is a quadrilateral.
96
4 Topological Data Analysis
x3
x1
x0 x4
x2
x11
x5 x9
x7
x6
x10
x8
We highlight the requirements of σ from Example 4.14, which are all requiring that
each hole has exactly one representative so we can accurately count the number of holes.
Recall from the definition of a cycle that we do not allow isolated vertices. This rules
out the possibility of the subgraph containing an isolated vertex
The fact that a cycle rules out repeated vertices, removes having additional edges which
also do not contribute to the hole:
Finally the requirement that any subset V of length 3 is not in K ensures that when
considering a cycle of length 3 that the triangle is not included in the simplex.
We have the following result for counting the number of holes:
97
4 Topological Data Analysis
Proof. We proceed by induction on the number of edges. In the base case, there are 0
edges, so the complex consists only of a single vertex, and therefore has zero holes.
Now for the induction step, suppose K is a simplicial complex for which the assump-
tion holds. We add an edge to K, which has two possibilities.
1. We add a vertex. Then in this case we do not create a circle,
• #{holes of K} and #{triangles of K} stay the same;
• #{vertices of K} and #{edges of K} each increase by 1.
2. We do not add any vertices. In this case, we create a circle.
a) If in the case the circle has 3 edges, we add the triangle to get
• #{holes of K} and #{vertices of K} stay the same;
• #{edges of K} and #{triangles of K} each increase by 1.
b) If we do not add a triangle, then
• #{vertices of K} and #{triangles of K} stay the same;
• #{holes of K} and #{edges of K} increase by 1.
x2
T= x0 x3
x1
Intuitively, we could guess that T has a 2-dimensional hole in the inside because it is
enclosed by 3 triangles. The 3 triangles play the role of a two-dimensional circle: that
is they give a 2-sphere! So the idea is to define a two-dimensional hole in a simplicial
complex as an unfilled 2-sphere. In general we want to define an n–dimensional hole as
an unfilled n-sphere.
98
4 Topological Data Analysis
x1 x2
W= x7
x4
x0 x3
Notice that T and W both build a hole, however the hole in T is enclosed by 4 triangles
while the hole in W is enclosed by 6 · 2 = 12 triangles. A circle in the 1-skeleton K (1)
is easy to define because there are only two options to go around a circle: clockwise or
counterclockwise. However the holes in T and W give many more options of how to go
around a hole.
With help from linear algebra we can solve this problem (and compute holes!). The
idea is to define all of the possibilities in which one can go around a hole as vectors in a
vector space, and then model the structure of the simplicial complex using linear maps.
We then define holes as the kernel of a linear map so that the number of holes is equal
to the number of linearly independent vectors in the kernel.
To operate via linear algebra on a simplicial complex K we first associate a vector
space to the simplices in K. The concept we need is that of a free vector space.
Definition 4.17. Let S = {s1 , . . . , sm } be a finite set. The free vector space of S (over
F2 = Z/2Z) is the vector space
( )
m
F(S) := ∑ aisi|ai ∈ F2 ,
i=1
|S|
Notice F(S) ∼
= F2 , and in particular dim(F(S)) = |S|.
99
4 Topological Data Analysis
Definition 4.18. Let n ≥ 0 and K a simplicial complex. The vector space of n-chains
in K is defined as
Cn (K) := F(K (n) \ K (n−1) );
i.e., Cn (K) is the free vector space of the set of n-simplices in K. We let C−1 = {0} be
the trivial vector space.
defined through
n
∂n ({x0 , . . . , xn }) = ∑ {x0 , . . . , xn } \ {xi }.
i=0
The central property of a boundary is that any boundary does not itself have a boundary–
exactly like a circle has no boundary. This property is central in the theory of homology
and it will be useful in what follows.
By symmetry, every summand appears twice for (i, j) and ( j, i), thus the coefficients in
F2 sum to zero, which shows that (∂n−1 ◦ ∂n )(α) = 0.
Proposition 4.20 shows that Im(∂n ) ⊆ ker(∂n−1 ) for all n. We already discussed above
that Im(∂n ) captures the boundary of all n-simplices in K. So we want to define an n-
dimensional hole through a (n − 1)-dimensional boundary that does not lie in Im(∂n ).
How to to capture this algebraically? Let us illustrate the underlying idea with the
following example: Consider the following complex K consisting of a filled and an
empty triangle, joined at a vertex.
100
4 Topological Data Analysis
x1 x3
x0 x4
x2
Then:
In this case w and v + w describe the same hole, but v + w does an extra round around
the right triangle, so they are not equal in C1 (K). To identify these two paths, we pass to
the quotient space ker(∂1 )/Im(∂2 ). This space is well-defined as Im(∂2 ) ⊆ ker(∂1 ) by
Proposition 4.20. In the quotient space v and v + w are identified, because they differ by
the boundary v. In other words, by passing to the quotient space we merge all possible
paths around a hole into one single object, which we then interpret as the hole.
This motivates the following definition.
Definition 4.21. Let K be a simplicial complex. The n-th homology vector space is
We can compute β0 by using from Definition 4.18 that C−1 is the trivial vector space
so that ker(∂0 ) = C0 (K) is equal to the set of vertices in K.
We interpret elements in Hn (K) as n-dimensional holes in K. The Betti number βn (K)
counts the number of n-dimensional holes. In this sense, an n-dimensional hole in K is
a hole whose boundary has dimension n: an empty triangle is a one-dimensional hole;
an empty tetrahedron is a two-dimensional hole and so on. Zero-dimensional holes are
then “empty edges”. We show in Lemma 4.23 that β0 (K) is the number of connected
components of K.
Example 4.22. Let K be the complex from above that consists of one filled and one
emptry triangle joined at vertex. Since K has one connected component and one hole,
101
4 Topological Data Analysis
Thus H1 (K) ∼
= span({w}) which implies β1 (K) = 1. Similarly,
Im(∂1 ) = span{x0 + x1 , x1 + x2 , x2 + x3 , x3 + x4 },
ker(∂0 ) = span{x0 , x1 , x2 , x3 , x4 } = span{x0 , x0 + x1 , x1 + x2 , x2 + x3 , x3 + x4 },
Let us now show that the zero-th Betti number gives the number of connected com-
ponents; this can be interpreted as that zero-dimensional are connected components.
102
4 Topological Data Analysis
Recall from Theorem 4.16 that for planar complexes we could compute the number
of holes β1 (K) by computing the alternating sum of the number of simplices in the
complex. The Euler characteristic generalizes this to general simplicial complexes.
Definition 4.24. Let K be a simplicial complex, and let ki be the number of i-dimensional
simplices in K. The Euler characteristic of K is
χ(K) := ∑ (−1)i ki .
i≥0
As an example, we compute the Euler characteristic of the triangle and cube from
above.
Example 4.25. Consider again the following emptry triangle and cube (consisting only
of edges, vertices and triangles):
x5 x6
x2
x1 x2
T = x0 x3 W= x7
x4
x1
x0 x3
χ(T ) = 4 − 6 + 4 = 2 χ(W ) = 8 − 18 + 12 = 2.
Both complexes have one connected component, no 1-dimensional holes, and one 2-
dimensional hole: β0 (T ) = β0 (W ) = 1, β1 (T ) = β1 (W ) = 0, and β2 (T ) = β2 (W ) = 1.
Their alternating sum is β0 (T ) − β1 (T ) + β2 (T ) = 2 = χ(T ).
The fact that the two Euler characteristics in the previous example are equal is no
coincidence. It follows from the Euler–Poincaré formula, which we state next. Note
also that Theorem 4.16 can be obtained as a corollary from this formula.
103
4 Topological Data Analysis
= ∑ (−1)i ki = χ(K).
i≥0
Exercise 4.5. Compute the homology vector spaces H0 (T ), H1 (T ) and H2 (T ) for the
tetrahedron T from Example 4.25.
Exercise 4.6. Compute H0 (W ), H1 (W ) and H2 (W ) for the cube W from Example 4.25.
104
4 Topological Data Analysis
Algorithm 4.2 is called Persistent Homology to highligh the underlying idea for this
algorithm: the number of holes that persist for many choices of r are considered to be
signals coming from the data. However, there are settings (Example 4.27) in which
Algorithm 4.2 does not capture the correct geometry.
Example 4.27. Consider the setting depicted in the picture below. Our data consists
of 8 points P = {x0 , . . . , x7 } ⊂ R2 clustered in two groups of four: {x0 , x1 , x2 , x3 }
and {x4 , x5 , x6 , x7 }. We have two radii r1 < r2 and compute Ki = Cri (P) for i = 1, 2.
Each group of points is placed around what could be a hole of the underlying model.
The points in the first group are further apart from each other than the points in the
second group.
At radius r1 , the disks around the first group {x0 , x1 , x2 , x3 } do not intersect, while the
disks around {x4 , x5 , x6 , x7 } do intersect to form a hole. Hence, β1 (K1 ) = 1.
r1
x3
x2 x7
x6
x1 x4 x5
x0
At radius r2 the disks around the group of points {x0 , x1 , x2 , x3 } intersect to build a hole.
But the hole that was present in K1 has filled up at radius r2 , so that β1 (K2 ) = 1.
r2
x3 x6
x2 x7
x1 x5
x0 x4
We find one hole for both radii r1 and r2 . If we had chosen the radii in Algorithm 4.2,
so that there is no other radius between r1 and r2 , the algorithm would detect only one
hole, while the true geometry has two holes of different sizes.
The goal of this section is to obtain an improved version of Algorithm 4.2 that also
records the values of r for which holes appear or disappear in the simplicial complex.
105
4 Topological Data Analysis
This allows us to correctly handle situations like in Example 4.27. For this, we first need
to introduce the concept of a filtration.
Definition 4.28. A chain of simplicial complexes of the form
K1 ⊆ K2 ⊆ · · · ⊆ Km
f : Cn (K) → Cn (K ′ ), ∑ a∆ ∆ 7→ ∑ a∆ f (∆).
∆∈K:dim ∆=n ∆∈K:dim ∆=n
∂n′ ◦ f = f ◦ ∂n ,
where on the left hand side, ∂n′ the boundary operator for K ′ and f : Cn (K) → Cn (K ′ ).
Similarly on the right ∂n is the n-th boundary operator for K and f : Cn−1 (K) → Cn−1 (K ′ ).
For a simplicial complex with boundary map ∂n we denote Qn : ker(∂n ) → Hn (K) and
call it the quotient map.
Lemma 4.31. Let K ⊂ K ′ be abstract simplicial complexes with quotient maps Qn
and Q′n , respectively. Let f : K → K ′ be a continuous simplicial map. Then, there is a
well-defined linear map between the n-th homology vector spaces f∗ : Hn (K) → Hn (K ′ )
defined by
f∗ (Qn (v)) := Q′n ( f (v)), v ∈ ker(∂n ).
106
4 Topological Data Analysis
Proof. We have to show that Q′n ( f (v)) = Q′n ( f (w)) for v − w ∈ Im(∂n+1 ). Since f
and Q′n are linear it suffices to show that Q′n ( f (v − w)) = 0. Since ker Q′n = Im(∂n+1
′ ),
′ ). Let u ∈ C
this is equivalent to f (v − w) ∈ Im(∂n+1 n+1 (K) be an n + 1 chain, such that
v − w = ∂n+1 (u). Then, f (v − w) = ( f ◦ ∂n+1 )(u) = ∂n+1 ′ ( f (u)) ∈ Im(∂ ′ ).
n+1
ιi, j : Ki → K j , i < j,
that send a simplex ∆ ∈ Ki to ∆ ∈ K j . All inclusion maps are continuous in the sense of
Definition 4.30. The image of (ιi, j )∗ : Hn (Ki ) → Hn (K j ) sees the n-dimensional holes
that are both in Ki and K j , while the kernel of (ιi, j )∗ tells us which simplices are merged
into the boundary of a higher dimensional simplex when closing a hole. We have the
following central definition.
Proof. The first term in the difference is the number of holes which are in Ki that are
still present in K j−1 , but not in K j . In other words,
Similarly,
Therefore, the difference between these two terms counts the number of holes that ap-
pear at index i and vanish at index j.
107
4 Topological Data Analysis
Persistence Diagram H₀
2.5
H₁
2.0
1.5
death
1.0
0.5
0.0
birth
Figure 4.2: The persistence diagram for the sample in Example 4.12 using the Vietori–Rips complex.
In topological data analysis it is standard to visualize the outputs from Algorithm 4.3
by plotting them in a persistent diagram or a barcode plot. A persistent diagram is
a two-dimensional plot, where for fixed n one plots a point at (i, j) ∈ N2 if and only
i, j
if µn > 0. Thus, the points in a persistent diagram indicate when holes appear and
vanish. Points that appear far from the diagonal R · (1, 1)T in R2 are considered signals
from the model underlying the data. In a barcode plot, for every fixed n one places a line
i, j
from i to j whenever µn > 0. Here, bars that appear longer than others are considered
as signals from the underlying model.
108
4 Topological Data Analysis
Persistence Barcode H₀
H₁
Figure 4.3: The barcode for the sample in Example 4.12 using the Vietori–Rips complex.
where zi ∈ R3 is position of the i-th carbon atom. Fig. 4.4 shows two possible points on
this model.
1 https://github.com/mtsch/Ripserer.jl
109
4 Topological Data Analysis
We consider data for c = 1 that is normalized: The equations for the positions are
invariant under simultaneous translation and rotation of the zi . In our data set z1 is the
origin, z8 = (c, 0, 0) and z7 is rotated, such that its last entry is equal to zero. Thus, every
data point in M is a vector in R17 . We have 4058 data points.
A first analysis of the data shows two connected components. They correspond to
the two connected components of O(3) and encode orientation of configurations. We
select the data on one of the two component and are left with 1966 out of the 4058 data
points. We use t-stochastic neighbor embedding2 [vdMH08] to visualize the 1966
data points in R3 . The result is shown in Fig. 4.5.
We use again Ripserer.jl [Ču20] to compute the persistence diagram for a subsam-
ple of 500 data points. The result is shown in Fig. 4.6. The diagram shows two weak
signals for holes of dimension one. It was argued in [MTCW10] that the cyclooctane
model actually has one 1-dimensional hole, and two 2-dimensional holes. The discrep-
ancy could be due to the size of our data set: we use 500 points for persistent homology
while [MTCW10] analyzes 1.031.644 data points.
Exercise 4.8. Compute a sample of 500 random points from the multivariate stan-
dard Gaussian distribution in R2 . Compute the persistent homology of your data using
Ripserer.jl. What do you see? Can you explain the results of your computation?
Exercise 4.9. Sample points on the sphere S2 ⊂ R3 and add noise. Compute the persis-
2 https://github.com/lejon/TSne.jl
110
4 Topological Data Analysis
Figure 4.5: 1966 data points from the cycloctane model after t-stochastic neighbor embedding.
————————————————————– —————————————
———————– ————————————————————–
111
4 Topological Data Analysis
Persistence Diagram H₀
H₁
5
H₂
3
death
0 1 2 3 4 5
birth
Figure 4.6: Persistence diagram for a subsample of 500 points from the cycloctane dataset. We see two
weak signals for one-dimensional holes.
112
Bibliography
[Ash70] Robert B. Ash. Basic probability theory. John Wiley & Sons, Inc., New York-
London-Sydney, 1970.
[AZ18] Martin Aigner and Günter M. Ziegler. Proofs from THE BOOK. Springer Publish-
ing Company, Incorporated, 6th edition, 2018.
[BEKS17] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral Shah. Julia: A fresh
approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
[Bot20] Magnus Bakke Botnan. Topological data analysis, 2020. Lecture Notes.
[Chu97] Fan R. K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference
Series in Mathematics. Published for the Conference Board of the Mathematical
Sciences, Washington, DC; by the American Mathematical Society, Providence,
RI. Available for download, 1997.
[Chu10] Fan Chung. Graph theory in the information age. Notices Amer. Math. Soc.,
57(6):726–732, 2010.
[DFO20] Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for
Machine Learning. Cambridge University Press, 2020.
[DW22] Tamal Krishna Dey and Yusu Wang. Computational Topology for Data Analysis.
Cambridge University Press, 2022.
[FV07] Alan Frieze and Eric Vigoda. A survey on the use of Markov chains to randomly
sample colourings. Combinatorica, 01 2007.
[GK12] Venkatesan Guruswami and Ravi Kannan. Computer science theory for the infor-
mation age, 2012. Lecture Notes.
113
Bibliography
[HJ92] Richard A. Horn and Charles R. Johnson. Matrix analysis, volume 349. Cambridge
University Press, Cambridge, 1992.
[MTCW10] Shawn Martin, Aidan Thompson, Evangelos Coutsias, and Jean-Paul Watson.
Topology of cyclooctane energy landscape. The Journal of chemical physics,
132:234115, 06 2010.
[PBMW98] Larry Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank
citation ranking: Bringing order to the Web. In Proceedings of the 7th International
World Wide Web Conference, pages 161–172, Brisbane, Australia, 1998.
[SS11] Thomas Sauerwald and He Sun. Spectral graph theory, 2011. Lecture Notes.
[Str93] Gilbert Strang. The fundamental theorem of linear algebra. Amer. Math. Monthly,
100(9):848–855, 1993.
[vdMH08] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Jour-
nal of Machine Learning Research, 9:2579–2605, 2008.
[Ču20] Matija Čufar. Ripserer.jl: flexible and efficient persistent homology computation
in julia. Journal of Open Source Software, 5(54):2614, 2020.
114