Lin Alg ML Mimuw
Lin Alg ML Mimuw
1
Every bit of data on a computer is represented by a sequence of numbers.
A potent point of view is to interpret these sequences as vectors in euclidean
space Rn . This brings geometric insight into data processing.
The following figure from [Cho17, p. 7] shows a data classification task
for a two-dimensional data set. A simple rotation simplifies the task: the
class of white points is now easily characterized by a property x < 0.
y
y y
x x
7
1 Linear algebra
https://github.com/anagorko/linalg.
8
1.1 Gaussian elimination
4 8 5 6
1 2 2 3
1 2 2 3 1 2 2 3
→
0 0 −3 −6 ·(− 13 ) 0 0 1 2
x1 = −1 − 2x2 , x3 = 2 for x2 ∈ R.
9
1 Linear algebra
There are infinitely many sequences of row operations that lead to reduced
echelon form. Nonetheless the final matrix is unique.
a · (x1 , x2 , . . . , xn ) = (a · x1 , a · x2 , . . . , a · xn ).
(x1 , x2 , . . . , xn ) ◦ (y1 , y2 , . . . , yn ) = x1 y1 + x2 y2 + · · · + xn yn .
A norm of a vector is
v
u n
p uX
∥(x1 , x2 , . . . , xn )∥ = (x1 , x2 , . . . , xn ) ◦ (x1 , x2 , . . . , xn ) = t x2i .
i=1
cos ∢(x, y)
x◦y = .
∥x∥∥y∥
10
1.3 Linear subspaces
x, y ∈ V ⇒ x + y ∈ V
a ∈ R, x ∈ V ⇒ ax ∈ V.
is a linear subspace of Rn .
Let x1 , x2 , . . . , xm ∈ Rn . A linear combination of vectors x1 , x2 , . . . , xm
with coefficients a1 , a2 , . . . , am ∈ R is a sum
a 1 x1 + a 2 x2 + · · · + a m xm .
A set
Xm
lin x1 , x2 , . . . , xm = ai xi : ai ∈ R ⊂ Rn
i=1
11
1 Linear algebra
a1 x1 + a 2 x2 + · · · + a m xm = 0
a1 = a2 = · · · = am = 0.
v = a 1 α 1 + a2 α 2 + · · · + a m α m
12
1.4 Coordinate system
v = a 1 α 1 + a2 α 2 + · · · + a m α m
where vectors of A are put in first m columns and vector v is put in the
coefficient column.
For example, to find coordinates of (1, 2) in a basis ((1, 1), (−1, 1)) of R2
([Hef20, Exercise 1.22a, p. 127]), we solve
1 −1 1 1 −1 1
→
1 1 2 +(−1) · w1 0 2 1
1 −1 1 1 −1 1
→
0 2 1 · 12 0 1 1
2
3
1 −1 1 +1 · w2 1 0 2
→
1 1
0 1 2 0 1 2
Indeed, we have
13
1 Linear algebra
3 1
(1, 2) = (1, 1) + (−1, 1).
2 2
The curse of the standard basis. We will rigorously differentiate between ith
component xi of a vector v = (x1 , x2 , . . . , xn ) ∈ Rn and its ith coordinate
xi in the standard basis
x1
x2
[v]st = . ,
.
.
xm
Theorem 1.4 ([Hef20, Theorem 3.11, p. 140]). Row and column ranks of a
matrix are equal.
14
1.6 Matrix algebra
Theorem 1.5 ([Hef20, Lemma 3.3, p. 136]). Elementary row and column
operations do not change the rank of a matrix.
3. If r(A) = r(A) < m, then the system has infinitely many solutions.
Matrix multiplication
15
1 Linear algebra
Cj
b11 b1j b1n
blj
bm1 bmj bmn
a11 a1m
Ri ai1 ail aim cij
ak1 akm
We have
m
X
cij = ail blj .
l=1
For example, if
1
2 3
2 7 1
A= and B =
2
1 ,
4
3 1 4
4 −2
1
3
2
1
2 4
4 −2
21
2 7 1 18 4
39 5
3 1 4 2 4
16
1.6 Matrix algebra
5 1
In the highlighted row and column, we have 4 =3·3+ 4 · 1 + (−2) · 4.
Inverse matrix
AR = I and LA = I,
Then
hence R = L.
If
AR = I and AS = I,
then
A(R − S) = 0
17
1 Linear algebra
The validity of the above statement is best argued using left multiplication
by matrices of elementary row operations [Hef20, Lemma 3.20, p. 250] but it
can also be checked by hand. Observe that a pair A|I satisfies I · A = A.
The invariant guarantees that after gaussian elimination the matrix A|I
becomes I|A−1 in the reduced echelon form.
For example, let
1 6 1
3 1 4
2 7 1
18
1.6 Matrix algebra
1 6 1 1 0 0 16 1 1 0 0
1 3 1 1 3 1
0 1 − 17 17 − 17 0 → 0 1 − 17 17 − 17 0
22 19 5 17 19 5 17
0 0 − 17 − 17 − 17 1 ·(− 22 ) 00 1 22 22 − 22
1 6 1 1 0 0 +(−6) · w2 1 0 23 17 − 1
17
6
17 0
1 3 1 1 3 1
0 1 − 17 17 − 17 0 → 0 1 − 17 17 − 17 0
19 5 17 19 5 17
0 0 1 22 22 − 22 00 1 22 22 − 22
23 1 6 23 27 1 23
1 0 17 − 17 17 0 +(− 17 ) · w3 1 0 0 − 22 22 22
1 3 1 1 5 1 1
0 1 − 17 17 − 17 0 + 17 · w3 → 0 1 0 22 − 22 − 22
19 5 17 19 5 17
0 0 1 22 22 − 22 0 0 1 22 22 − 22
The matrix
27 1 23
− 22 22 22
5 1 1
22 − 22 − 22
19 5
22 22 − 17
22
is the inverse of
1 6 1
3 1 4
2 7 1
is
19
1 Linear algebra
1 0 0 1 6 1 1 6 1
= 0
3 1 1 .
17 − 17 0
· 3 1 4 1 − 17
−2 0 1 2 7 1 0 −5 −1
Matrix transposition
Lemma 1.9 ([Hef20, Exercise 4.33, p. 261]). For any matrices A, B such
that the product AB is well defined, we have
(AB)T = B T AT .
20
1.7 Linear transformations
A · [α]A = [φ(α)]B .
M (φ)B
Coordinate change
21
1 Linear algebra
B
MA · [v]A = [v]B ,
so MA
B
can be used to change coordinates of a vector from basis A to basis B.
We call Mφ AB is a change of basis matrix. As above it can be computed
by transforming a matrix
β β2 ··· βm α1 α2 ··· αn
1
1.8 Kernel
Let φ : V → W be a linear map. We let
im φ = {φ(v) : v ∈ V }.
ker A = {v : Av = 0}.
22
1.9 Space decomposition
U + V = {u + v : u ∈ U, v ∈ V } .
Invariant subspaces
23
1 Linear algebra
1.10 Orthogonality
The geometric interpretation of the dot product ◦ implies the definition of
orthogonality of vectors x, y ∈ Rn , which is defined as
x ⊥ y ⇔ x ◦ y = 0.
w⊥U
U ⊥ = {w ∈ W : w ⊥ U }.
Orthonormal basis
u ◦ v = [u]TA · [v]A ,
with dot product on the left and matrix multiplication on the right.
n
X n
X
u= ai αi and v = bi α i
i=1 i=1
with
24
1.10 Orthogonality
a1 b1
a2 b2
[u]A = . and [v]A = . .
. .
. .
an bn
We have
n
X Xn n X
X n
u◦v =
ai αi ◦ bi α i =
ai bj (αi ◦ αj ) =
i=1 i=1 i=1 j=1
n
X
= ai bi = [u]TA · [v]A .
i=1
Pn
Proof. Let A = (α1 , α2 , . . . , αn ). We have v = i=1 bi αi , where by the
definition
b1
b2
[v]A = . .
.
.
bn
From orthonormality of A,
Xn
v ◦ αi = bi α i ◦ α i = b i .
i=1
25
1 Linear algebra
Gram-Schmidt process
Let
U = lin(α1 , α2 , . . . , αk )
Xk
v ◦ αi
πA (v) = αi .
i=1
α i ◦ α i
v − πA (v) ⊥ U
πA (u) = u.
Xk
v ◦ αi
= v ◦ αj − (αi ◦ αj ) = 0,
i=1
α i ◦ α i
since αi ◦ αj = 0 for i ̸= j.
α1 = β 1 ,
α2 = β2 − π(α1 ) (β2 ),
26
1.11 Determinant
...
Then
α1 α2 αn
A= , ,...,
∥α1 ∥ ∥α2 ∥ ∥αn ∥
lin(α1 , α2 , . . . , αk ) = lin(β1 , β2 , . . . , βk ).
Orthogonal matrices
AAT = AT A = I,
AT = A−1 .
1.11 Determinant
Each square n × n matrix A is assigned a real number called its determinant,
denoted det(A) or |A|. If
a11 a12 ... a1n
a22 a22 . . . a2n
A= . .. .. ..
. .
. . .
an1 an2 . . . ann
27
1 Linear algebra
then
a11 = a11 .
For n = 2 we have
a11 a12
= a11 a22 − a12 a21 .
a21 a22
28
1.11 Determinant
1 1 0 1 1 0
3 0 2 +(−3) · w1 → 0 −3 2
5 2 2 +(−5) · w1 0 −3 2
1 1 0 1 1 0
0 −3 2 ·(− 13 ) → (−3) · 0 1 − 23
0 −3 2 0 −3 2
1 1 0 1 1 0
(−3) · 0 1 − 23 → (−3) · 0 1 − 23
0 −3 2 +3 · w2 0 0 0
The last matrix is diagonal and its determinant is the product of the
diagonal entries, so the initial matrix has determinant (−3) · 1 · 1 · 0 = 0.
29
1 Linear algebra
1 1 0
1 0 1 0
3 0 2 = (−1)2+1 · (3) · + (−1)2+2 · (0) · +
2 2 5 2
5 2 2
1 1
+(−1)2+3 · (2) · = −3(2 − 0) + 0 − 2(2 − 5) = 0.
5 2
In the above example we show Laplace expansion along the second row
combined with the permutation formula or 2 × 2 matrices.
1.12 Eigenvectors
The following figure contains a plot of function φ : R2 → R2 defined with the
formula
The vector field on the left depicts vectors φ((x, y)) attached to points (x, y) in
the plane. The vector field on the right depicts the same vectors, normalized
to equal lengths.
30
1.12 Eigenvectors
4 4
2 2
0 0
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
These plots make it clear that φ acts along two axes and has a simple
algebraic nature, but this algebraic simplicity cannot be seen in the matrix of
φ in the standard basis. The two directions seen on the plots are eigenvectors
of φ.
Let φ : V → V be a linear automorphism. We say that v ∈ V is an
eigenvector for φ with eigenvalue λ ∈ R if v ̸= 0 and
φ(v) = λv.
M (φ)A
A · [v]A = λ[v]A ,
M (φ)A
A − λI,
where I is the identity matrix. Such solution exists if and only if the rank of
M (φ)A
A − λI is not maximal, i.e. when
31
1 Linear algebra
−2 4
A − 3I =
2 −4
1 −2 1 −2
→
2 −4 +(−2) · w1 0 0
32
1.13 Self-adjoint operators
1 4 2 6 2
· = = 3 · .
2 −1 1 3 1
φ(u) ◦ v = u ◦ φ(v),
where ◦ denotes the dot product. For example, a projection onto a subspace
by least distance is self-adjoint, while rotation (in general) is not.
Proof. We have
T
T
M AM = M T AT (M T )T = M T AM.
33
1 Linear algebra
and
T
φ(u)◦v = [φ(u)]TA ·[v]A = M (φ)A
A · [u] A ·[v]A = [u]TA ·(M (φ)A
A ) ·[v]A .
T
If M (φ)A
A is symmetric, then right-hand sides are equal and from left-hand
sides we see that φ is self-adjoint.
f (v) = v ◦ (Av),
34
1.14 Spectral Theorem
The restriction of f to the unit sphere S n−1 ⊂ Rn attains its maximum for
some p ∈ S n−1 (since the sphere is compact and f is continuous). At p, every
tangent vector to S n−1 must be orthogonal to the gradient. Hence p⊥ ⊥ Ap,
so Ap ∈ (p⊥ )⊥ = lin p. This means that p is an eigenvector for A.
Proof. The "only if" part follows from the observation that if B = M −1 AM
and M is orthogonal, then A and B are congruent. By Lemma 1.22, if B is
diagonal, then A has to be symmetric.
If A is symmetric, then φ : Rn → Rn such that
M (φ)st
st = A
st
A = Mst AMA ,
M (φ)A A
st
is diagonal and since MA is orthogonal, A is orthogonally diagonalizable.
35
1 Linear algebra
(Av) ◦ v ≥ 0.
From the geometric interpretation of the dot product, it says that the angle
between Av and v does not exceed 90◦ .
A matrix is positive definite if for each v ∈ R such that v ̸= 0 we have
(Av) ◦ v > 0.
36
1.15 Positive semidefinite matrices
A = M −1 DM,
B = M −1 EM.
We have
B 2 = M −1 E 2 M = A.
37
1 Linear algebra
B = N −1 F N and C = N −1 GN,
Proof. ⇒: take
B
C = ,
0
38
1.17 The LU decomposition
A = LU
then a11 = l11 u11 and if a11 is 0, then either l11 or u11 is 0. This means that
the right
hand
side is singular. Hence if A is non-singular and a11 = 0, e.g.
0 1
A= , then LU factorization in the above form does not exist.
1 1
This can be easily fixed by allowing row permutations. LU decom-
position with partial pivoting (LUP) is a LU decomposition with row
permutations
P A = LU,
1. U = A, L = I, P = I.
2. For k = 1, 2, . . . , n − 1:
1. Select i ≥ k to maximize |uik |
39
1 Linear algebra
L k Pk A = U k .
We have
Pk = QPk−1 ,
Uk = EQUk−1 ,
and
Lk = EQLk−1 Q−1 .
Therefore
40
1.17 The LU decomposition
P A = LU
We have
2 1 1 0 1 0 0 0
4 3 3 1 0 1 0 0
[U |L] = .
8 7 9 5 0 0 1 0
6 7 9 8 0 0 0 1
41
1 Linear algebra
2 1 1 0 1 0 0 0 ← r3 \ 0 8 7 9 5 1 0 0 0
4 3 3 1 0 1 0 0 4 3 3 1 0 1 0 0
[U |L] :
→
8 7 9 5 0 0 1 0 ← r1 \ 0 2 1 1 0 0 0 1 0
6 7 9 8 0 0 0 1 6 7 9 8 0 0 0 1
8 7 9 5 1 0 0 0 8 7 9 5 1 000
0 − 12 − 32 − 32 − 12 1 0 0 ← r4 \ 0 0 74 49 17 4 −4
3
1 0 0
→ .
0 − 34 − 54 − 54 − 14 0 1 0 0 − 34 − 54 − 54 − 14 0 1 0
0 74 49 17 4 −4
3
0 0 1 ← r2 \ 0 0 − 12 − 32 − 32 − 12 001
42
1.17 The LU decomposition
43
1 Linear algebra
8 7 9 5 1 0 0 0 8 7 9 5 1 0 0 0
7 9 17 3 7 9 17 3
0 4 4 4 −4 1 0 0 0 4 4 4 −4 1 0 0
→
6 2 5 2
0 0 −7 −7 −7 7 1 0 0 0 − 67 − 27 − 57 27 1 0
2 4 4 3 1
0 0 − 7 7 − 7 7 0 1 +(− 3 ) · w3 0 0 0 23 − 13 13 − 13 1
We have
−1
1 0 0 0 1 0 0 0
3 3
− 4 1 0 0
4 1 0 0
L=
= .
− 57 2
1 0 1 − 27 1 0
7
2
− 13 1
3
1
−3 1 1
4 − 37 1
3 1
44
1.18 The Cholesky decomposition
8 7 9 5 1 0 0 0 8 7 9 5
6 7 9 8 3 1 0 0 0 7 9 17
4 4 4 4
= = = LU.
4 3 3 1 1 − 27 1 0 0 0 − 67 − 27
2
1
2 1 1 0 4 − 37 1
3 1 0 0 0 2
3
Ax = b.
Ax = LU x = Ly = b.
A = LLT ,
45
1 Linear algebra
always exists for positive semidefinite matrices, but they don’t have to be
unique. For example
0 0 0 0 0 sin α
=
0 1 sin α cos α 0 cos α
0 0
shows infinitely many Cholesky decompositions of matrix .
0 1
Theorem 1.33 is proved by induction on the size n of the matrix.
The case n = 2 is crux of the matter. Let
a b λ 0
A= and L = .
b d x δ
We have equation
2
λ 0 λ x λ λx a b
LLT = = = = A.
x δ 0 δ λx x2 + δ 2 b d
b
x= √ ,
a
r
p b2
δ= d − x2 = d− .
a
Since A is positive definite, we have a ≥ 0 and ad − b2 ≥ 0, so d − x2 =
ad−b2
a ≥ 0. Therefore the above solution is well defined. It also is the only
solution with positive λ and δ.
The above step may be upgraded to a full recursive procedure of computing
L.
46
1.18 The Cholesky decomposition
T
QQ Qx
A=
(Qx)T d
(Q ) x
T −1
Proof. Let u = . Then
−1
0
Au =
∥x∥2 − d∥
because
Consequently,
Proof (Theorem 1.33). Assume theorem holds for all matrices of size n × n.
Let
e
A b
A= ,
bT d
where A
e is an n × n matrix, b is a column vectors and d is a number. By the
inductive hypothesis, there exists a unique n × n lower triangular matrix Q
with positive diagonal entries such that A
e = QQT .
Let
Q 0
L= .
xt δ
47
1 Linear algebra
The equation
Q 0 Q T
x QQ T
Qx
LLT = = =A
xt δ 0 δ (Qx)T ∥x∥2 + δ 2
is equivalent to
QQT = A,
Qx = b,
and
∥x∥2 + δ 2 = d.
x = Q−1 b
and
p
δ= d − ∥x∥2 .
Computational example
Let
4 12 −16
A=
12 37 −43
.
−16 −43 98
48
1.18 The Cholesky decomposition
2 0 0
L=
· · 0.
· · ·
Next, we have Q = 2 and we have
x=Q −1
b= 1
2 12 = 6
and
p
δ= 37 − 62 = 1.
Hence
2 0 0
L=
6 1 0.
· · ·
2 0
Finally, for Q = we have
6 1
1
2 0 −16 −8
x = Q−1 b = =
−3 1 −43 5
and
p √
δ= 98 − ∥(−8, 5)∥2 = 98 − 89 = 3.
49
1 Linear algebra
2 0 0
L=
6 1 0.
−8 5 3
Indeed, we have
2 0 0 2 6 −8 4 12 −16
= 12
6 1 0 0 1 5 37 −43
.
−8 5 3 0 0 3 −16 −43 98
A = U ΣV,
M (φ)st
st = A,
AT A = M DM T ,
M T AT AM = D,
50
1.19 Singular value decomposition
so (AM )T (AM ) = D.
Let A = (α1 , α2 , . . . , αn ) be sequence of column vectors of M . Since M
is orthogonal, A is an orthonormal basis of Rn . We have
M (φ)st
A = AM,
(β1 , β2 , . . . , βm )
from this sequence. If r(A) < m, we pick an orthogonal basis of the image φ
and extend it to orthogonal basis of Rm with arbitrary vectors. Without a
loss of generality we may assume that vectors of A (hence columns of M ) are
ordered in such a way that φ(αi ) = βi , for i = 1, 2, . . . , r(A). Let
β1 β2 βm
B={ , ,..., }.
∥β1 ∥ ∥β2 ∥ ∥βm ∥
st
√
M (φ)B
A = Mst AMA = N AM =
B
D,
hence
A = U ΣV,
√
for U = N T = MBst , Σ = D and V = M T .
A = U ΣV
51
1 Linear algebra
r(A) r(A)
X X
A = U ΣV = U ( Σi )V = σi ui v i ,
i=1 i=1
Computational example
Let
1 2 3
A= .
4 5 6
We have
17 22 27
AT A =
22 29 36.
27 36 45
52
1.19 Singular value decomposition
0.805964
−0.712867
A· 0.112382 = ,
0.298575
−0.581198
0.428667
3.673121
A· 0.566306 = ,
8.769883
0.703946
We pick
β1 = (−0.712867, 0.298575),
β2 = (3.673121, 8.769883).
We rearrange A to
Hence
53
1 Linear algebra
−0.922364 0.386317
U = (Mst ) = MBst =
B −1
,
0.386321 0.922365
0.805964 0.112382 −0.581198
V = Mst
A
=
0.428667 0.566306 0.703946
0.408248 −0.816496 0.408248
and
√
√ 0.597327 0 0 0.772869 0 0
Σ= D= √ = .
0 90.402672 0 0 9.508032 0
Indeed, we have
0.805964 0.112382 −0.581198
−0.922364 0.386317 0.772869 0 0
U ΣV = 0.428667 0.566306 0.703946
0.386321 0.922365 0 9.508032 0
0.408248 −0.816496 0.408248
0.805964 0.112382 −0.581198
−0.712866 3.673114 0.0
= 0.428667 0.566306 0.703946
0.298575 8.769875 0.0
0.408248 −0.816496 0.408248
0.999998 1.999993 2.999990
=
3.999997 4.999987 5.999987
54
1.20 Problems
−0.574544 −0.080113 0.414316 1.574542 2.080106 2.585674
= + =
0.240641 0.033554 −0.173531 3.759356 4.966433 6.173519
0.999998 1.999993 2.999990
=
3.999997 4.999987 5.999987
1.20 Problems
Problem 1.20.1. Verify that the formula
cos ∢(x, y)
x·y = . x2
y
∥x∥∥y∥
−
x
x3
y
Hint. Compute the squared length of y − x as
x1
(y − x) ◦ (y − x) and from the law of cosines.
1
∥u − v∥2 − ∥u − v∥2 = u ◦ v,
4
for any u, v ∈ Rn .
55
1 Linear algebra
Problem 1.20.6. Pick a basis of R3 from (2, 1, 3), (1, 2, 4), (3, 0, 2), (2, −2, 2).
How many solutions does this problem have?
56
1.20 Problems
−2 −3 2 1
0 2 3
−1 .
0 0 1 7
Let A = ((3, 4, 1), (2, 3, 1), (5, 1, 1)), B = ((3, 1), (2, 1)). Find M (φ)B
A and
M (φ)st
st .
Problem 1.20.14. [put90] If A and B are square matrices of the same size
such that ABAB = 0, does it follow that BABA = 0?
57
1 Linear algebra
2 4 −2 −2
1 3 1 2
.
1 3 1 3
−1 2 1 2
58
1.20 Problems
59
1 Linear algebra
1 2 3 4 1 2 3 4
2 4 6 8 2 4 5 6
A= and B =
3 6 9 12 3 5 6 7
4 8 12 16 4 6 7 8
60
1.20 Problems
1 1 0
1 0 1
61