Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
28 views

Lin Alg ML Mimuw

uw

Uploaded by

niko.trebacz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Lin Alg ML Mimuw

uw

Uploaded by

niko.trebacz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Linear algebra

1
Every bit of data on a computer is represented by a sequence of numbers.
A potent point of view is to interpret these sequences as vectors in euclidean
space Rn . This brings geometric insight into data processing.
The following figure from [Cho17, p. 7] shows a data classification task
for a two-dimensional data set. A simple rotation simplifies the task: the
class of white points is now easily characterized by a property x < 0.
y
y y

x x

The above is a coordinate change or a linear/affine transformation.


We’ve entered the domain of linear algebra, which is the topic of the present
chapter.
The first part of the chapter is a quick refresher of the basics. Its main
purpose is to set notation. We limit our scope to finite dimensional real vector
spaces.
The second part covers more specialized topics that are relevant to machine
learning. We study in more detail:

7
1 Linear algebra

• the Spectral Theorem: every self-adjoint operator admits orthonormal


basis consisting of eigenvectors (or - equivalently - every symmetric matrix
is orthogonally diagonalizable);
• the Gram matrix: every symmetric positive semidefinite matrix A
admits a decomposition A = C T C for some rectangular matrix C.
• LU Decomposition: every n × n matrix A admits (sometimes it re-
quires permutation of rows) a decomposition A = LU , where L is a lower
triangular matrix and U is an upper triangular matrix.
• Cholesky Decomposition: every symmetric positive definite matrix A
admits a unique decomposition A = LLT , where L is a lower triangular
matrix with positive diagonal entries.
• Singular Value Decomposition (SVD): every m × n matrix A admits
a decomposition A = U ΣV , where U , V are square orthogonal matrices
and Σ is a diagonal matrix.

Examples of computations (Gaussian elimination to compute matrix in-


verses, determinants and so on) were generated using linalg SageMath
package that is available at

https://github.com/anagorko/linalg.

It was created to supplement a "Linear algebra and geometry I" course


taught during winter semester 2020 at the University of Warsaw Mathematics
Department and it can be used to automate and visualize steps of matrix
computations in Linear algebra.

1.1 Gaussian elimination


Let us recall how Gaussian elimination works. Consider the following system
of linear equations.


 4x1 + 8x2 + 5x3 = 6

 x + 2x2 + 2x3 = 3
1

We rewrite it in a matrix notation.

8
1.1 Gaussian elimination
 
 4 8 5 6 
 
1 2 2 3

To avoid fractions, we start by swapping rows.


   
 4 8 5 6  ← r2  1 2 2 3 
  → 
1 2 2 3 ← r1 4 8 5 6

Now we go straight to an echelon form.


   
 1 2 2 3   1 2 2 3 
  → 
4 8 5 6 +(−4) · r1 0 0 −3 −6

   
 1 2 2 3   1 2 2 3 
  → 
0 0 −3 −6 ·(− 13 ) 0 0 1 2

To get a solution we need the reduced echelon form.


   
 1 2 2 3  +(−2) · r2  1 2 0 −1 
  → 
0 0 1 2 0 0 1 2

After these transformations our system of linear equations is in the fol-


lowing form.


 x1 + 2x2 = −1

 x3 = 2

We treat x2 as a parameter and write out a parametrized set of solutions:

x1 = −1 − 2x2 , x3 = 2 for x2 ∈ R.

In other words, the solution set is



(−1 − 2x2 , x2 , 2) : x2 ∈ R .

9
1 Linear algebra

There are infinitely many sequences of row operations that lead to reduced
echelon form. Nonetheless the final matrix is unique.

Theorem 1.1 ([Hef20, Theorem 2.6, p. 59]). Reduced echelon form of a


matrix is unique.

1.2 Euclidean vector space Rn



Euclidean n-space is a set of vectors Rn = (x1 , x2 , . . . , xn ) : xi ∈ R with
vector addition

(x1 , x2 , . . . , xn ) + (y1 , y2 , . . . , yn ) = (x1 + y1 , x2 + y2 , . . . , xn + yn )

and scalar multiplication (a ∈ R)

a · (x1 , x2 , . . . , xn ) = (a · x1 , a · x2 , . . . , a · xn ).

We define a dot product

(x1 , x2 , . . . , xn ) ◦ (y1 , y2 , . . . , yn ) = x1 y1 + x2 y2 + · · · + xn yn .

A norm of a vector is
v
u n
p uX
∥(x1 , x2 , . . . , xn )∥ = (x1 , x2 , . . . , xn ) ◦ (x1 , x2 , . . . , xn ) = t x2i .
i=1

It is a length of a vector and it agrees with Pythagorean formula in Euclidean


geometry. If x = (x1 , x2 , . . . , xn ) ∈ Rn and y = (y1 , y2 , . . . , yn ) ∈ Rn , then
the geometric interpretation of the dot product x ◦ y is

cos ∢(x, y)
x◦y = .
∥x∥∥y∥

It is a reformulation of the law of cosines in Euclidean geometry (see Prob-


lem 1.20.1).

10
1.3 Linear subspaces

1.3 Linear subspaces

A linear subspace V of Rn is a subset of Rn that is closed under vector


addition

x, y ∈ V ⇒ x + y ∈ V

and scalar multiplication

a ∈ R, x ∈ V ⇒ ax ∈ V.

We limit our scope to linear subspaces of Rn so in the sequel we will


simply say linear space instead of linear subspace of Rn . If V , W are
linear spaces and V ⊂ W , then we say that V is a subspace of W (not just
of Rn ).
Let ai,j ∈ R with 1 ≤ i ≤ m, 1 ≤ j ≤ n. A set of solutions of a
homogeneous system of linear equations


 a1,1 x1 + a1,2 x2 + · · · + a1,n xn = 0





 a2,1 x1 + a2,2 x2 + · · · + a2,n xn = 0
 ..


 .



 am,1 x1 + am,2 x2 + · · · + am,n xn = 0

is a linear subspace of Rn .
Let x1 , x2 , . . . , xm ∈ Rn . A linear combination of vectors x1 , x2 , . . . , xm
with coefficients a1 , a2 , . . . , am ∈ R is a sum

a 1 x1 + a 2 x2 + · · · + a m xm .

A set
 
  Xm 
lin x1 , x2 , . . . , xm = ai xi : ai ∈ R ⊂ Rn
 
i=1

of all linear combinations of vectors x1 , x2 , . . . , xm is a subspace of Rn and we

11
1 Linear algebra

call it a subspace spanned by x1 , x2 , . . . , xm (or a span or linear closure


of vectors x1 , x2 , . . . , xm ).

Theorem 1.2. Every subspace of Rn is spanned by a (finite) sequence of


vectors and is also a solution set of a (finite) homogeneous system of linear
equations.

1.4 Coordinate system


We say that vectors x1 , x2 , . . . , xm ∈ Rn are linearly independent if the
equation

a1 x1 + a 2 x2 + · · · + a m xm = 0

has exactly one solution

a1 = a2 = · · · = am = 0.

Otherwise we say that the vectors are linearly dependent. Vectors


x1 , x2 , . . . , xm are linearly independent if the zero vector 0 can be represented
as a linear combination of x1 , x2 , . . . , xm in a unique way. Vectors are linearly
dependent if and only if one of the vectors can be expressed as a linear
combination of others.
Let V be a subspace of Rn . If x1 , x2 , . . . , xm are linearly independent and
span V , then we say that x1 , x2 , . . . , xm is a basis of V .

Theorem 1.3 ([Hef20, Theorem 2.4, p. 131]). If x1 , x2 , . . . , xm and y 1 , y 2 ,


y 3 , . . . , y k are bases of a linear space V , then k = m. Every linearly inde-
pendent set of vectors in a space V may be extended to a basis of V .

The cardinality of a base of a linear space V is the dimension of V and


is denoted by dim V . It follows from Theorem 1.3 that: (1) dim V is well
defined; (2) if W is a subspace of V , then dim W ≤ dim V ; (3) bases of V are
exactly the maximal linearly independent subsets of V .

Let A = α1 , α2 , . . . , αm be a basis of V . It follows from the definition
of a basis that for each v ∈ V the equation

v = a 1 α 1 + a2 α 2 + · · · + a m α m

12
1.4 Coordinate system

has exactly one solution a1 , a2 , . . . , am ∈ R. We call the coefficients coordin-


ates of v in basis A and denote it with a column vector
 
 a1 
 
 a2 
 
[v]A =  .  .
 . 
 . 
 
am

To find coordinates of a vector v in basis A we need to solve equation

v = a 1 α 1 + a2 α 2 + · · · + a m α m

which is a system of linear equations with a matrix


 
 
 
 α α1 ··· αm v 
 1 
 

where vectors of A are put in first m columns and vector v is put in the
coefficient column.

For example, to find coordinates of (1, 2) in a basis ((1, 1), (−1, 1)) of R2
([Hef20, Exercise 1.22a, p. 127]), we solve
   
 1 −1 1   1 −1 1 
  → 
1 1 2 +(−1) · w1 0 2 1

   
 1 −1 1   1 −1 1 
  → 
0 2 1 · 12 0 1 1
2
   
3
 1 −1 1  +1 · w2  1 0 2 
  → 
1 1
0 1 2 0 1 2

Indeed, we have

13
1 Linear algebra

3 1
(1, 2) = (1, 1) + (−1, 1).
2 2

The curse of the standard basis. We will rigorously differentiate between ith
component xi of a vector v = (x1 , x2 , . . . , xn ) ∈ Rn and its ith coordinate
xi in the standard basis
 
 x1 
 
 x2 
 
[v]st =  .  ,
 . 
 . 
 
xm

which is also xi . By st we denote the standard basis of Rn . We make an


exception in the context of matrix multiplication, when we write v or v T
instead of [v]st or ([v]st )T (i.e. we treat v as a column matrix), e.g. we write
Av, v T A, uT Av (here ()T denotes matrix transposition, see Section 1.6). This
will not lead to confusion as there is no other way to multiply matrix by a
vector.
Mixing the two (components and coordinates) makes some formulas much
easier, but at the expense of the ability to understand the same formulas with
non-standard bases. We gave an example to motivate usage of non-standard
bases at the beginning of the chapter. Here let us note that existence of
a canonical base of a linear space is a rare luxury. Consider a space V
of all polynomials P of degree at most 3 that vanish at 2, i.e. such that
P (2) = 0. In other words it is a subspace of R4 defined by the equation
8x1 + 4x2 + 2x3 + x4 = 0. What base would be "standard" for V ?

1.5 Rank of a matrix


A column rank of a matrix is the dimension of a span of its columns. A
row rank of a matrix is the dimension of a span of its rows.

Theorem 1.4 ([Hef20, Theorem 3.11, p. 140]). Row and column ranks of a
matrix are equal.

We define a rank of a matrix to be either of these numbers. We denote


rank of a matrix A by r(A).

14
1.6 Matrix algebra

Theorem 1.5 ([Hef20, Lemma 3.3, p. 136]). Elementary row and column
operations do not change the rank of a matrix.

Theorem 1.6 (Kronecker-Capelli). Let




 a11 x1 + a12 x2 + ··· + a1n xn = b1





 a21 x1 + a22 x2 + ··· + a2n xn = b2
 ..


 .



 am1 x1 + am2 x2 + ··· + amn xn = bm

be a system of linear equations, A = (aij ) be the coefficient m × n matrix


and A be the augmented matrix obtained from A by adding the column of
free terms bi .

1. The system is compatible if and only if r(A) = r(A).

2. If r(A) = r(A) = m, then the system has exactly one solution.

3. If r(A) = r(A) < m, then the system has infinitely many solutions.

1.6 Matrix algebra

Matrix multiplication

Let A be an k × m matrix (k rows, m columns). Let B be n m × n matrix. A


k×n matrix C is a result of matrix multiplication of A by B if ci,j = Ri ◦Cj ,
where Ri is ith row of A, Cj is jth column of B and ◦ is the dot product.

15
1 Linear algebra

Cj
 
b11 b1j b1n
 
 
 
 
 
 blj 
 
 
 
 
bm1 bmj bmn
  
a11 a1m
  
  
  
  
  
Ri  ai1 ail aim  cij 
  
  
  
  
ak1 akm

We have
m
X
cij = ail blj .
l=1

For example, if
 
  1
2 3
2 7 1  
A=  and B = 
2
1 ,
4 
3 1 4  
4 −2

then A · B is a 2 × 2 matrix computed in the following way.

 
1
3
2 
 1 
2 4

 
4 −2
   
21
2 7 1 18 4
   
39 5
3 1 4 2 4

16
1.6 Matrix algebra

5 1
In the highlighted row and column, we have 4 =3·3+ 4 · 1 + (−2) · 4.

Theorem 1.7 ([Hef20, Exercise 2.30, p. 243]).

r(A · B) ≤ min(r(A), r(B))

Inverse matrix

A matrix G is a left inverse of a matrix H if GH is the identity matrix. It is


a right inverse if HG is the identity. A matrix H with a two-sided inverse
is an invertible matrix. That two-side inverse is denoted H −1 ([Hef20, p.
255]).
If H is a square n × n matrix, then H has an inverse if and only if
r(H) = n.

Theorem 1.8. If A is a square n × n matrix of rank n, then it has both left


and right inverses, these inverses are unique and equal.

Proof. The existence of left/right inverses of a rank n matrix will be demon-


stated below. Here we show uniqueness and equality. Assume that

AR = I and LA = I,

i.e. R is a right inverse of A and L is a left inverse of A. Then

A(R − L)A = ARA − ALA = (AR)A − A(LA) = A − A = 0.

Then

R − L = LA(R − L)AR = L0R = 0,

hence R = L.
If

AR = I and AS = I,

then

A(R − S) = 0

17
1 Linear algebra

and from the existence of left inverse, R − S = 0 so R = S.

Let A be a square n × n matrix with r(A) = n. To find the (left) inverse


matrix A−1 of A we use the following invariant (to find right inverse we may
find a left inverse of AT , since row and column ranks are equal. In the light
of the last theorem we’ll stop differentiating between left and right inverses.).
   
If L|R is a pair of n × n matrices such that R · A = L and L′ |R′ is
 
obtained from L|R by an elementary row operation, then the relation
R′ · A = L′ is preserved.

The validity of the above statement is best argued using left multiplication
by matrices of elementary row operations [Hef20, Lemma 3.20, p. 250] but it
 
can also be checked by hand. Observe that a pair A|I satisfies I · A = A.
 
The invariant guarantees that after gaussian elimination the matrix A|I
 
becomes I|A−1 in the reduced echelon form.
For example, let
 
 1 6 1 
 
 3 1 4 
 
 
2 7 1

We run gaussian elimination on the extended matrix.


   
16 1 1 0 0 1 6 1 100
   
   
 3 1 4 0 1 0  +(−3) · w1 →  0 −17 1 −3 1 0 
   
2 7 1 0 0 1 +(−2) · w1 0 −5 −1 −2 0 1
   
1 6 1 1 0 0 1 6 1 1 00
   
  1  1 3 1 
 0 −17 1 −3 1 0  ·(− 17 ) →  0 1 − 17 17 − 17 0
   
0 −5 −1 −2 0 1 0 −5 −1 −2 01
   
1 6 1 1 0 0 1 6 1 1 00
   
 1 3 1   1 3 1 
 0 1 − 17 17 − 17 0 → 0 1 − 17 17 − 17 0
   
0 −5 −1 −2 0 1 +5 · w2 0 0 − 22 19 5
17 − 17 − 17 1

18
1.6 Matrix algebra
   
1 6 1 1 0 0 16 1 1 0 0
   
 1 3 1   1 3 1 
0 1 − 17 17 − 17 0  →  0 1 − 17 17 − 17 0
   
22 19 5 17 19 5 17
0 0 − 17 − 17 − 17 1 ·(− 22 ) 00 1 22 22 − 22
   
1 6 1 1 0 0 +(−6) · w2 1 0 23 17 − 1
17
6
17 0
   
 1 3 1   1 3 1 
0 1 − 17 17 − 17 0 →  0 1 − 17 17 − 17 0
   
19 5 17 19 5 17
0 0 1 22 22 − 22 00 1 22 22 − 22
   
23 1 6 23 27 1 23
1 0 17 − 17 17 0 +(− 17 ) · w3 1 0 0 − 22 22 22
   
 1 3 1  1  5 1 1 
0 1 − 17 17 − 17 0  + 17 · w3 →  0 1 0 22 − 22 − 22 
   
19 5 17 19 5 17
0 0 1 22 22 − 22 0 0 1 22 22 − 22

The matrix
 
27 1 23
 − 22 22 22 
 
 5 1 1 
 22 − 22 − 22 
 
19 5
22 22 − 17
22

is the inverse of
 
 1 6 1 
 
 3 1 4 
 
 
2 7 1

The invariant in the third row of the computation for a pair


 
 1 6 1 1 0 0 
 
 0 1 3 1
 1 − 17 17 − 17 0 

 
0 −5 −1 −2 0 1

is

19
1 Linear algebra
     
 1 0 0   1 6 1   1 6 1 
     
= 0
 3 1 1 .
 17 − 17 0  
· 3 1 4   1 − 17 
     
−2 0 1 2 7 1 0 −5 −1

Matrix transposition

For any matrix A, the transpose of A, written AT , is the matrix whose


columns are the rows of A.

Lemma 1.9 ([Hef20, Exercise 4.33, p. 261]). For any matrices A, B such
that the product AB is well defined, we have

(AB)T = B T AT .

1.7 Linear transformations


A map φ from a linear space V into a linear space W is linear if it preserves
vector addition

φ(x + y) = φ(x) + φ(y) for each x, y ∈ V

and scalar multiplication

φ(ax) = aφ(x) for each a ∈ R, x ∈ V.

Matrix of a linear transformation

Let A = (α1 , α2 , . . . , αn ) be a basis of V and B = (β1 , β2 , . . . , βm ) be a basis


of W . We let
 
 
 
M (φ)B
A =
[φ(α1 )]B [φ(α1 )]B ··· [φ(αn )]B 

 

20
1.7 Linear transformations

be a matrix of φ in bases A, B. It is a matrix with m rows and n columns.


The ith column of the matrix contains coordinates of φ(αi ) (the image of ith
vector from basis A under transformation φ) in basis B.
To compute M (φ)B
A we need to compute [φ(αi )]B for each i, which can
be done by performing full gaussian elimination on a matrix
 
 
 
 β β2 ··· βm φ(α1 ) φ(α2 ) ··· φ(αn ) 
 1 
 

to get in reduced echelon form a matrix


 
 1 0 0 
 
 0 1 ··· 0 [φ(α1 )]B [φ(α2 )]B ··· [φ(αn )]B .
 
 
0 0 1

The matrix on the right hand side is M (φ)B


A.

Theorem 1.10. Let φ : V → W be a linear transformation, let A be a basis


of V and let B be a basis of W . We have

A · [α]A = [φ(α)]B .
M (φ)B

Proof. It is a direct consequence of the definition of M (φ)B


A.

By Theorem 1.10, matrix of a linear transformation defines the transformation


on the entire domain. Theorem 1.10 leads to the following theorem.

Theorem 1.11 ([Hef20, Theorem 2.1, p. 267]). Let φ : V → W and ψ : W →


Z. Let A, B, C be bases of V, W, Z respectively. We have

M (ψ ◦ φ)CA = M (ψ)CB · M (φ)B


A.

Coordinate change

Let id be the identity map id : Rn → Rn , let A, B be bases of Rn and let


B
MA = M (id)B
A . From Theorem 1.10, we have

21
1 Linear algebra
B
MA · [v]A = [v]B ,

so MA
B
can be used to change coordinates of a vector from basis A to basis B.
We call Mφ AB is a change of basis matrix. As above it can be computed
by transforming a matrix
 
 
 
 β β2 ··· βm α1 α2 ··· αn 
 1 
 

to a reduced echelon form, in which there will be MA


B
on the right hand side,
next to the identity matrix on the left. Notice that if B is the standard basis
st
st, then no gaussian elimination is needed and MA is just a matrix with
vectors from basis A in columns.

1.8 Kernel
Let φ : V → W be a linear map. We let

ker φ = {v ∈ V : φ(v) = 0}.

We call ker φ the kernel of φ. It is a linear subspace of V .


We let

im φ = {φ(v) : v ∈ V }.

We call im φ the image of φ. It is a linear subspace of W .

Theorem 1.12. Let φ : V → W be a linear map. We have

dim V = dim im φ + dim ker φ.

For matrix A, we let

ker A = {v : Av = 0}.

We call ker A the kernel of A.

22
1.9 Space decomposition

1.9 Space decomposition


Let U, V be subspaces of W . We let

U + V = {u + v : u ∈ U, v ∈ V } .

It is a subspace of W . If additionally U ∩ V = {0}, then we write U ⊕ V to


denote U + V .

Lemma 1.13. If W = U ⊕ V , then each vector w ∈ W has unique decom-


position w = u + v such that u ∈ U and v ∈ V .

If A = {α1 , α2 , . . . , αn } is a base of U and B = {β1 , β2 , . . . , βm } is a base


of V , then {α1 , α2 , . . . , αn , β1 , β2 , . . . , βm } is a base of U ⊕ V .

Invariant subspaces

Let φ : V → V be a linear transformation. A subspace W ⊂ V is invariant


under φ if φ(W ) ⊂ W .

Lemma 1.14. If W = U ⊕ V and A = (α1 , α2 , . . . , αn ) is a basis of W such


that (α1 , α2 , . . . , αk ) is a basis of U and (αk+1 , αk+2 , . . . , αn ) is a basis of V .
Then
 
a11 a12 ··· a1k 0 0 ··· 0 
 
a21
 a22 ··· a2k 0 0 ··· 0  
 . .. .. .. .. .. .. .. 
 . . . 
 . . . . . . 
 
 
ak1 ak2 ··· akk 0 0 ··· 0 
M (φ)A = 
A



 0 0 ··· 0 b11 b12 ··· b1m 
 
 
 0 0 ··· 0 b21 b22 ··· b2m 
 
 . .. .. .. .. .. .. .. 
 .. . . . . . . . 
 
 
0 0 ··· 0 bm1 bm2 ··· bmm

is a block matrix with k × k and m × m blocks on the diagonal, where


m = n − k.

23
1 Linear algebra

1.10 Orthogonality
The geometric interpretation of the dot product ◦ implies the definition of
orthogonality of vectors x, y ∈ Rn , which is defined as

x ⊥ y ⇔ x ◦ y = 0.

Let U ⊂ W be a set of vectors. For each w ∈ W we write

w⊥U

to denote that for each u ∈ U we have w ⊥ u (vector w is perpendicular to


all vectors from U ).
Let U be a subspace of W . The orthogonal complement of U in W is

U ⊥ = {w ∈ W : w ⊥ U }.

Orthonormal basis

Let A = (α1 , α2 , . . . , αn ) be a basis of a linear space V . We say that A is


orthogonal if for each i ̸= j we have αi ⊥ αj . If additionally each vector
has norm 1 (∥αi ∥ = 1 for each i), then we say that A is orthonormal.
The standard basis st in Rn is orthonormal.

Lemma 1.15. If A is an orthonormal basis of a linear space V and u, v ∈ V ,


then

u ◦ v = [u]TA · [v]A ,

with dot product on the left and matrix multiplication on the right.

Proof. Let A = (α1 , α2 , . . . , αn ). We have

n
X n
X
u= ai αi and v = bi α i
i=1 i=1

with

24
1.10 Orthogonality
   
 a1   b1 
   
 a2   b2 
   
[u]A =  .  and [v]A =  .  .
. .
. .
   
an bn

We have
   
n
X Xn n X
X n
u◦v = 
ai αi ◦  bi α i =
 ai bj (αi ◦ αj ) =
i=1 i=1 i=1 j=1

n
X
= ai bi = [u]TA · [v]A .
i=1

Lemma 1.16. If A = (α1 , α2 , . . . , αn ) is an orthonormal basis of a linear


space V and v ∈ V , then
 
 v ◦ α1 
 
 v ◦ α2 
 
[v]A =  .  .
 . 
 . 
 
v ◦ αn

Pn
Proof. Let A = (α1 , α2 , . . . , αn ). We have v = i=1 bi αi , where by the
definition
 
 b1 
 
 b2 
 
[v]A =  .  .
.
.
 
bn

From orthonormality of A,
 
Xn
v ◦ αi =  bi α i  ◦ α i = b i .
i=1

25
1 Linear algebra

Gram-Schmidt process

Let

U = lin(α1 , α2 , . . . , αk )

be a subspace of V spanned by a sequence of orthogonal vectors A =


(α1 , α2 , . . . , αk ). For v ∈ V we let

Xk
v ◦ αi
πA (v) = αi .
i=1
α i ◦ α i

Lemma 1.17. If A is orthogonal, then πA : V → U is orthogonal projection


onto U , i.e. for each v ∈ V

v − πA (v) ⊥ U

and for each u ∈ U

πA (u) = u.

Proof. Since U is spanned by A it is enough to check that v − πA (v) ⊥ αj


for each j. We have
 
Xk
v ◦ αi 
(v − πA (v)) ◦ αj = v − αi ◦ α j =
i=1
α i ◦ α i

Xk
v ◦ αi
= v ◦ αj − (αi ◦ αj ) = 0,
i=1
α i ◦ α i

since αi ◦ αj = 0 for i ̸= j.

Let B = (β1 , β2 , . . . , βn ) be a basis of V . The Gram-Schmidt ortho-


gonalization process is the following construction. Let

α1 = β 1 ,

α2 = β2 − π(α1 ) (β2 ),

26
1.11 Determinant

α3 = β3 − π(α1 ,α2 ) (β3 ),

...

αn = βn − π(α1 ,α2 ,...,αn−1 ) (βn ),

Then
 
α1 α2 αn
A= , ,...,
∥α1 ∥ ∥α2 ∥ ∥αn ∥

is an orthonormal basis of V . Moreover, for each k we have

lin(α1 , α2 , . . . , αk ) = lin(β1 , β2 , . . . , βk ).

It follows directly from the construction and from Lemma 1.17.

Orthogonal matrices

An n × n matrix is orthogonal or orthonormal if both its rows and its


columns are orthonormal sequences of vectors. Equivalently, A is orthogonal
if

AAT = AT A = I,

where I is the identity matrix. Equivalently, A is orthogonal if

AT = A−1 .

1.11 Determinant
Each square n × n matrix A is assigned a real number called its determinant,
denoted det(A) or |A|. If
 
 a11 a12 ... a1n 
 
 a22 a22 . . . a2n 
 
A= . .. .. .. 
 . . 
 . . . 
 
an1 an2 . . . ann

27
1 Linear algebra

then

a11 a12 ... a1n


a22 a22 ... a2n
.. .. .. ..
. . . .
an1 an2 . . . ann

denotes the determinant of A (notice the brackets).


For n = 1 we have

a11 = a11 .

For n = 2 we have

a11 a12
= a11 a22 − a12 a21 .
a21 a22

In general, let Sn denote the group of permutations of the set {1, 2, . . . , n}


and for σ ∈ Sn let sgn(σ) denote the sign of permutation σ. We have

a11 a12 ... a1n


a22 a22 ... a2n X
.. .. .. .. = sgn(σ)a1σ(1) a2σ(2) · · · anσ(n) .
. . . . σ∈Sn

an1 an2 . . . ann

The above is called permutation formula for determinants and it is a


concise way to define the unique function described in the following theorem.

Theorem 1.18. Determinant of square n × n matrices is the unique real


function with the following properties.

1. Elementary row operations do not change the determinant (ri ← ri + a · rj


for i ̸= j).
2. Row swapping changes the sign of the determinant (ri ← rj , rj ← ri for
i ̸= j).

28
1.11 Determinant

3. Row multiplication by a scalar a changes the determinant by the factor a


(ri ← a · ri ).
4. The determinant of the identity matrix is 1.

Permutation formula has n! terms and while for n = 3 it still gives a


useful formula

a11 a12 a13


a21 a22 a23 = a11 a22 a33 + a12 a23 a31 + a13 a21 a32 +
a31 a32 a33

−a13 a22 a31 − a12 a21 a33 − a11 a23 a32

it is not as effective to compute determinants of higher orders.


Usually the most effective way is Gaussian elimination. It is used in
conjuction with the formula for determinants of block matrices.

Theorem 1.19. The determinant of a block matrix is equal to the product


of determinants of the blocks.

1 1 0 1 1 0
3 0 2 +(−3) · w1 → 0 −3 2
5 2 2 +(−5) · w1 0 −3 2

1 1 0 1 1 0
0 −3 2 ·(− 13 ) → (−3) · 0 1 − 23
0 −3 2 0 −3 2

1 1 0 1 1 0
(−3) · 0 1 − 23 → (−3) · 0 1 − 23
0 −3 2 +3 · w2 0 0 0

The last matrix is diagonal and its determinant is the product of the
diagonal entries, so the initial matrix has determinant (−3) · 1 · 1 · 0 = 0.

29
1 Linear algebra

Another useful formula is the Laplace expansion (or cofactor expansion).

1 1 0
1 0 1 0
3 0 2 = (−1)2+1 · (3) · + (−1)2+2 · (0) · +
2 2 5 2
5 2 2

1 1
+(−1)2+3 · (2) · = −3(2 − 0) + 0 − 2(2 − 5) = 0.
5 2

In the above example we show Laplace expansion along the second row
combined with the permutation formula or 2 × 2 matrices.

Theorem 1.20. An n × n matrix A is invertible if and only if det(A) ̸= 0.

Theorem 1.21 (Cauchy). If A, B are n × n matrices, then

det(AB) = det(A) det(B).

1.12 Eigenvectors
The following figure contains a plot of function φ : R2 → R2 defined with the
formula

φ((x, y)) = (x + 4y, 2x − y),

i.e. a linear map φ such that


 
1 4
M (φ)st
st =  .
2 −1

The vector field on the left depicts vectors φ((x, y)) attached to points (x, y) in
the plane. The vector field on the right depicts the same vectors, normalized
to equal lengths.

30
1.12 Eigenvectors

4 4

2 2

0 0

−2 −2

−4 −4
−4 −2 0 2 4 −4 −2 0 2 4

These plots make it clear that φ acts along two axes and has a simple
algebraic nature, but this algebraic simplicity cannot be seen in the matrix of
φ in the standard basis. The two directions seen on the plots are eigenvectors
of φ.
Let φ : V → V be a linear automorphism. We say that v ∈ V is an
eigenvector for φ with eigenvalue λ ∈ R if v ̸= 0 and

φ(v) = λv.

Let A be a basis of V . Clearly, φ(v) = λv if and only if

M (φ)A
A · [v]A = λ[v]A ,

so [v]A is a non-zero solution of a homogeneous system of linear equations


with matrix

M (φ)A
A − λI,

where I is the identity matrix. Such solution exists if and only if the rank of
M (φ)A
A − λI is not maximal, i.e. when

p(λ) = pφ (λ) = det(M (φ)A


A − λI)

is equal to zero. We call p(λ) a characteristic polynomial of matrix


M (φ)A
A or of map φ as it does not depend on the choice of A (from Cauchy’s
theorem about determinants). Directly from the definitions, roots of p are
the eigenvalues of φ.

31
1 Linear algebra

Sometimes we don’t want to keep track of φ and A we prefer to talk about


characteristic polynomial of an n × n matrix A

p(λ) = pA = det(A − λI),

its eigenvalues and eigenvectors.


For matrix
 
1 4 
A= 
2 −1

the characteristic polynomial is


 
1 − λ 4  2
p(λ) = det( ) = λ − 9,
2 −1 − λ

with eigenvalues −3 and 3. The eigenvectors are kernels of A − 3I and A + 3I.

 
 −2 4 
A − 3I =  
2 −4

We have to determine the kernel of A − 3I.


   
1
 −2 4  ·(− 2 )  1 −2 
  → 
2 −4 2 −4

   
 1 −2   1 −2 
  → 
2 −4 +(−2) · w1 0 0

We have x1 = 2x2 , the solution space is {(2x2 , x2 ) : x1 ∈ R} = lin((2, 1)) and


(2, 1) is an eigenvector that spans the eigenspace corresponding to eigenvalue
3. Indeed, we have

32
1.13 Self-adjoint operators
       
1 4  2 6 2
  ·   =   = 3 ·  .
2 −1 1 3 1

Similarily, (1, −1) is an eigenvector with eigenvalue −3. If we let A =


((2, 1), (1, −1)), then
 
3 0
M (φ)A
A =  .
0 −3

The simple geometrical nature of φ is now reflected in its algebraic represent-


ation.

1.13 Self-adjoint operators


Geometrically, a self-adjoint operator is a linear transformation V → V with
orthogonal eigenspaces that sum to the entire V . We will prove this in a
form of a Spectral Theorem in the next section; the definition and basic
properties of a self-adjoint operator are much simpler.
A linear transformation φ : V → V is self-adjoint if for each u, v ∈ V we
have

φ(u) ◦ v = u ◦ φ(v),

where ◦ denotes the dot product. For example, a projection onto a subspace
by least distance is self-adjoint, while rotation (in general) is not.

Lemma 1.22. If A be a symmetric n × n matrix, then M T AM is symmetric,


for any n × n matrix M .

Proof. We have
 T
T
M AM = M T AT (M T )T = M T AM.

If A = M T BM for some matrix M , then we say that A and B are


congruent. Therefore lemma 1.22 states that a matrix congruent to a
symmetric matrix is also symmetric.

33
1 Linear algebra

Lemma 1.23. A linear transformation φ : V → V is self-adjoint if and only


if matrix M (φ)A
A is symmetric for some orthonormal basis A of V (or - by
Lemma 1.22 - any orthonormal basis of V ).

Proof. Let A be an orthonormal basis of V . From Lemma 1.15, we have

u ◦ φ(v) = [u]TA · [φ(v)]A = [u]TA · M (φ)A


A · [v]A

and
 T
φ(u)◦v = [φ(u)]TA ·[v]A = M (φ)A
A · [u] A ·[v]A = [u]TA ·(M (φ)A
A ) ·[v]A .
T

If φ is self-adjoint, then left-hand sides are equal and from right-hand


sides we see that M (φ)A
A = (M (φ)A ) , so M (φ)A is symmetric.
A T A

If M (φ)A
A is symmetric, then right-hand sides are equal and from left-hand
sides we see that φ is self-adjoint.

Proposition 1.24. If φ : V → V is self-adjoint and W ⊂ V is an invariant


subspace under φ, then W ⊥ is invariant under φ.

Proof. Let v ∈ W ⊥ . For any w ∈ W we have v ◦ φ(w) = 0, since W is


invariant under φ. By self-adjointness, φ(v) ◦ w = 0, so (since w was any
vector in W ) φ(v) ∈ W ⊥ .

1.14 Spectral Theorem


The main insight is that every self-adjoint operator has an eigenvector. To
prove it we use some facts from calculus and the details are discussed in the
next chapter.

Lemma 1.25. Every symmetric n × n matrix A has an eigenvector.

Proof. Let f : Rn → R be a map defined by a formula

f (v) = v ◦ (Av),

where ◦ is the dot product. It is a differentiable function and its gradient


at p ∈ Rn is 2Ap (to prove it we need symmetry of A, see Problem 2.9.6).

34
1.14 Spectral Theorem

The restriction of f to the unit sphere S n−1 ⊂ Rn attains its maximum for
some p ∈ S n−1 (since the sphere is compact and f is continuous). At p, every
tangent vector to S n−1 must be orthogonal to the gradient. Hence p⊥ ⊥ Ap,
so Ap ∈ (p⊥ )⊥ = lin p. This means that p is an eigenvector for A.

Theorem 1.26. A self-adjoint linear transformation φ : V → V admits an


orthonormal basis consisting of eigenvectors of φ.

Proof. The proof is by induction on the dimension of V . The statement is


clear in dimension 1.
Let dim V > 1. By Lemma 1.25 there exists eigenvector v ∈ Rn of A
such that ∥v∥ = 1. Let U = v ⊥ . By Proposition 1.24, U is an invariant
subspace for φ, so ψ = φ|U is a self-adjoint operator on U . By the inductive
assumption, there exists orthonormal basis {v2 , v3 , . . . , vn } of U consisting of
eigenvectors of ψ. Then A = {v, v2 , v3 , . . . , vn } is an orthonormal basis of V
consisting of eigenvectors of φ.

We say that an n × n matrix A is orthogonally diagonalizable if there


exists orthogonal matrix M such that M −1 AM is diagonal.
Theorem 1.27. An n × n matrix A is orthogonally diagonalizable if and
only if A is symmetric.

Proof. The "only if" part follows from the observation that if B = M −1 AM
and M is orthogonal, then A and B are congruent. By Lemma 1.22, if B is
diagonal, then A has to be symmetric.
If A is symmetric, then φ : Rn → Rn such that

M (φ)st
st = A

is self-adjoint. By Theorem 1.26, there exists orthonormal basis A of Rn


consisting of eigenvectors of A. Then

st
A = Mst AMA ,
M (φ)A A

st
is diagonal and since MA is orthogonal, A is orthogonally diagonalizable.

We say that two symmetric n × n matrices A, B are simultaneously


diagonalizable if there exists orthogonal matrix M such that M −1 AM and
M −1 BM are diagonal.

35
1 Linear algebra

In other words A, B are simultaneously diagonalizable if and only if there


exists a single basis consisting of eigenvectors of both A and B.
We say that matrices A, B commute if AB = BA.

Lemma 1.28. If n × n symmetric matrices A, B are simultaneously diagon-


alizable, then they commute.

Proof. We have A = M −1 DM and B = M −1 EM with D, E diagonal. Then

AB = M −1 DEM = M −1 EDM = BA.

Theorem 1.29. If n × n symmetric matrices A, B commute, then they are


simultaneously diagonalizable.

1.15 Positive semidefinite matrices


We say that a symmetric n × n matrix is positive semidefinite if for each
v ∈ Rn we have

(Av) ◦ v ≥ 0.

From the geometric interpretation of the dot product, it says that the angle
between Av and v does not exceed 90◦ .
A matrix is positive definite if for each v ∈ R such that v ̸= 0 we have

(Av) ◦ v > 0.

Theorem 1.30. Let A be an n × n matrix and let Ak be a square k × k


submatrix of A spanned by the intersection of first k columns and rows.
Matrix A is positive definite if and only if for each k we have det(Ak ) > 0
if and only if all eigenvalues of A are positive.

Square roots of positive semidefinite matrices

Theorem 1.31. For a positive semidefinite n × n matrix A there exists a


unique positive semidefinite n × n matrix B such that B 2 = A.

36
1.15 Positive semidefinite matrices

Proof. From the Spectral Theorem (Theorem 1.27),

A = M −1 DM,

for some orthogonal matrix M and diagonal matrix D,


 
 λ1 0 ... 0
 
0 λ2 ... 0 
 
D= . .. .. . .
. .
. . 
. . 
 
0 0 . . . λn

Since A is positive semidefinite, λi ≥ 0 for each 1 ≤ i ≤ n and a matrix


 

 λ1 0 ... 0 
 √ 

 0 λ2 ... 0 
E= .. .. .. .. 
 . 
 . . . 
 √ 
0 0 ... λn

is well defined. Let

B = M −1 EM.

We have

B 2 = M −1 E 2 M = A.

To prove uniqueness, let us consider a positive semidefinite n × n matrix



such that C 2 = A. Let p be a polynomial such that p(λi ) = λi . Since C
and A = C 2 , also C and p(A) commute. We have

p(A) = p(M −1 DM ) = M −1 p(D)M M −1 EM = B,

so also C and B commute.

By Theorem 1.29, there exists orthogonal matrix N such that

37
1 Linear algebra

B = N −1 F N and C = N −1 GN,

where F and G are diagonal matrices. Let φ1 , . . . , φn be diagonal of F and


let γ1 , . . . , γn be diagonal of G. From the equality B 2 = C 2 we have φ2i = γi2 .
Since B and C are positive semidefinite, we have φi , γi ≥ 0, so φi = γi , i.e.
B = C. Therefore matrix B is unique.

1.16 Gram matrix


The Gram matrix of a system of vectors (v1 , v2 , . . . , vn ) ∈ Rm is a matrix
 
 v1 ◦ v1 v1 ◦ v2 ... v 1 ◦ vn 
 
 v2 ◦ v1 v2 ◦ v2 . . . v 2 ◦ vn 
 
G= . .. .. ..  .
 . . 
 . . . 
 
vn ◦ v1 vn ◦ v2 . . . v n ◦ vn

It can be also written as G = C T C, where


 
 
 
C=
v1 v2 ··· vn 
.
 

Every Gram matrix is positive semidefinite, as for G = C T C we have

(Gv) ◦ v = v T Gv = v T C T Cv = (Cv) ◦ (Cv) ≥ 0.

Theorem 1.32. Let A be a symmetric n × n matrix and let m ≥ n. Matrix


A is positive semidefinite if and only if there exists a m × n matrix C such
that A = C T C.

Proof. ⇒: take
 
B 
C =  ,
0

38
1.17 The LU decomposition

where B is the square root of A from Theorem 1.31.


⇐: C T C is symmetric from Lemma 1.9. It is a Gram matrix, so it is
positive semidefinite, as discussed above.

1.17 The LU decomposition


Let A be an n × n matrix. An LU-decomposition of A is a factorization
of A

A = LU

into the product of a lower triangular matrix L and an upper triangular


matrix U .
It turns out that it is not always possible. If
    
a11 a12  l11 0  u11 u12 
 =  ,
a21 a22 l21 l22 0 u22

then a11 = l11 u11 and if a11 is 0, then either l11 or u11 is 0. This means that
the right
 hand
 side is singular. Hence if A is non-singular and a11 = 0, e.g.
0 1
A= , then LU factorization in the above form does not exist.
1 1
This can be easily fixed by allowing row permutations. LU decom-
position with partial pivoting (LUP) is a LU decomposition with row
permutations

P A = LU,

where P is a permutation matrix that reorders rows of A when A is multiplied


by P from the left.
A familiar procedure that we used to compute the inverse matrix A−1
can be used to compute LUP.

1. U = A, L = I, P = I.
2. For k = 1, 2, . . . , n − 1:
1. Select i ≥ k to maximize |uik |

39
1 Linear algebra

2. Swap kth and ith row of U .


3. Swap kth and ith row of L, but only part up to (k − 1)th column (part
under the diagonal).
4. Swap kth and ith row of P .
5. For j = k + 1, k + 2, . . . , m:
1. ljk = ujk /ukk
2. Subtract from jth row of U the kth row of U multiplied by l jk .

Let Uk , Lk , Uk denote matrices U, L, P obtained in subsequent steps of


the computation with U0 = A, L0 = I, P0 = I. We will show the following
invariant

L k Pk A = U k .

Let Q be a matrix of elementary row operation that swaps rows i and k,


 
i.e. if Q = qij 1≤i,j≤n , then qmm = 1 for m =
̸ k, i, qki = 1, qik = 1 and all
other entries are 0.
Let E be a matrix of Gaussian elimination on kth column of
U = Uk , i.e. E is the identity matrix with kth column replaced by
[0, 0, . . . , 0, 1, − (k+1)k
ukk , − ukk , . . . , − ukk , ].
u u(k+2)k unk

We have

Pk = QPk−1 ,

Uk = EQUk−1 ,

and

Lk = EQLk−1 Q−1 .

The last equality is notable: we use the fact that in columns k, k + 1, . . . , n


matrix Lk−1 is still equal to the identity matrix, hence QLk−1 Q−1 swaps kth
and ith row of Lk−1 up to (k − 1)th column (under the diagonal).
We have

Lk Pk A = EQLk−1 Q−1 QPk−1 A = EQLk−1 Pk−1 A = EQUk−1 = Uk .

Therefore

40
1.17 The LU decomposition

Ln−1 Pn−1 A = Un−1 .

Since we ran Gaussian elimination on U to an echelon form, Un−1 is upper


triangular. It is clear from the construction that each Lk is lower triangular.
Since inverse of a lower triangular matrix is lower triangular, we obtain an
LUP decomposition of A

P A = LU

for P = Pn−1 , L = (Ln−1 )−1 , U = Un−1 .

It is best explained on an example. Take


   
 2 1 1 0   1 0 0 0 
   
 4 3 3 1   0 1 0 0 
   
A=  and P =  .
 8 7 9 5   0 0 1 0 
   
   
6 7 9 8 0 0 0 1

We have
 
 2 1 1 0 1 0 0 0 
 
 4 3 3 1 0 1 0 0 
 
[U |L] =  .
 8 7 9 5 0 0 1 0 
 
 
6 7 9 8 0 0 0 1

We start by swapping rows 1 and 3 of U , L and P , but in L only up to


the diagonal.
   
 1 0 0 0  ← r3  0 0 1 0 
   
 0 1 0 0   0 1 0 0 
   
P :  → 
 0 0 1 0   1 0 0 0 
  ← r1  
   
0 0 0 1 0 0 0 1

41
1 Linear algebra
   
2 1 1 0 1 0 0 0 ← r3 \ 0 8 7 9 5 1 0 0 0
   
   
4 3 3 1 0 1 0 0 4 3 3 1 0 1 0 0
[U |L] : 


 → 



8 7 9 5 0 0 1 0  ← r1 \ 0 2 1 1 0 0 0 1 0
   
6 7 9 8 0 0 0 1 6 7 9 8 0 0 0 1

The we run Gaussian elimination on the first column:


   
87 9 5 1 0 0 0 8 7 9 5 1 000
   
   
4 3 3 1 0 1 0 0  +(− 12 ) · w1  0 − 12 − 32 − 32 − 12 1 0 0
  → 
   
2 1 1 0 0 0 1 0 2 1 1 0 0 0 1 0
   
67 9 8 0 0 0 1 6 7 9 8 0 001
   
8 7 9 5 1 0 0 0 8 7 9 5 1000
   
   
 0 − 12 − 32 − 32 − 12 1 0 0  0 − 12 − 32 − 32 − 12 10 0
  → 
  1  
2 1 1 0 0 0 1 0  +(− 4 ) · w1  0 − 34 − 54 − 54 − 14 0 1 0
   
6 7 9 8 0 0 0 1 6 7 9 8 000 1
   
8 7 9 5 1 0 0 0 8 7 9 5 100 0
   
   
 0 − 12 − 32 − 32 − 12 1 0 0  0 − 12 − 32 − 32 − 12 1 0 0
  →  
   
 0 − 34 − 54 − 54 − 14 0 1 0  0 − 34 − 54 − 54 − 14 0 1 0
   
3
6 7 9 8 0 0 0 1 +(− 4 ) · w1 0 74 9
4
17
4 − 34 0 0 1

The swap operation above is only required when we encounter 0 on the


diagonal. We swap in a row with highest absolute value for numerical stability;
we don’t want to divide by small numbers.

   
8 7 9 5 1 0 0 0 8 7 9 5 1 000
   
   
 0 − 12 − 32 − 32 − 12 1 0 0  ← r4 \ 0  0 74 49 17 4 −4
3
1 0 0
  
→ .
  
 0 − 34 − 54 − 54 − 14 0 1 0  0 − 34 − 54 − 54 − 14 0 1 0
   
0 74 49 17 4 −4
3
0 0 1 ← r2 \ 0 0 − 12 − 32 − 32 − 12 001

42
1.17 The LU decomposition

The swap operation is performed on P too:


   
 0 0 1 0   0 0 1 0 
   
 0 1 0 0  ← r4
 0 0 0 1 
   
  → .
 1 0 0 0   1 0 0 0 
   
   
0 0 0 1 ← r2 0 1 0 0

Gaussian elimination on the second column:


   
8 7 9 5 1 0 0 0 8 7 9 5 1 000
   
   
 0 74 49 17 3
4 −4 1 0 0 7 9 17 3
 0 4 4 4 −4 1 0 0 
  → 
  3  
 0 − 34 − 54 − 54 − 14 0 1 0  + 7 · w2 2 4 4 3
 0 0 −7 7 −7 7 1 0 
   
0 − 12 − 32 − 32 − 12 0 0 1 1 3 3 1
0 −2 −2 −2 −2 0 0 1
   
8 7 9 5 1 0 0 0 8 7 9 5 1 000
   
 7 9 17 3   7 9 17 3 
 0 4 4 4 −4 1 0 0   0 4 4 4 −4 1 0 0 
  → 
 2 4 4 3   2 4 4 3 
 0 0 −7 7 −7 7 1 0   0 0 −7 7 −7 7 1 0 
   
1 3 3 1 2 6 2 5 2
0 − 2 − 2 − 2 − 2 0 0 1 + 7 · w2 0 0 −7 −7 −7 7 0 1

The last swap:


   
8 7 9 5 1 0 0 0 8 7 9 5 1 000
   
 7 9 17 3   7 9 17 3 
 0 4 4 4 −4 1 0 0   0 4 4 4 −4 1 0 0 
  
→ 
 2 4 4 3  6 2 5 2 
 0 0 − 7 7 − 7 7 1 0  ← r4 \ 0  0 0 −7 −7 −7 7 1 0 
   
6 2 5 2 2 4 4 3
0 0 − 7 − 7 − 7 7 0 1 ← r3 \ 0 0 0 −7 7 −7 7 0 1
   
0 0 1 0 0 0 10
   
   
0 0 0 1 0 0 0 1
  → 
   
1 0 0 0  ← r4 0 1 0 0
   
0 1 0 0 ← r3 1 0 00

Gaussian elimination on the last (third) column:

43
1 Linear algebra
   
8 7 9 5 1 0 0 0 8 7 9 5 1 0 0 0
   
 7 9 17 3   7 9 17 3 
 0 4 4 4 −4 1 0 0   0 4 4 4 −4 1 0 0
  →  
 6 2 5 2   
 0 0 −7 −7 −7 7 1 0   0 0 − 67 − 27 − 57 27 1 0
   
2 4 4 3 1
0 0 − 7 7 − 7 7 0 1 +(− 3 ) · w3 0 0 0 23 − 13 13 − 13 1

We have
 −1  
 1 0 0 0   1 0 0 0 
   
 3   3
 − 4 1 0 0 
  4 1 0 0 

L=
  =  .
 − 57 2
1 0    1 − 27 1 0 

 7 
  2




− 13 1
3
1
−3 1 1
4 − 37 1
3 1

We can compute L−1 using Gaussian elimination (a pass to echelon form is


sufficient as L is lower triangular). Alternatively, we can find inverse using
factorization of L into elementary row operations.
   
 0 0 1 0   8 7 9 5 
   
 0 7 9 17 
 0 0 1 

 0
 4 4 4 
P = , U =  .
 0 1 0 0   0 0 − 67 2 
−7 
  
   
2
1 0 0 0 0 0 0 3

We can verify that


  
 0 0 1 0  2 1 1 0 
  
 0 0 0 1   4 3 3 1 
  
PA =   =
 0 1 0 0   8 7 9 5 
  
  
1 0 0 0 6 7 9 8

44
1.18 The Cholesky decomposition
    
 8 7 9 5   1 0 0 0  8 7 9 5 
    
 6 7 9 8   3 1 0 0   0 7 9 17 
   4  4 4 4 
= =   = LU.
 4 3 3 1   1 − 27 1 0   0 0 − 67 − 27 
   2  
    
1
2 1 1 0 4 − 37 1
3 1 0 0 0 2
3

Back and forward substitution

As an application of LU factorization consider the following system of linear


equalities.

Ax = b.

If A = LU , then to solve the system we may first find y such that Ly = b


and next find x such that U x = y. Then

Ax = LU x = Ly = b.

If L is lower triangular, then we may solve Ly = b by forward substitution.


We can compute the value of y1 from the first row of L. Then we may
substitute y1 and compute y2 from the second row. We find the remaining
values of y with the same method.
If U is upper triangular, then we may solve U x = y by backward substi-
tution. We find value of xn from the last row of U and then, like above, we
proceed to find xn−1 , xn−2 , . . . by substitution.

1.18 The Cholesky decomposition


Theorem 1.33. Every symmetric positive definite matrix A has unique
factorization of the form

A = LLT ,

where L is a lower triangular matrix with positive diagonal entries.

The matrix L is called the lower Cholesky factor of A. Note that if


A = LLT , then A is positive semidefinite as a Gram matrix. Cholesky factors

45
1 Linear algebra

always exists for positive semidefinite matrices, but they don’t have to be
unique. For example
    
0 0  0 0  0 sin α 
 =  
0 1 sin α cos α 0 cos α

 
0 0
shows infinitely many Cholesky decompositions of matrix  .
0 1
Theorem 1.33 is proved by induction on the size n of the matrix.
The case n = 2 is crux of the matter. Let
   
a b  λ 0
A=  and L =  .
b d x δ

We have equation
      
2
λ 0  λ x  λ λx  a b
LLT =   = =  = A.
x δ 0 δ λx x2 + δ 2 b d

Solving for λ, x, δ we have



λ= a,

b
x= √ ,
a
r
p b2
δ= d − x2 = d− .
a
Since A is positive definite, we have a ≥ 0 and ad − b2 ≥ 0, so d − x2 =
ad−b2
a ≥ 0. Therefore the above solution is well defined. It also is the only
solution with positive λ and δ.
The above step may be upgraded to a full recursive procedure of computing
L.

Lemma 1.34. Let Q be nonsingular, x be a column vector, d be a number,


and let

46
1.18 The Cholesky decomposition
 
T
 QQ Qx
A= 
(Qx)T d

be positive definite. Then ∥x∥2 ≤ d.

 
(Q ) x
T −1
Proof. Let u =  . Then
−1

 
 0 
Au =  
∥x∥2 − d∥

because

QQT (QT )−1 x − Qx = Qx − Qx = 0,

(Qx)T (QT )−1 x = xT QT (QT )−1 x = xT x = ∥x∥2 .

Consequently,

(Au) ◦ u = −∥x∥2 + d > 0.

Proof (Theorem 1.33). Assume theorem holds for all matrices of size n × n.
Let
 
e
A b
A= ,
bT d

where A
e is an n × n matrix, b is a column vectors and d is a number. By the
inductive hypothesis, there exists a unique n × n lower triangular matrix Q
with positive diagonal entries such that A
e = QQT .

Let
 
Q 0
L= .
xt δ

47
1 Linear algebra

The equation
    
Q 0  Q T
x   QQ T
Qx 
LLT =   = =A
xt δ 0 δ (Qx)T ∥x∥2 + δ 2

is equivalent to

QQT = A,

Qx = b,

and

∥x∥2 + δ 2 = d.

We obtain the desired lower triangular matrix L by letting

x = Q−1 b

and
p
δ= d − ∥x∥2 .

The square root is well-defined by Lemma 1.34.

Computational example

Let
 
 4 12 −16
 
A=
 12 37 −43
.
 
−16 −43 98

It is a symmetric positive definite matrix. To find its Cholesky factor we


repeat computational steps of the above proof.
First, from 1 × 1 matrix in the upper left corner of A we get:

48
1.18 The Cholesky decomposition
 
2 0 0
 
L=
· · 0.
 
· · ·

 
Next, we have Q = 2 and we have

    
x=Q −1
b= 1
2 12 = 6

and
p
δ= 37 − 62 = 1.

Hence
 
2 0 0
 
L=
6 1 0.
 
· · ·

 
2 0
Finally, for Q =   we have
6 1

    
1
 2 0 −16 −8
x = Q−1 b =   = 
−3 1 −43 5

and
p √
δ= 98 − ∥(−8, 5)∥2 = 98 − 89 = 3.

This gives the final matrix

49
1 Linear algebra
 
2 0 0
 
L=
6 1 0.
 
−8 5 3

Indeed, we have
    
 2 0 0 2 6 −8  4 12 −16
    
 =  12
 6 1 0 0 1 5  37 −43
  .
    
−8 5 3 0 0 3 −16 −43 98

1.19 Singular value decomposition


Let A me an m × n matrix. A singular value decomposition of A is
factorization

A = U ΣV,

where U and V are orthogonal and Σ is diagonal.

Theorem 1.35. Every matrix admits a singular value decomposition.

Proof. Consider m×n matrix A. Let φ : Rn → Rm be a linear transformation


such that

M (φ)st
st = A,

i.e. φ(v) = Av.


Matrix AT A is symmetric and positive semidefinite, therefore by the
spectral theorem it is orthogonally diagonalizable,

AT A = M DM T ,

where D is a diagonal matrix and M is an orthogonal matrix. We have

M T AT AM = D,

50
1.19 Singular value decomposition

so (AM )T (AM ) = D.
Let A = (α1 , α2 , . . . , αn ) be sequence of column vectors of M . Since M
is orthogonal, A is an orthonormal basis of Rn . We have

M (φ)st
A = AM,

so φ maps vectors from basis A onto a set of pairwise orthogonal vectors


(φ(α1 ), φ(α2 ), . . . , φ(αn )) (some may be equal to 0).
If r(A) = m, then we may pick an orthogonal basis

(β1 , β2 , . . . , βm )

from this sequence. If r(A) < m, we pick an orthogonal basis of the image φ
and extend it to orthogonal basis of Rm with arbitrary vectors. Without a
loss of generality we may assume that vectors of A (hence columns of M ) are
ordered in such a way that φ(αi ) = βi , for i = 1, 2, . . . , r(A). Let

β1 β2 βm
B={ , ,..., }.
∥β1 ∥ ∥β2 ∥ ∥βm ∥

It is orthonormal basis of Rm . Let N = Mst


B
. We have

st

M (φ)B
A = Mst AMA = N AM =
B
D,

hence

A = U ΣV,

for U = N T = MBst , Σ = D and V = M T .

Remark. Assume that

A = U ΣV

is a singular value decomposition of an m×n matrix A. Let σi be ith non-zero


element from the diagonal of Σ. Let Σi be a zero m × n matrix with ith
diagonal element set to σi . We have

51
1 Linear algebra

r(A) r(A)
X X
A = U ΣV = U ( Σi )V = σi ui v i ,
i=1 i=1

where vi is ith row of V (we treat is as an 1 × n matrix) and ui is the ith


column of U (we treat it as an m × 1 matrix). Note that viT is ith eigenvector
of AT A corresponding to eigenvalue σi2 and ui = Avi /∥Avi ∥.
If rank of A is low, then the above decomposition uses less data to represent
Pk
A than m · n entries of A. We can also approximate A ≈ i=1 σi ui vi , by
throwing away small factors starting from σk+1 (it helps to arrange singular
values on the diagonal in descending order).

Computational example

Let
 
1 2 3
A= .
4 5 6

We have
 
17 22 27
 
AT A = 
22 29 36.
 
27 36 45

Eigenvalues of AT A are 0, 0.597327, 90.402672 (in this example we’ll trun-


cate all numbers after six decimal places).
Eigenspaces are

V0 = lin((1, −2, 1)),

V0.597327 = lin((1, 0.139438, −0.721122)),

V90.402672 = lin((1, 1.321087, 1.642175)).

After norming the vectors, we have

A = (0.408248, −0.816496, 0.408248),

52
1.19 Singular value decomposition

(0.805964, 0.112382, −0.581198),



(0.428667, 0.566306, 0.703946) .

We compute φ(αi ) = Aαi :


 
 
 0.408248 
  0.0
A· −0.816496  =  ,
 
  0.0
0.408248

 
 
 0.805964 
  −0.712867
A· 0.112382 = ,
 
  0.298575
−0.581198
 
 
0.428667
  3.673121
A· 0.566306 = ,
 
  8.769883
0.703946

We pick

β1 = (−0.712867, 0.298575),

β2 = (3.673121, 8.769883).

We rearrange A to

A = (0.805964, 0.112382, −0.581198),


(0.428667, 0.566306, 0.703946),

(0.408248, −0.816496, 0.408248) .

After norming we get



B = (−0.922364, 0.386321), (0.386317, 0.922365) .

Hence

53
1 Linear algebra
 
−0.922364 0.386317
U = (Mst ) = MBst = 
B −1
,
0.386321 0.922365

 
0.805964 0.112382 −0.581198
 
V = Mst
A
=
0.428667 0.566306 0.703946 

 
0.408248 −0.816496 0.408248

and
   

√  0.597327 0 0 0.772869 0 0
Σ= D= √ = .
0 90.402672 0 0 9.508032 0

Indeed, we have
 
   0.805964 0.112382 −0.581198
−0.922364 0.386317 0.772869 0 0 
U ΣV = 0.428667 0.566306 0.703946 
0.386321 0.922365 0 9.508032 0
0.408248 −0.816496 0.408248
 
  0.805964 0.112382 −0.581198
−0.712866 3.673114 0.0 
= 0.428667 0.566306 0.703946 
0.298575 8.769875 0.0
0.408248 −0.816496 0.408248
 
0.999998 1.999993 2.999990
=
3.999997 4.999987 5.999987

We also have a decomposition


 
−0.922364 h i
A = 0.772869   0.805964 0.112382 −0.581198 +
0.386321
 
0.386317 h i
+9.508032   0.428667 0.566306 0.703946 =
0.922365
 
−0.743392 −0.103657 0.536076
= 0.772869  +
0.311360 0.043415 −0.224528
 
0.165601 0.218773 0.271946
+9.508032  =
0.395387 0.522340 0.649295

54
1.20 Problems
   
−0.574544 −0.080113 0.414316 1.574542 2.080106 2.585674
= + =
0.240641 0.033554 −0.173531 3.759356 4.966433 6.173519
 
0.999998 1.999993 2.999990
= 
3.999997 4.999987 5.999987

If we throw away summand with smaller coefficient 0.772869, we get an


approximation of A by a matrix of rank 1:
 
 
0.386317
A ≈ 9.508032   0.428667 0.566306 0.703946 =
0.922365
 
 1.574542 2.080106 2.585674 
 
3.759356 4.966433 6.173519

The quality of the approximation depends on the magnitue of singular values


that we throw away.

1.20 Problems
Problem 1.20.1. Verify that the formula

cos ∢(x, y)
x·y = . x2
y

∥x∥∥y∥

x

is equivalent to the law of cosines.


x

x3
y
Hint. Compute the squared length of y − x as
x1
(y − x) ◦ (y − x) and from the law of cosines.

Problem 1.20.2. Show the polarization identity, i.e.

1 
∥u − v∥2 − ∥u − v∥2 = u ◦ v,
4

for any u, v ∈ Rn .

55
1 Linear algebra

Problem 1.20.3. Given two non-zero vectors u, v ∈ Rn , give the formula


for the unit vector which bisects the (smaller) angle between u and v.

Problem 1.20.4. Prove that vectors x1 , x2 , . . . , xm ∈ Rn are linearly de-


pendent if and only if one of the vectors can be expressed as a linear combin-
ation of others.

Problem 1.20.5. Find rank of the "multiplication table" matrix


 
1 2 3 4 5 6 7 8 9 10 
 
2 4 6 8 10 12 14 16 18 20 
 
 
3 6 9 12 15 18 21 24 27 30 
 
 
 
4 8 12 16 20 24 28 32 36 40 
 
 
5 10 15 20 25 30 35 40 45 50 
A=



6 12 18 24 30 36 42 48 54 60 
 
 
7 14 21 28 35 42 49 56 63 70 
 
 
8 16 24 32 40 48 56 64 72 80 
 
 
9 18 27 36 45 54 63 72 81 90 
 
 
10 20 30 40 50 60 70 80 90 100

Problem 1.20.6. Pick a basis of R3 from (2, 1, 3), (1, 2, 4), (3, 0, 2), (2, −2, 2).
How many solutions does this problem have?

Problem 1.20.7. Find basis of a subspace of R4 described by the following


system of linear equations




 2x1 − x2 + x3 − x4 = 0

x1 + 2x2 + x3 + 2x4 = 0




 3x1 + x2 + 2x3 + x4 = 0

Problem 1.20.8. Use back propagation to solve a system of linear equations


described by the following augmented matrix

56
1.20 Problems
 
 −2 −3 2 1 
 
 0 2 3 
 −1 .
 
0 0 1 7

Problem 1.20.9. Find matrix inverse and LUP decomposition of


 
1 3 2
 
A=
2 6 −2.
 
3 3 0

Problem 1.20.10. Describe

V = lin((1, 2, 1, 3), (2, 5, 2, 7), (1, 3, 1, 4)) ⊂ R4

as solution space of a system of linear equations.

Problem 1.20.11. What are coordinates of (1, 8, 10, 10) in basis

((1, 2, 3, 1), (2, 1, 3, 3), (−1, 1, 0, −1), (0, 0, 1, 2))?

Problem 1.20.12. Let φ : R3 → R2 be a linear transformation defined by


the formula

φ((x1 , x2 , x3 )) = (x1 − x2 + 4x3 , −3x1 + 8x3 ).

Let A = ((3, 4, 1), (2, 3, 1), (5, 1, 1)), B = ((3, 1), (2, 1)). Find M (φ)B
A and
M (φ)st
st .

Problem 1.20.13. [Hef20, Problem 3.48c, p. 254] Find a way to multiply


two 2 × 2 matrices using only seven multiplications instead of the eight used
in the standard approach.

Problem 1.20.14. [put90] If A and B are square matrices of the same size
such that ABAB = 0, does it follow that BABA = 0?

Problem 1.20.15. Prove that the inverse of a symmetric matrix is symmet-


ric.

57
1 Linear algebra

Problem 1.20.16. Compute determinant

2 4 −2 −2
1 3 1 2
.
1 3 1 3
−1 2 1 2

Problem 1.20.17. Find eigenvalues and eigenvectors of a matrices


   
−3 1 1 0 1 0
   
0 1 2  4 0
  , −4 .
   
0 2 −2 −2 1 2

Problem 1.20.18. Let α1 , α2 , . . . , αk be a basis of a subspace of Rn . Let


 
— α1 — 
 
— α2 — 
 
A= .. 
 
 . 
 
— αk —

be an k × n matrix with α vectors as rows. Show that Gaussian elimination


on the augmented matrix
 
AAT A

produces orthogonalized vectors in place of A, identical to the vectors obtained


via the Gram-Schmidt process.

Problem 1.20.19. Show that the Gram matrix of a system is invertible


(hence positive definite) if and only if the system of vectors is linearly inde-
pendent.

Problem 1.20.20. The Cayley-Hamilton Theorem states that a square n×n


matrix satisfies its own characteristic equation, i.e. that pA (A) = 0, where
pA is the characteristic polynomial for A.

58
1.20 Problems

Prove this theorem for for 2 × 2 matrices.

Problem 1.20.21. Let


 
3
 7 · ·
 
A= −
 7
6
· · .

 
2
7 · ·

Fill in blank spaces to make A orthogonal.

Problem 1.20.22. Show that if columns of a square n × n matrix A form


an orthonormal basis of Rn , then rows of A form an orthonormal basis of Rn
as well.

Problem 1.20.23. Find orthogonal diagonalization of


 
2 1 1
 
A=
1 2 1
 
1 1 2

Problem 1.20.24. Find singular value decomposition of a matrix


 
1 2
 
2 1 .
 
 
1 1

Problem 1.20.25. Find singular value decomposition of


 
  1
   
1 2 2
 
A= 1 2 3 4 ,B =   , and C =  .
3 4 3
 
 
4

Problem 1.20.26. Let

59
1 Linear algebra
   
1 2 3 4 1 2 3 4
   
2 4 6 8 2 4 5 6
   
A=  and B =  
3 6 9 12 3 5 6 7
   
   
4 8 12 16 4 6 7 8

be multiplication and addition tables. Find orthogonal diagonalizations of A


and B.

Problem 1.20.27. Find singular value decomposition of a Polish flag matrix


 
w w w w w
 
w w w w w
 
F = .
r r r r r
 
 
r r r r r

Hint. It is a matrix of rank 1.

Problem 1.20.28. Show that if v is an eigenvector of AT A, then Av is an


eigenvector of AAT .

Problem 1.20.29. Show that


X
U ΣV = σi ui v i ,
i

where Σ is a m × n diagonal matrix, σi is ith element of the diagonal of Σ, ui


is ith column of an m × m matrix U and vi is ith row of an n × n matrix V .

Problem 1.20.30. Find singular value decomposition of


 
3 1 4 2
 
1 4 1 4

(use decimal approximations).

Problem 1.20.31. Find singular value decomposition of

60
1.20 Problems
 
1 1 0
 
1 0 1

(with exact computation).

Problem 1.20.32. Check whether matrices


   
 1 2 1 1 2 2 
   
A=  2 3 3 , B =  2 8 0 
  
   
1 3 2 2 0 24

are positive definite. If possible, find their Cholesky decomposition.

Problem 1.20.33. Check whether matrices


   
1 −2 1  1 −2 1 
   
A=  −2 8 −14  , B = −2
  8 −14 

   
1 −14 28 1 −14 46

are positive definite. If possible, find their Cholesky decomposition.

Problem 1.20.34. Find Cholesky decomposition of a matrix


 
 16 −12 −12 −16
 
−12
 25 1 −4 
A= .
−12 1 17 14 
 
 
−16 −4 14 57

Problem 1.20.35. For what numbers b is the following matrix positive


definite?
 
2 −1 b 
 
A=
−1 2 −1.
 
b −1 2

61

You might also like