Applied Linear Algebra

Applied Linear Algebra
MAT 3341
Spring/Summer 2019
Alistair Savage
Department of Mathematics and Statistics
University of Ottawa
This work is licensed under a

Creative Commons Attribution-ShareAlike 4.0 International License
Contents
Preface 4
1 Matrix algebra 5
1.1 Conventions and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Matrix arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Matrices and linear transformations . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Matrix inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Matrix norms, sensitivity, and conditioning 36

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Normed vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Orthogonality 48
3.1 Orthogonal complements and projections . . . . . . . . . . . . . . . . . . . . 48
3.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Hermitian and unitary matrices . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 The spectral theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Positive definite matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Computing eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Generalized diagonalization 85
4.1 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Fundamental subspaces and principal components . . . . . . . . . . . . . . . 91
4.3 Pseudoinverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Jordan canonical form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5 The matrix exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 Quadratic forms 106

5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Diagonalization of quadratic forms . . . . . . . . . . . . . . . . . . . . . . . 108
2
Contents 3
5.3 Rayleigh’s principle and the min-max theorem . . . . . . . . . . . . . . . . . 112
Index 118
Preface
These are notes for the course Applied Linear Algebra (MAT 3341) at the University of
Ottawa. This is a third course in linear algebra. The prerequisites are uOttawa courses
MAT 1322 and (MAT 2141 or MAT 2342).
In this course we will explore aspects of linear algebra that are of particular use in concrete
applications. For example, we will learn how to factor matrices in various ways that aid in
solving linear systems. We will also learn how one can effectively compute estimates of
eigenvalues when solving for precise ones is impractical. In addition, we will investigate the
theory of quadratic forms. The course will involve a mixture of theory and computation. It
is important to understand why our methods work (the theory) in addition to being able to
apply the methods themselves (the computation).
Acknowledgements: I would like to thank Benoit Dionne, Monica Nevins, and Mike Newman
for sharing with me their lecture notes for this course.
Alistair Savage
Course website: https://alistairsavage.ca/mat3341
4
Chapter 1
Matrix algebra
We begin this chapter by briefly recalling some matrix algebra that you learned in previous
courses. In particular, we review matrix arithmetic (matrix addition, scalar multiplication,
the transpose, and matrix multiplication), linear transformations, and gaussian elimination
(row reduction). Next we discuss matrix inverses. Although you have seen the concept of a
matrix inverse in previous courses, we delve into the topic in further detail. In particular,
we will investigate the concept of one-sided inverses. We then conclude the chapter with a
discussion of LU factorization, which is a very useful technique for solving linear systems.
1.1 Conventions and notation

We let Z denote the set of integers, and let N = {0, 1, 2, . . . } denote the set of natural
numbers.
In this course we will work over the field R of real numbers or the field C of complex
numbers, unless otherwise specified. To handle both cases simultaneously, we will use the
notation F to denote the field of scalars. So F = R or F = C, unless otherwise specified. We
call the elements of F scalars. We let F× = F \ {0} denote the set of nonzero scalars.
We will use uppercase roman letters to denote matrices: A, B, C, M , N , etc. We will
use the corresponding lowercase letters to denote the entries of a matrix. Thus, for instance,
aij is the (i, j)-entry of the matrix A. We will sometimes write A = [aij ] to emphasize this.
In some cases, we will separate the indices with a comma when there is some chance for
confusion, e.g. ai,i+1 versus aii+1 .
Recall that a matrix A has size m × n if it has m rows and n columns:
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A =  .. ..  .
 
.. ...
 . . . 
am1 am2 · · · amn
We let Mm,n (F) denote the set of all m × n matrices with entries in F. We let GL(n, F)
denote the set of all invertible n × n matrices with entries in F. (Here ‘GL’ stands for general
5
6 Chapter 1. Matrix algebra
linear group.) If a1 , . . . , an ∈ F, we define

 
a1 0 · · · · · · 0
. .. 
 0 a2 . . .

. . . . . 
diag(a1 , . . . , an ) = 
 .
. . . . . . . .
. .

. ... ...
 ..

0
0 · · · · · · 0 an
We will use boldface lowercase letters a, b, x, y, etc. to denote vectors. (In class, we
will often write vectors as ~a, ~b, etc. since bold is hard to write on the blackboard.) Most of
the time, our vectors will be elements of Fn . (Although, in general, they can be elements of
any vector space.) For vectors in Fn , we denote their components with the corresponding
non-bold letter with subscripts. We will write vectors x ∈ Fn in column notation:
 
x1
 x2 
x =  ..  , x1 , x2 , . . . , xn ∈ F.
 
.
xn
Sometimes, to save space, we will also write this vector as
x = (x1 , x2 , . . . , xn ).
For 1 ≤ i ≤ n, we let ei denote the i-th standard basis vector of Fn . This is the vector
ei = (0, . . . , 0, 1, 0, . . . , 0),
where the 1 is in the i-th position. Then {e1 , e2 , · · · , en } is a basis for Fn . Indeed, every
x ∈ Fn can be written uniquely as the linear combination
x = x 1 e 1 + x2 e 2 + · · · + xn e n .
1.2 Matrix arithmetic

We now quickly review the basic matrix operations. Further detail on the material in this
section can be found in [Nic, §§2.1–2.3].
1.2.1 Matrix addition and scalar multiplication

We add matrices of the same size componentwise:
A + B = [aij + bij ].
If A and B are of different sizes, then the sum A + B is not defined. We define the negative
of a matrix A by
−A = [−aij ]
1.2. Matrix arithmetic 7
Then the difference of matrices of the same size is defined by
A − B = A + (−B) = [aij − bij ].
If k ∈ F is a scalar, then we define the scalar multiple
kA = [kaij ].
We denote the zero matrix by 0. This is matrix with all entries equal to zero. Note that
there is some possibility for confusion here since we will use 0 to denote the real (or complex)
number zero, as well as the zero matrices of different sizes. The context should make it clear
which zero we mean. The context should also make clear what size of zero matrix we are
considering. For example, if A ∈ Mm,n (F) and we write A + 0, then 0 must denote the m × n
zero matrix here.
The following theorem summarizes the important properties of matrix addition and scalar
multiplication.
Proposition 1.2.1. Let A, B, and C be m × n matrices and let k, p ∈ F be scalars. Then
we have the following:
(a) A + B = B + A (commutativity)
(b) A + (B + C) = (A + B) + C (associativity)
(c) 0 + A = A (0 is an additive identity)
(d) A + (−A) = 0 (−A is the additive inverse of A)
(e) k(A + B) = kA + kB (scalar multiplication is distributive over matrix addition)
(f) (k + p)A = kA + pA (scalar multiplication is distributive over scalar addition)
(g) (kp)A = k(pA)
(h) 1A = A
Remark 1.2.2. Proposition 1.2.1 can be summarized as stating that the set Mm,n (F) is a
vector space over the field F under the operations of matrix addition and scalar multiplication.
1.2.2 Transpose
The transpose of an m × n matrix A, written AT , is the n × m matrix whose rows are the
columns of A in the same order. In other words, the (i, j)-entry of AT is the (j, i)-entry of
A. So,
if A = [aij ], then AT = [aji ].
We say the matrix A is symmetric if AT = A. Note that this implies that all symmetric
matrices are square, that is, they are of size n × n for some n.
Example 1.2.3. We have  

T π 5
π i −1
= i 7 .
5 7 3/2
−1 3/2
The matrix  
1 −5 7
−5 0 8
7 8 9
is symmetric.
Proposition 1.2.4. Let A and B denote matrices of the same size, and let k ∈ F. Then we
have the following:
(a) (AT )T = A
(b) (kA)T = kAT
(c) (A + B)T = AT + B T
1.2.3 Matrix-vector multiplication

Suppose A is an m × n matrix with columns a1 , a2 , . . . , an ∈ Fm :

A = a1 a2 · · · an .
For x ∈ Fn , we define the matrix-vector product
Ax := x1 a1 + x2 a2 + · · · + xn an ∈ Fm .
Example 1.2.5. If  
2 −1 0  
 3 1/2 π  −1
A=
−2 1 1 
 and x =  1  ,
2
0 0 0
then        
2 −1 0 −3
 + 1 1/2 + 2 π  = −5/2 + 2π  .
3      
Ax = −1 
−2  1  1  5 
0 0 0 0
1.2.4 Matrix multiplication

Suppose A ∈ Mm,n (F) and B ∈ Mn,k (F). Let bi denote the i-th column of B, so that

B = b1 b2 · · · bk .
We then define the matrix product AB to be the m × k matrix given by

AB := Ab1 Ab2 · · · Abk .
Recall that the dot product of x, y ∈ Fn is defined to be
x · y = x1 y 1 + x2 y 2 + · · · + xn y n . (1.1)
1.2. Matrix arithmetic 9
Then another way to compute the matrix product is as follows: The (i, j)-entry of AB is
the dot product of the i-th row of A and the j-column of B. In other words
n
X
C = AB ⇐⇒ cij = ai` b`j for all 1 ≤ i ≤ m, 1 ≤ ` ≤ k.
`=1
0 1 −1
2 0 −1 1 1 0 2
A= and B =  
0 3 2 −1 0 0 −2
3 1 0
then
3 3 0
AB = .
0 −1 2
Recall that the n × n identity matrix is the matrix
0 ··· ···
 
1 0
0
. 1 0 ··· 0
. ... .. 
I :=  .. 0 .. . .
 
. .. . . ..
 ..

. . . 0
0 0 ··· 0 1
Even though there is an n × n identity matrix for each n, the size of I should be clear from
the context. For instance, if A ∈ Mm,n (F) and we write AI, then I is the n × n identity
matrix. If, on the other hand, we write IA, then I is the m × m identity matrix. In case we
need to specify the size to avoid confusion, we will write In for the n × n identity matrix.
Proposition 1.2.7 (Properties of matrix multiplication). Suppose A, B, and C are matrices
of sizes such that the indicated matrix products are defined. Furthermore, suppose a is a
scalar. Then:
(a) IA = A = AI (I is a multiplicative identity)
(b) A(BC) = (AB)C (associativity)
(c) A(B + C) = AB + AC (distributivity on the left)
(d) (B + C)A = BA + CA (distributivity on the right)
(e) a(AB) = (aA)B = A(aB)
(f) (AB)T = B T AT
Note that matrix multiplication is not commutative in general. First of all, it is possible
that the product AB is defined but BA is not. This the case when A ∈ Mm,n (F) and
B ∈ Mn,k (F) with m 6= k. Now suppose A ∈ Mm,n (F) and B ∈ Mn,m (F). Then AB and BA
are both defined, but they are different sizes when m 6= n. However, even if m = n, so that
A and B are both square matrices, we can have AB 6= BA. For example, if

1 0 0 1
A= and B =
0 0 0 0
then
0 1 0 0
AB = 6 = = BA.
0 0 0 0
Of course, it is possible for AB = BA for some specific matrices (e.g. the zero or identity
matrices). In this case, we say that A and B commute. But since this does not hold in
general, we say that matrix multiplication is not commutative.
1.2.5 Block form

It is often convenient to group the entries of a matrix into submatrices, called blocks. For
instance, if  
1 2 −1 3 5
A = 0 3 0 4 7 ,
5 −4 2 0 1
then we could write
 
−1 3 5
B 0 3
A= C , where B = 1 2 , C =  0 4 7 , D= .
D 5 −4
2 0 1
Similarly, if
2 1 3 −5
X= and A = ,
−1 0 −2 9
then  
0 0 2 1
X  0 1 −1 0 
0 e2 = ,
A  0 0 3 −5 
0 0 −2 9
where we have used horizontal and vertical lines to indicate the blocks. Note that we could
infer from the sizes of X and A, that the 0 in the block matrix must be the zero vector in
F4 and that e2 must be the second standard basis vector in F4 .
Provided the sizes of the blocks match up, we can multiply matrices in block form using
the usual rules for matrix multiplication. For example,

A B X AX + BY
=
C D Y CX + DY
as long as the products AX, BY , CX, and DY are defined. That is, we need the number
of columns of A to equal the number of rows of X, etc. (See [Nic, Th. 2.3.4].) Note that,
since matrix multiplication is not commutative, the order of the multiplication of the blocks
is very important here.
We can also compute transposes of matrices in block form. For example,
 T
A T T
T T A B A CT
A B C = B  and = .
T C D B T DT
C
1.3. Matrices and linear transformations 11
In certain circumstance, we can also compute determinants in block form. Precisely, if A

and B are square matrices, then

A X A 0
det = (det A)(det B) and det = (det A)(det B). (1.2)
0 B Y B
(See [Nic, Th. 3.1.5].)
Exercises.
Recommended exercises: Exercises in [Nic, §§2.1–2.3].
1.3 Matrices and linear transformations

We briefly recall the connection between matrices and linear transformations. For further
details, see [Nic, §2.6].
Recall that a map T : V → W , where V and W are F-vector spaces (e.g. V = Fn ,
W = Fm ), is called a linear transformation if
(T1) T (x + y) = T (x) + T (y) for all x, y ∈ V , and

(T2) T (ax) = aT (x) for all x ∈ V and a ∈ F.
We let L(V, W ) denote the set of all linear transformations from V to W .

Multiplication by A ∈ Mm,n (F) is a linear transformation
TA : Fn → Fm , x 7→ Ax. (1.3)
Conversely, every linear map Fm → Fn is given by multiplication by a some matrix. Indeed,

suppose
T : Fn → Fm
is a linear transformation. Then define the matrix

A = T (e1 ) T (e2 ) · · · T (en ) ∈ Mm,n (F).
Proposition 1.3.1. With A and T defined as above, we have T = TA .
Proof. For x ∈ Fn , we have
T x = T (x1 e1 + · · · + xn en )
= x1 T (e1 ) + · · · + xn (T en ) (since T is linear)
= Ax.
It follows from the above discussion that we have a one-to-one correspondence

Mm,n (F) → L(Fn , Fm ), A 7→ TA . (1.4)
Now suppose A ∈ Mm,n (F) and B ∈ Mn,k (F). Then, for x ∈ Fk , we have
(TA ◦ TB )(x) = TA (TB (x)) = TA (Bx) = A(Bx) = (AB)x = TAB x.
Hence
TA ◦ TB = TAB .
So, under the bijection (1.4), matrix multiplication corresponds to composition of linear
transformations. In fact, this is why we define matrix multiplication the way we do.
Recall the following definitions associated to a matrix A ∈ Mm,n (F).
• The column space of A, denoted col A, is the span of the columns of A. It is a subspace
of Fm and equal to the image of TA , denoted im TA .
• The row space of A, denoted row A, is the span of the columns of A. It is a subspace
of Fn .
• The rank of A is
rank A = dim(col A) = dim(im TA ) = dim(row A).
(The first equality is a definition, the second follows from the definition of TA , and the
third is a logical consequence that we will recall in Section 1.4.)
• The null space of A is
{x ∈ Fn : Ax = 0}.
• The nullity of A, denoted null A, is the dimension of the null space of A.
Recall also that the kernel of a linear transformation T : V → W is
ker T = {v ∈ V : T v = 0}
It follows that ker TA is equal to the null space of A, and so
null A = dim(ker TA ). (1.5)
The important Rank-Nullity Theorem (also known as the Dimension Theoreom) states
that if T : V → W is a linear transformation, then
dim V = dim(ker T ) + dim(im T ). (1.6)
For a matrix A ∈ Mm,n (F), applying the Rank-Nullity Theorem to TA : Fn → Fm gives
n = null A + rank A. (1.7)
Exercises.
Recommended exercises: Exercises in [Nic, §2.6].
1.4. Gaussian elimination 13
1.4 Gaussian elimination

Recall that you learned in previous courses how to row reduce a matrix. This procedure is
called Gaussian elimination. We briefly review this technique here. For further details, see
[Nic, §§1.1, 1.2, 2.5].
To row reduce a matrix, you used the following operations:
Definition 1.4.1 (Elementary row operations). The following are called elementary row
operations on a matrix A with entries in F.
• Type I : Interchange two rows of A.

• Type II : Multiply any row of A by a nonzero element of F.
• Type III : Add a multiple of one row of A to another row of A.
Definition 1.4.2 (Elementary matrices). An n × n elementary matrix is a matrix obtained

by performing an elementary row operation on the identity matrix In . In particular, we
define the following elementary matrices:
• For 1 ≤ i, j ≤ n, i 6= j, we let Pi,j be the elementary matrix obtained from In by

interchanging the i-th and j-th rows.
• For 1 ≤ i ≤ n and a ∈ F× , we let Mi (a) be the elementary matrix obtained from In by
multiplying the i-th row by a.
• For i ≤ i, j ≤ n, i 6= j, and a ∈ F, we let Ei,j (a) be the elementary matrix obtained
from In by adding a times row i to row j.
The type of the elementary matrix is the type of the corresponding row operation performed
on In .
Example 1.4.3. If n = 4, we have

     
0 0 1 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 0 0 0 1 0 0
P1,3 = 
1 0 0 0 , M4 (−2) = 0
  , and E2,4 (3) =  .
0 1 0 0 0 1 0
0 0 0 1 0 0 0 −2 0 3 0 1
Proposition 1.4.4. (a) Every elementary matrix is invertible and the inverse is an ele-
mentary matrix of the same type.
(b) Performing an elementary row operation on a matrix A is equivalent to multiplying A
on the left by the corresponding elementary matrix.
Proof. (a) We leave it as an exercise (see Exercise 1.4.1) to check that
Pi,j Pi,j = I, Mi (a)Mi (a−1 ) = I, Ei,j (a)Ei,j (−a) = I (1.8)
(b) We give the proof for row operations of type III, and leave the proofs for types I and
II as exercises. (See Exercise 1.4.1.) Fix 1 ≤ i, j ≤ n, with i 6= j. Note that
• row k of Ei,j (a) is ek if k 6= j, and

• row j of Ei,j (a) is ej + aei .
Thus, if k 6= j, then row k of Ei,j (a)A is
ek A = row k of A,
and row j of Ei,j A is
(ej + aei )A = ej A + aei A = (row j of A) + a(row i of A).
Therefore, Ei,j (a)A is the result of adding a times row i to row j.
Definition 1.4.5 (Row-echelon form). A matrix is in row-echelon form (and will be called
a row-echelon matrix ) if:
(a) all nonzero rows are above all zero rows,

(b) the first nonzero entry from the left in each nonzero row is a 1, called the leading 1 for
that row, and
(c) each leading 1 is strictly to the right of all leading 1s in rows above it.
A row-echelon matrix is in reduced row-echelon form (and will be called a reduced row-echelon
matrix ) if, in addition,
(d) each leading 1 is the only nonzero entry in its column.
Remark 1.4.6. Some references do not require the leading entry (i.e. the first nonzero entry
from the left in a nonzero row) to be 1 in row-echelon form.
Proposition 1.4.7. Every matrix A ∈ Mm,n (F) can be transformed to a row-echelon form
matrix R by performing elementary row operations. Equivalently, there exist finitely many
elementary matrices E1 , E2 , . . . , Ek such that R = E1 E2 · · · Ek A.
Proof. You saw this in previous courses, and so we will omit the proof here. In fact, there
is a precise algorithm, called the gaussian algorithm, for bringing a matrix to row-echelon
form. See [Nic, Th. 1.2.1] for details.
Proposition 1.4.8. A square matrix is invertible if and only if it is a product of elementary

matrices.
Proof. Since the elementary matrices are invertible by Proposition 1.4.4(a), if A is a product
of invertible matrices, then A is invertible. Conversely, suppose A is invertible. Then it can
be row-reduced to the identity matrix I. Hence, by Proposition 1.4.7, there are elementary
matrices E1 , E2 , . . . , Ek such that I = E1 E2 · · · Ek A. Then
A = Ek−1 · · · E2−1 E1−1 .
Since inverses of elementary matrices are elementary matrices by Proposition 1.4.4(a), we

are done.
1.4. Gaussian elimination 15
Reducing a matrix all the way to reduced row-echelon form is sometimes called Gauss–
Jordan elimination.
Recall that the rank of a matrix A, denoted rank A, is the dimension of the column space
of A. Equivalently, rank A is the number of nonzero rows (which is equal to the number of
leading 1s) in any row-echelon matrix U that is row equivalent to A (i.e. that can be obtained
from A by row operations). Thus we see that rank A is also the dimension of the row space
of A, as noted earlier.
Recall that every linear system consisting of m linear equations in n variables can be
written in matrix form
Ax = b,
where A is an m × n matrix, called the coefficient matrix ,
 
x1
 x2 
x =  .. 
 
.
xn
is the vector of variables (or unknowns), and b is the vector of constant terms. We say that
• the linear system is overdetermined if there are more equations than unknowns (i.e. if
m > n),
• the linear system is underdetermined if there are more unknowns than equations (i.e.
if m < n),
• the linear system is square if there are the same number of unknowns as equations (i.e.
if m = n),
• an m × n matrix is tall if m > n, and
• an m × n matrix is wide if m < n.
It follows immediately that the linear system Ax = b is
• overdetermined if and only if A is tall,
• underdetermined if and only if A is wide, and
• square if and only if A is square.
Example 1.4.9. As a refresher, let’s solve the following underdetermined system of linear
equations:
−4x3 + x4 + 2x5 = 11
4x1 − 2x2 + 8x3 + x4 − 5x5 = 5
2x1 − x2 + 2x3 + x4 − 3x5 = 2
We write down the augmented matrix and row reduce:
   
0 0 −4 1 2 11 2 −1 2 1 −3 2
R1 ↔R3
 4 −2 8 1 −5 5  −−−−→  4 −2 8 1 −5 5 
2 −1 2 1 −3 2 0 0 −4 1 2 11
   
2 −1 2 1 −3 2 2 −1 2 1 −3 2
R2 −2R1 R +R2
−− −−→  0 0 4 −1 1 1  −−3−−→  0 0 4 −1 1 1 
0 0 −4 1 2 11 0 0 0 0 3 12
One can now easily solve the linear system using a technique called back substitution, which
we will discuss in Section 1.6.1. However, to further illustrate the process of row reduction,
let’s continue with gaussian elimination:
   
2 −1 2 1 −3 2 1
R3
2 −1 2 1 −3 2
 0 0 4 −1 1 1  −3−→  0 0 4 −1 1 1 
0 0 0 0 3 12 0 0 0 0 1 4
   
R1 +3R3 2 −1 2 1 0 14 1
R
2 −1 2 1 0 14
R2 −R3 4 2
−−−−→ 0 0 4 −1 0 −3 −−→ 0 0 1 −1/4
   0 −3/4 
0 0 0 0 1 4 0 0 0 0 1 4
   
2 −1 0 3/2 0 31/2 1
R
1 −1/2 0 3/4 0 31/4
R2 −2R2 2 1
−−−−→ 0 0 1 −1/4 0 −3/4 −−→ 0
   0 1 −1/4 0 −3/4 
0 0 0 0 1 4 0 0 0 0 1 4
The matrix is now in reduced row-echelon form. This reduced matrix corresponds to the
equivalent linear system:
1 3 31
x1 − x
2 2
+ x
4 4
= 4
1 −3
x3 − x
4 4
= 4
x5 = 4
The leading variables are the variables corresponding to leading 1s in the reduced row-echelon
matrix: x1 , x3 , and x5 . The non-leading variables, or free variables, are x2 and x4 . We let
the free variables be parameters:
x2 = s, x4 = t, s, t ∈ F.
Then we solve for the leading variables in terms of these parameters giving the general
solution in parametric form:
31 1 3
x1 = + s− t
4 2 4
x2 =s
3 1
x3 =− + t
4 4
x4 =t
x5 =4
1.5. Matrix inverses 17
Exercises.
1.4.1. Complete the proof of Proposition 1.4.4 by verifying the equalities in (1.8) and veri-
fying part (b) for the case of elementary matrices of types I and II.
Additional recommended exercises: [Nic, §§1.1, 1.2, 2.5].
1.5 Matrix inverses

In earlier courses you learned about invertible matrices and how to find their inverses. We
now revisit the topic of matrix inverses in more detail. In particular, we will discuss the more
general notions of one-sided inverses. In this section we follow the presentation in [BV18,
Ch. 11].
1.5.1 Left inverses

Definition 1.5.1 (Left inverse). A matrix X satisfying
XA = I
is called a left inverse of A. If such a left inverse exists, we say that A is left-invertible. Note
that if A has size m × n, then any left inverse X will have size n × m.
Examples 1.5.2. (a) If A ∈ M1,1 (F), then we can think of A simply as a scalar. In this
case a left inverse is equivalent to the inverse of the scalar. Thus, A is left-invertible if
and only if it is nonzero, and in this case it has only one left inverse.
(b) Any nonzero vector a ∈ Fn is left-invertible. Indeed, if ai 6= 0 for some 1 ≤ i ≤ n, then
1 T
ei a = 1 .
ai
1 T
Hence e
ai i
is a left inverse of a. For example, if

2
0
a=
−1 ,

then

1/2 0 0 0 , 0 0 −1 0 , and 0 0 0 1
are all left inverses of a. In fact, a has infinitely many left inverses. See Exercise 1.5.1.
(c) The matrix  

4 3
A = −6 −4
−1 −1
has left inverses

1 −7 −8 11 1 0 −1 4
B= and C = .
9 11 10 −16 2 0 1 −6
Indeed, one can check directly that BA = CA = I.
Proposition 1.5.3. If A has a left inverse, then the columns of A are linearly independent.
Proof. Suppose A ∈ Mm,n (F) has a left inverse B, and let a1 , . . . , an be the columns of A.
Suppose that
x1 a1 + · · · + xn an = 0
for some x1 , . . . , xn ∈ F. Thus, taking x = (x1 , . . . , xn ), we have
Ax = x1 a1 + · · · + xn an = 0.
Thus
x = Ix = BAx = B0 = 0.
This implies that x1 , . . . , xn = 0. Hence the columns of A are linearly independent.
We will prove the converse of Proposition 1.5.3 in Proposition 1.5.15.
Corollary 1.5.4. If A has a left inverse, then A is square or tall.
Proof. Suppose A is a wide matrix, i.e. A is m × n with m < n. Then it has n columns,
each of which is a vector in Fm . Since m < n, these columns cannot be linearly independent.
Hence A cannot have a left inverse.
So we see that only square or tall matrices can be left-invertible. Of course, not every
square or tall matrix is left-invertible (e.g. consider the zero matrix).
Now suppose we want to solve a system of linear equations
Ax = b
in the case where A has a left inverse C. If this system has a solution x, then
Cb = CAx = Ix = x.
So x = Cb is the solution of Ax = b. However, we started with the assumption that the

system has a solution. If, on the other hand, it has no solution, then x = Cb cannot satisfy
Ax = b.
This gives us a method to check if the linear system Ax = b has a solution, and to find
the solution if it exists, provided A has a left inverse C. We simply compute ACb.
• If ACb = b, then x = Cb is the unique solution of the linear system Ax = b.

• If ACb 6= b, then the linear system Ax = b has no solution.
Keep in mind that this method only works when A has a left inverse. In particular, A must
be square or tall. So this method only has a chance of working for square or overdetermined
systems.
Example 1.5.5. Consider the matrices of Example 1.5.2(c): The matrix

 
4 3
A = −6 −4
−1 −1
has left inverses

1 −7 −8 11 1 0 −1 4
B= and C = .
9 11 10 −16 2 0 1 −6
Suppose we want to solve the over-determined linear system

 
1
Ax = −2 .
 (1.9)
0
We can use either left inverse and compute

   
1 1
1
B −2 =
  = C −2 .

−1
0 0
Then we check to see if (1, −1) is a solution:


1
1
A = −2 .
−1
0
So (1, −1) is the unique solution to the linear system (1.9).
Example 1.5.6. Using the same matrix A from Example 1.5.5, consider the over-determined
linear system  
1
Ax = −1 .

0
We compute  
1
1/9
B −1 =
  .
1/9
0
Thus, if the system has a solution, it must be (1/9, 1/9, 0). However, we check that
     
1/9 7/9 1
A 1/9 = −10/9 6= −1 .
0 −2/9 0
Thus, the system has no solution. Of course, we could also compute using the left inverse C
as above, or see that    
1 1
1/9 1/2
B −1 =
  6= = C −1 .

1/9 −1/2
0 0
If the system had a solution, both (1/9, 1/9, 0) and (1/2, −1/2, 0) would be the unique
solution, which is clearly not possible.
1.5.2 Right inverses

Definition 1.5.7 (Right inverse). A matrix X satisfying
AX = I
is called a right inverse of A. If such a right inverse exists, we say that A is right-invertible.
Note that if A has size m × n, then any right inverse X will have size n × m.
Suppose A has a right inverse B. Then
B T AT = (AB)T = I,
and so B T is a left inverse of AT . Similarly, if A has a left inverse C, then
AT C T = (CA)T = I,
and so C T is a right inverse of AT . This allows us to translate our results about left inverses
to results about right inverses.
Proposition 1.5.8. (a) The matrix A is left-invertible if and only if AT is right invertible.
Furthermore, if C is a left inverse of A, then C T is a right inverse of AT
(b) Similarly, A is right-invertible if and only if AT is left-invertible. Furthermore, if B is
a right inverse of A, then B T is a left inverse of AT .
(c) If a matrix is right-invertible then its rows are linearly independent.
(d) If A has a right inverse, then A is square or wide.
Proof. (a) We proved this above.
(b) We proved this above. Alternatively, it follows from part (a) and the fact that
T T
(A ) = A.
(c) This follows from Proposition 1.5.3 and part (b) since the rows of A are the columns
of AT .
(d) This follows from Corollary 1.5.4 and part (b) since A is square or wide if and only
if AT is square or tall.
We can also transpose Examples 1.5.2.
Examples 1.5.9. (a) If A ∈ M1,1 (F), then a right inverse is equivalent to the inverse of the
scalar. Thus, A is right-invertible if and only if it is nonzero, and in this case it has
only one right inverse.

(b) Any nonzero row matrix a = a1 · · · an ∈ M1,n (F) is right-invertible. Indeed, if
ai 6= 0 for some 1 ≤ i ≤ n, then
1
aei = 1 .
ai
1
Hence ai ei is a right inverse of a.
(c) The matrix
4 −6 −1
A=
3 −4 −1
has right inverses
   
−7 11 0 0
1 1
B = −8 10  and C = −1 1  .
9 2
11 −16 4 −6
Now suppose we want to solve a linear system
Ax = b
in the case where A has a right inverse B. Note that
ABb = Ib = b.
Thus x = Bb is a solution to his system. Hence, the system has a solution for any b. Of
course, there can be other solutions; the solution x = Bb is just one of them.
This gives us a method to solve any linear system Ax = b in the case that A has a right
inverse. Of course, this implies that A is square or wide. So this method only has a chance
of working for square or underdetermined systems.
Example 1.5.10. Using the matrix A from Example 1.5.9(c) with right inverses B and C,
the linear system
1
Ax =
1
has solutions    
4/9 0
1 1
B =  2/9  and C =  0 .
1 1
−5/9 −1
(Of course, there are more. As you learned in previous courses, any linear system with more
than one solution has infinitely many solutions.) Indeed, we can find a solution of Ax = b
for any b.
1.5.3 Two-sided inverses

Definition 1.5.11 (Two-sided inverse). A matrix X satisfying
AX = I = XA
is called a two-sided inverse, or simply an inverse, of A. If such an inverse exists, we say

that A is invertible or nonsingular . A square matrix that is not invertible is called singular .
Proposition 1.5.12. A matrix A is invertible if and only if it is both left-invertible and
right-invertible. In this case, any left inverse of A is equal to any right inverse of A. Hence
the inverse of A is unique.
Proof. Suppose AX = I and Y A = I. Then
X = IX = Y AX = Y I = Y.
If A is invertible, we denote its inverse by A−1 .

Corollary 1.5.13. Invertible matrices are square.
Proof. If A is invertible, it is both left-invertible and right-invertible by Proposition 1.5.12.
Then A must be square by Corollary 1.5.4 and Proposition 1.5.8(d).
You learned about inverses in previous courses. In particular, you learned:
• If A is an invertible matrix, then the linear system Ax = b has the unique solution
x = A−1 b.
• If A is invertible, then the inverse can be computed by row-reducing an augmented
matrix:
I A−1 .

A I
Proposition 1.5.14. If A is a square matrix, then the following are equivalent:
(a) A is invertible.
(b) The columns of A are linearly independent.
(c) The rows of A are linearly independent.
(d) A is left-invertible.
(e) A is right-invertible.
Proof. Let A ∈ Mn,n (F). First suppose that A is left-invertible. Then, by Proposition 1.5.3,
the columns of A are linearly independent. Since there are n columns, this implies that they
form a basis of Fn . Thus any vector in Fn is a linear combination of the columns of A. In
particular, each of the standard basisvector ei , 1 ≤ i ≤
n, can be expressed as ei = Abi for
n
some bi ∈ F . Then the matrix B = b1 b2 · · · bn satisfies

AB = Ab1 Ab2 · · · Abn = e1 e2 · · · en = I.
Therefore B is a right inverse of A. So we have shown that

(d) =⇒ (b) =⇒ (e).

Applying this result to the transpose of A, we see that
(e) =⇒ (c) =⇒ (d).
Therefore (b) to (e) are all equivalent. Since, by Proposition 1.5.12, (a) is equivalent to ((d)
and (e)), the proof is complete.
1.5.4 The pseudoinverse

We learned in previous courses how to compute two-sided inverses of invertible matrices.
But how do we compute left and right inverses in general? We will assume that F = R here,
but similar methods work over the complex numbers. We just have to replace the transpose
by the conjugate transpose (see Definition 3.3.1).
If A ∈ Mm,n (R), then the square matrix
AT A ∈ Mn,n (R)
is called the Gram matrix of A.

Proposition 1.5.15. (a) A matrix has linearly independent columns if and only if its
Gram matrix AT A is invertible.
(b) A matrix is left-invertible if and only if its columns are linearly independent. Further-
more, if A is left-invertible, then (AT A)−1 AT is a left inverse of A.
Proof. (a) First suppose the columns of A are linearly independent. Assume that
AT Ax = 0
for some x ∈ Rn . Multiplying on the left by xT gives
0 = xT 0 = xT AT Ax = (Ax)T (Ax) = (Ax) · (Ax),
which implies that Ax = 0. (Recall that, for any vector v ∈ Rn , we have v · v = 0 if and
only if v = 0.) Because the columns of A are linearly independent, this implies that x = 0.
Since the only solution to AT Ax = 0 is x = 0, we conclude that AT A is invertible.
Now suppose the columns of A are linearly dependent. Thus, there exists a nonzero
x ∈ Rn such that Ax = 0. Multiplying on the left by AT gives
AT Ax = 0, x 6= 0.
Thus the Gram matrix AT A is singular.

(b) We already saw in Proposition 1.5.3 that if A is left-invertible, then its columns
are linearly independent. To prove the converse, suppose the columns of A are linearly
independent. Then, by (a), the matrix AT A is invertible. Then we compute
(AT A)−1 AT A = (AT A)−1 (AT A) = I.

Hence (AT A)−1 AT is a left inverse of A.

If the columns of A are linearly independent (in particular, A is square or tall), then the
particular left inverse (AT A)−1 AT described in Proposition 1.5.15 is called the pseudoinverse
of A, the generalized inverse of A, or the Moore–Penrose inverse of A, and is denoted A+ .
Recall that left inverses are not unique in general. So this is just one left inverse. However,
when A is square, we have
A+ = (AT A)−1 AT = A−1 (AT )−1 AT = A−1 I = A−1 ,
and so the pseudoinverse reduces to the ordinary inverse (which is unique). Note that this
equation does not make sense when A is not square or, more generally, when A is not
invertible.
We also have a right analogue of Proposition 1.5.15.
Proposition 1.5.16. (a) The rows of A are linearly independent if and only if the matrix
AAT is invertible.
(b) A matrix is right-invertible if and only if its rows are linearly independent. Further-
more, if A is right-invertible, then AT (AAT )−1 is a right inverse of A.
Proof. Essentially, we take the transpose of the statements in Proposition 1.5.15.
(a) We have
rows of A are lin. ind. ⇐⇒ cols of AT are lin. ind.

⇐⇒ (AT )T AT = AAT is invertible (by Proposition 1.5.15(a)).
(b) We have
A is right-invertible ⇐⇒ AT is left-invertible
⇐⇒ columns of AT are lin. ind. (by Proposition 1.5.15(b))
⇐⇒ rows of A are lin. ind.
Furthermore, if A is right-invertible, then, by part (a), AAT is invertible. Then we have
AAT (AAT )−1 = I,
and so AT (AAT )−1 is a right inverse of A.
Propositions 1.5.15 and 1.5.16 give us a method to compute right and left inverses, if they
exist. Precisely, these results reduce the problem to the computation of two-sided inverses,
which you have done in previous courses. Later in the course we will develop other, more
efficient, methods for computing left- and right-inverses. (See Sections 3.6 and 4.3.)
1.6. LU factorization 25
Exercises.
1.5.1. Suppose A is a matrix with left inverses B and C. Show that, for any scalars α and
β satisfying α + β = 1, the matrix αB + βC is also a left inverse of A. It follows that if a
matrix has two different left inverses, then it has a infinite number of left inverses.
1.5.2. Let A be an m × n matrix. Suppose there exists a nonzero n × k matrix B such

that AB = 0. Show that A has no left inverse. Formulate an analogous statement for right
inverses.
1.5.3. Let A ∈ Mm,n (R) and let TA : Rn → Rm be the corresponding linear map (see (1.3)).
(a) Prove that A has a left inverse if and only if TA is injective.
(b) Prove that A has a right inverse if and only if TA is surjective.
(c) Prove that A has a two-sided inverse if and only if TA is an isomorphism.
1.5.4. Consider the matrix
1 1 1
A= .
−2 1 4
(a) Is A left-invertible? If so, find a left inverse.
(b) Compute AAT .
(c) Is A right-invertible? If so, find a right inverse.
Additional recommended exercises from [BV18]: 11.2, 11.3, 11.5, 11.6, 11.7, 11.12, 11.13,
11.17, 11.18, 11.22.
1.6 LU factorization
In this section we discuss a certain factorization of matrices that is very useful in solving
linear systems. We follow here the presentation in [Nic, §2.7].
1.6.1 Triangular matrices

If A = [aij ] is an m × n matrix, the elements a11 , a22 , . . . form the main diagonal of A, even
if A is not square. We say A is upper triangular if every entry below and to the left of the
main diagonal is zero. For example, the matrices
 
5 2 −5    
0 0 3  2 5 −1 0 1 0 1 2 3 0
0 −1 4 7 −1 , and 0 0 0 1 −3
0 0 8  ,
 
0 0 3 7 0 0 0 2 7 −4
0 0 0
are all upper triangular. In addition

every row-echelon matrix is upper triangular.

Similarly, A is lower triangular if every entry above and to the right of the main diagonal
is zero. Equivalently, A is lower triangular if and only if AT is upper triangular. We say a
matrix is triangular if it is upper or lower triangular.
If the coefficient matrix of a linear system is upper triangular, there is a particularly
efficient way to solve the system, known as back substitution, where later variables are sub-
stituted into earlier equations.
Example 1.6.1. Let’s solve the following system:

2x1 − x2 + 2x3 + x4 − 3x5 = 2
4x3 − x4 + x5 = 1
3x5 = 12
Note that the coefficient matrix is
 
2 −1 2 1 −3
0 0 4 −1 1  ,
0 0 0 0 3
which is upper triangular. We obtained this matrix part-way through the row reduction in
Example 1.4.9. As in gaussian elimination, we let the free variables (i.e. the non-leading
variables) be parameters:
x2 = s, x4 = t.
Then we solve for x5 , x3 , and x1 , in that order. The last equation gives
12
x5 = = 4.
3
Substitution into the second-to-last equation then gives
1 3 1
x3 = (1 + x4 − x5 ) = − + t.
4 4 4
Then, substitution into the first equation gives

1 1 3 1 31 1 3
x1 = (2 + x2 − 2x3 − x4 + 3x5 ) = 2 + s + − t − t + 12 = + s − t.
2 2 2 2 4 2 4
Note that this is the same solution we obtained in Example 1.4.9. But the method of back
substitution involved less work!
Similarly, if the coefficient matrix of a system is lower triangular, it can be solved by

forward substitution, where earlier variables are substituted into later equations. Back sub-
stitution is more numerically efficient than gaussian elimination. (With n equations, where
n is large, gaussian elimination requires approximately n3 /2 multiplications and divisions,
whereas back substitution requires about n3 /3.) Thus, if we want to solve a large number
of linear systems with the same coefficient matrix, it would be useful to write the coefficient
matrix in terms of lower and/or upper triangular matrices.
Suppose we have a linear system Ax = b, where we can factor A as A = LU for some
lower triangular matrix L and upper triangular matrix U . Then we can efficiently solve the
system Ax = b as follows:
(a) First solve Ly = b for y using forward substitution.

(b) Then solve U x = y for x using back substitution.
Then we have
Ax = LU x = Ly = b,
and so we have solved the system. Furthermore, every solution can be found using this
procedure: if x is a solution, take y = U x. This is an efficient way to solve the system that
can be implemented on a computer.
The following lemma will be useful in our exploration of this topic.
Lemma 1.6.2. Suppose A and B are matrices.
(a) If A and B are both lower (upper) triangular, then so is AB (assuming it is defined).
(b) If A is a square lower (upper) triangular matrix, then A is invertible if and only if every
main diagonal entry is nonzero. In this case A−1 is also lower (upper) triangular.
Proof. The proof of this lemma is left as Exercise 1.6.1
1.6.2 LU factorization
Suppose A is an m × n matrix. Then we can use row reduction to transform A to a row-
echelon matrix U , which is therefore upper triangular. As discussed in Section 1.4, this
reduction can be performed by multiplying on the left by elementary matrices:
A → E1 A → E2 E1 A → · · · → Ek Ek−1 · · · E2 E1 A = U.
It follows that
−1
A = LU, where L = (Ek Ek−1 · · · E2 E1 )−1 = E1−1 E2−1 · · · Ek−1 Ek−1 .
As long as we do not require that U be reduced then, except for row interchanges, none of
the above row operations involve adding a row to a row above it. Therefore, if we can avoid
row interchanges, all the Ei are lower triangular. In this case, L is lower triangular (and
invertible) by Lemma 1.6.2. Thus, we have the following result. We say that a matrix can
be lower reduced if it can be reduced to row-echelon form without using row interchanges.
Proposition 1.6.3. If A can be lower reduced to a row-echelon (hence upper triangular)
matrix U , then we have
A = LU
for some lower triangular, invertible matrix L.
Definition 1.6.4 (LU factorization). A factorization A = LU as in Proposition 1.6.3 is
called an LU factorization or LU decomposition of A.
It is possible that no LU factorization exists, when A cannot be reduced to row-echelon
form without using row interchanges. We will discuss in Section 1.6.3 how to handle this
situation. However, if an LU factorization exists, then the gaussian algorithm gives U and a
procedure for finding L.
Example 1.6.5. Let’s find an LU factorization of

 
0 1 2 −1 3
A = 0 −2 −4 4 −2 .
0 −1 −2 4 3
We first lower reduce A to row-echelon form:

    1  
0 1 2 −1 3 R2 +2R1 0 1 2 −1 3 R
2 2 0 1 2 −1 3
R +R1 R3 −3R2
A =  0 −2 −4 4 −2  −−3−−→  0 0 0 2 4 − −−−→ 0 0 0 1 2 = U.
0 −1 −2 4 3 0 0 0 3 6 0 0 0 0 0
We have highlighted the leading column at each step. In each leading column, we divide the
top row by the top (pivot) entry to create a 1 in the pivot position. Then we use the leading
one to create zeros below that entry. Then we have
 
1 0 0
A = LU, where L = −2 2 0 .
−1 3 1
The matrix L is obtained from the identity matrix I3 by replacing the bottom of the first two
columns with the highlighted columns above. Note that rank A = 2, which is the number of
highlighted columns.
The method of Example 1.6.5 works in general, provided A can be lower reduced. Note
that we did not need to calculate the elementary matrices used in our row operations.
Algorithm 1.6.6 (LU algorithm). Suppose A is an m × n matrix of rank r, and that A can
be lower reduced to a row-echelon matrix U . Then A = LU where L is a lower triangular,
invertible matrix constructed as follows:
(a) If A = 0, take L = Im and U = 0.

(b) If A 6= 0, write A1 = A and let c1 be the leading column of A1 . Use c1 and lower
reduction to create a leading one at the top of c1 and zeros below it. When this is
completed, let A2 denote the matrix consisting of rows 2 to m of the new matrix.
(c) If A2 6= 0, let c2 be the leading column of A2 and repeat Step (b) on A2 to create A3 .
(d) Continue in this way until U is reached, where all rows below the last leading 1 are zero
rows. This will happen after r steps.
(e) Create L by replacing the bottoms of the first r columns of Im with c1 , . . . , cr .
Proof. If c1 , c2 , . . . , cr are columns of lengths m, m − 1, . . . , m − r + 1, respectively, let

L(m) (c1 , c2 , . . . , cr ) denote the lower triangular m × m matrix obtained from Im by placing
c1 , c2 , . . . , cr at the bottom of the first r columns:

 
c1 0 · · · · · · · · · · · · · · · 0
. .. 
c2 . . .


 .. .. .
.. 


 . . 
 ... .. 
 cr .
L(m) (c1 , c2 , . . . , cr ) = 
 ... ..  .

 1 .
. . . . .. 
 

 0 . . .
 .. . . . . 
 . . . 0
0 ··· 0 1
We prove the result by induction on n. The case where A = 0 or n = 1 is straightforward.

Suppose n > 1 and that the result holds for n − 1. Let c1 be the leading column of A. By the
assumption that A can be lower reduced, there exist lower-triangular elementary matrices
E1 , . . . , Ek such that, in block form,

X1
(Ek · · · E2 E1 )A = 0 e1 , where (Ek · · · E2 E1 )c1 = e1 .
A1
(Recall that e1 = (1, 0, . . . , 0) is the standard basis element.) Define
G = (Ek · · · E2 E2 )−1 = E1−1 E2−1 · · · Ek−1 .
Then we have Ge1 = c1 . By Lemma 1.6.2, G is lower triangular. In addition, each Ej , and
hence each Ej−1 , is the result of either multiplying row 1 of Im by a nonzero scalar or adding
a multiple of row 1 to another row. Thus, in block form,

0
G = c1
Im−1
By our induction hypothesis, we have an LU factorization A1 = L1 U1 , where L1 =

L(m−1) (c2 , . . . , cr ) and U1 is row-echelon. Then block multiplication gives

−1 X1 1 0 0 1 X1
G A = 0 e1 = .
L1 U1 0 L1 0 0 U1
Thus A = LU , where
0 1 X1
U=
0 0 U1
is row-echelon and

0 1 0 0
L = c1 = c1 = L(m) (c1 , c2 , . . . , cr ).
Im−1 0 L1 L1
This completes the proof of the induction step.

LU factorization is very important in practice. It often happens that one wants to solve
a series of systems
Ax = b1 , Ax = b2 , · · · , Ax = bk
with the same coefficient matrix. It is very efficient to first solve the first system by gaus-
sian elimination, simultaneously creating an LU factorization of A. Then one can use this
factorization to solve the remaining systems quickly by forward and back substitution.
Example 1.6.7. Let’s find an LU factorization for

 
3 6 −3 0 3
−2 4 6 8 −2
A=  1 0 −2
.
−5 0 
1 2 −1 6 3
We reduce A to row-echelon form:
  13 R1  
3 6 −3 0 3 R2 +2R1 1 2 −1 0 1
R3 −R1
 −2 4 6 8 −2 R4 −R1 0
  8 4 8 0
A= −−−−→  
 1 0 −2 −5 0  0 −2 −1 −5 −1
1 2 −1 6 3 0 0 0 6 2
   
1
R
1 2 −1 0 1 1
− 3 R3
1 2 −1 0 1
8 2
R3 +2R1 0 1 1/2
 1 0 R4 −6R3  0 1 1/2 1 0 
−− −−→  − −−−→   = U.
0 0 0 −3 −1 0 0 0 1 1/3
0 0 0 6 2 0 0 0 0 0
Thus we have A = LU , with  
3 0 0 0
−2 8 0 0
L=
 1 −2 −3 0 .

1 0 6 1
Let’s do one more example, this time where the matrix A is invertible.
Example 1.6.8. Let’s find an LU factorization for

 
1 −1 2
A = −1 3 4 .
1 −4 −3
We reduce A to row-echelon form:
    1    
1 −1 2 R2 +R1 1 −1 2 R
2 2 1 −1 2 1
R3
1 −1 2
R −R1 R3 +3R2
A =  −1 3 4  −−3−−→ 0 2 6  −− −−→ 0 1 3  −4−→ 0 1 3 = U.
1 −4 −3 0 −3 −5 0 0 4 0 0 1
Then A = LU with  
1 0 0
L = −1 2 0 .
1 −3 4
1.6.3 PLU factorization

All of our examples in Section 1.6.2 worked because we started with matrices A that could
be lower reduced. However, there are matrices that have no LU factorization. Consider, for
example, the matrix
0 1
A= .
1 0
Suppose this matrix had an LU decomposition A = LU . Write

`11 0 u11 u12
L= and U = .
`21 `22 0 u22
Then we have
0 1 `11 u11 `11 u12
A= = .
1 0 `21 u11 `21 u12 + `22 u22
In particular, `11 u11 = 0, which implies that `11 = 0 or u11 = 0. But this would mean that
L is singular or U is singular. In either case, we would have
det(A) = det(L) det(U ) = 0,
which contradicts the fact that det(A) = −1. Therefore, A has no LU decomposition. By
Algorithm 1.6.6, this means that A cannot be lower reduced. The problem is that we need
to use a row interchange to reduce A to row-echelon form.
The following theorem tells us how to handle LU factorization in general.
Theorem 1.6.9. Suppose an m × n matrix A is row reduced to a row-echelon matrix U . Let

P1 , P2 , . . . , Ps be the elementary matrices corresponding (in order) to the row interchanges
using in this reduction, and let P = Ps · · · P2 P1 . (If no interchanges are used, take P = Im .)
Then P A has an LU factorization.
Proof. The only thing that can go wrong in the LU algorithm (Algorithm 1.6.6) is that, in
step (b), the leading column (i.e. first nonzero column) may have a zero entry at the top.
This can be remedied by an elementary row operation that swaps two rows. This corresponds
to multiplication by a permutation matrix (see Proposition 1.4.4(b)). Thus, if U is a row
echelon form of A, then we can write
Lr Pr Lr−1 Pr−1 · · · L2 P2 L1 P1 A = U, (1.10)
where U is a row-echelon matrix, each of the P1 , . . . , Pr is either an identity matrix or an

elementary matrix corresponding to a row interchange, and, for each 1 ≤ j ≤ r, Lj =
L(m) (e1 , . . . , ej−1 , cj )−1 for some column cj of length m − j + 1. (Refer to the proof of
Algorithm 1.6.6 for notation.) It is not hard to check that, for each 1 ≤ j ≤ r,
Lj = L(m) (e1 , . . . , ej−1 , c0j )
for some column c0j of length m − j + 1.

Now, each permutation matrix can be “moved past” each lower triangular matrix to the
right of it, in the sense that, if k > j, then
Pk Lj = L0j Pk ,
where L0j = L(m) (e1 , . . . , ej−1 , c00j ) for some column c00j of length m − j + 1. See Exercise 1.6.2.
Thus, from (1.10) we obtain
(Lr L0r−1 · · · L02 L01 )(Pr Pr−1 · · · P2 P1 )A = U,
for some lower triangular matrices L01 , L02 , . . . , L0r−1 . Setting P = Pr Pr−1 · · · P2 P1 , this implies
that P A has an LU factorization, since Lr L0r−1 · · · L02 L01 is lower triangular and invertible by
Lemma 1.6.2.
Note that Theorem 1.6.9 generalizes Proposition 1.6.3. If A can be lower reduced, then
we can take P = Im in Theorem 1.6.9, which then states that A has an LU factorization.
A matrix that is the product of elementary matrices corresponding to row interchanges
is called a permutation matrix . (We also consider the identity matrix to be a permutation
matrix.) Every permutation matrix P is obtained from the identity matrix by permuting
the rows. Then P A is the matrix obtained from A by performing the same permutation
on the rows of A. The matrix P is a permutation matrix if and only if it has exactly one
1 in each row and column, and all other entries are zero. The elementary permutation
matrices are those matrices obtained from the identity matrix by a single row exchange.
Every permutation matrix is a product of elementary ones.
Example 1.6.10. Consider the matrix

 
0 2 0 4
1 0 1 3 
A=
1
.
0 1 14 
−2 −1 −1 −10
Let’s find a permutation matrix P such that P A has an LU factorization, and then find the
factorization.
We first row reduce A:
     
1 0 1 3 1 0 1 3 1 0 1 3
R3 −R1
0 2 0 4   R4 + 12 R2 0 2 0 4 
R1 ↔R2 
A −−−−→  −− − −→ 0 2 0 4  −
R4 +2R1 
0 0 0 11  −−−−→ 0 0 0 11 
   
1 0 1 14 
−2 −1 −1 −10 0 −1 1 −4 0 0 1 −2
 
1 0 1 3
0 2 0 4  .
R ↔R4  
−−3−−→ 0 0 1 −2
0 0 0 11
We used two row interchanges: first R1 ↔ R2 and then R3 ↔ R4 . Thus, as in Theorem 1.6.9,
we take     
1 0 0 0 0 1 0 0 0 1 0 0
0 1 0 0 1 0 0 0 1 0 0 0
P =0 0 0 1 0 0 1 0 = 0 0 0 1 .
   
0 0 1 0 0 0 0 1 0 0 1 0
We now apply the LU algorithm to P A:

     
1 0 1 3 1 0 1 3 1
R
1 0 1 3
R3 +2R1 2 2
 0 2 0 4  R4 −R1 0 2 0 4  R3 +R2 0 1
    0 2
PA =  −2 −1 −1 −10 −−−−→ 0 −1 1 −4 −−−−→ 0 0

1 −2
1 0 1 14 0 0 0 11 0 0 0 11
   
1 0 1 3 1 0 1 3
0 1 0 2  111 R3 0 1 0 2
−→ 
0 0 1 −2 −−−→ 0
   = U.
0 1 −2
0 0 0 11 0 0 0 1
Hence P A = LU , where  
1 0 0 0
0 2 0 0
L=
−2 −1 1 0  .

1 0 0 11
Theorem 1.6.9 is an important factorization theorem that applies to any matrix. If A
is any matrix, this theorem asserts that there exists a permutation matrix P and an LU
factorization P A = LU . Furthermore, it tells us how to find P , and we then know how to
find L and U .
Note that Pi = Pi−1 for each i (since any elementary permutation matrix is its own
inverse). Thus, the matrix A can be factored as
A = P −1 LU,
where P −1 is a permutation matrix, L is lower triangular and invertible, and U is a row-
echelon matrix. This is called a PLU factorization or a P A = LU factorization of A.
1.6.4 Uniqueness
Theorem 1.6.9 is an existence theorem. It tells us that a PLU factorization always exists.
However, it leaves open the question of uniqueness. In general, LU factorizations (and hence
PLU factorizations) are not unique. For example,

−1 0 1 4 −5 −1 −4 5 −1 0 1 4 −5
= = .
4 1 0 0 0 4 16 −20 4 8 0 0 0
(In fact, one can put any value in the (2, 2)-position of the 2 × 2 matrix and obtain the same
result.) The key to this non-uniqueness is the zero row in the row-echelon matrix. Note
that, if A is m × n, then the matrix U has no zero row if and only if A has rank m.
Theorem 1.6.11. Suppose A is an m × n matrix with LU factorization A = LU . If A has
rank m (that is, U has no zero row), then L and U are uniquely determined by A.
Proof. Suppose A = M V is another LU factorization of A. Thus, M is lower triangular and
invertible, and V is row-echelon. Thus we have
LU = M V, (1.11)
and we wish to show that L = M and U = V .

We have
V = M −1 LU = N U, where N = M −1 L.
Note that N is lower triangular and invertible by Lemma 1.6.2. It suffices to show that
N = I. Suppose N is m × m. We prove the result by induction on m.
First note that the first column of V is N times the first column of U . Since N is
invertible, this implies that the first column of V is zero if and only if the first column of U
is zero. Hence, by deleting zero columns if necessary, we can assume that the (1, 1)-entry is
1 in both U and V .
If m = 1, then, since U is row-echelon, we have

LU = `11 1 u12 · · · u1n = `11 `11 u12 · · · `11 u1n = A = a11 · · · a1n .

Thus `11 = a11 . Similarly, m11 = a11 . So L = a11 = M , as desired.
Now suppose m > 1 and that the result holds for N of size (m − 1) × (m − 1). As before,
we can delete any zero columns. So we can write, in block form,

a 0 1 Y 1 Z
N= , U= , V = .
X N1 0 U1 0 V1
Then
a aY 1 Z
N U = V =⇒ = .
X XY + N1 U1 0 V1
This implies that
a = 1, Y = Z, X = 0, and N1 U1 = V1 .
By the induction hypothesis, the equality N1 U1 = V1 implies N1 = I. Hence N = I, as
desired.
Recall that an m × m matrix is invertible if and only if it has rank m. Thus, we get the
following special case of Theorem 1.6.11.
Corollary 1.6.12. If an invertible matrix A has an LU factorization A = LU , then L and

U are uniquely determined by A.
Exercises.
1.6.1. Prove Lemma 1.6.2.
1.6.2 ([Nic, Ex. 2.7.11]). Recall the notation L(m) (c1 , c2 , . . . , cr ) from the proof of Algo-
rithm 1.6.6. Suppose 1 ≤ i < j < k ≤ m, and let ci be a column of length m − i + 1. Show
that there is another column c0i of length m − i + 1 such that
Pj,k L(m) (e1 , . . . , ei−1 , ci ) = L(m) (e1 , . . . , ei−1 , c0i )Pj,k .

Here Pj,k is the m × m elementary permutation matrix (see Definition 1.4.2). Hint: Recall
−1
that Pj,k = Pj,k . Write

Ii 0
Pj,k =
0 Pj−i,k−i
in block form.
Additional recommended exercises: [Nic, §2.7].

Chapter 2
Matrix norms, sensitivity, and

conditioning
In this chapter we will consider the issue of how sensitive a linear system is to small changes
or errors in its coefficients. This is particularly important in applications, where these
coefficients are often the results of measurement, and thus inherently subject to some level
of error. Therefore, our goal is to develop some precise measure of how sensitive a linear
system is to such changes.
2.1 Motivation
Consider the linear systems
Ax = b and Ax0 = b0
where
1 1 2 0 2
A= , b= , b = .
1 1.00001 2.00001 2.00002
Since det A 6= 0, the matrix A is invertible, and hence these linear systems have unique
solutions. Indeed it is not hard to see that the solutions are

1 0 0
x= and x = .
1 2
Note that even though the vector b0 is very close to b, the solutions to the two systems are
quite different. So the solution is very sensitive to the entries of the vector of constants b.
When the solution to the system
Ax = b
is highly sensitive to the entries of the coefficient matrix A or the vector b of constant terms,
we say that the system is ill-conditioned . Ill-conditioned systems are especially problematic
when the coefficients are obtained from experimental results (which always come associated
with some error) or when computations are carried out by computer (which can involve
round-off error).
So how do we know if a linear system is ill-conditioned? To do this, we need to discuss
vector and matrix norms.
36
2.2. Normed vector spaces 37
Exercises.
2.1.1. Consider the following linear systems:

400 −201 x1 200 401 −201 x1 200
= and = .
−800 401 x2 −200 −800 401 x2 −200
Solve these two linear systems (feel free to use a computer) to see how the small change
in the coefficient matrix results in a large change in the solution. So the solution is very
sensitive to the entries of the coefficient matrix.
2.2 Normed vector spaces

Recall that every complex number z ∈ C can be written in the form
a + bi, a, b ∈ R.
The complex conjugate of z is

z̄ = a − bi.
Note that z̄ = z if and only if z ∈ R. The absolute value, or magnitude, of z = a + bi ∈ C is
defined to be √ √
|z| = z z̄ = a2 + b2 .
Note that if z ∈ R, then |z| is the usual absolute value of z.
Definition 2.2.1 (Vector norm, normed vector space). A norm (or vector norm) on an
F-vector space V (e.g. V = Fn ) is a function
k · k: V → R
that satisfies the following axioms. For all u, v ∈ V and c ∈ F, we have
(N1) kvk ≥ 0,
(N2) if kvk = 0 then v = 0,
(N3) kcvk = |c| kvk, and
(N4) ku + vk ≤ kuk + kvk (the triangule inequality).
(Note that the codomain of the norm is R, regardless of whether F = R or F = C.) A vector
space equipped with a norm is called a normed vector space. Thus, a normed vector space
is a pair (V, k · k), where k · k is a norm on V . However, we will often just refer to V as a
normed vector space, leaving it implied that we have a specific norm k · k in mind.
38 Chapter 2. Matrix norms, sensitivity, and conditioning
Note that (N3) implies that

k0k = |0| k0k = 0k0k = 0.
Thus, combined with (N2), we have
kvk = 0 ⇐⇒ v = 0.
Example 2.2.2 (2-norm). Let V = R3 . Then
p
k(x, y, z)k2 := x2 + y 2 + z 2
is the usual norm,
√ called the 2-norm or the Euclidean norm. It comes from the dot product,
since kvk = v · v. We can clearly generalize this to a norm on V = Rn by
q
k(x1 , . . . , xn )k2 := x21 + · · · + x2n .
We can even define an analogous norm on the complex vector space V = Cn by
p
k(x1 , . . . , xn )k2 := |x1 |2 + · · · + |xn |2 . (2.1)
(Since the definition (2.1) works for R or C, we will take it as the definition of the 2-norm
from now on.) You have verified in previous courses that k · k satisfies axioms (N1)–(N4).
Example 2.2.3 (1-norm). Let V = Fn and define

k(x1 , . . . , xn )k1 := |x1 | + · · · + |xn |.
Let’s verify that this is a norm, called the 1-norm. Since |c| ≥ 0 for all c ∈ F, we have
k(x1 , . . . , xn )k1 = |x1 | + · · · + |xn | ≥ 0
and
k(x1 , . . . , xn )k1 = 0 =⇒ |x1 | = · · · = |xn | = 0 =⇒ x1 = · · · = xn = 0.
Thus, axioms (N1) and (N2) are satisfied. To verify axiom (N3), we see that
kc(x1 , . . . , xn )k = k(cx1 , . . . , cxn )k
= |cx1 | + · · · + |cxn |
= |c| |x1 | + · · · + |c| |xn |

= |c| |x1 | + · · · + |xn |
= |c| k(x1 , . . . , xn )k1 .
Also, we have
k(x1 , . . . , xn ) + (y1 , . . . , yn )k1 = k(x1 + y1 , . . . , xn + yn )k1
= |x1 + y1 | + · · · + |xn + yn |
≤ |x1 | + |y1 | + · · · + |xn | + |yn | (by the ∆ inequality for F)
= |x1 | + · · · + |xn | + |y1 | + · · · + |yn |
= k(x1 , . . . , xn )k + k(y1 , . . . , yn )k.
So (N4) is satisfied.
2.2. Normed vector spaces 39
In general, for any p ∈ R, p ≥ 1, one can define the p-norm by
k(x1 , . . . , xn )kp := (|x1 |p + · · · + |x2 |p )1/p .
(It is a bit harder to show this is a norm in general. The proof uses Minkowski’s inequality.)
As p approaches ∞, this becomes the norm
k · k∞ : Fn → R, k(x1 , . . . , xn )k∞ := max{|x1 |, . . . , |xn |}, (2.2)
which is called the ∞-norm or maximum norm. See Exercise 2.2.1. In this course, we’ll
focus mainly on the cases p = 1, 2, ∞.
Remark 2.2.4. It is a theorem of analysis that all norms on Fn are equivalent in the sense
that, if k · k and k · k0 are two norms on Fn , then there is a c ∈ R, c > 0, such that
1
kvk ≤ kvk0 ≤ ckvk for all v ∈ Fn .
c
This implies that they induce the same topology on Fn . That’s beyond the scope of this
course, but it means that, in practice, we can choose whichever norm best suits our particular
application.
Exercises.
2.2.1. Show that (2.2) defines a norm on Fn .
2.2.2. Suppose that k · k is a norm on Fn .
(a) Show that ku − vk ≥ kuk − kvk for all u, v ∈ Fn .

(b) Show that, if v 6= 0, then k(1/c)vk = 1 when c = kvk.
2.2.3. Show that

√
kvk∞ ≤ kvk2 ≤ nkvk∞
for all v ∈ Fn .
2.2.4. Suppose that, for p > 1, we define
kvk = |v1 |p + |v2 |p + · · · + |vn |p for v = (v1 , v2 , . . . , vn ) ∈ Fn .
Is this a norm on Fn ? If yes, prove it. If not, show that one of the axioms of a norm is
violated.
2.3 Matrix norms

We would now like to define norms of matrices. In view of the motivation in Section 2.1,
we would like the norm to somehow measure how much a matrix A can change a vector. If
multiplication by A has a large effect, then small changes in x can result in large changes in
Ax, which is a problem.
Definition 2.3.1 (Matrix norm). Let A ∈ Mm,n (F). If k · kp and k · kq are norms on Fn and
Fm respectively, we define

kAxkq n
kAkp,q = max : x ∈ F , x 6= 0 . (2.3)
kxkp
This is called the matrix norm, or operator norm, of A with respect to the norms k · kp and
k · kq . We also say that kAkp,q is the matrix norm associated to the norms k · kp and k · kq .
The next lemma tells us that, in order to compute the matrix norm (2.3), it is not
necessary to check the value of kAxkq /kxkp for every x 6= 0. Instead, it is enough to
consider unit vectors.
Proposition 2.3.2. In the setup Definition 2.3.1, we have
kAkq,p = max {kAxkq : x ∈ Fn , kxkp = 1} .
Proof. Let

kAxkq n
S= : x ∈ F , x 6= 0 and T = {kAxkq : x ∈ Fn , kxkp = 1} .
kxkp
These are both sets of nonnegative real numbers, and we wish to show that they have the
same maximum. To do this, we will show that these sets are actually the same (which is a
stronger assertion).
First note that, if kxkp = 1, then
kAxkq
= kAxkq .
kxkp
Thus T ⊆ S.
Now we want to show the reverse inclusion: S ⊆ T . Suppose s ∈ S. Then there is some
x 6= 0 such that
kAxkq
s= .
kxkp
Define
1
c= ∈ R.
kxkp
Then, by (N3),
1
kcxkp = |c| kxkp = kxkp = 1.
kxkp
2.3. Matrix norms 41
Thus we have
kAxkq |c| kAxkq (N3) kcAxkq kA(cx)kq
s= = = = = kA(cx)kq ∈ T,
kxkp |c| kxkp kcxkp kcxkp
since kcxkp = 1. Thus we have shown the reverse inclusion S ⊆ T .
Having shown both inclusions, it follows that S = T , and hence their maxima are equal.
Remark 2.3.3. In general, a set of real numbers may not have a maximum. (Consider the
set R itself.) Thus, it is not immediately clear that kAkp,q is indeed well defined. However,
one can use Proposition 2.3.2 to show that it is well defined. The set {x ∈ Fn : kxkp = 1} is
compact, and the function x 7→ kAxkq is continuous. It follows from a theorem in analysis
that this function attains a maximum.
When we use the same type of norm (e.g. the 1-norm, the 2-norm, or the ∞-norm) in
both the domain and the codomain, we typically use a single subscript on the matrix norm.
Thus, for instance,
kAk1 := kAk1,1 , kAk2 := kAk2,2 , kAk∞ := kAk∞,∞ .
Note that, in principle, we could choose different types of norms for the domain and codomain.
However, in practice, we usually choose the same one.
Example 2.3.4. Let’s calculate kAk1 for F = R and

2 3
A= .
1 −5
We will use Proposition 2.3.2, so we need to consider the set
{(x, y) : k(x, y)k1 = 1} = {(x, y) : |x| + |y| = 1}.
This set is the union of the four blue line segments shown below:
(0, 1)
(−1, 0) (1, 0)
(0, −1)
By Proposition 2.3.2, we have
kAk1 = max{kAxk1 : kxk1 = 1}

x
= max A : |x| + |y| = 1
y 1

2x + 3y
= max : |x| + |y| = 1
x − 5y 1
= max {|2x + 3y| + |x − 5y| : |x| + |y| = 1}

≤ max {2|x| + 3|y| + |x| + 5|y| : |x| + |y| = 1}
= max {3|x| + 8|y| : |x| + |y| = 1}
= max {3|x| + 8(1 − |x|) : 0 ≤ |x| ≤ 1}
= max {8 − 5x : 0 ≤ x ≤ 1}
= 8.
So we know that kAk1 = max{kAxk1 : kxk1 = 1} is at most 8. To show that it is equal to 8,

it is enough to show that 8 ∈ {kAxk1 : kxk1 = 1}. So we want to find an x ∈ R2 such that
kxk1 = 1 and kAxk1 = 8.
Indeed, if we take x = (0, 1), then kxk1 = 1, and

2 3 0 3
Ax = = ,
1 −5 1 −5
and so kAxk1 = |3| + | − 5| = 8. Thus kAk1 = 8.
Note that, in Example 2.3.4, the matrix norm kAk1 was precisely the 1-norm of one of its
columns (the second column, to be precise). The following result gives the general situation.
Theorem 2.3.5. Suppose A = [aij ] ∈ Mm,n (F). Then
(a) kAk1 = max { m

P
i=1 |aij | : 1 ≤ j ≤ n}, the maximum of the 1-norms of the columns of
A;
nP o
n
(b) kAk∞ = max j=1 |a ij | : 1 ≤ i ≤ m , the maximum of the 1-norms of the rows of A;
qP P
m n 2
(c) kAk2 ≤ i=1 j=1 |aij | =: kAkF , called the Frobenius norm of A, which is the
2-norm of A, viewing it as a vector in Fnm .
Proof. We will prove part (a) and leave parts (b) and (c) as Exercise 2.3.1.
Recall that aj = Aej is the j-th column of A, for 1 ≤ j ≤ n. We have kej k1 = 1 and
m
X
kAej k1 = |ai,j |
i=1
is the 1-norm of the j-th column of A. Thus, by definition,

( m )
X
kAk1 ≥ max |aij | : 1 ≤ j ≤ n .
i=1
It remains to prove the reverse inequality. If x = (x1 , . . . , xn ), we have
Ax = x1 a1 + · · · + xn an .
2.3. Matrix norms 43
Thus
kAxk1 = kx1 a1 + · · · + xn an k1
≤ kx1 a1 k1 + · · · + kxn an k1 (by the triangle inequality (N4))
≤ |x1 | ka1 k1 + · · · + |xn | kan k1 . (by (N3))
Now, suppose the column of A with the maximum 1-norm is the j-th column, and suppose
kxk1 = 1. Then we have
kAxk1 ≤ |x1 | kaj k1 + · · · + |xn | kaj k1 = (|x1 | + · · · + |xn |) kaj k1 = kaj k1 .
This completes the proof.
Note that part (c) of Theorem 2.3.5 involves an inequality. In practice, the norm kAk2
is difficult to compute, while the Frobnenius norm is much easier.
The following theorem summarizes the most important properties of matrix norms.
Theorem 2.3.6 (Properties of matrix norms). Suppose k · k is a family of norms on Fn ,
n ≥ 1. We also use the notation kAk for the matrix norm with respect to these vector norms.
(a) For all v ∈ Fn and A ∈ Mm,n (F), we have kAvk ≤ kAk kvk.
(b) kIk = 1.
(c) For all A ∈ Mm,n (F), we have kAk ≥ 0 and kAk = 0 if and only if A = 0.
(d) For all c ∈ F and A ∈ Mm,n (F), we have kcAk = |c| kAk.
(e) For all A, B ∈ Mm,n (F), we have kA + Bk ≤ kAk + kBk.
(f) For all A ∈ Mm,n (F) and B ∈ Mn,k (F), we have kABk ≤ kAk kBk.
(g) For all A ∈ Mn,n (F), we have kAk k ≤ kAkk for all k ≥ 1.
(h) If A ∈ GL(n, F), then kA−1 k ≥ 1
kAk
.
Proof. We prove parts (a) and (h) and leave the remaining parts as Exercise 2.3.3. Suppose
v ∈ Fn and A ∈ Mm,n (F). If v = 0, then
kAvk = 0 = kAk kvk.
Now suppose v 6= 0. Then

kAvk kAxk n
≤ max : x ∈ F , x 6= 0 = kAk.
kvk kxk
Multiplying both sides by kv| then gives (a).
Now suppose A ∈ GL(n, F). By the definition of kAk, we can choose x ∈ Fn , x 6= 0, such
that
kAxk
kAk = .
kxk
Then we have
1 kxk kA−1 (Ax)k
= = ≤ kA−1 k
kAk kAxk kAxk
by part (a) for A−1 .
Exercises.
2.3.1. Prove parts (b) and (c) of Theorem 2.3.5. For part (c), use the Cauchy–Schwarz
inequality
|uT v| ≤ kuk kvk, u, v ∈ Fn ,
(here we view the 1 × 1 matrix uT v as an element of F) and note that the entries of the
product Ax are of the form uT x, where u is a row of A.
2.3.2. Suppose A ∈ Mn,n (F) and that λ is an eigenvalue of A. Show that, for any choice of
vector norm on Fn , we have kAk ≥ |λ|, where kAk is the associated matrix norm of A.
2.3.3. Complete the proof of Theorem 2.3.6.
2.3.4. What is the matrix norm of a zero matrix?
2.3.5. Suppose A ∈ Mm,n (F) and that there is some fixed k ∈ R such that kAvk ≤ kkvk for
all v ∈ Fn . (Here we have fixed some arbitrary norms on Fn and Fm .) Show that kAk ≤ k.
2.3.6. For each of the following matrices, find kAk1 and kAk∞ .
 
−4
(a)  1
5

(b) −4 1 5

(c) −9 0 2 9

−5 8
(d)
6 2
 
2i 5
(e)  4 −i
3 5
2.4 Conditioning
Our goal is to develop some measure of how “good” a matrix is as a coefficient matrix of
a linear system. That is, we want some measure that allows us to know whether or not a
matrix can exhibit the bad behaviour we saw in Section 2.1.
Definition 2.4.1 (Condition number). Suppose A ∈ GL(n, F) and let k · k denote a norm
on Fn as well as the associated matrix norm. The value
κ(A) = kAk kA−1 k ≥ 1
is called the condition number of the matrix A, relative to the choice of norm k · k.
2.4. Conditioning 45
Note that the condition number depends on the choice of norm. The fact that κ(A) ≥ 1
follows from Theorem 2.3.6(h).
Examples 2.4.2. (a) Consider the matrix from Section 2.1. We have

1 1 −1 5 1.00001 −1
A= and A = 10 .
1 1.00001 −1 1
Therefore, with respect to the 1-norm,
κ(A) = (2.00001)(2.00001 · 105 ) ≥ 4 · 105 .
(b) If
2 2 −1 −1 3 −2
B= , then B = .
4 3 2 −4 2
Thus, with respect to the 1-norm,
7
κ(B) = 6 · = 21.
2
(c) If a ∈ R and
1 a −1 1 −a
C= and C =
0 1 0 1
and so, with respect to the 1-norm,
κ(C) = (|a| + 1)2 .
Since a is arbitrary, this example shows that the condition number can be arbitrarily large.
Lemma 2.4.3. If c ∈ F× , then for every invertible matrix A, we have κ(cA) = κ(A).
Proof. Note that (cA)−1 = c−1 A−1 . Thus
κ(cA) = kcAk kc−1 A−1 k = |c| |c−1 | kAk kA−1 k = |cc−1 |κ(A) = κ(A).
Example 2.4.4. If
1016 0
M= ,
0 1016
then M = 1016 I. Hence κ(M ) = κ(I) = 1 by Theorem 2.3.6(b).
Now that we’ve defined the condition number of a matrix, what does it have to do with
the situation discussed in Section 2.1? Suppose we want to solve the system
Ax = b.
where b is determined by some experiment (thus subject to measurement error) or com-

putation (subject to rounding error). Let ∆b be some small perturbation in b, so that
b0 = b + ∆b is close to b0 . Then there is some new solution
Ax0 = b0 .
Let ∆x = x0 − x. We say that the error ∆b induces (via A) the error ∆x in x.

The norms of the vectors ∆b and ∆x are the absolute errors; the quotients
k∆bk k∆xk
and
kbk kxk
are the relative errors. In general, we are interested in the relative error.
Theorem 2.4.5. With the above notation,
k∆xk k∆bk
≤ κ(A) .
kxk kbk
Proof. We have
A(∆x) = A(x0 − x) = Ax0 − Ax = b0 − b = ∆b,
and so ∆x = A−1 ∆b. Thus, by Theorem 2.3.6(a), we have
k∆xk ≤ kA−1 k k∆bk and kbk ≤ kAk kxk.
Thus
k∆xk kA−1 k k∆bk k∆bk
≤ = κ(A) .
kxk kbk/kAk kbk
In fact, there always exists a choice of b and ∆b such that we have equality in Theo-
rem 2.4.5. See Exercise 2.4.1.
Theorem 2.4.5 says that, when solving the linear system,
Ax = b,
the condition number is roughly the rate at which the solution x will change with respect
to a change in b. So if the condition number is large, then even a small error/change in b
may cause a large error/change in x. More precisely, the condition number is the maximum
ratio of the relative error in x to the relative error in b.
If κ(A) is close to 1, we say that A is well-conditioned . On the other hand, if κ(A) is
large, we say that A is ill-conditioned . Note that these terms are a bit vague; we do not
have a precise notion of how large κ(A) needs to be before we say A is ill-conditioned. This
depends somewhat on the particular situation.
Example 2.4.6. Consider the situation from Section 2.1 and Example 2.4.2(a):

1 1 2 0 2 1 0 0
A= , b= , b = , x= , x = .
1 1.00001 2.00001 2.00002 1 2
Thus
k∆bk1 = 10−5 , kbk1 = 4.00001, k∆xk1 = 2, kxk1 = 2.
So we have
k∆bk 10−5 k∆xk
κ(A) = (2.00001)2 · 105 · ≥1= ,
kbk 4.00001 kxk
as predicted by Theorem 2.4.5. The fact that A is ill-conditioned explains the phenomenon
we noticed in Section 2.1: that a small change in b can result in a large change in the solution
x to the system Ax = b.
2.4. Conditioning 47
What happens if there is some small change/error in the matrix A in addition to a

change/error in b? (See Exercise 2.1.1.) In general, one can show that

k∆xk k∆b| k∆Ak
≤ c · κ(A) + ,
kxk kbk kAk
where c = k(∆A)A−1 k or c = kA−1 (∆A)k (and we assume that c < 1). For a proof see, for
example, [ND77, Th. 6.29].
While an ill-conditioned coefficient matrix A tells us that the system Ax = b can be
ill-conditioned (see Exercise 2.4.1), it does not imply in general that this system is ill-
conditioned. See Exercise 2.4.2. Thus, κ(A) measures the worse case scenario for a linear
system with coefficient matrix A.
Exercises.
2.4.1. Show that there always exists a choice of b and ∆b such that we have equality in
Theorem 2.4.5.
2.4.2. If s is very large, we know from Example 2.4.2(c) that the matrix

1 s
C=
0 1
is ill-conditioned. Show that, if b = (1, 1), then the system Cx = b satisfies
k∆xk k∆b||
≤3
kxk kbk
and is therefore well-conditioned. (Here we use the 1-norm.) On the other hand, find a
choice of b for which the system Cx = b is ill-conditioned.
2.4.3. Suppose k ∈ F× and find the condition number of the matrix

1 0
A=
0 k1
using either the 1-norm or the ∞-norm.
2.4.4 ([ND77, 6.5.16]). Consider the system of equations

0.89x1 + 0.53x2 = 0.36
0.47x1 + 0.28x2 = 0.19
with exact solution x1 = 1, x2 = −1.
(a) Find ∆b so that if you replace the right-hand side b by b + ∆b, the exact solution
will be x1 = 0.47, x2 = 0.11.
(b) Is the system ill-conditioned or well-conditioned?
(c) Find the condition number for the coefficient matrix of the system using the ∞-norm.
Chapter 3
Orthogonality
The notion of orthogonality is fundamental in linear algebra. You’ve encountered this con-
cept in previous courses. Here we will delve into this subject in further detail. We begin by
briefly reviewing the Gram–Schmidt algorithm, orthogonal complements, orthogonal projec-
tion, and diagonalization. We then discuss hermitian and unitary matrices, which are com-
plex analogues of symmetric and orthogonal matrices that you’ve seen before. Afterwards,
we learn about Schur decomposition and prove the important spectral and Cayley–Hamilton
theorems. We also define positive definite matrices and consider Cholesky and QR factor-
izations. We conclude with a discussion of computing/estimating eigenvalues, including the
Gershgorin circle theorem.
3.1 Orthogonal complements and projections

In this section we briefly review the concepts of orthogonality, orthogonal complements,
orthogonal projections, and the Gram–Schmidt algorithm. Since you have seen this material
in previous courses, we will move quickly and omit proofs. We follow the presentation in
[Nic, §8.1], and proofs can be found there.
Recall that F is either R or C. If v = (v1 , . . . , vn ) ∈ Fn , we define
v̄ = (v1 , . . . , vn ) ∈ Fn .
Then we define the inner product on Fn as follows:
hu, vi := ūT v = u1 v1 + · · · + un vn , v = (v1 , . . . , vn ), u = (u1 , . . . , un ). (3.1)
When F = R, this is the usual dot product. (Note that, in [Nic, §8.7], the inner product
is defined with the complex conjugation on the second vector.) The inner product has the
following important properties: For all u, v, w ∈ Fn and c, d ∈ F, we have
(IP1) hu, vi = hv, ui,
¯ wi,
(IP2) hcu + dv, wi = c̄hu, wi + dhv,
(IP3) hu, cv + dwi = chu, vi + dhu, wi,
(IP4) hu, ui ∈ R and hu, ui ≥ 0,
48
3.1. Orthogonal complements and projections 49
(IP5) hu, ui = 0 if and only if u = 0.
More generally, if V is a vector space over F, then any map
h·, ·i : V × V → R
satisfying (IP1)–(IP5) is called an inner product on V . In light of (IP4), for any inner
product, we may define p
kvk = hu, ui,
and one can check that this defines a norm on V . For the purposes of this course, we will
stick to the particular inner product (3.1). In this case kvk is the usual 2-norm:
p
kvk = |v1 |2 + · · · + |vn |2 .
Throughout this chapter, k · k will denote the 2-norm.
Definition 3.1.1 (Orthogonal, orthonormal). We say that u, v ∈ Fn are orthogonal , and

we write u ⊥ v, if
hu, vi = 0.
A set {v1 , . . . , vm } is orthogonal if vi ⊥ vj for all i 6= j. If, in addition, we have kvi k = 1
for all i, then we say the set is orthonormal . An orthogonal basis is a basis that is also an
orthogonal set. Similarly, an orthonormal basis is a basis that is also an orthonormal set.
Proposition 3.1.2. Let U be a subspace of Fn .
(a) Every orthogonal subset {v1 , . . . , vm } in U is a subset of an orthogonal basis of U . (We

say that we can extend any orthogonal subset to an orthogonal basis.)
(b) The subspace U has an orthogonal basis.
Proof. You saw this in previous courses, so we will omit the proof here. It can be found in
[Nic, Th. 8.1.1].
Theorem 3.1.3 (Gram–Schmidt orthogonalization algorithm). If {v1 , . . . , vm } is any basis

of a subspace of Fn , construct u1 , . . . , um ∈ U successively as follows:
u1 = v1 ,
hu1 , v2 i
u2 = v2 − u1 ,
ku1 k2
hu1 , v3 i hu2 , v3 i
u3 = v3 − 2
u1 − u2 ,
ku1 k ku2 k2
..
.
hu1 , vm i hu2 , vm i hum−1 , vm i
um = vm − 2
u1 − 2
u2 − · · · − um−1 .
ku1 k ku2 k kum−1 k2
Then
50 Chapter 3. Orthogonality
(a) {u1 , . . . , um } is an orthogonal basis of U , and

(b) Span{u1 , . . . , uk } = Span{v1 , . . . , vk } for each k = 1, 2, . . . , m.
Proof. You saw this in previous courses, and so we will not repeat the proof here. See
[Nic, Th. 8.1.2]. Note that, in [Nic, Th. 8.1.2], it is assumed that F = R, in which case
hv, ui = hu, vi for all u, v ∈ Rn . If we wish to allow F = C, we have (IP1) instead. Then it
is important that we write huk , vi in the Gram–Schmidt algorithm instead of hv, uk i.
Example 3.1.4. Let’s find an orthogonal basis for the row space of
 
1 −1 0 1
A = 2 −1 −2 3 .
0 2 0 1
Let v1 , v2 , v3 denote the rows of A. One can check that these rows are linearly independent.
(Reduce A to echelon form and note that it has rank 3.) So they give a basis of the row
space. Let’s use the Gram–Schmidt algorithm to find an orthogonal basis:
u1 = v1 = (1, −1, 0, 1),
hu1 , v2 i 6
u2 = v2 − 2
u1 = (2, −1, −2, 3) − (1, −1, 0, 1) = (0, 1, −2, 1),
ku1 k 3
hu1 , v3 i hu2 , v3 i
u3 = v3 − 2
u1 − u2
ku1 k ku2 k2

−1 3 1 7 5
= (0, 2, 0, 1) − (1, −1, 0, 1) − (0, 1, −2, 1) = , , 1,
3 6 3 6 6
It can be nice to eliminate the fractions (see Remark 3.1.5), so
{(1, −1, 0, 1), (0, 1, −2, 1), (2, 7, 6, 5)}
is an orthogonal basis for the row space of A. If we wanted an orthonormal basis, we would
divide each of these basis vectors by its norm.
Remark 3.1.5. Note that, for c ∈ F× and u, v ∈ Fn , we have

hcu, vi c̄hu, vi hu, vi
2
(cu) = 2 2
(cu) = u.
kcuk |c| kuk kuk2
Therefore, in the Gram–Schmidt algorithm, replacing some ui by cui , c 6= 0, does not affect
any of the subsequent steps. This is useful in computations, since we can, for instance, clear
denominators.
Orthogonal (especially orthonormal) bases are particularly nice since it is easy to write
a vector as a linear combinations of the elements of such a basis.
Proposition 3.1.6. Suppose {u1 , . . . , um } is an orthogonal basis of a subspace U of Fn .
Then, for any v ∈ U , we have
m
X hui , vi
v= ui . (3.2)
i=1
kui k2
3.1. Orthogonal complements and projections 51
In particular, if the basis is orthonormal, then

m
X
v= hui , viui . (3.3)
i=1
Proof. Since v ∈ U , we can write

m
X
v= ci ui for some c1 , . . . , cm ∈ F.
i=1
Then, for j = 1, . . . , m, we have

* m
+ m
X (IP3) X
huj , vi = uj , ci ui = ci huj , ui i = cj huj , uj i = cj kuj k2 .
i=1 i=1
huj ,vi
Thus cj = kuj k2
, as desired.
Remark 3.1.7. What happens if you apply the Gram–Schmidt algorithm to a set of vectors
that is not linearly independent? Remember that a list of vectors v1 , . . . , vm is linearly
dependent if and only if one of the vectors, say vk , is a linear combination of the previous
ones. Then, using Proposition 3.1.6, one can see that the Gram–Schmidt algorithm will give
uk = 0. Thus, you can still apply the Gram–Schmidt algorithm to linearly dependent sets,
as long as you simply throw out any zero vectors that you obtain in the process.
Definition 3.1.8. If U is a subspace of Fn , we define the orthogonal complement of U by
U ⊥ := {v ∈ Fn : hu, vi = 0 for all u ∈ U }.
We read U ⊥ as “U -perp”.
Lemma 3.1.9. Let U be a subspace of Fn .
(a) U ⊥ is a subspace of Fn .
(b) {0}⊥ = Fn and (Fn )⊥ = {0}.
(c) If U = Span{u1 , . . . , uk }, then U ⊥ = {v ∈ Fn : hv, ui i = 0 for all i = 1, 2, . . . , k}.
Proof. You saw these properties of the orthogonal complement in previous courses, so we
will not repeat the proofs there. See [Nic, Lem. 8.1.2].
Definition 3.1.10 (Orthogonal projection). Let U be a subspace of Fn with orthogonal

basis {u1 , . . . , um }. If v ∈ Fn , then the vector
hu1 , vi hu2 , vi hum , vi

projU v = u 1 + u 2 + · · · + um (3.4)
ku1 k2 ku2 k2 kum k2
is called the orthogonal projection of v onto U . For the zero subspace U = {0}, we define
proj{0} x = 0.
Noting the similarity between (3.4) and (3.2), we see that

v = projU v ⇐⇒ v ∈ U.
(The forward implication follows from the fact that projU v ∈ U , while the reverse implication
follows from Proposition 3.1.6.) Also note that the k-th step of Gram–Schmidt algorithm
can be written as
uk = vk − projSpan{u1 ,...,uk−1 } vk .
Proposition 3.1.11. Suppose U is a subspace of Fn and v ∈ Fn . Define p = projU v.
(a) We have p ∈ U and v − p ∈ U ⊥ .
(b) The vector p is the vector in U closest to v in the sense that
kv − pk < kv − uk for all u ∈ U, u 6= p.
Proof. You saw this in previous courses, so we will not repeat the proofs here. See [Nic,
Th. 8.1.3].
Proposition 3.1.12. Suppose U is a subspace of Fn . Define
T : Fn → Fn , T (v) = projU v, v ∈ Fn .
(a) T is a linear operator.
(b) im T = U and ker T = U ⊥ .
(c) dim U + dim U ⊥ = n.
Proof. See [Nic, Th. 8.1.4].
If U is a subspace of Fn , then every v ∈ Fn can be written uniquely as a sum of a vector
in U and a vector in U ⊥ . Precisely, we have
v = projU v + projU ⊥ v.
Exercises.
3.1.1. Prove that the inner product defined by (3.1) satisfies conditions (IP1)–(IP5).
3.1.2. Suppose U and V are subspaces of Fn . We write U ⊥ V if

u⊥v for all u ∈ U, v ∈ V.
Show that if U ⊥ V , then projU ◦ projV = 0.
Additional recommended exercises:

• [Nic, §8.1]: All exercises
• [Nic, §8.7]: 8.7.1–8.7.4
3.2. Diagonalization 53
3.2 Diagonalization
In this section we quickly review the topics of eigenvectors, eigenvalues, and diagonalization
that you saw in previous courses. For a more detailed review of this material, see [Nic, §3.3].
Throughout this section, we suppose that A ∈ Mn,n (F). Recall that if
Av = λv for some λ ∈ F, v ∈ Fn , v 6= 0, (3.5)
then we say that v is an eigenvector of A with corresponding eigenvalue λ. So λ ∈ F is an

eigenvector of A if (3.5) is satisfied for some nonzero vector v ∈ Fn .
You learned in previous courses how to find the eigenvalues of A. The eigenvalues of A
are precisely the roots of the characteristic polynomial
cA (x) := det(xI − A).
If F = C, then the characteristic polynomial will factor completely. That is, we have
det(xI − A) = (x − λ1 )mλ1 (x − λ2 )mλ2 · · · (x − λk )mλk ,
where the λ1P , . . . , λk are distinct. Then mλi is the called algebraic multiplicity of λi . It
follows that ki=1 mλi = n.
For an eigenvalue λ of A, the set of solutions to the equation
Ax = λx or (A − λI)x = 0
is the associated eigenspace Eλ . The eigenvectors corresponding to λ are precisely the

nonzero vectors in Eλ . The dimension dim Eλ is called the geometric multiplicity of λ. We
always have
1 ≤ dim Eλ ≤ mλ (3.6)
for every eigenvalue λ.
Recall that the matrix A is diagonalizable if there exists an invertible matrix P such that
P −1 AP
is diagonal. If D is this diagonal matrix, then we have
A = P DP −1 .
We say that matrices A, B ∈ Mn,n (C) are similar if there exists some invertible matrix P
such that A = P BP −1 . So a matrix is diagonalizable if and only if it is similar to a triangular
matrix.
Theorem 3.2.1. Suppose A ∈ Mn,n (C). The following statements are equivalent.
(a) Cn has a basis consisting of eigenvectors of A.

(b) dim Eλ = mλ for every eigenvalue λ of A.
(c) A is diagonalizable.
Corollary 3.2.2. If A ∈ Mn,n (C) has n distinct eigenvalues, then A is diagonalizable.
Proof. Suppose A has n distinct eigenvalues. Then, mλ = 1 for every eigenvalue λ. It

then from (3.6) that dim Eλ = mλ for every eigenvalue λ. Hence A is diagonalizable by
Theorem 3.2.1.
Suppose A is diagonalizable, and let v1 , . . . , vn be a basis of eigenvectors. Then the

matrix
P = v1 v 2 · · · vn
is invertible. Furthermore, if we define
··· ···
 
λ1 0 0
 0 λ2
. 0 ··· 0
... ... .. 
D =  .. 0 . ,
 
. .. ... ...
 ..

. 0
0 0 ··· 0 λn
where λi is the eigenvalue corresponding to the eigenvector vi , then we have A = P DP −1 .

Each eigenvalue appears on the diagonal of D a number of times equal to its (algebraic or
geometric) multiplicity.

 
3 0 0
A =  1 3 1 .
−4 0 −1
The characteristic polynomial is
x−3 0 0
x − 3 −1
cA (x) = det(xI − A) = −1 x − 3 −1 = (x − 3) = (x − 3)2 (x + 1).
0 x+1
4 0 x+1
Thus, the eigenvalues are −1 and 3, with algebraic multiplicities
m−1 = 1, m3 = 2.
For the eigenvalue 3, we compute the corresponding eigenspace E3 by solving the system
(A − 3I)x = 0:    
0 0 0 0 1 0 1 0
row reduce
 1 0 1 0 − −−−−−→  0 0 0 0  .
−4 0 −4 0 0 0 0 0
Thus
E3 = Span{(1, 0, −1), (0, 1, 0)}.
In particular, this eigenspace is 2-dimensional, with basis {(1, 0, −1), (0, 1, 0)}.
3.2. Diagonalization 55
For the eigenvalue −1, we find E−1 by solving (A + I)x = 0:

   
−4 0 0 0 1 0 0 0
row reduce
 1 −4 1 0  −
−−−−−→  0 1 −1/4 0  .
−4 0 0 0 0 0 0 0
Thus
E−1 = Span{(0, 1, 4)}.
In particular, this eigenspace is 1-dimensional, with basis {(0, 1, 4)}.
Since we have dim Eλ = mλ for each eigenvalue λ, the matrix A is diagonalizable. In
particular, we have A = P DP −1 , where
   
1 0 0 3 0 0
P =  0 1 1 and D = 0 3 0  .
−1 0 4 0 0 −1

 
0 0 0
B = 2 0 0 .
0 −3 0
Then
x 0 0
cB (x) = det(xI − B) = −2 x 0 = x3 .
0 3 x
Thus B has only one eigenvalue λ = 0, with algebraic multiplicity 3. To find E0 we solve
 
0 0 0
2 0 0  x = 0
0 −3 0
and find the solution set

E0 = Span{(0, 0, 1)}.
So dim E0 = 1 < 3 = m0 . Thus there is no basis for C3 consisting of eigenvectors. Hence B
is not diagonalizable.
Exercises.
Recommended exercises: Exercises in [Nic, §3.3].
3.3 Hermitian and unitary matrices

In this section we introduce hermitian and unitary matrices, and study some of their im-
portant properties. These matrices are complex analogues of symmetric and orthogonal
matrices, which you saw in previous courses.
Recall that real matrices (i.e. matrices with real entries) can have complex eigenvalues
that are not real. For example, consider the matrix

0 1
A= .
−1 0
Its characteristic polynomial is
cA (x) = det(xI − A) = x2 + 1,
which has roots ±i. Then one can find the associated eigenvectors in the usual way to see
that
1 i 1 1 −i 1
A = =i and A = = −i .
i −1 i −i −1 −i
Thus, when considering eigenvalues, eigenvectors, and diagonalization, it makes much more
sense to work over the complex numbers.
Definition 3.3.1 (Conjugate transpose). The conjugate transpose (or hermitian conjugate)
AH of a complex matrix is defined by
AH = (Ā)T = (AT ).
Other common notations for AH are A∗ and A† .
Note that AH = AT when A is real. In many ways, the conjugate transpose is the “cor-
rect” complex analogue of the transpose for real matrices, in the sense that many theorems
for real matrices involving the transpose remain true for complex matrices when you replace
“transpose” by “conjugate transpose”. We can also rewrite the inner product (3.1) as
hu, vi = uH v. (3.7)
Example 3.3.2. We have

 
H 2+i 5i
2 − i −3 i
=  −3 7 + 3i .
−5i 7 − 3i 0
−i 0
Proposition 3.3.3. Suppose A, B ∈ Mm,n (C) and c ∈ C.
(a) (AH )H = A.
(b) (A + B)H = AH + B H .
(c) (cA)H = c̄AH .
3.3. Hermitian and unitary matrices 57
If A ∈ Mm,n (C) and B ∈ Mn,k (C), then
(d) (AB)H = B H AH .
Recall that a matrix is symmetric if AT = A. The natural complex generalization of this

concept is the following.
Definition 3.3.4 (Hermitian matrix). A square complex matrix A is hermitian if AH = A,

equivalently, if Ā = AT .
Example 3.3.5. The matrix  

−2 2 − i 5i
2 + i 1 4
−5i 4 8
is hermitian, whereas
3 i i 2−i
and
i 0 2+i 3
are not. Note that entries on the main diagonal of a hermitian matrix must be real.
Proposition 3.3.6. A matrix A ∈ Mn,n (C) is hermitian if and only if
hx, Ayi = hAx, yi for all x, y ∈ Cn . (3.8)
Proof. First suppose that A = [aij ] is hermitian. Then, for x, y ∈ Cn , we have
hx, Ayi = xH Ay = xH AH y = (Ax)H y = hAx, yi.
Conversely, suppose that (3.8) holds. Then, for all 1 ≤ i, j ≤ n, we have
aij = eTi Aej = hei , Aej i = hAei , ej i = (Aei )H ej = eH H T H

i A ej = ei A ej = aji .
Thus A is hermitian.
Proposition 3.3.7. If A ∈ Mn,n (C) is hermitian, then every eigenvalue of A is real.
Proof. Suppose we have
Ax = λx, x ∈ Cn , x 6= 0, λ ∈ C.
Then
λhx, xi = hx, λxi (by (IP3))

= hx, Axi
= xH Ax (by (3.7))
= x H AH x (since A is hermitian)
= (Ax)H x
= hAx, xi (by (3.7))
= hλx, xi
= λ̄hx, xi. (by (IP2))
Thus we have
(λ − λ̄)hx, xi = 0.
Since x 6= 0, (IP5) implies that λ = λ̄. Hence λ ∈ R.
Proposition 3.3.8. If A ∈ Mn,n (C) is hermitian, then eigenvectors of A corresponding to
distinct eigenvalues are orthogonal.
Proof. Suppose
Ax = λx, Ay = µy, x, y 6= 0, λ 6= µ.
Since A is hermitian, we have λ, µ ∈ R by Proposition 3.3.7. Then we have
λhx, yi = hλx, yi (by (IP2) and λ ∈ R)
= hAx, yi
= hx, Ayi (by Proposition 3.3.6)
= hx, µyi
= µhx, yi. (by (IP3))
Thus we have
(λ − µ)hx, yi = 0.
Since λ 6= µ, it follows that hx, yi = 0.
Proposition 3.3.9. The following conditions are equivalent for a matrix U ∈ Mn,n (C).
(a) U is invertible and U −1 = U H .
(b) The rows of U are orthonormal.
(c) The columns of U are orthonormal.
Proof. The proof of this result is almost identical to the characterization of orthogonal
matrices that you saw in previous courses. For details, see [Nic, Th. 8.2.1] and [Nic, Th. 8.7.6].
Definition 3.3.10 (Unitary matrix). A square complex matrix U is unitary if U −1 = U H .

Recall that a matrix P is orthogonal if P −1 = P T . Note that a real matrix is unitary
if and only if it is orthogonal. You should think of unitary matrices as the correct complex
analogue of orthogonal matrices.
Example 3.3.11. The matrix

i 1−i
1 1+i
has orthogonal columns. However, the columns are not orthonormal (and the rows are not
orthogonal). So the matrix is not unitary. However, if we normalize the columns, we obtain
the unitary matrix √
1 2i 1 − i
√ .
2 2 1+i
3.3. Hermitian and unitary matrices 59
You saw in previous courses that symmetric real matrices are always diagonalizable. We
will see in the next section that the same is true for complex hermitian matrices. Before
discussing the general theory, let’s do a simple example that illustrates some of the ideas
we’ve seen in this section.
Example 3.3.12. Consider the hermitian matrix

2 i
A= .
−i 2
Its characteristic polynomial is
x − 2 −i
cA (x) = det(xI − A) = = (x − 2)2 − 1 = x2 − 4x + 3 = (x − 1)(x − 3).
i x−2
Thus, the eigenvalues are 1 and 3, which are real, as predicted by Proposition 3.3.7. The
corresponding eigenvectors are

1 i
and .
i 1
√
These are orthogonal, as predicted by Proposition 3.3.8. Each has length 2, so an or-
thonormal basis of eigenvectors is
n o
√1 (1, i), √1 (i, 1) .
2 2
Thus, the matrix

1 1 i
U=√
2 i 1
is unitary, and we have that
H 1 0
U AU =
0 3
is diagonal.
Exercises.
3.3.1. Recall that, for θ ∈ R, we define the complex exponential
eiθ = cos θ + i sin θ.
Find necessary and sufficient conditions on α, β, γ, θ ∈ R for the matrix
iα iβ
e e
eiγ eiθ
to be hermitian. Your final answer should not involve any complex exponentials or trigono-
metric functions.
3.3.2. Show that if A is a square matrix, then det(AH ) = det A.
Additional recommended exercises: Exercises 8.7.12–8.7.17 in [Nic, §8.7].
3.4 The spectral theorem

In previous courses you studied the diagonalization of symmetric matrices. In particular, you
learned that every real symmetric matrix was orthogonally diagonalizable. In this section
we will study the complex analogues of those results.
Definition 3.4.1 (Unitarily diagonalizable matrix). An matrix A ∈ Mn,n (C) is said to be

unitarily diagonalizable if there exists a unitary matrix U such that U −1 AU = U H AU is
diagonal.
One of our goals in this section is to show that every hermitian matrix is unitarily diago-
nalizable. We first prove an important theorem which has this result as an easy consequence.
Theorem 3.4.2 (Schur’s theorem). If A ∈ Mn,n (C), then there exists a unitary matrix U
such that
U H AU = T
is upper triangular. Moreover, the entries on the main diagonal of T are the eigenvalues of
A (including multiplicities).
Proof. We prove the result by induction on n. If n = 1, then A is already upper triangular,

and we are done (just take U = I). Now suppose n > 1, and that the theorem holds for
(n − 1) × (n − 1) complex matrices.
Let λ1 be an eigenvalue of A, and let y1 be a corresponding eigenvector with ky1 k = 1.
By Proposition 3.1.2, we can extend this to an orthonormal basis
{y1 , y2 , . . . , yn }
of Cn . Then
U 1 = y1 y2 · · · yn
is a unitary matrix. In block form, we have
 
y1H
y H 
H  2 λ1 X1
U1 AU1 =  ..  λ1 y1 Ay2 · · · Ayn = .
 .  0 A1
ynH
Now, by the induction hypothesis applied to the (n − 1) × (n − 1) matrix A1 , there exists a

unitary (n − 1) × (n − 1) matrix W1 such that
W1H A1 W1 = T1
3.4. The spectral theorem 61
is upper triangular. Then

1 0
U2 =
0 W1
is a unitary n × n matrix. If we let U = U1 U2 , then
U H = (U1 U2 )H = U2H U1H = U2−1 U1−1 = (U1 U2 )−1 = U −1 ,
and so U is unitary. Furthermore,

H 1 0 λ1 X1 1 0 λ1 X1 W1
U AU = U2H (U1H AU1 )U2 = = =T
0 W1H 0 A 1 0 W1 0 T1
is upper triangular.
Finally, since A and T are similar matrices, they have the same eigenvalues, and these
eigenvalues are the diagonal entries of T since T is upper triangular.
By Schur’s theorem (Theorem 3.4.2), every matrix A ∈ Mn,n (C) can be written in the
form
A = U T U H = U T U −1 (3.9)
where U is unitary, T is upper triangular, and the diagonal entries of T are the eigenvalues
of T . The expression (3.9) is called a Schur decomposition of A.
Recall that the trace of a square matrix A = [aij ] is
tr A = a11 + a22 + . . . + ann .
In other words, tr A is the sum of the entries of A on the main diagonal. Similar matrices
have the same trace and determinant (see Exercises 3.4.1 and 3.4.2).
Corollary 3.4.3. Suppose A ∈ Mn,n (C), and let λ1 , . . . , λn denote the eigenvalues of A,
including multiplicities. Then
det A = λ1 λ2 · · · λn and tr A = λ1 + λ2 + · · · + λn .
Proof. Since the statements are clear true for triangular matrices, the corollary follows from
the fact mentioned above that similar matrices have the same determinant and trace.
Schur’s theorem states that every complex square matrix can be “unitarily triangular-
ized”. However, not every complex square matrix can be unitarily diagonalized. For example,
the matrix
1 1
0 1
cannot be unitarily diagonalized. You can see this by find the eigenvectors of A and seeing
that there is no basis of eigenvectors (there is only one eigenvalue, but its corresponding
eigenspace only has dimension one).
Theorem 3.4.4 (Spectral theorem). Every hermitian matrix is unitarily diagonalizable. In
other words, if A is a hermitian matrix, then there exists a unitary matrix U such that U H AU
is diagonal.
Proof. Suppose A is a hermitian matrix. By Schur’s Theorem (Theorem 3.4.2), there exists
a unitary matrix U such that U H AU = T is upper triangular. Then we have
T H = (U H AU )H = U H AH (U H )H = U H AU = T.
Thus T is both upper and lower triangular. Hence T is diagonal.

The terminology “spectral theorem” comes from the fact that the set of distinct eigen-
values is called the spectrum of a matrix. In previous courses, you learned the following real
analogue of the spectral theorem.
Theorem 3.4.5 (Real spectral theorem, principal axes theorem). The following conditions
are equivalent for A ∈ Mn,n (R).
(a) A has an orthonormal set of n eigenvectors in Rn .

(b) A is orthogonally diagonalizable. That is, there exists a real orthogonal matrix P such
that P −1 AP = P T AP is diagonal.
(c) A is symmetric.
A set of orthonormal eigenvectors of a symmetric matrix A is called a set of principle

axes for A (hence the name of Theorem 3.4.5.).
Note that the principle axes theorem states that a real matrix is orthogonally diagonal-
izable if and only if the matrix is symmetric. However, the converse of the spectral theorem
(Theorem 3.4.4) is false, as the following example shows.
Example 3.4.6. Consider the non-hermitian matrix

0 −2
A= .
2 0
The characteristic polynomial is
cA (x) = det(xI − A) = x2 + 4.
Thus the eigenvalues are 2i and −2i. The corresponding eigenvectors are

−1 i
and .
i −1
√
These vectors are orthogonal and both have length 2. Therefore

1 −1 i
U=√
2 i −1
is a unitary matrix such that
H i 0
U AU =
0 −i
is diagonal.
3.4. The spectral theorem 63
Why does the converse of Theorem 3.4.4 fail? Why doesn’t the proof that an orthogonally
diagonalizable real matrix is symmetric carry over to the complex setting? Let’s recall the
proof in the real case. Suppose A is orthogonally diagonalizable. Then there exists a real
orthogonal matrix P (so P −1 = P T ) and a real diagonal matrix D such that A = P DP T .
Then we have T
AT = P DP T = P DT P T = P DP T = A,
where we used the fact that D = DT for a diagonal matrix. Hence A is symmetric. However,
suppose we assume that A is a unitarily diagonalizable. Then there exists a unitary matrix
U and a complex diagonal matrix D such that A = U DU T . We have
H
AH = U DU H = U DH U H ,
and here we’re stuck. We won’t have DH = D unless the entries of the diagonal matrix
D are all real. So the argument fails. It turns out that we need to introduce a stronger
condition on the matrix A.
Definition 3.4.7 (Normal matrix). A matrix N ∈ Mn,n (C) is normal if N N H = N H N .
Clearly every hermitian matrix is normal. Note that the matrix A in Example 3.4.6 is
also normal since

H 0 −2 0 2 4 0 0 2 0 −2
AA = = = = AH A.
2 0 −2 0 0 4 −2 0 2 0
Theorem 3.4.8. A complex square matrix is unitarily diagonalizable if and only if it is

normal.
Proof. First suppose that A ∈ Mn,n (C) is unitarily diagonalizable. So we have
U H AU = D
for some unitary matrix U and diagonal matrix D. Since diagonal matrices commute with
each other, we have DDH = DH D. Now
DDH = (U H AU )(U H AH U ) = U H AAH U
and
DH D = (U H AH U )(U H AU ) = U H AH AU.
Hence
U H (AAH )U = U H (AH A)U.
Multiplying on the left by U and on the right by U H gives AAH = AH A, as desired.
Now suppose that A ∈ Mn,n (C) is normal, so that AAH = AH A. By Schur’s theorem
(Theorem 3.4.2), we can write
U H AU = T
for some unitary matrix U and upper triangular matrix T . Then T is also normal since
T T H = (U H AU )(U H AH U ) = U H (AAH )U = U H (AH A)U = (U H AH U )(U H AU ) = T H T.

So it is enough to show that a normal n × n upper triangular matrix is diagonal. We prove

this by induction on n. The case n = 1 is clear, since all 1×1 matrices are diagonal. Suppose
n > 1 and that all normal (n − 1) × (n − 1) upper triangular matrices are diagonal. Let
T = [tij ] be a normal n × n upper triangular matrix. Equating the (1, 1)-entries of T T H and
T H T gives
|t11 |2 + |t12 |2 + · · · + |t1n |2 = |t11 |2 .
It follows that
t12 = t13 = · · · = t1n = 0.
Thus, in block form, we have
t 0
T = 11
0 T11
Then
t 0
T H
= 11 H
0 T11
and so we have 2
|t11 |2 0 H H |t11 | 0
= TT = T T =
0 T1 T1H 0 T1H T1 .
Thus T1 T1H = T1H T1 . By our induction hypothesis, this implies that T1 is diagonal. Hence
T is diagonal, completing the proof of the induction step.
We conclude this section with a famous theorem about matrices. Recall that if A is a
square matrix, then we can form powers Ak , k ≥ 0, of A. (We define A0 = I). Thus, we can
substitute A into polynomials. For example, if
p(x) = 2x3 − 3x2 + 4x + 5 = 2x3 − 3x2 + 4x + 5x0 ,
then
p(A) = 2A3 − 3A2 + 4A + 5I.
Theorem 3.4.9 (Cayley–Hamilton Theorem). If A ∈ Mn,n (C), then cA (A) = 0. In other
words, every square matrix is a “root” of its characteristic polynomial.
Proof. Note that, for any k ≥ 0 and invertible matrix P , we have
(P −1 AP )k = P −1 Ak P.
It follows that if p(x) is any polynomial, then
p(P −1 AP ) = P −1 p(A)P.
Thus, p(A) = 0 if and only if p(P −1 AP ) = 0. Therefore, by Schur’s theorem (Theorem 3.4.2),
we may assume that A is upper triangular. Then the eigenvalues λ1 , λ2 , . . . , λn of A appear
on the main diagonal and we have
cA (x) = (x − λ1 )(x − λ2 ) · · · (x − λn ).
Therefore
cA (A) = (A − λ1 I)(A − λ2 I) · · · (A − λn I).
Each matrix A − λi I is upper triangular. Observe that:
3.5. Positive definite matrices 65
(a) A − λ1 I has zero first column, since the first column of A is (λ1 , 0, . . . , 0).
(b) Then (A − λ1 I)(A − λ2 I) has the first two columns zero because the second column of
(A − λ2 I) is of the form (b, 0, . . . , 0) for some b ∈ C.
(c) Next (A − λ1 I)(A − λ2 I)(A − λ3 I) has the first three columns zero because the third
column of (A − λ3 I) is of the form (c, d, 0, . . . , 0) for some c, d ∈ C.
Continuing in this manner, we see that (A − λ1 I)(A − λ2 I) · · · (A − λn I) has all n columns
zero, and hence is the zero matrix.
Exercises.
3.4.1. Suppose that A and B are similar square matrices. Show that det A = det B.
3.4.2. Suppose A, B ∈ Mn,n (C).

(a) Show that tr(AB) = tr(BA).
(b) Show that if A and B are similar, then tr A = tr B.
3.4.3. (a) Suppose that N ∈ Mn,n (C) is upper triangular with diagonal entries equal to
zero. Show that, for all j = 1, 2, . . . , n, we have
N ej ∈ Span{e1 , e2 , . . . , ej−1 },
where ei is the i-th standard basis vector. (When j = 1, we interpret the set {e1 , e2 , . . . , ej−1 }
as the empty set. Recall that Span ∅ = {0}.)
(b) Again, suppose that N ∈ Mn,n (C) is upper triangular with diagonal entries equal to
zero. Show that N n = 0.
(c) Suppose A ∈ Nn,n (C) has eigenvalues λ1 , . . . , λn , with multiplicity. Show that A =
P + N for some P, N ∈ Mn,n (C) satisfying N n = 0 and P = U DU T , where U is
unitary and D = diag(λ1 , . . . , λn ). Hint: Use Schur’s Theorem.
Additional exercises from [Nic, §8.7]: 8.7.5–8.7.9, 8.7.18–8.7.25.
3.5 Positive definite matrices

In this section we look at symmetric matrices whose eigenvalues are all positive. These
matrices are important in applications including optimization, statistics, and geometry. We
follow the presentation in [Nic, §8.3], except that we will consider the complex case (whereas
[Nic, §8.3] works over the real numbers).
Let
R>0 = {a ∈ R : a > 0}
denote the set of positive real numbers.
Definition 3.5.1 (Positive definite). A hermitian matrix A ∈ Mn,n (C) is positive definite if
hx, Axi ∈ R>0 for all x ∈ Cn , x 6= 0.
By Proposition 3.3.7, we know that the eigenvalues of a hermitian matrix are real.
Proposition 3.5.2. A hermitian matrix is positive definite if and only if all its eigenvalues
λ are positive, that is, λ > 0.
Proof. Suppose A is a hermitian matrix. By the spectral theorem (Theorem 3.4.4), there
exists a unitary matrix U such that U H AU = D = diag(λ1 , . . . , λn ), where λ1 , . . . , λn are
the eigenvalues of A. For x ∈ Cn , define
 
y1
 y2 
y = U H x =  ..  .
 
.
yn
Then
hx, Axi = xH Ax = xH (U DU H )x
= (U H x)H DU H x = yH Dy = λ1 |y1 |2 + λ2 |y2 |2 + · · · λn |yn |2 . (3.10)
If every λi > 0, then (3.10) implies that hx, Axi > 0 since some yj > 0 (because x = 6 0 and
U is invertible). So A is positive definite.
Conversely, suppose A is positive definite. For j ∈ {1, 2, . . . , n}, let x = U ej 6= 0. Then
y = ej , and so (3.10) gives
λj = hx, Axi > 0.
Hence all the eigenvalues of A are positive.
Remark 3.5.3. A hermitian matrix is positive semi-definite if hx, Axi ≥ 0 for all x 6= 0 in
Cn . Then one can show that a hermitian matrix is positive semi-definite if and only if all
its eigenvalues λ are nonnegative, that is λ ≥ 0. (See Exercise 3.5.1.) One can also consider
negative definite and negative semi-definite matrices and they have analogous properties.
However, we will focus here on positive definite matrices.

1 1
A= .
−1 1
Note that A is not hermitian (which is the same as symmetric, since A is real). For any
x = (x1 , x2 ), we have
xT Ax = x21 + x22 .
Thus, if x ∈ R2 , then xT Ax is always positive when x 6= 0. However, if x = (1, i), we have
xH Ax = 2 + 2i,
which is not real. However, if A ∈ Mn,n (R) is symmetric, then A is positive definite if and
only if hx, Axi > 0 for all x ∈ Rn . That is, for real symmetric matrices, it is enough to check
this condition for real vectors.
Corollary 3.5.5. If A is positive definite, then it is invertible and det A ∈ R>0 .

Proof. Suppose A is positive definite. Then, by Proposition 3.5.2, the eigenvalues of A are
positive real numbers. Hence, by Corollary 3.4.3, det A ∈ R>0 . In particular, since its
determinant is nonzero, A is invertible.
Example 3.5.6. Let’s show that, for any invertible matrix U ∈ Mn,n (C), the matrix A = U H U
is positive definite. Indeed, we have
AH = (U H U )H = U H U = A,
and so A is hermitian. Also, for x ∈ Cn , x 6= 0, we have
xH Ax = xH (U H U )x = (U x)H (U x) = kU xk2 > 0,
where the last equality follows from (IP5) and the fact that U x 6= 0, since x 6= 0 and U is
invertible.
In fact, we will see that the converse to Example 3.5.6 is also true. Before verifying this,
we discuss another important concept.
Definition 3.5.7 (Principal submatrices). If A ∈ Mn,n (C), let (r) A denote the r × r subma-
trix in the upper-left corner of A; that is, (r) A is the matrix obtained from A by deleting the
last n − r rows and columns. The matrices (1) A, (2) A, . . . , (n) A = A are called the principal
submatrices of A.
5 7 2−i
A =  6i 0 −3i  ,
2 + 9i −3 1
then
(1)
(2) 5 7 (3)
A= 5 , A= , A = A.
6i 0
(r)
Lemma 3.5.9. If A ∈ Mn,n (C) is positive definite, then so is each principal matrix A for
r = 1, 2, . . . , n.
Proof. Write (r)
A P
A=
Q R
in block form. First note that
(r)
(r)
H
A P H A QH
=A=A = .
Q R PH RH
H
Hence (r) A = (r) A , and so A is hermitian.
Now let y ∈ Cr , y 6= 0. Define
y
x= ∈ Cn .
0
Then x 6= 0 and so, since A is positive definite, we have

H
H (r) A P y
= yH (r)

R>0 3 x Ax = y 0 A y.
Q R 0
(r)
Thus A is positive definite.
We can now prove a theorem that includes the converse to Example 3.5.6.
Theorem 3.5.10. The following conditions are equivalent for any hermitian A ∈ Mn,n (C).
(a) A is positive definite.

(b) det (r) A ∈ R>0 for each r = 1, 2, . . . , n.
(c) A = U H U for some upper triangular matrix U with positive real entries on the diagonal.
Furthermore, the factorization in (c) is unique, called the Cholesky factorization of A.
Proof. We already saw that (c) =⇒ (a) in Example 3.5.6. Also, (a) =⇒ (b) by Lemma 3.5.9
and Corollary 3.5.5. So it remains to show (b) =⇒ (c).
Assume
that (b) is true. We prove that (c) holds√by induction on n. If n = 1, then
A = a , where a ∈ R>0 by (b). So we can take U = a .
Now suppose n > 1 and that the result holds for matrices of size (n − 1) × (n − 1). Define
B = (n−1) A.
Then B is hermitian and satisfies (b). Hence, by Lemma 3.5.9 and our induction hypothesis,
we have
B = UHU
for some upper triangular matrix U ∈ Mn−1,n−1 (C) with positive real entries on the main
diagonal. Since A is hermitian, it has block form

B p
A= H , p ∈ Cn−1 , b ∈ R.
p b
If we define
x = (U H )−1 p and c = b − xH x,
then block multiplication gives
H
UHU p U 0 U x
A= = H . (3.11)
pH b x 1 0 c
Taking determinants and using (1.2) gives
det A = (det U H )(det U )c = c | det U |2 .
(Here we use Exercise 3.3.2.) Since det A > 0 by (b), it follows that c > 0. Thus, the
factorization (3.11) can be modified to give
H
U 0
√ U √
x
A= H .
x c 0 c
Since U is upper triangular with positive real entries on the main diagonal, this proves the
induction step.
It remains to prove the uniqueness assertion in the statement of the theorem. Suppose
that
A = U H U = U1H U1
are two Cholesky factorizations. Then, by Lemma 1.6.2,
D = U U1−1 = (U H )−1 U1H (3.12)
is both upper triangular (since U and U1 are) and lower triangular (since U H and U1H are).
Thus D is a diagonal matrix. It follows from (3.12) that
U = DU1 and U1 = DH U,
and so
U = DU1 = DDH U.
Since U is invertible, this implies that DDH = I. Because the diagonal entries of D are
positive real numbers (since this is true for U and U1 ), it follows that D = I. Thus U = U1 ,
as desired.
Remark 3.5.11. (a) If the real matrix A ∈ Mn,n (R) is symmetric (hence also hermitian),
then the matrix U appearing in the Cholesky factorization A = U H U also has real
entries, and so A = U T U . See [Nic, Th. 8.3.3].
(b) Positive semi-definite matrices also have Cholesky factorizations, as long as we allow
the diagonal matrices of U to be zero. However, the factorization is no longer unique
in general.
Theorem 3.5.10 tells us that every positive definite matrix has a Cholesky factorization.
But how do we find the Cholesky factorization?
Algorithm 3.5.12 (Algorithm for the Cholesky factorization). If A is a positive definite

matrix, then the Cholesky factorization A = U H U can be found as follows:
(a) Transform A to an upper triangular matrix U1 with positive real diagonal entries using
row operations, each of which adds a multiple of a row to a lower row.
(b) Obtain U from U1 by dividing each row of U1 by the square root of the diagonal entry
in that row.
The key is that step (a) is possible for any positive definite matrix A. Let’s do an example
before proving Algorithm 3.5.12.

 
2 i −3
A =  −i 5 2i  .
−3 −2i 10
We can compute
det (1) A = 2 > 0, det (2) A = 11 > 0, det (3) A = det A = 49 > 0.
Thus, by Theorem 3.5.10, A is positive definite and has a unique Cholesky factorization. We
carry out step (a) of Algorithm 3.5.12 as follows:
 2 + i R1 
i −3 R
   
2 2
3
R3 + R1
2 i −3 i
R3 + R2
2 i −3
A =  −i 5 2i  −−−−2−→ 0 9/2 i/2  −−−−9−→ 0 9/2 i/2  = U1 .
−3 −2i 10 0 −i/2 11/2 0 0 49/9
Now we carry out step (b) to obtain

√ −3

2 √i √
2 √2
√3 i 2
U = 0 .

2 √6
49
0 0 3
You can then check that A = U H U .
Proof of Algorithm 3.5.12. Suppose A is positive definite, and let A = U H U be the Cholesky
factorization. Let D = diag(d1 , . . . , dn ) be the common diagonal of U and U H . (So the di
are positive real numbers.) Then U H D−1 is lower unitriangular (lower triangular with ones
on the diagonal). Thus L = (U H D−1 )−1 is also lower unitriangular. Therefore we can write
L = Er · · · E2 E1 In ,
where each Ei is an elementary matrix corresponding to a row operation that adds a multiple
of one row to a lower row (we modify columns right to left). Then we have
Er · · · E2 E1 A = LA = D(U H )−1 U H U = DU

is upper triangular with positive real entries on the diagonal. This proves that step (a) of
the algorithm is possible.
Now consider step (b). We have already shown that we can find a lower unitriangular
matrix L1 and an invertible upper triangular matrix U1 , with positive real entries on the
diagonal, such that L1 A = U1 . (In the notation above, L1 = Er · · · E1 and U1 = DU .) Since
A is hermitian, we have
L1 U1H = L1 (L1 A)H = L1 AH LH H H

1 = L1 AL1 = U1 L1 . (3.13)
Let D1 = diag(d1 , . . . , dn ) denote diagonal matrix with the same diagonal entries as U1 .
Then (3.13) implies that
L1 U1H D1−1 = U1 LH −1

1 D1 .
−1 H −1

This is both upper triangular (since U1 LH 1 D1 is) and lower unitriangular (since L1 U1 D1
is), and so must equal In . Thus
U1H D1−1 = L−1
1 .
3.6. QR factorization 71
Now let p p
D2 = diag d1 , . . . , dn ,
so that D22 = D1 . If we define U = D2−1 U1 , then

−1
U H U = U1H D2−1 D2−1 U1 = U1H D22 U1 = U1H D1−1 U1 = L−1

1 U1 = A.
Since U = D2−1 U1 is the matrix obtained from U1 by dividing each row by the square root
of its diagonal entry, this completes the proof of step (b).
Suppose we have a linear system
Ax = b,
where A is a hermitian (e.g. real symmetric) matrix. Then we can find the Cholesky decom-
position A = U H U and consider the linear system
U H U x = b.
As with the LU decomposition, we can first solve U H y = b by forward substitution, and then
solve U x = y by back substitution. For linear systems that can be put in symmetric form,
using the Cholesky decomposition is roughly twice as efficient as using the LU decomposition.
Exercises.
3.5.1. Show that a hermitian matrix is positive semi-definite if and only if all its eigenvalues
λ are nonnegative, that is, λ ≥ 0.
Additional recommended exercises: [Nic, §8.3].
3.6 QR factorization
Unitary matrices are very easy to invert, since the conjugate transpose is the inverse. Thus,
it is useful to factor an arbitrary matrix as a product of a unitary matrix and a triangular
matrix (which we’ve seen are also nice in many ways). We tackle this problem in this section.
A good reference for this material is [Nic, §8.4]. (However, see Remark 3.6.2.)
Definition 3.6.1 (QR factorization). A QR factorization of A ∈ Mm,n (C), m ≥ n, is a

factorization A = QR, where Q is an m × m unitary matrix and R is an m × n upper
triangular matrix whose entries on the main diagonal are nonnegative real numbers.
Note that the bottom m − n rows of an m × n upper triangular matrix (with m ≥ n) are
zero rows. Thus, we can write a QR factorization A = QR in block form

R1 R1
A = QR = Q = Q1 Q2 = Q1 R1
0 0
where R1 is an n × n upper triangular matrix whose entries on the main diagonal are
nonnegative real numbers, 0 is the (m − n) × n zero matrix, Q1 is m × n, Q2 is m × (m − n),
and Q1 , Q2 both have orthonormal columns. The factorization A = Q1 R1 is called a thin
QR factorization or reduced QR factorization of A.
Remark 3.6.2. You may find different definitions of QR factorization in other references. In
particular, some references, including [Nic, §8.6], refer to the reduced QR factorization as
a QR factorization (without using the word “reduced”). In addition, [Nic, §8.6] imposes
the condition that the columns of A are linearly independent. This is not needed for the
existence result we will prove below (Theorem 3.6.3), but it will be needed for uniqueness
(Theorem 3.6.6). When consulting other references, be sure to look at which definition they
are using to avoid confusion.
Note that, given a reduced QR factorization, one can easily obtain a QR factorization
by extending the columns of Q1 to an orthonormal basis, and defining Q2 to be the matrix
whose columns are the additional vectors in this basis. It follows that, when m > n, the
QR factorization is not unique. (Once can extend to an orthonormal basis in more than
one way.) However the reduced QR factorization has some chance of being unique. (See
Theorem 3.6.6.)
The power of the QR factorization comes from the fact that there are computer algorithms
that can compute it with good control over round-off error. Finding the QR factorization
involves the Gram–Schmidt algorithm.
Recall that a matrix A ∈ Mm,n (C) has linearly independent columns if and only if its
rank is n, which can only occur if A is tall or square (i.e. m ≥ n).
Theorem 3.6.3. Every tall or square matrix has a QR factorization.
Proof. We will prove the theorem under the additional assumption that A has full rank (i.e.
rank A = n), and then make some remarks about how one can modify the proof to work
without this assumption. We show that A has a reduced QR factorization, from which it
follows (as discussed above) that A has a QR factorization.
Suppose
A = a1 a2 · · · an ∈ Mm,n (C)
with linearly independent columns a1 , a2 , . . . , an . We can use the Gram-Schmidt algorithm
to obtain an orthogonal set u1 , . . . , un spanning the column space of A. Namely, we set
u1 = a1 and
k−1
X huj , ak i
uk = ak − uj for k = 2, 3, . . . , n. (3.14)
j=1
ku2j k
If we define
1
qk = uk for each k = 1, 2, . . . , n,
kuk k
then the q1 , . . . , qn are orthonormal and (3.14) becomes
k−1
X
kuk kqk = ak − hqj , ak iqj . (3.15)
j=1
Using (3.15), we can express each ak as a linear combination of the qj :
a1 = ku1 kq1 ,
a2 = hq1 , a2 iq1 + ku2 kq2 ,
a3 = hq1 , a3 iq1 + hq2 , a3 iq2 + ku3 kq3
..
.
an = hq1 , an iq1 + hq2 , an iq2 + . . . + hqn−1 , an iqn−1 + kun kqn .
Writing these equations in matrix form gives us the factorization we’re looking for:

A = a1 a2 a3 · · · an
 
ku1 k hq1 , a2 i hq1 , a3 i · · · hq1 , an i
 0 ku2 k hq2 , a3 i · · · hq2 , an i 
= q 1 q2 q3 · · · qn  .
 0 0 ku3 k · · · hq3 , un i
0 0 0 · · · kun k
What do we do if rank A < n? In this case, the columns of A are linearly dependent,
so some of the columns are in the span of the previous ones. Suppose that ak is the first
such column. Then, as noted in Remark 3.1.7, we get uk = 0. We can fix the proof as
follows: Let v1 , . . . , vr be an orthonormal basis of (col A)⊥ . Then, if we obtain uk = 0 in
the Gram–Schmidt algorithm above, we let qk be one of the vi , using each vi exactly once.
We then continue the proof as above, and the matrix R will have a row of zeros in the k-th
row for each k that gave uk = 0 in the Gram–Schmidt algorithm.
Remark 3.6.4. Note that for a tall or square real matrix, the proof of Theorem 3.6.3 shows
us that we can find a QR factorization with Q and R real matrices.
Example 3.6.5. Let’s find a QR factorization of

 
1 2
A = 0 1 .
1 0
We first find a reduced QR factorization. We have
a1 = (1, 0, 1) and a2 = (2, 1, 0).
Thus
u1 = a1 ,
hu1 , a2 i 2
u2 = a2 − 2
u1 = (2, 1, 0) − (1, 0, 1) = (1, 1, −1).
ku1 k 2
So we have
1 1
q1 = u1 = √ (1, 0, 1),
ku1 k 2
1 1
q2 = u2 = √ (1, 1, −1).
ku2 k 3
We define
 
√1 √1 √ √
2 3


√1 
ku1 k hq1 , a2 i 2 √2
Q1 = q1 q2 =  0 and R1 = = .

3 0 ku2 k 0 3
√1 − √13
2
We can then verify that A = Q1 R1 . To get a QR factorization, we need to complete

{q1 , q2 } to an orthonormal basis. So we need to find a vector orthogonal to both q1 and q2
(equivalently, to their multiples u1 and u2 ). A vector u3 = (x, y, z) is orthogonal to both
when
0 = hu1 , (x, y, z)i = x + z and 0 = hu2 , (x, y, z)i = x + y − z.
Solving this system, we see that u3 = (1, −2, −1) is orthogonal to u1 and u2 . So we define
1 1
q3 = u3 = √ (1, −2, −1).
ku3 k 6
Then, setting

√1 √1 √1
 √ √ 
2 3 6 2 √2
√1 − √26 

Q = q1 q2 q 3 =  0 3  and R =  0 3 ,
√1 − √13 − √16 0 0
2
we have a QR factorization A = QR.
Now that we know that QR factorizations exist, what about uniqueness? As noted earlier,
here we should focus on reduced QR factorizations.
Theorem 3.6.6. Every tall or square matrix A with linearly independent columns has a
unique reduced QR factorization A = QR. Furthermore, the matrix R is invertible.
Proof. Suppose A ∈ Mm,n (C), m ≥ n, has linearly independent columns. We know from
Theorem 3.6.3 that A has a QR factorization A = QR. Furthermore, since Q is invertible,
we have
rank R = rank(QR) = rank A = n.
Since R is m × n and m ≥ n, the columns of R are also linearly independent. It follows that
the entries on the main diagonal of R must be nonzero, and hence positive (since they are
nonnegative real numbers). This also holds for the upper triangular matrix appearing in a
reduced QR factorization.
Now suppose
A = QR and A = Q1 R1
are two reduced QR factorizations of A. By the above, the entries on the main diagonal of
R and R1 are positive. Since R and R1 are square and upper triangular, this implies that
they are invertible. We wish to show that Q = Q1 and R = R1 . Label the columns of Q and
Q1 :
Q = c1 c2 · · · cn and Q1 = d1 d2 · · · dn .
Since Q and Q1 have orthonormal columns, we have
QH Q = In = QH
1 Q1 .
(Note that Q and Q1 are not unitary matrices unless they are square. If they are not
square, then QQH and Q1 QH 1 are not defined. Recall our discussion of one-sided inverses in
−1
Section 1.5.) Therefore, the equality QR = Q1 R1 implies QH 1 Q = R1 R . Let
−1
[tij ] = QH
1 Q = R1 R . (3.16)
Since R and R1 are upper triangular with positive real diagonal elements, we have
tii ∈ R>0 and tij = 0 for i > j.
On the other hand, the (i, j)-entry of QH H

1 Q is di cj . So we have
hdi , cj i = dH
i cj = tij for all i, j.
Since Q = Q1 (R1 R−1 ), each cj is in Span{d1 , d2 , . . . , dn }. Thus, by (3.3), we have

n j
X X
cj = hdi , cj idi = tij di ,
i=1 i=1
since hdi , cj i = tij = 0 for i > j. Writing out these equations explicitly, we have:
c1 = t11 d1 ,
c2 = t12 d1 + t22 d2 ,
c3 = t13 d1 + t23 d2 + t33 d3 , (3.17)
c4 = t14 d1 + t24 d2 + t34 d3 + t44 d4 ,
..
.
The first equation gives
1 = kc1 k = kt11 d1 k = |t11 | kd1 k = t11 , (3.18)
since t11 ∈ R>0 . Thus c1 = d1 . Then we have
t12 = hd1 , c2 i = hc1 , c2 i = 0,
and so the second equation in (3.17) gives
c2 = t22 d2 .
As in (3.18), this implies that c2 = d2 . Then t13 = 0 and t23 = 0 follows in the same way.
Continuing in this way, we conclude that ci = di for all i. Thus, Q = Q1 and, by (3.16),
R = R1 , as desired.
So far our results are all about tall or square matrices. What about wide matrices?
Corollary 3.6.7. Every wide or square matrix A factors as A = LP , where P is unitary

matrix, and L is a lower triangular matrix whose entries on the diagonal are nonnegative
real numbers. It also factors as A = L1 P1 , where L1 is a square lower triangular matrix
whose entries on the main diagonal are nonnegative real numbers and P1 has orthonormal
rows. If A has linearly independent rows, then the second factorization is unique and the
diagonal entries of L1 are positive.
Proof. We apply Theorems 3.6.3 and 3.6.6 to AT .
Corollary 3.6.8. Every invertible (hence square) matrix A has unique factorizations A =
QR and Q = LP where Q and P are unitary, R is upper triangular with positive real diagonal
entries, and L is lower triangular with positive real diagonal entries.
QR factorizations have a number of applications. We learned in Proposition 1.5.15 that

a matrix has linearly independent columns if and only if it has a left inverse if and only if
AH A is invertible. Furthermore, if A is left-invertible, then (AH A)−1 AH is a left inverse of
A. (We worked over R in Section 1.5.4, but the results are true over C as long as we replace
the transpose by the conjugate transpose.) So, in this situation, it is useful to compute
(AH A)−1 . Here Theorem 3.6.6 guarantees us that we have a reduced QR factorization
A = QR with R invertible. (And we have an algorithm for finding this factorization!) Since
Q has orthonormal columns, we have QH Q = I. Thus
AH A = RH QH QR = RH R,
and so H
(AH A)−1 = R−1 R−1 .
Since R is upper triangular, it is easy to find its inverse. So this gives us an efficient method
of finding left inverses.
Students who took MAT 2342 learned about finding best approximations to (possibly
inconsistent) linear systems. (See [Nic, §5.6].) In particular, if Ax = b is a linear system,
then any solution z to the normal equations
(AH A)z = AH b (3.19)
is a best approximation to a solution to Ax = b in the sense that kb − Azk is the minimum

value of kb − Axk for x ∈ Cn . (You probably worked over R in MAT 2342.) As noted above,
A has linearly independent columns if and only if AH A is invertible. In this case, there is a
unique solution z to (3.19), and it is given by
z = (AH A)−1 AH b.
As noted above, (AH A)−1 AH is a left inverse to A. We saw in Section 1.5.1 that if Ax = b
has a solution, then it must be (AH A)−1 AH b. What we’re saying here is that, even there is
no solution, (AH A)−1 AH b is the best approximation to a solution.
3.7. Computing eigenvalues 77
Exercises.
Recommended exercises: Exercises in [Nic, §8.4]. Keep in mind the different definition of
QR factorization used in [Nic] (see [Nic, Def. 8.6]).
3.7 Computing eigenvalues

Until now, you’ve found eigenvalues by finding the roots of the characteristic polynomial.
In practice, this is almost never done. For large matrices, finding these roots is difficult.
Instead, iterative methods for estimating eigenvalues are much better. In this section, we
will explore such methods. A reference for some of this material is [Nic, §8.5].
3.7.1 The power method

Throughout this subsection, we suppose that A ∈ Mn,n (C). We will use k · k to denote the
2-norm on Cn .
An eigenvalue λ for A is called a dominant eigenvalue if λ has algebraic multiplicity one,
and
|λ| > |µ| for all eigenvalues µ 6= λ.
Any eigenvector corresponding to a dominant eigenvalue is called a dominant eigenvector of

A.
Suppose A is diagonalizable, and let λ1 , . . . , λn be the eigenvalues of A, with multiplicity,
and suppose that
|λ1 | ≤ · · · ≤ |λn−1 | < λn .
(We are implicitly saying that λn is real here.) In particular, λn is a dominant eigenvalue.
Fix a basis {v1 , . . . , vn } of unit eigenvectors of A, so that Avi = λi vi for each i = 1, . . . , n.
Let start with some unit vector x0 ∈ Cn and recursively define a sequence of vectors
x0 , x1 , x2 , . . . and positive real numbers ky1 k, ky2 k, ky3 k, . . . by
yk
yk = Axk−1 , xk = , k ≥ 1. (3.20)
kyk k
The power method uses the fact that, under some mild assumptions, the sequence ky1 k, ky2 k, . . .
converges to λn , and the sequence x1 , x2 , · · · converges to a corresponding eigenvector.
We should be precise about what we mean by convergence here. You learned about
convergence of sequences of real numbers in calculus. What about convergence of vectors?
We say that a sequence of vectors x1 , x2 , . . . converges to a vector v if
lim kxk − vk = 0.
k→∞
This is equivalent to the components of the vectors xk converging (in the sense of sequences
of scalars) to the components of v.
Theorem 3.7.1. With the notation and assumptions from above, suppose that the initial
vector x0 is of the form
Xn
x0 = ai vi with an 6= 0.
i=1
Then the power method converges, that is

an
lim kyk k = λn and lim xk = vn .
k→∞ k→∞ |an |
In particular, the xk converge to an eigenvector of eigenvalue λn .
Proof. Note that, by definition, xk is a unit vector that is a positive real multiple of
n
! n n
X X X
k k k
A x0 = A ai v i = ai A v i = ai λki vi .
i=1 i=1 i=1
Thus
Pn−1 λi k
Pn k
ai λi vi a v
n n + i=1 ai λn vi
xk = Pi=1 = .
k ni=1 ai λki vi k Pn−1 λi k
an vn + i=1 ai λn vi
(Note that we could divide by λn since λn 6= 0. Also, since an 6= 0, the norm in the
denominator is nonzero.) Since |λi | < λn for i 6= n, we have
k
λi
→ 0 as k → ∞.
λn
Hence, as k → ∞, we have
an v n an
xk → = vn .
kan vn k kan k
Similarly,
Pn−1 k
λi
an λn vn + i=1 ai λn
λi vi
kyk+1 k = kAxk k = k
Pn−1 λi
an vn + i=1 ai λn
vi
Pn−1 k+1
λi
an v n + i=1 ai λn
vi
= λn k → λn .
Pn−1 λi
an vn + i=1 ai λn
vi
Remarks 3.7.2. (a) It is crucial that the largest eigenvalue be real. On the other hand,
the assumption that it is positive can be avoided since, if it is negative, we can apply
Theorem 3.7.1 to −A.
(b) If there are several eigenvalues with maximum norm, then the sequences kyk k and xk
will not converge in general. On the other hand, if λn has multiplicity greater than one, but
is the unique eigenvalue with maximum norm, then the sequence kyk k will always converge
to λn , but the sequence xk may not converge.
(c) If you choose an initial vector x0 at random, it is very unlikely that you will choose
one with an = 0. (This would be the same as choosing a random real number and ending up
with the real number 0.) Thus, the condition in Theorem 3.7.1 that an 6= 0 is not a serious
obstacle in practice.
(d) It is possible to compute the smallest eigenvalue (in norm) by applying the power
method to A−1 . This is called the inverse power method . It is computationally more involved,
since one must solve a linear system at each iteration.

3 3
A= .
5 1
We leave it as an exercise to verify that the eigenvalues of A are −2 and 6. So 6 is a dominant

eigenvalue. Let
x0 = (1, 0)
be our initial vector. Then we compute

3 3 1 3 y1 0.514496
y1 = Ax0 = = , ky1 k ≈ 5.830951, x1 = ≈ ,
5 1 1 5 ky1 k 0.857493

4.115967 y2 0.768221
y2 = Ax1 ≈ , ky2 k ≈ 5.357789, x2 = ≈ ,
3.429973 ky2 k 0.640184

4.225215 y3 0.686012
y3 = Ax2 ≈ , ky3 k ≈ 6.159090, x3 = ≈ ,
4.481289 ky3 k 0.727589

4.240803 y4 0.714073
y4 = Ax3 ≈ , ky4 k ≈ 5.938893, x4 = ≈ ,
4.157649 ky4 k 0.700071

4.242432 y5 0.704777
y5 = Ax4 ≈ , ky5 k ≈ 6.019539, x5 = ≈ .
4.270436 ky5 k 0.709429
The kyk k are converging to 6, while the xk are converging to √1 (1, 1) ≈ (0.707107, 0.707107),
2
which is an eigenvector of eigenvalue 6.
Students who took MAT 2342 learned about Markov chains. Markov chains are a partic-
ular case of the power method. The matrix A is stochastic if each entry is a nonnegative real
number and the sum of the entries in each column is equal to one. A stochastic matrix is
regular if there exists a positive integer k such that all entries of Ak are strictly positive. One
can prove that every regular stochastic matrix has 1 as a dominant eigenvalue and a unique
steady state vector , which is an eigenvector of eigenvalue 1, whose entries are all nonnegative
and sum to 1. One can then use the Theorem 3.7.1 to find this steady state vector. See [Nic,
§2.9] for further details.
Remark 3.7.4. Note that the power method only allows us to compute the dominant eigen-
value (or the smallest eigenvalue in norm if we use the inverse power method). What if
we want to find other eigenvalues? In this case, the power method has serious limitations.
If A is hermitian, one can first find the dominant eigenvalue λn with eigenvector vn , and
then repeat the power method with an initial vector orthogonal to vn . At each iteration, we
subtract the projection onto vn to ensure that we remain in the subspace orthogonal to vn .
However, this is quite computationally intensive, and so is not practical for computing all
eigenvalues.
3.7.2 The QR method

The QR method is the most-used algorithm to compute all the eigenvalues of a matrix.
Here we will restrict our attention to the case where A ∈ Mn,n (R) is a real matrix whose
eigenvalues have distinct norms:
0 < |λ1 | < |λ2 | < · · · < |λn−1 | < |λn |. (3.21)
(The general case is beyond the scope of this course.) These conditions on A ensure that it
is invertible and diagonalizable, with distinct real eigenvalues. We do not assume that A is
symmetric.
The QR method consists in computing a sequence of matrices A1 , A2 , . . . with A1 = A
and
Ak+1 = Rk Qk ,
where Ak = Qk Rk is the QR factorization of Ak , for k ≥ 1. Note that, since A is invertible, it
has a unique QR factorization by Corollary 3.6.8. Recall that the matrices Qk are orthogonal
(since they are real and unitary) and the matrices Rk are upper triangular.
Since
Ak+1 = Rk Qk = Q−1 T
k Ak Qk = Qk Ak Qk , (3.22)
the matrices A1 , A2 , . . . are all similar, and hence have the same eigenvalues.
Theorem 3.7.5. Suppose A ∈ Mn,n (R) satisfies (3.21). In addition, assume that P −1
admits an LU factorization, where P is the matrix of eigenvectors of A, that is, A =
P diag(λ1 , . . . , λn )P −1 . Then the sequence A1 , A2 , . . . produced by the QR method converges
to an upper triangular matrix whose diagonal entries are the eigenvalues of A.
Proof. We will omit the proof of this theorem. The interested student can find a proof in
[AK08, Th. 10.6.1].
Remark 3.7.6. If the matrix A is symmetric, then so are the Ak , by (3.22). Thus, the limit
of the Ak is a diagonal matrix.
Example 3.7.7. Consider the symmetric matrix

 
1 3 4
A = 3 1 2 .
4 2 1
Using the QR method gives

 
7.07467 0.0 0.0
A20 =  0.0 −3.18788 0.0  .
0.0 0.0 −0.88679
The diagonal entries here are the eigenvalues of A to within 5 × 10−6 .
In practice, the convergence of the QR method can be slow when the eigenvalues are
close together. The speed can be improved in certain ways.
• Shifting: If, at stage k of the algorithm, a number sk is chosen and Ak − sk I is factored

in the form Qk Rk rather than Ak itself, then
Q−1 −1
k Ak Qk = Qk (Qk Rk + sk I)Qk = Rk Qk + sk I,
and so we take Ak+1 = Rk Qk +sk I. If the shifts sk are carefully chosen, one can greatly
improve convergence.
• One can first bring the matrix A to upper Hessenberg form, which is a matrix that is
nearly upper triangular (one allows nonzero entries just below the diagonal), using a
technique based on Householder reduction. Then the convergence in the QR algorithm
is faster.
See [AK08, §10.6] for further details.
3.7.3 Gershgorin circle theorem

We conclude this section with a result that can be used to bound the spectrum (i.e. set of
eigenvalues) of a square matrix. We suppose in this subsection that A = [aij ] ∈ Mn,n (C).
For i = 1, 2, . . . , n, let X
Ri = |aij |
j:j6=i
be the sum of the absolute values of the non-diagonal entries in the i-th row. For a ∈ C and
r ∈ R≥0 , let
D(a, r) = {z ∈ C : |z − a| ≤ r}
be the closed disc with centre a and radius r. The discs D(aii , Ri ) are called the Gershgorin
discs of A.
Theorem 3.7.8 (Gershgorin circle theorem). Every eigenvalue of A lies in at least one of
the Gershgorin discs D(aii , Ri ).
Proof. Suppose λ is an eigenvalue of A. Let x = (x1 , . . . , xn ) be a corresponding eigenvector.

Dividing x by its component with largest absolute value, we can assume that, for some
i ∈ {1, 2, . . . , n}, we have
xi = 1 and |xj | ≤ 1 for all j.
Equating the i-th entries of Ax = λx, we have

n
X
aij xj = λxi = λ.
j=1
Since xi = 1, this implies that X

aij xj + aii = λ.
j:j6=i
Then we have
X
|λ − aii | = aij xj
j6=i
X
≤ |aij | |xj | (by the triangle inequality)
j6=i
X
≤ |aij | (since |xj | ≤ 1 for all j)
j6=i
= Ri .
Corollary 3.7.9. The eigenvalues of A lie in the Gershgorin discs corresponding P to the
columns of A. More precisely, each eigenvalue lies in at least one of the discs D(ajj , i:i6=j |aij |).
Proof. Since A and AT have the same eigenvalues, we can apply Theorem 3.7.8 to AT .
One can interpret the Gershgorin circle theorem as saying that if the off-diagonal entries
of a square matrix have small norms, then the eigenvalues of the matrix are close to the
diagonal entries of the matrix.

0 1
A= .
4 0
It has characteristic polynomial x2 − 4 = (x − 2)(x + 2), and so its eigenvalues are ±2. The
Gershgorin circle theorem tells us that the eigenvalues lie in the discs D(0, 1) and D(0, 4).
iR
R
−2 2

1 −2
A= .
1 −1
It has characteristic polynomial x2 + 1, and so its eigenvalues are ±i. The Gershgorin circle
theorem tells us that the eigenvalues lie in the discs D(1, 2) and D(−1, 1).
iR
i
R
−i
Note that, in Examples 3.7.10 and 3.7.11, it was not the case that each Gershgorin disc
contained one eigenvalue. There was one disc that contained no eigenvalues, and one disc
that contained two eigenvalues. In general, one has the following strengthened version of the
Gershgorin circle theorem.
Theorem 3.7.12. If the union of k Gershgorin discs is disjoint from the union of the other
n − k discs, then the former union contains exactly k eigenvalues of A, and the latter union
contains exactly n − k eigenvalues of A.
Proof. The proof of this theorem uses a continuity argument, where one starts with Gersh-
gorin discs that are points, and gradually enlarges them. For details, see [Mey00, p. 498].
Example 3.7.13. Let’s use the Gershgorin circle theorem to estimate the eigenvalues of
 
−7 0.3 0.2
A= 5 0 2 .
1 −1 10
The Gershgorin discs are
D(−7, 0.5), D(0, 7), D(10, 2).
iR
By Theorem 3.7.12, we know that two eigenvalues lie in the union of discs D(−7, 0.5)∪D(0, 7)
and one lies in the disc D(10, 2).
Exercises.
3.7.1. Suppose A ∈ Mn,n (C) has eigenvalues λ1 , . . . , λr . (We only list each eigenvalue once
here, even if it has multiplicity greater than one.) Prove that the Gershgorin discs for A are
precisely the sets {λ1 }, . . . , {λr } if and only if A is diagonal.
3.7.2. Let  
1.3 0.5 0.1 0.2 0.1
−0.2 0.7 0 0.2 0.1 
 
 1
A= −2 4 0.1 −0.1
.
 0 0.2 −0.1 2 1 
0.05 0 0.1 0.5 1
Use the Gershgorin circle theorem to prove that A is invertible.
3.7.3. Suppose that P is a permutation matrix. (Recall that this means that each row and
column of P have one entry equal to 1 and all other entries equal to 0.)
(a) Show that there are two possibilities for the Gershgorin discs of P .
(b) Show, using different methods, that the eigenvalues of a permutation all have absolute
value 1. Compare this with your results from (a).
3.7.4. Fix z ∈ C× , and define

   
0 0.1 0.1 z 0 0
A = 10 0 1 , C = 0 1 0 .
1 1 0 0 0 1
(a) Calculate the matrix B = CAC −1 . Recall that A and B have the same eigenvalues.
(b) Give the Gershgorin discs for B, and find the values of z that give the strongest
conclusion for the eigenvalues.
(c) What can we conclude about the eigenvalues of A?
Additional recommended exercises: Exercises in [Nic, §8.5].

Chapter 4
Generalized diagonalization
In this and previous courses, you’ve seen the concept of diagonalization. Diagonalizing
a matrix makes it very easy to work with in many ways: you know the eigenvalues and
eigenvectors, you can easily compute powers of the matrix, etc. However, you know that not
all matrices are diagonalizable. So it is natural to ask if there is some slightly more general
result concerning a nice form in which all matrices can be written. In this chapter we will
consider two such forms: singular value decomposition and Jordan canonical form. Students
who took MAT 2342 also saw singular value decomposition a bit in that course.
4.1 Singular value decomposition

One of the most useful tools in applied linear algebra is a factorization called singular value
decomposition (SVD). A good reference for the material in this section is [Nic, §8.6.1]. Note
however, that [Nic, §8.6] works over R, whereas we will work over C.
Definition 4.1.1 (Singular value decomposition). A singular value decomposition (SVD) of

A ∈ Mm,n (C) is a factorization
A = P ΣQH ,
where P and Q are unitary and, in block form,

D 0
Σ= , D = diag(d1 , d2 , . . . , dr ), d1 , d2 , . . . , dr ∈ R>0 .
0 0 m×n
Note that if P = Q in Definition 4.1.1, then A is unitary diagonalizable. So SVD is a

kind of generalization of unitary diagonalization. Our goal in this section is to prove that
every matrix has a SVD, and to describe an algorithm for finding this decomposition. Later,
we’ll discuss some applications of SVDs.
Recall that hermitian matrices are nice in many ways. For instance, their eigenvalues are
real and they are unitarily diagonalizable by the spectral theorem (Theorem 3.4.4). Note
that for any complex matrix A (not necessarily square), both AH A and AAH are hermitian.
It turns out that, to find a SVD of A, we should study these matrices.
Lemma 4.1.2. Suppose A ∈ Mm,n (C).
85
86 Chapter 4. Generalized diagonalization
(a) The eigenvalues of AH A and AAH are real and nonnegative.

(b) The matrices AH A and AAH have the same set of positive eigenvalues.
Proof. (a) Suppose λ is an eigenvalue of AH A with eigenvector v ∈ Cn , v 6= 0. Then we

have
kAvk2 = (Av)H Av = vH AH Av = vH (λv) = λvH v = λkvk2 .
Thus λ = kAvk2 /kvk2 ∈ R≥0 . The proof for AAH is analogous, replacing A by AH .
(b) Suppose λ is a positive eigenvalues of AH A with eigenvector v ∈ Cn , v 6= 0. Then
Av ∈ Cm and
AAH (Av) = A (AH A)v = A(λv) = λ(Av).

Also, we have Av 6= 0 since AH Av = λv 6= 0 (because λ 6= 0 and v 6= 0). Thus λ is an

eigenvalue of AAH . This proves that every positive eigenvalue of AH A is an eigenvalue of
AAH . For the reverse inclusion, we replace A everywhere by AH .
We now analyze the symmetric matrix AH A, called the Gram matrix of A (see Sec-
tion 1.5.4), in more detail.
Step 1: Unitarily diagonalize AH A

Since the matrix AH A is hermitian, the spectral theorem (Theorem 3.4.4) states that we can
choose an orthonormal basis
{q1 , q2 , . . . , qn } ⊆ Cn
consisting of eigenvectors of AH A with corresponding eigenvalues λ1 , . . . , λn . By Lemma 4.1.2(a),
we have λi ∈ R≥0 for each i. We may choose the order of the q1 , q2 , . . . , qn such that
λ1 ≥ λ2 ≥ · · · ≥ λr > 0 and λi = 0 for i > r. (4.1)
(We allow the possibility that r = 0, so that λi = 0 for all i, and also the possibility that
r = n.) By Proposition 3.3.9,
Q = q1 q2 · · · qn is unitary and unitarily diagonalizes AH A.

(4.2)
Step 2: Show that rank A = r

Recall from Section 1.3 that
rank A = dim(col A) = dim(im TA ).
We wish to show that rank A = r, where r is defined in (4.1). We do this by showing that
{Aq1 , Aq2 , . . . , Aqr } is an orthogonal basis of im TA . (4.3)
First note that, for all i, j, we have
hAqi , Aqj i = (Aqi )H Aqj = qH H H H

i (A A)qj = qi (λj qj ) = λj qi qj = λj hqi , qj i.
4.1. Singular value decomposition 87
Since the set {q1 , q2 , . . . , qn } is orthonormal, this implies that
hAqi , Aqj i = 0 if i 6= j and kAqi k2 = λi kqi k2 = λi for each i. (4.4)
Thus, using (4.1), we see that
{Aq1 , Aq2 , . . . , Aqr } ⊆ im TA is an orthogonal set and Aqi = 0 if i > r. (4.5)
Since {Aq1 , Aq2 , . . . , Aqr } is orthogonal, it is linearly independent. It remains to show that
it spans im TA . So we need to show that
U := Span{Aq1 , Aq2 , . . . , Aqr }
is equal to im TA . Since we already know U ⊆ im TA , we just need to show that im TA ⊆ U .

So we need to show
Ax ∈ U for all x ∈ Cn .
Suppose x ∈ Cn . Since {q1 , q2 , . . . , qn } is a basis for Cn , we can write
x = t1 q1 + t2 q2 + · · · + tn qn for some t1 , t2 , . . . , tn ∈ C.
Then, by (4.5), we have
Ax = t1 Aq1 + t2 Aq2 + · · · + tn Aqn = t1 Aq1 + · · · + tr Aqr ∈ U,
as desired.
Step 3: Some definitions

Definition 4.1.3 (Singular values of A). The real numbers
p (4.4)
σi = λi = kAqi k, i = 1, 2, . . . , n (4.6)
are called the singular values of the matrix A.

Note that, by (4.1), we have
σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and σi = 0 if i > r.
So the number of positive singular values is equal to r = rank A.

Definition 4.1.4 (Singular matrix of A). Define
DA = diag(σ1 , . . . , σr ).
Then the m × n matrix

DA 0
ΣA :=
0 0
(in block form) is called the singular matrix of A.
Remark 4.1.5. Don’t confuse the “singular matrix of A” with the concept of a “singular
matrix” (a square matrix that is not invertible).
Step 4: Find an orthonormal basis of Cm compatible with col A

Normalize the vectors Aq1 , Aq2 , . . . , Aqr by defining
1 (4.6) 1
pi = Aqi = Aqi ∈ Cm for i = 1, 2, . . . , r. (4.7)
kAqi k σi
It follows from (4.3) that
{p1 , p2 , . . . , pr } is an orthonormal basis of im TA = col A ⊆ Cm . (4.8)
By Proposition 3.1.2, we can find pr+1 , . . . , pm so that
{p1 , p2 , . . . , pm } is an orthonormal basis of Cm . (4.9)
Step 5: Find the decomposition

By Section 4.1 and (4.2) we have two unitary matrices

P = p1 p2 · · · pm ∈ Mm,m (C) and Q = q1 q2 · · · qn ∈ Mn,n (C).
We also have
(4.6) (4.7)
σi pi = kAqi kpi = Aqi for i = 1, 2, . . . , r.
Using this and (4.5), we have

AQ = Aq1 Aq2 · · · Aqn = σ1 p1 · · · σr pr 0 · · · 0 .
Then we compute
··· ···0
 
σ1 0 0
 .. ... .. .. .. 
. . . .
0
 · · · σr 0 ··· 0

P ΣA = p1 · · · pr pr+1 · · · pm 
0 ··· 0 0 ··· 0

. .. .. .. 
 .. . . .
0 ··· 0 0··· 0

= σ1 p1 · · · σr pr 0 · · · 0
= AQ.
Since Q−1 = QH , it follows that A = P ΣA QH . Thus we have proved the following theorem.
Theorem 4.1.6. Let A ∈ Mm,n (C), and let σ1 ≥ σ2 ≥ · · · ≥ σr > 0 be the positive singular
values of A. Then r = rank A and we have a factorization
A = P ΣA QH where P and Q are unitary matrices.
In particular, every complex matrix has a SVD.

4.1. Singular value decomposition 89
Note that the SVD is not unique. For example, if r < m, then there are infinitely
many ways to extend {p1 , . . . , pr } to an orthonormal basis {p1 , . . . , pm } of Cm . Each such
extension leads to a different matrix P in the SVD. For another example illustrating non-
uniqueness, consider A = In . Then ΣA = In , and A = P ΣA P H is a SVD of A for any unitary
n × n matrix P .
Remark 4.1.7 (Real SVD). If A ∈ Mm,n (R), then we can find a SVD where P and Q are
real (hence orthogonal) matrices. To see this, observe that our proof is valid if we replace C
by R everywhere.
Our proof of Theorem 4.1.6 gives an algorithm for finding SVDs.
Example 4.1.8. Let’s find a SVD of the matrix

1 −2 −1
A= .
1 2 −1
Since this matrix is real, we can write AT instead of AH in our SVD algorithm. We have
 
2 0 −2
T T 6 −2
A A =  0 8 0  and AA = .
−2 6
−2 0 2
As expected, both of these matrices are symmetric. It’s easier to find the eigenvalues of
AAT . This has characteristic polynomial
(x − 6)2 − 4 = x2 − 12x + 32 = (x − 4)(x − 8).
Thus, the eigenvalues of AAT are λ1 = 8 and λ2 = 4, both with multiplicity one. It follows
from Lemma 4.1.2(b) that the eigenvalues of AT A are λ1 = 8, λ2 = 4, and λ3 = 0, all with
multiplicity one. So the positive singular values of A are
p √ p
σ1 = λ1 = 2 2 and σ2 = λ2 = 2.
We now find an orthonormal basis of each eigenspace of AT A. We find the eigenvectors

−1 1 1 1
q1 = (0, 1, 0), q2 = √ , 0, √ , q3 = √ , 0, √ .
2 2 2 2
We now compute

1 1 −1 1
p1 = Aq1 = √ (−2, 2) = √ , √ ,
σ1 2 2 2 2
1 √ √

1 −1 −1
p2 = Aq2 = (− 2, − 2) = √ , √ .
σ2 2 2 2
Then define
√ √
−1/√ 2 −1/√2
P = p1 p2 = ,
1/ 2 −1/ 2
 √ √ 
0 −1/ 2 1/ 2
Q = q1 q2 q3 = 1
 0√ 0√  ,
0 1/ 2 1/ 2
√
σ1 0 0 2 2 0 0
Σ= = .
0 σ2 0 0 2 0
We can then check that A = P ΣQT .
In practice, SVDs are not computed using the above method. There are sophisticated
numerical algorithms for calculating the singular values, P , Q, and the rank of an m × n
matrix to a high degree of accuracy. Such algorithms are beyond the scope of this course.
Our algorithm gives us a way of finding one SVD. However, since SVDs are not unique,
it is natural to ask how they are related.
Lemma 4.1.9. If A = P ΣQH is any SVD for A as in Definition 4.1.1, then
(a) r = rank A, and

(b) the d1 , . . . , dr are the singular values of A in some order.
Proof. We have
H
AH A = P ΣQH P ΣQH = QΣH P H P ΣQH = QΣH ΣQH .

(4.10)
Thus ΣH Σ and AH A are similar matrices, and so rank(AH A) = rank(ΣH Σ) = r. As we saw

in Step 2 above, rank A = rank AH A. Hence rank A = r.
It also follows from the fact that ΣH Σ and AH A are similar that they have the same
eigenvalues. Thus
{d21 , d22 , . . . , d2r } = {λ1 , λ2 , . . . , λr },
where λ1 , λ2 , . . . , λr are the positive eigenvalues of AH A. So there is somep permutation τ of

2
{1, 2, . . . , r} such that di = λτ (i) for each i = 1, 2, . . . , r. Therefore di = λτ (i) = στ (i) for
each i by (4.6).
Exercises.
4.1.1. Show that A ∈ Mm,n (C) is the zero matrix if and only if all of its singular values are
zero.
Recommended exercises: Exercises in [Nic, §8.6]: 8.6.2–8.6.12, 8.6.16.

4.2. Fundamental subspaces and principal components 91
4.2 Fundamental subspaces and principal components

A singular value decomposition contains a lot of useful information about a matrix. We
explore this phenomenon in this section. A good reference for most of the material here is
[Nic, §8.6.2].
Definition 4.2.1 (Fundamental subspaces). The fundamental subspaces of A ∈ Mm,n (F)
are:
• row A = Span{x : x is a row of A},
• col A = Span{x : x is a column of A},
• null A = {x : Ax = 0},
• null AH = {x : AH x = 0}.
Recall Definition 3.1.8 of the orthogonal complement U ⊥ of a subspace U of Fn . We will
need a few facts about the orthogonal complement.
Lemma 4.2.2. (a) If A ∈ Mm,n (F), then
(row Ā)⊥ = null A and (col A)⊥ = null AH .
(b) If U is any subspace of Fn , then (U ⊥ )⊥ = U .

(c) Suppose {v1 , . . . , vm } is an orthonormal basis of Fm . If U = Span{v1 , . . . , vk } for
some 1 ≤ k ≤ m, then
U ⊥ = Span{vk+1 , . . . , vm }.
Proof. (a) Let a1 , . . . , am be the rows of A. Then
x ∈ null A ⇐⇒ Ax = 0
 
a1
 .. 
⇐⇒  .  x = 0
am
 
a1 x
⇐⇒  ...  = 0 (block multiplication)
 
am x
⇐⇒ ai x = 0 for all i
⇐⇒ hai , xi = 0 for all i
⇐⇒ x ∈ (Span{a1 , . . . , am })⊥ = (row Ā)⊥ . (by Lemma 3.1.9(c))
Thus null A = (row Ā)⊥ . Replacing A by AH , we get
null AH = (row AT )⊥ = (col A)⊥ .
(b) You saw this in previous courses, so we will not repeat the proof here. See [Nic,
Lem. 8.6.4].
(c) We leave this as an exercise. The proof can be found in [Nic, Lem. 8.6.4].
Now we can see that any SVD for a matrix A immediately gives orthonormal bases for
the fundamental subspaces of A.
Theorem 4.2.3. Suppose A ∈ Mm,n (F). Let A = P ΣQH be a SVD for A, where

P = u1 · · · um ∈ Mm,m (F) and Q = v1 · · · vn ∈ Mn,n (F)
are unitary (hence orthogonal if F = R) and

D 0
Σ= , where D = diag(d1 , d2 , . . . , dr ), with each di ∈ R>0 .
0 0 m×n
Then
(a) r = rank A, and the positive singular values of A are d1 , d2 , . . . , dr ;
(b) the fundamental spaces are as follows:
(i) {u1 , . . . , ur } is an orthonormal basis of col A,
(ii) {ur+1 , . . . , um } is an orthonormal basis of null AH ,
(iii) {vr+1 , . . . , vn } is an orthonormal basis of null A,
(iv) {v1 , . . . , vr } is an orthonormal basis of row A.
Proof. (a) This is Lemma 4.1.9.
(b) (i) Since Q is invertible, we have col A = col(AQ) = col(P Σ). Also

diag(d1 , . . . , dr ) 0
P Σ = u1 · · · um = d1 u1 · · · dr ur 0 · · · 0 .
0 0
Thus
col A = Span{d1 u1 , . . . , dr ur } = Span{u1 , . . . , ur }.
Since the u1 , . . . , ur are orthonormal, they are linearly independent. So {u1 , . . . , ur } is an
orthonormal basis for col A.
(ii) We have
null AH = (col A)⊥ (by Lemma 4.2.2(a))

= (Span{u1 , . . . , ur })⊥
= Span{ur+1 , . . . , um }. (by Lemma 4.2.2(c))
(iii) We first show that the proposed basis has the correct size. By the Rank-Nullity
Theorem (see (1.7)), we have
dim(null A) = n − r = dim(Span{vr+1 , . . . , vn }).
So, if we can show that

Span{vr+1 , . . . , vn } ⊆ null A, (4.11)
4.2. Fundamental subspaces and principal components 93
it will follow that Span{vr+1 , . . . , vn } = null A (since the two spaces have the same dimen-
sion), and hence that {vr+1 , . . . , vn } is a basis for null A (since it is linearly independent
because it is orthonormal).
To show the inclusion (4.11), it is enough to show that vj ∈ null A (i.e. Avj = 0) for
j > r. Define
dr+1 = · · · = dn = 0, so that ΣH Σ = diag(d21 , . . . , d2n ).
Then, for 1 ≤ j ≤ n, we have
(AH A)vj = (QΣH ΣQH )vj (by (4.10))

= (QΣH ΣQH )Qej
= Q(ΣH Σ)ej
= Q(d2j ej )
= d2j Qej
= d2j vj .
Thus, for 1 ≤ i ≤ n,
kAvj k2 = (Avj )H Avj = vjH AH Avj = vjH (d2j vj ) = d2j kvj k2 = d2j .
In particular, Avj = 0 for j > r, as desired.

(iv) First note that
(row Ā)⊥ = null A (by Lemma 4.2.2(a))

= Span{vr+1 , . . . , vn }. (by (iii))
Then
⊥
row Ā = (row Ā)⊥ (by Lemma 4.2.2(b))
⊥
= (Span{vr+1 , . . . , vn })
= Span{v1 , . . . , vr }. (by Lemma 4.2.2(c))
Taking complex conjugates, we have row A = Span{v1 , . . . , vr }.
Example 4.2.4. Suppose we want to solve the homogeneous linear system
Ax = 0 of m equations in n variables.
The set of solutions is precisely null A. If we compute a SVD A = U ΣV T for A then, in the
notation of Theorem 4.2.3, the set {vr+1 , . . . , vn } is an orthonormal basis of the solution set.
SVDs are also closely related to principal component analysis. If A = P ΣQT is a SVD
with, in the notation of Theorem 4.2.3,
σ1 = d1 > σ2 = d2 > · · · > σn = dn > 0,

then we have
A = P ΣQT

diag(σ1 , . . . , σr ) 0 H
= u1 · · · um v 1 · · · vn
0 0
 
vH
 .1 
= σ1 u1 · · · σr ur 0 · · · 0  .. 
vnH
= σ1 u1 v1H + · · · + σr ur vrH . (block multiplication)
Thus, we have written A as a sum of r rank one matrices (see Exercise 4.2.1), called the
principle components of A. For 1 ≤ t ≤ r, the matrix
At := σ1 u1 v1H + . . . + σt ut vtH
is a truncated matrix that is the closest approximation to A by a rank t matrix, in the

sense that the difference between A and At has the smallest possible Frobenius norm. This
result is known as the Eckart–Young–Mirsky Theorem. Such approximations are particularly
useful in data analysis and machine learning. There exist algorithms for finding At without
computing a full SVD.
Exercises.
4.2.1. Suppose u ∈ Cn and v ∈ Cm . Show that if u and v are both nonzero, then the rank
of the matrix uvH is 1.
4.3 Pseudoinverses
In Section 1.5 we discussed left and right inverses. In particular, we discussed the pseudoin-
verse in Section 1.5.4. This was a particular left/right inverse; in general left/right inverses
are not unique. We’ll now discuss this concept in a bit more detail, seeing what property
uniquely characterizes the pseudoinverse and how we can compute it using a SVD. A good
reference for this material is [Nic, §8.6.4].
Definition 4.3.1 (Middle inverse). A middle inverse of A ∈ Mm,n (F) is a matrix B ∈

Mn,m (F) such that
ABA = A and BAB = B.
Examples 4.3.2. (a) Suppose A is left-invertible, with left inverse B. Then
ABA = AI = A and BAB = IB = B,

4.3. Pseudoinverses 95
so B is a middle inverse of A. Conversely if C is any other middle inverse of A, then
ACA = A =⇒ BACA = BA =⇒ CA = I,
and so C is a left inverse of A. Thus, middle inverses of A are the same as left inverses.
(b) If A right invertible, then middle inverses of A are the same as right inverses. The
proof is analogous to the one above.
(c) It follows that if A is invertible, then middle inverse are the same as inverses.
In general, middle inverses are not unique, even for square matrices.
Example 4.3.3. If
1 0 0
A= ,
0 0 0
then  
1 b
B = 0 0
0 0
is a middle inverse for any b.
While the middle inverse is not unique in general, it turns out that it is unique if we
require that AB and BA be hermitian.
Theorem 4.3.4 (Penrose’ Theorem). For any A ∈ Mm,n (C), there exists a unique B ∈
Mn,m (C) such that
(P1) ABA = A and BAB = B.

(P2) Both AB and BA are hermitian.
Proof. A proof can be found in [Pen55].
Definition 4.3.5 (Pseudoinverse). The pseudoinverse (or Moore–Penrose inverse) of A ∈

Mm,n (C) is the unique A+ ∈ Mn,m (C) such that A and A+ satisfy (P1) and (P2), that is:
AA+ A = A, A+ AA+ = A+ , and both AA+ and A+ A are hermitian.
If A is invertible, then A+ = A−1 , as follows from Example 4.3.2. Also, the symmetry in
the conditions (P1) and (P2) imply that A++ = A.
The following proposition shows that the terminology pseudoinverse, as used above, co-
incides with our use of this terminology in Section 1.5.4.
Proposition 4.3.6. Suppose A ∈ Mm,n (C).
(a) If rank A = m, then AAH is invertible and A+ = AH (AAH )−1 .

(b) If rank A = n, then AH A is invertible and A+ = (AH A)−1 AH .
Proof. We proof the first statement; the proof of the second is similar. If rank A = m,
then the rows of A are linearly independent and so, by Proposition 1.5.16(a) (we worked
over R there, but the same result holds over C if we replace the transpose by the conjugate
transpose), AAH is invertible. Then
A AH (AAH )−1 A = (AAH )(AAH )−1 A = IA = A

and
AH (AAH )−1 A AH (AAH )−1 = AH (AAH )−1 (AAH )(AAH )−1 = AH (AAH )−1 .

Furthermore, A AH (AAH )−1 = I is hermitian, and

H
AH (AAH )−1 A = AH (AAH )−1 A,

so AH (AAH )−1 A is also hermitian.

In turns out that if we have a SVD for A, then it is particularly easy to compute the
pseudoinverse.
Proposition 4.3.7. Suppose A ∈ Mm,n (C) and A = P ΣQH is a SVD for A as in Defini-
tion 4.1.1, with

D 0
Σ= , D = diag(d1 , d2 , . . . , dr ), d1 , d2 , . . . , dr ∈ R>0 .
0 0 m×n
Then A+ = QΣ0 P H , where −1

−1 D 0
Σ = .
0 0 n×m
Proof. It is straightforward to verify that

0 0 0 0 0 Ir 0 0 Ir 0
ΣΣ Σ = Σ, Σ ΣΣ = Σ , ΣΣ = , ΣΣ= .
0 0 m×m 0 0 n×n
In particular, Σ0 is the pseudoinverse of Σ. Now let B = QΣ0 P H . Then
ABA = (P ΣQH )(QΣ0 P H )(P ΣQH ) = P ΣΣ0 ΣQH = P ΣQH = A.
Similarly, BAB = B. Furthermore,
AB = U (ΣΣ0 )U H and BA = Q(Σ0 Σ)QH
are both hermitian. Thus B = A+ .

1 0 0
A= .
0 0 0
4.3. Pseudoinverses 97
We saw in Example 4.3.3 that  

1 b
B = 0 0
0 0
is a middle inverse for B for any choice of b. We have
 
1 0 0
1 b
AB = and BA = 0 0 0 .
0 0
0 0 0
So BA is always symmetric, but AB is symmetric exactly when b = 0. Hence B = A+ if

and only if b = 0.
Let’s compute A+ using a SVD of A. The matrix
 
1 0 0
AT A = 0 0 0
0 0 0
has eigenvalues λ1 = 1, λ2 = λ3 = 0 with orthonormal eigenvectors
q1 = (1, 0, 0), q2 = (0, 1, 0), q3 = (0, 0, 1).

Thus we take Q = q1 q2 q3 = I3 . In addition, A has rank 1 with singular values
σ1 = 1, σ2 = 0, σ3 = 0.
Thus  
1 0
1 0 0
ΣA = = A and Σ0 = 0 0 = AT .
0 0 0
0 0
We then compute
1 1
p1 = Aq1 = .
σ1 0
We extend this to an orthonormal basis of C2 by choosing

0
p2 = .
1
Hence
P = p1 p2 = I2 .
Thus a SVD for A is A = P ΣA QT = ΣA = A. Therefore the pseudoinverse of A is

+ 0 T 0 1 0 0
A = QΣA P = ΣA = = AT .
0 0 0
We conclude this section with a list of properties of the pseudoinverse, many of which
parallel properties of the inverse.
Proposition 4.3.9. Suppose A ∈ Mm,n (C).
(a) A++ = A.
(b) If A is invertible, then A+ = A−1 .
(c) The pseudoinverse of a zero matrix is its transpose.
(d) (AT )+ = (A+ )T .
(e) (Ā)+ = A+ .
(f) (AH )+ = (A+ )H .
(g) (zA)+ = z −1 A+ for z ∈ C× .
(h) If A ∈ Mm,n (R), then A+ ∈ Mn,m (R).
Proof. We’ve already proved the first two. The proof of the remaining properties is left as
Exercise 4.3.3.
Exercises.
4.3.1. Suppose that B is a middle inverse for A. Prove that B T is a middle inverse for AT .
4.3.2. A square matrix M is idempotent if M 2 = M . Show that, if B is a middle inverse for

A, then AB and BA are both idempotent matrices.
4.3.3. Complete the proof of Proposition 4.3.9.
Additional recommended exercises from [Nic, §8.6]: 8.6.1, 8.6.17.
4.4 Jordan canonical form

One of most fundamental results in linear algebra is the classification of matrices up to
similarity. It turns out that every matrix is similar to a matrix of a special form, called
Jordan canonical form. In this section we discuss this form. We will omit some of the
technical parts of the proof, referring to [Nic, §§11.1, 11.2] for details. Throughout this
section we work over the field C of complex numbers.
By Schur’s theorem (Theorem 3.4.2), every matrix is unitarily similar to an upper trian-
gular matrix. The following theorem shows that, in fact, every matrix is similar to a special
type of upper triangular matrix.
Proposition 4.4.1 (Block triangulation). Suppose A ∈ Mn,n (C) has characteristic polyno-
mial
cA (x) = (x − λ1 )m1 (x − λ2 )m2 · · · (x − λk )mk
4.4. Jordan canonical form 99
where λ1 , λ2 , . . . , λk are the distinct eigenvalues of A. Then there exists an invertible matrix
P such that  
U1 0 0 ··· 0
 0 U2 0
 ··· 0  
P −1 AP =  0 0 U3
 ··· 0  
 .. .. .. .. 
. . . .
0 0 0 · · · Uk
where each Ui is an mi × mi upper triangular matrix with all entries on the main diagonal
entry equal to λi .
Proof. The proof proceeds by induction on n. See [Nic, Th. 11.1.1] for details.
Recall that if λ is an eigenvalue of A, then the associated eigenspace is
Eλ = null(λI − A) = {x : (λI − A)x = 0} = {x : Ax = λx}.
Definition 4.4.2 (Generalized eigenspace). If λ is an eigenvalue of the matrix A, then the

associated generalized eigenspace is
Gλ = Gλ (A) := null(λI − A)mλ ,
where mλ is the algebraic multiplicity of λ. (We use the notation Gλ (A) if we want to
emphasize the matrix.)
Note that we always have

Eλ (A) ⊆ Gλ (A).

0 1
A= ∈ M2,2 (C).
0 0
The only eigenvalue of A is λ = 0. The associated eigenspace is
E0 (A) = Span{(1, 0)}.
However, since A2 = 0, we have

G0 = C2 .
Recall that the geometric multiplicity of the eigenvalue λ of A is dim Eλ and we have
dim Eλ ≤ mλ . This inequality can be strict, as in Example 4.4.3. (In fact, it is strict for
some eigenvalue precisely when the matrix A is not diagonalizable.)
Lemma 4.4.4. If λ is an eigenvalue of A, then dim Gλ (A) = mλ , where mλ is the algebraic

multiplicity of λ.
Proof. Choose P as in Proposition 4.4.1 and choose an eigenvalue λi . To simplify notation,

let B = (λi I − A)mi . We have an isomorphism
Gλ (A) = null B → null(P −1 BP ), x 7→ P −1 x (4.12)
(Exercise 4.4.1). Thus, it suffices to show that dim null(P −1 BP ) = mi . Using the notation
of Proposition 4.4.1, we have λ = λi for some i and
P −1 BP = (λi I − P −1 AP )mi
 mi
λi I − U1 0 ··· 0
 0 λi I − U2 · · · 0 
=
 
.. .. .. 
 . . . 
0 0 · · · λi I − Uk
 
(λi I − U1 )mi 0 ··· 0
 0 (λi I − U2 )mi · · · 0 
= .
 
.. .. ..
 . . . 
mi
0 0 · · · (λi I − Uk )
The matrix λi I − Uj is upper triangular with main diagonal entries equal to λi − λj ; so

(λi I −Uj )mi it is invertible when i 6= j and the zero matrix when i = j (see Exercise 3.4.3(b)).
Therefore mi = dim null(P −1 BP ), as desired.
Definition 4.4.5 (Jordan block). For n ∈ Z>0 and λ ∈ C, the Jordan block J(n, λ) is the
0
n × n matrix with λ’s on the main
diagonal, 1 s on the diagonal above, and 0’s elsewhere.
By convention, we set J(1, λ) = λ .
We have
 
  λ 1 0 0
λ 1 0
λ 1 0 λ 1 0
J(1, λ) = λ , J(2, λ) = , J(3, λ) =  0 λ 1  , J(4, λ) =   , etc.
0 λ 0 0 λ 1
0 0 λ
0 0 0 λ
Our goal is to show that Proposition 4.4.1 holds with each block Ui replaced by a Jordan
block. The key is to show the result for λ = 0. We say a linear operator (or matrix) T
is nilpotent if T m = 0 for some m ≥ 1. Every eigenvalue of a nilpotent linear operator or
matrix is equal to zero (Exercise 4.4.2). The converse also holds by Proposition 4.4.1 and
Exercise 3.4.3(b).
Lemma 4.4.6. If A ∈ Mn,n (C) is nilpotent, then there exists an invertible matrix P such
that
P −1 AP = diag(J1 , J2 , . . . , Jk ),
where each Ji is a Jordan block J(m, 0) for some m.
Proof. The proof proceeds by induction on n. See [Nic, Lem. 11.2.1] for details.
4.4. Jordan canonical form 101
Theorem 4.4.7 (Jordan canonical form). Suppose A ∈ Mn,n (C) has distinct (i.e. non-
repeated) eigenvalues λ1 , . . . , λk . For 1 ≤ i ≤ k, let mi be the algebraic multiplicity of λi .
Then there exists an invertible matrix P such that
P −1 AP = diag(J1 , . . . , Jm ), (4.13)
where each J` is a Jordan block corresponding to some eigenvalue λi . Furthermore, the sum
of the sizes of the Jordan block corresponding to λi is equal to mi . The form (4.13) is called
a Jordan canonical form of A.
Proof. By Proposition 4.4.1, there exists an invertible matrix Q such that

 
U1 0 0 ··· 0
0
 U2 0 · · · 0  
−1
Q AQ =  0
 0 U3 · · · 0  ,
 .. .. .. .. 
. . . . 
0 0 0 · · · Uk
where each Ui is upper triangular with entries on the main diagonal equal to λi . Suppose
that for each Ui we can find an invertible matrix Pi as in the statement of the theorem. Then
 −1  
P1 0 0 ··· 0 P1 0 0 · · · 0
 0 P2 0
 ··· 0 
 0 P2 0 · · · 0 
 
 0 0 P3
 ··· 0 −1  0 0 P3 · · · 0 
 Q AQ  
 .. .. .. ..   .. .. .. .. 
. . . .   . . . .
0 0 0 · · · Pk 0 0 0 · · · Pk
 −1 
P1 U1 P1 0 0 ··· 0
−1

 0 P2 U2 P1 0 ··· 0 

−1
=
 0 0 P3 U3 P3 ··· 0 

 .. .. .. .. 
 . . . . 
0 0 0 · · · Pk−1 Uk Pk
would be in Jordan canonical form.

So it suffices to prove the theorem in the case that A is an upper triangular matrix
with diagonal entries all equal to some λ ∈ C. Then A − λI is strictly upper triangular
(upper triangular with zeros on the main diagonal), and hence nilpotent by Exercise 3.4.3(b).
Therefore, by Lemma 4.4.6, there exists an invertible matrix P such that
P −1 (A − λI)P = diag(J1 , . . . , J` ),
where each Ji is a Jordan block J(m, 0) for some m. Then
P −1 AP = P −1 (A − λI) + λI P = P −1 (A − λI)P + λI = diag(J1 + λI, . . . , J` + λI),

and each Ji + λI is a Jordan block J(m, λ) for some m.

Remarks 4.4.8. (a) Suppose that T : V → V is a linear operator on a finite-dimensional

vector space V . If we choose a basis B of V , then we can find the matrix of T relative
to this basis. It follows from Theorem 4.4.7 that we can choose the basis B so that
this matrix is in Jordan canonical form.
(b) The Jordan canonical form of a matrix (or linear operator) is uniquely determined up
to reordering the Jordan blocks. In other words, for each eigenvalue λ, the number and
size of the Jordan blocks corresponding to λ are uniquely determined. This is most
easily proved using more advanced techniques from the theory of modules. It follows
that two matrices are similar if and only if they have the same Jordan canonical form
(up to re-ordering of the Jordan blocks).
(c) A matrix is diagonalizable if and only if, in its Jordan canonical form, all Jordan blocks
have size 1.
Exercises.
4.4.1. Show that (4.12) is an isomorphism.
4.4.2. Show that if T : V → V is a nilpotent linear operator, then zero is the only eigenvalue
of T .
Additional exercises from [Nic, §11.2]: 11.2.1, 11.2.2.
4.5 The matrix exponential

As discussed in Section 3.4, we can substitute a matrix A ∈ Mn,n (C) into any polynomial.
In this section, we’d like to make sense of the expression eA . Recall from calculus that, for
x ∈ C, we have the power series
∞
x x x3 X xm
e =1+x+ + + ··· = .
2 3! m=0
m!
(You probably only saw this for x ∈ R, but it holds for complex values of x as well.) So we
might naively try to define
∞
A
X 1 m
e = A .
m=0
m!
In fact, this is the correct definition. However, we need to justify that this infinite sum
makes sense (i.e. that in converges) and figure out how to compute it. To do this, we use
the Jordan canonical form.
4.5. The matrix exponential 103
As a warm up, let’s suppose A = diag(a1 , . . . , an ) is diagonal. Then

∞ ∞
A
X 1 m X 1
e = A = diag(am m a1
1 , . . . , an ) = diag(e , . . . , e
am
). (4.14)
m=0
m! m=0
m!
So exponentials of diagonal matrices exist and are easy to compute.

Now suppose A is arbitrary. By Theorem 4.4.7, there exists an invertible matrix P such
that B = P −1 AP is in Jordan canonical form. For each m, we have

1 m 1 −1 m 1 m
P −1 .

A = P BP =P B
m! m! m!
Summing over m, it follows that
eA = P eB P −1
provided that the infinite sum for eB converges. If A is diagonalizable, then its Jordan
canonical form B is diagonal, and we can then compute eB as in (4.14).
The above discussion shows that we can restrict our attention to matrices in Jordan
canonical form. However, since the Jordan canonical form might not be diagonal, we have
some more work to do. Note that if A is in Jordan canonical form, we have
A = diag(J1 , . . . , Jk )
for some Jordan blocks Ji . But then, for all m, we have
Am = diag(J1m , . . . , Jkm ).
Thus
eA = diag(eJ1 , . . . , eJk )
provided that the sum for each eJi converges. So it suffices to consider the case where A is
a single Jordan block.
Let’s first consider the case of a Jordan block J = J(n, 0) corresponding to eigenvalue
zero. Then we know from Exercise 3.4.3(b) that J n = 0. Hence
∞ n−1
X 1 m X 1 m
eJ = J = J . (4.15)
m=0
m! m=0
m!
Since this sum is finite, there are no convergence issues.

Now consider an arbitrary Jordan block J(n, λ), λ ∈ C. We have
J(n, λ) = λI + J(n, 0)
Now, if A, B ∈ Mn,n (C) commute (i.e. AB = BA) and then the usual argument shows that
eA+B = eA eB .
(It is crucial that A and B commute; see Exercise 4.5.2.) Thus
eJ(n,λ) = eλI+J(n,0) = eλI eJ(n,0) = eλ IeJ(n,0) = eλ eJ(n,0) .
So we can compute the exponential of any Jordan block. Hence, by our above discussion,
we can compute the exponential of any matrix. We sometimes write exp(A) for eA .
Example 4.5.1. Suppose  

2 1 0
A = 0 2 0  .
0 0 −1
We have
2 1 2 0 1 2 1 1
exp = e exp =e
0 2 0 0 0 1
and
−1 = e−1 .

exp
Hence  2 2 
e e 0
eA =  0 e2 0  .
0 0 e−1
Proposition 4.5.2. For A ∈ Mn,n (C) we have det eA = etr A .

Proof. We first prove the result for a Jordan block J = J(n, λ). We have
n
det eJ = det eλ eJ(n,0) = eλ det eJ(n,0) = enλ · 1 = etr J ,

where we have used the fact (which follows from (4.15)) that eJ(n,0) is upper triangular
with all entries on the main diagonal equal to 1 (hence det eJ(n,0) = 1). Now suppose A is
arbitrary. Since similar matrices have the same determinant and trace (see Exercises 3.4.1
and 3.4.2), we can use Theorem 4.4.7 to assume that A is in Jordan canonical form. So
A = diag(J1 , . . . , Jk )
for some Jordan blocks Ji . Then
det eA = det diag(eJ1 , eJ2 , . . . , eJk )

= (det eJ1 )(det eJ2 ) · · · (det eJk ) (by (1.2))

tr J1 tr J2 tr Jk
=e e ···e (by the above)
= etr J1 +tr J2 +···+tr Jk (since ea+b = ea eb for a, b ∈ C)
= etr(diag(J1 ,J2 ,...,Jk ))
= etr A .
The matrix exponential can be used to solve certain initial value problems.. You probably
learned in calculus that the solution to the differential equation
x0 (t) = ax(t), x(t0 ) = x0 , a, t0 , x0 ∈ R
is
x(t) = e(t−t0 )a x0 .
In fact, this continues to hold more generally. If
x(t) = (x1 (t), . . . , xn (t)), x0 (t) = (x01 (t), . . . , x0n (t)), x0 ∈ Rn , A = Mn,n (C),
4.5. The matrix exponential 105
then the solution to the initial value problem
x0 (t) = Ax(t), x(t0 ) = x0 ,
is
x(t) = e(t−t0 )A x0 .
In addition to the matrix exponential, it is also possible to compute other functions of
matrices (sin A, cos A, etc.) using Taylor series for these functions. See [Usm87, Ch. 6] for
further details.
Exercises.
4.5.1. Prove that eA is invertible for any A ∈ Mn,n (C).
4.5.2. Let

0 1 0 0
A= and B = .
0 0 1 0
Show that eA eB 6= eA+B .
4.5.3. If  
3 1 0 0 0
0 3 1 0 0
 
0
A= 0 3 0 0,
0 0 0 −4 1 
0 0 0 0 −4
compute eA .
4.5.4. If  
0 2 −3
A = 0 0 1  ,
0 0 0
compute eA by summing the power series.
4.5.5. Solve the initial value problem x0 (t) = Ax(t), x(1) = (1, −2, 3), where A is the matrix
of Example 4.5.1.
Chapter 5
Quadratic forms
In your courses in linear algebra you have spent a great deal of time studying linear functions.
The next level of complexity involves quadratic functions. These have a large number of
applications, including in the theory of conic sections, optimization, physics, and statistics.
Matrix methods allow for a unified treatment of quadratic functions. Conversely, quadratic
functions provide useful insights into the theory of eigenvectors and eigenvalues. Good
references for the material in this chapter are [Tre, Ch. 7] and [ND77, Ch. 10].
5.1 Definitions
A quadratic form on Rn is a homogenous polynomial of degree two in the variables x1 , . . . , xn .
In other words, a quadratic form is a polynomial Q(x) = Q(x1 , . . . , xn ) having only terms of
degree two. So only terms that are scalar multiples of x2k and xj xk are allowed.
Every quadratic form on Rn can be written in the form Q(x) = hx, Axi = xT Ax for some
matrix A ∈ Mn,n (R). In general, the matrix A is not unique. For instance, the quadratic
form
Q(x) = 2x21 − x22 + 6x1 x2
can be written as hx, Axi where A is any of the matrices

2 6 2 0 2 3
, , .
0 −1 6 −1 3 −1
In fact, we can choose any matrix of the form

2 6−a
, a ∈ F.
a −1
Note that only the choice of a = 3 yields a symmetric matrix.
Lemma 5.1.1. For any quadratic form Q(x) on Rn , there is a unique symmetric matrix
A ∈ Mn,n (R) such that Q(x) = hx, Axi.
Proof. Suppose X
Q(x) = cij xi xj .
i≤j
106
5.1. Definitions 107
If A = [aij ] is symmetric, we have

n
X n
X X n
X X
hx, Axi = aij xi xj = aii x2i + (aij + aji )xi xj = aii x2i + 2aij xi xj .
i,j=1 i=1 i<j i=1 i<j
Thus, we have Q(x) = hx, Axi if and only if

1
aii = cii for 1 ≤ i ≤ n, aij = aji = cij for 1 ≤ i < j ≤ n.
2
Example 5.1.2. If
Q(x) = x21 − 3x22 + 4x23 + 3x1 x2 − 12x1 x3 + 8x2 x3 ,
then the corresponding symmetric matrix is

 
1 1.5 −6
1.5 −3 4  .
−6 4 2
We can also consider quadratic forms on Cn . Typically, we still want the quadratic form
to take real values. Before we consider such quadratic forms, we state an important identity.
Lemma 5.1.3 (Polarization identity). Suppose A ∈ Mn,n (F).
(a) If F = C, then
1 X
hx, Ayi = αhx + αy, A(x + αy)i.
4 α=±1,±i
(b) If F = R and A = AT , then

1
hx, Ayi = hx + y, A(x + y)i − hx − y, A(x − y)i .
4
Proof. We leave the proof as Exercise 5.1.1.
The importance of the polarization identity is that the values hx, Ayi are completely
determined by the values of the corresponding quadratic form (the case where x = y).
Lemma 5.1.4. Suppose A ∈ Mn,n (C). Then
hx, Axi ∈ R for all x ∈ Cn
if and only if A is hermitian.
Proof. First suppose that A is hermitian. Then, for all x ∈ Cn ,

(IP1)
hx, Axi = hx, AH xi = hAx, xi = hx, Axi.
Thus hx, Axi ∈ R.

108 Chapter 5. Quadratic forms
Now suppose that hx, Axi ∈ R for all x ∈ Cn , and A = [aij ]. Using the polarization
identity (Lemma 5.1.3) we have, for all x, y ∈ Cn ,
1
hx, Ayi = hx + y, A(x + y)i − hx − y, A(x − y)i + ihx + iy, A(x + iy)i − ihx − iy, A(x − iy)i
4
(IP2) 1

= hx + y, A(x + y)i − hy − x, A(y − x)i + ihy − ix, A(y − ix)i − ihy + ix, A(y + ix)i
(IP3) 4
(IP1)
= hy, Axi = hAx, yi
= hx, AH yi.
(In the third equality, we used the fact that all the inner products appearing in the expression
are real.) It follows that A = AH (e.g. take x = ei and y = ej ).
In light of Lemma 5.1.4, we define a quadratic form on Cn to be a function of the form
Q(x) = hx, Axi for some hermitian A ∈ Mn,n (C).
Thus a quadratic form on Cn is a function Cn → R.
Exercises.
5.1.1. Prove Lemma 5.1.3.
5.1.2. Consider the quadratic form
Q(x1 , x2 , x3 ) = x21 − x22 + 4x33 + x1 x2 − 3x1 x3 + 5x2 x3 .
Find all matrices A ∈ M3,3 (R) such that Q(x) = hx, Axi. Which one is symmetric?
5.2 Diagonalization of quadratic forms

Suppose Q(x) = hx, Axi is a quadratic form. If A is diagonal, then the quadratic form is
particularly easy to understand. For instance, if n = 2, then it is of the form
Q(x) = a1 x21 + a2 x22 .
For fixed c ∈ R, the level set consisting of all points x satisfying
Q(x) = c
is an ellipse or a hyperbola (or parallel lines, which is a kind of degenerate hyperbola, or the
empty set) whose axes are parallel to the x-axis and/or y-axis. For example, if F = R, the
set defined by
x21 + 4x22 = 4
5.2. Diagonalization of quadratic forms 109
is the ellipse
x2
x1
while the set defined by

x21 − 4x22 = 4
is the following hyperbola:
x2
x1
If n = 3, then the level sets are ellipsoids or hyperboloids.

Given an arbitrary quadratic form, we would like to make it diagonal by changing vari-
ables. Suppose we introduce some new variables
 
y1
 y2 
y =  ..  = S −1 x,
 
.
yn
where S ∈ Mn,n (F) is some invertible matrix, so that x = Sy. Then we have
Q(x) = Q(Sy) = hSy, ASyi = hy, S H ASyi.
Therefore, in the new variables y, the quadratic form has matrix S H AS.
So we would like to find an invertible matrix S so that S H AS is diagonal. Note that is a

bit different than usual diagonalization, where you want S −1 AS to be diagonal. However, if S
is unitary, then S H = S −1 . Fortunately, we know from the spectral theorem (Theorem 3.4.4)
that we can unitarily diagonalize a hermitian matrix. So we can find a unitary matrix U
such that
U H AU = U −1 AU
is diagonal. The columns u1 , u2 , . . . , un of U form an orthonormal basis B of Fn . Then U
is the change of coordinate matrix from this basis to the standard basis e1 , e2 , . . . , en . We
have
y = U −1 x,
so y1 , y2 , . . . , yn are the coordinates of the vector x in the new basis u1 , u2 , . . . , un .
Example 5.2.1. Let F = R and consider the quadratic form

Q(x1 , x2 ) = 5x21 + 5x22 − 2x1 x2 .
Let’s describe the level set given by Q(x1 , x2 ) = 1. The matrix of Q is

5 −1
A= .
−1 5
We can orthogonally diagonalizes this matrix (exercise) as

4 0 T 1 1 1
D= = U AU, with U = √ .
0 6 2 −1 1
Thus, we introduce the new variables y1 , y2 given by

x1 y1 1 y1 + y2
=U =√ .
x2 y2 2 y2 − y1
In these new variables, the quadratic form is
Q(y1 , y2 ) = 4y12 + 6y22 .
√
So the level set given by Q(y1 , y2 ) = 1 is the ellipse with half-axes 1/2 and 1/ 6:
y2
y1
√ √ √ √
The set {x : hx, Axi = 1} is the same ellipse, but in the basis (1/ 2, −1/ 2), (1/ 2, 1/ 2).
In other words, it is the same ellipse, rotated −π/4.
5.2. Diagonalization of quadratic forms 111
Unitary diagonalization involves computing eigenvalues and eigenvectors, and so it can

be difficult to do for large n. However, one can non-unitarily diagonalize the quadratic form
associated to A, i.e. find an invertible S (without requiring that S be unitary) such that
S H AS is diagonal. This is much easier computationally. It can be done by completion of
squares or by using row/column operations. See [Tre, §2.2] for details.
Proposition 5.2.2 (Sylvester’s law of inertia). Suppose A ∈ Mn,n (C) is hermitian. There
exists an invertible matrix S such that S H AS is of the form
diag(1, . . . , 1, −1, . . . , −1, 0, . . . , 0),
where
• the number of 1’s is equal to the number of positive eigenvalues of A (with multiplicity),
• the number of −1’s is equal to the number of negative eigenvalues of A (with multiplic-
ity),
• the number of 0’s is equal to the multiplicity of zero as an eigenvalue of A.
Proof. We leave the proof as Exercise 5.2.3.
The sequence (1, . . . , 1, −1, . . . , −1, 0, . . . , 0) appearing in Proposition 5.2.2 is called the
signature of the hermitian matrix A.
Exercises.
5.2.1 ([Tre, Ex. 7.2.2]). For the matrix
 
2 1 1
A = 1 2 1 ,
1 1 2
unitarily diagonalize the corresponding quadratic form. That is, find a diagonal matrix D
and a unitary matrix U such that D = U H AU .
5.2.2 ([ND77, Ex. 10.2.4]). For each quadratic form Q(x) below, find new variables y in
which the form is diagonal. Then graph the level curves Q(y) = t (in the coordinates y) for
various values of t.
(a) x21 + 4x1 x2 − 2x2x
(b) x21 + 12x1 x2 + 4x22
(c) 2x21 + 4x1 x2 + 4x22
(d) −5x21 − 8x1 x2 − 5x22
(e) 11x21 + 2x1 x2 + 3x22
5.2.3. Prove Proposition 5.2.2.
5.3 Rayleigh’s principle and the min-max theorem

In many applications (e.g. in certain problems in statistics), one wants to maximize a
quadratic form Q(x) = hx, Axi subject to the condition kxk = 1. (Recall that A is hermi-
tian.) This is equivalent to maximizing the Rayleigh quotient
hx, Axi
RA (x) :=
hx, xi
subject to the condition x 6= 0. (The proof that this equivalent is similar to the proof of
Proposition 2.3.2. We leave it as Exercise 5.3.1.)
It turns out that it is fairly easy to maximize the Rayleigh quotient when A is diagonal.
As we saw in Section 5.2 we can always unitarily diagonalize A, and hence the associated
quadratic form. Let U be a unitary matrix such that
U H AU = D = diag(λ1 , . . . , λn )
and consider the new variables
y = U −1 x = U H x so that x = U y.
Then
hx, Axi = xH Ax = (U y)H AU y = yH U H AU y = yH Dy = hy, Dyi
and
hx, xi = xH x = (U y)H U y = yH U H U y = yH y = hy, yi.
Thus
hx, Axi hy, Dyi
RA (x) = = = RD (y). (5.1)
hx, xi hy, yi
So we can always reduce the problem of maximizing a Rayleigh quotient to one of maximizing
a diagonalized Rayleigh quotient.
Theorem 5.3.1. Suppose A ∈ Mn,n (C) is hermitian with eigenvalues
λ1 ≤ λ2 ≤ · · · ≤ λn
and corresponding orthonormal eigenvectors v1 , v2 , . . . , vn . Then

(a) λ1 ≤ RA (x) ≤ λn for all x 6= 0;
(b) λ1 is the minimum value of RA (x) for x 6= 0, and RA (x) = λ1 if and only if x is an
eigenvector of eigenvalue λ1 ;
(c) λn is the maximum value of RA (x) for x 6= 0, and RA (x) = λn if and only if x is an
eigenvector of eigenvalue λn .
Proof. By (5.1) and the fact that
y = ei ⇐⇒ x = U ei = vi ,
it suffices to prove the theorem for the diagonal matrix D = (λ1 , λ2 , . . . , λn ).

5.3. Rayleigh’s principle and the min-max theorem 113
(a) This follows immediately from (b) and (c).

(b) We clearly have
he1 , De1 i he1 , e1 i

RD (e1 ) = = λ1 = λ1 .
he1 , e1 i he1 , e1 i
Also, for any x = x1 e1 + x2 e2 + · · · + xn en , we have
hx, Axi
RD (x) =
hx, xi
hx, A(x1 e1 + · · · + xn en )
=
hx, xi
hx1 e1 + · · · + xn en , x1 λ1 e1 + · · · xn λn en
=
hx, xi
λ1 |x1 | + · · · + λn |xn |2
2
=
hx, xi
λ1 |x1 |2 + · · · + λn |xn |2
≥ = λ1 ,
|x1 |2 + · · · + |xn |2
with equality holding if and only if xi = 0 for all i with λi > λ1 ; that is, if and only if
Dx = λ1 x.
(c) This follows from applying (b) to −A.

 
1 2 1
A = 2 −1 0  .
1 0 −1
We have
h(1, 0, 0), (1, 2, 1)i h(0, 1, 0), (2, −1, 0)i
RA (1, 0, 0) = = 1 and RA (0, 1, 0) = = −1.
1 1
Thus, by Theorem 5.3.1, we have
λ1 ≤ −1 and 1 ≤ λ3 .
√
Additional
√ choices of x can give us improved bounds. In fact, λ1 = − 6, λ2 = −1, and
λ3 = 6. Note that RA (0, 1, 0) = −1 is an eigenvalue, even though (0, 1, 0) is not an
eigenvector. This does not contradict Theorem 5.3.1 since −1 is neither the minimum nor
the maximum eigenvalue.
We studied matrix norms in Section 2.3. In Theorem 2.3.5 we saw that there are easy ways
to compute the 1-norm kAk1 and the ∞-norm kAk∞ . But we only obtained an inequality
for the 2-norm kAk2 . We can now say something more precise.
Corollary 5.3.3 (Matrix 2-norm). For any A ∈ Mm,n (C), the matrix 2-norm kAk2 is equal
to σ1 , the largest singular value of A.
Proof. We have
2
kAxk2
kAk22 = max : x 6= 0
kxk2
kAxk22

= max : x 6= 0
kxk22

hAx, Axi
= max : x 6= 0
hx, xi
hx, AH Axi

= max : x 6= 0
hx, xi
= λ,
where√λ is the largest eigenvalue of the hermitian matrix AH A (by Theorem 5.3.1). Since
σ1 = λ, we are done.
Theorem 5.3.1 relates the maximum (resp. minimum) eigenvalue of a hermitian matrix
A to the maximum (resp. minimum) value of the Rayleigh quotient RA (x). But what if we
are interested in the intermediate eigenvalues?
Theorem 5.3.4 (Rayleigh’s principle). Suppose A ∈ Mn,n (C) is hermitian, with eigenvalues
λ1 ≤ λ2 ≤ · · · ≤ λn and an orthonormal set of associated eigenvectors v1 , . . . , vn . For
1 ≤ j ≤ n, let
• Sj be the set of all x 6= 0 that are orthogonal to v1 , v2 , . . . , vj .

• Tj be the set of all x 6= 0 that are orthogonal to vn , vn−1 , · · · , vn−j+1 ,
Then:
(a) RA (x) ≥ λj+1 for all x ∈ Sj , and RA (x) = λj+1 for x ∈ Sj if and only if x is an
eigenvector associated to λj+1 .
(b) RA (x) ≤ λn−j for all x ∈ Tj , and RA (x) = λn−j for x ∈ Tj if and only if x is an
eigenvector associated to λn−j .
Proof. The proof is very similar to that of Theorem 5.3.1, so we will omit it.
Rayleigh’s principle (Theorem 5.3.4) characterizes each eigenvector and eigenvalue of an
n × n hermitian matrix in terms of an extremum (i.e. maximization or minimization) prob-
lem. However, this characterization of the eigenvectors/eigenvalues other than the largest
and smallest requires knowledge of eigenvectors other than the one being characterized. In
particular, to characterize or estimate λj for 1 < j < n, we need the eigenvectors v1 , . . . , vj−1
or vj+1 , . . . , vn .
The following result remedies the aforementioned issue by giving a characterization of
each eigenvalue that is independent of the other eigenvalues.
Theorem 5.3.5 (Min-max theorem). Suppose A ∈ Mn,n (C) is hermitian with eigenvalues
λ1 ≤ λ2 ≤ · · · ≤ λn .
Then
λk = min {max{RA (x) : x ∈ U, x 6= 0}}
U :dim U =k
and
λk = max {min{RA (x) : x ∈ U, x 6= 0}} ,
U :dim U =n−k+1
where the first min/max in each expression is over subspaces U ⊆ Cn of the given dimension.
Proof. We prove the first assertion, since the proof of the second is similar. Since A is
hermitian, it is unitarily diagonalizable by the spectral theorem (Theorem 3.4.4). So we can
choose an orthonormal basis
u1 , u2 , . . . , un
of eigenvectors, with ui an eigenvector of eigenvalue λi .
Suppose U ⊆ Cn with dim U = k. Then
dim U + dim Span{uk , . . . , un } = k + (n − k + 1) = n + 1 > n.
Hence
U ∩ Span{uk , . . . , un } =
6 {0}.
So we can choose a nonzero vector
n
X
v= ai ui ∈ U
i=k
and Pn Pn
hv, Avi i=k λi |ai |2 i=k λk |ai |
2
RA (v) = = Pn 2
≥ P n 2
= λk .
hv, vi i=k |ai | i=k |ai |
Since this is true for all U , we have
min {max{RA (x) : x ∈ U, x 6= 0}} ≥ λk .

U :dim U =k
To prove the reverse inequality, choose the particular subspace
V = Span{u1 , . . . , uk }.
Then
max{RA (x) : x ∈ V, x 6= 0} ≤ λk ,
since λk is the largest eigenvalue in V . Thus we also have
min {max{RA (x) : x ∈ U, x 6= 0}} ≤ λk .

U :dim U =k
Note that, when k = 1 or k = n, Theorem 5.3.5 recovers Theorem 5.3.1.


 
1 2 0
A = 2 −1 5 
0 5 −1
of Example 5.3.2. Let’s use the min-max theorem to get some bounds on λ2 . So n = 3 and
k = 2. Take
V = Span{(1, 0, 0), (0, 0, 1)}.
Then dim V = 2. The nonzero element of V are the vectors
(x1 , 0, x3 ) 6= (0, 0, 0).
We have
h(x1 , 0, x3 ), (x1 , 2x1 + 5x3 , −x3 )i |x1 |2 − |x3 |2
RA (x1 , 0, x3 ) = = .
|x1 |2 + |x3 |2 |x1 |2 + |x3 |2
This attains its maximum value of 1 at any scalar multiple of (1, 0, 0) and its minimum value
of −1 at any scalar multiple of (0, 0, 1). Thus, by the min-max theorem (Theorem 5.3.5), we
have
λ2 = min {max{RA (x) : x ∈ U, x 6= 0}} ≤ max{RA (x) : x ∈ V, x 6= 0} = 1

U :dim U =2
and
λ2 = max {min{RA (x) : x ∈ U, x 6= 0}} ≥ min{RA (x) : x ∈ V, x 6= 0} = −1.

U :dim U =2
So we know that −1 ≤ λ2 ≤ 1. In fact, λ2 ≈ 0.69385.
Example 5.3.7. Consider the non-hermitian matrix

0 1
A= .
0 0
The only eigenvalue of A is zero. However, for x ∈ R2 , the Rayleigh quotient
hx, Axi h(x1 , x2 ), (x2 , 0)i x 1 x2

RA (x) = = = 2
hx, xi h(x1 , x2 ), (x1 , x2 )i x1 + x22
has maximum value 21 , when x is a scalar multiple of (1, 1). So it is crucial that A be
hermitian in the theorems of this section.
Exercises.
5.3.1. Prove that x0 maximizes RA (x) subject to x 6= 0 and yields the maximum value
M = RA (x0 ) if and only if x1 = x0 /kx0 k maximizes hx, Axi subject to kxk = 1 and yields
hx1 , Ax1 i = M . Hint: The proof is similar to that of Proposition 2.3.2.
5.3.2 ([ND77, Ex. 10.4.1]). Use the Rayleigh quotient to find lower bounds for the largest
eigenvalue and upper bounds for the smallest eigenvalue of
 
0 −1 0
A = −1 −1 1 .
0 1 0
5.3.3 ([ND77, Ex. 10.4.2]). An eigenvector associated with the lowest eigenvalue of the matrix
below has the form xa = (1, a, 1). Find the exact value of a by defining the function
f (a) = RA (xa ) and using calculus to minimize f (a). What is the lowest eigenvalue of A?
 
3 −1 0
A = −1 2 −1 .
0 −1 3
5.3.4 ([ND77, Ex. 10.4.5]). For each matrix A below, use RA to obtain lower bounds on the
greatest eigenvalue and upper bounds on the least eigenvalue.
 
3 −1 0
(a) −1 2 −1
0 −1 3
 
7 −16 −8
(b) −16 7 8
−8 8 −5
 
2 −1 0
(c) −1 3 −1
0 −1 2
5.3.5 ([ND77, Ex. 10.4.6]). Using v3 = (1, −1, −1) as an eigenvector associated with the
largest eigenvalue λ3 of the matrix A of Exercise 5.3.2, use RA to obtain lower bounds on
the second largest eigenvalue λ2 .
5.3.6 ([ND77, Ex. 10.5.3, 10.5.4]). Consider the matrix

 
0.4 0.1 0.1
A = 0.1 0.3 0.2 .
0.1 0.2 0.3
(a) Use U = Span{(1, 1, 1)}⊥ in the min-max theorem (Theorem 5.3.5) to obtain an upper
bound on the second largest eigenvalue of A.
(b) Repeat with U = Span{(1, 2, 3)}⊥ .
Index
1-norm, 38 elementary row operations, 13

2-norm, 38 Euclidean norm, 38
∞-norm, 39
F, 5
A+ , 24 F× , 5
absolute error, 46 forward substitution, 26
absolute value, 37 free variable, 16
algebraic multiplicity, 53 Frobenius norm, 42
fundamental subspaces, 91
back substitution, 26
block form, 10 Gauss–Jordan elimination, 15
block triangulation, 98 gaussian algorithm, 14
Gaussian elimination, 13
C, 5 generalized eigenspace, 99
Cauchy–Schwarz inequality, 44 generalized inverse, 24
characteristic polynomial, 53 geometric multiplicity, 53
Cholesky factorization, 68 Gershgorin circle theorem, 81
coefficient matrix, 15 Gershgorin discs, 81
column space, 12 GL(n, F), 5
commute, 10 Gram matrix, 23, 86
complex conjugate, 37 Gram–Schmidt othogonalization algorithm, 49
condition number, 44
conjugate transpose, 56 hermitian conjugate, 56
hermitian matrix, 57
diag(a1 , . . . , an ), 6
diagonalizable, 53 idempotent, 98
diagonalization, 53 identity matrix, 9
diagonalized Rayleigh quotient, 112 ill-conditioned, 36, 46
Dimension Theoreom, 12 initial value problem, 104
dominant eigenvalue, 77 inner product, 48
dominant eigenvector, 77 inverse, 22
dot product, 8, 48 inverse power method, 79
invertible, 22
Eckart–Young–Mirsky Theorem, 94
eigenspace, 53 Jordan block, 100
eigenvalue, 53 kernel, 12
eigenvector, 53
elementary matrix, 13 L(V, W ), 11
elementary permutation matrix, 32 leading 1, 14
118
Index 119
leading variable, 16 null space, 12

left inverse, 17 nullity, 12
left-invertible, 17
level set, 108 operator norm, 40
linear transformation, 11 orthogonal, 49
lower reduced, 27 basis, 49
lower triangular, 26 orthogonal complement, 51
lower unitriangular, 70 orthogonal matrix, 58
LU decomposition, 27 orthogonal projection, 51
LU factorization, 27 orthogonally diagonalizable, 62
orthonormal, 49
Mm,n (F), 5 basis, 49
magnitude, 37 overdetermined, 15
main diagonal, 25
p-norm, 39
Markov chains, 79
P A = LU factorization, 33
matrix
permutation matrix, 32
addition, 6
Pi,j , 13
arithmetic, 6
PLU factorization, 33
difference, 7
polarization identity, 107
multiplication, 8
positive definite, 66
negative, 6
positive semi-definite, 66
scalar multiple, 7
power method, 77
size, 5
Principal axes theorem, 62
square, 7
principal submatrices, 67
subtraction, 7
principle axes, 62
symmetric, 7
principle components, 94
matrix exponential, 102
pseudoinverse, 24, 95
matrix norm, 40
matrix product, 8 QR factorization, 71
matrix-vector product, 8 QR method, 80
maximum norm, 39 quadratic form, 106, 108
Mi (a), 13
middle inverse, 94 R, 5
Moore–Penrose inverse, 24, 95 R>0 , 65
rank, 12, 15
N, 5 Rank-Nullity Theorem, 12
negative definite, 66 Rayleigh quotient, 112
negative semi-definite, 66 reduced QR factorization, 72
nilpotent, 100 reduced row-echelon form, 14
non-leading variable, 16 reduced row-echelon matrix, 14
nonsingular, 22 regular stochastic matrix, 79
norm, 37 relative error, 46
normal equations, 76 right inverse, 20
normal matrix, 63 right-invertible, 20
normed vector space, 37 row space, 12
120 Index
row-echelon form, 14 wide matrix, 15

row-echelon matrix, 14
Z, 5
scalar, 5 zero matrix, 7
scalar multiplication, 7
Schur decomposition, 61
Schur’s theorem, 60
signature, 111
similar matrices, 53
singular, 22
singular matrix, 87
singular value decomposition, 85
singular values, 87
size of a matrix, 5
spectral theorem, 61
spectrum, 62
square
linear system, 15
matrix, 7
standard basis vector, 6
steady state vector, 79
stochastic matrix, 79
SVD, 85
Sylvester’s law of inertia, 111
symmetic matrix, 7
symmetric, 57
tall matrix, 15
thin QR factorization, 72
trace, 61
transpose, 7
triangular, 26
triangule inequality, 37
two-sided inverse, 22
underdetermined, 15
unitarily diagonalizable, 60
unitary matrix, 58
unknown, 15
upper triangular, 25
variable, 15
vector, 6
vector norm, 37
vector space, 7
well-conditioned, 46
Bibliography
[AK08] Grégoire Allaire and Sidi Mahmoud Kaber. Numerical linear algebra, volume 55
of Texts in Applied Mathematics. Springer, New York, 2008. Translated from
the 2002 French original by Karim Trabelsi. URL: https://doi.org/10.1007/
978-0-387-68918-0.
[BV18] S. Boyd and L. Vandenberghe. Introduction to Linear Algebra – Vectors, Matrices,

and Least Squares. Cambridge University Press, Cambridge, 2018. URL: http:
//vmls-book.stanford.edu/.
[Mey00] Carl Meyer. Matrix analysis and applied linear algebra. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2000. With 1 CD-ROM
(Windows, Macintosh and UNIX) and a solutions manual (iv+171 pp.). URL:
https://doi.org/10.1137/1.9780898719512.
[ND77] Ben Noble and James W. Daniel. Applied linear algebra. Prentice-Hall, Inc., En-
glewood Cliffs, N.J., second edition, 1977.
[Nic] W. K. Nicholson. Linear Algebra With Applications. URL: https://lyryx.com/

products/mathematics/linear-algebra-applications/.
[Pen55] R. Penrose. A generalized inverse for matrices. Proc. Cambridge Philos. Soc.,
51:406–413, 1955.
[Tre] S. Treil. Linear Algebra Done Wrong. URL: http://www.math.brown.edu/

~treil/papers/LADW/LADW.html.
[Usm87] Riaz A. Usmani. Applied linear algebra, volume 105 of Monographs and Textbooks
in Pure and Applied Mathematics. Marcel Dekker, Inc., New York, 1987.
121

Applied Linear Algebra

Uploaded by

Copyright:

Available Formats

Applied Linear Algebra

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Linear Algebra

Uploaded by

Copyright:

Available Formats

Applied Linear Algebra

Department of Mathematics and Statistics

This work is licensed under a

2 Matrix norms, sensitivity, and conditioning 36

5 Quadratic forms 106

5.3 Rayleigh’s principle and the min-max theorem . . . . . . . . . . . . . . . . . 112

Course website: https://alistairsavage.ca/mat3341

1.1 Conventions and notation

linear group.) If a1 , . . . , an ∈ F, we define

Sometimes, to save space, we will also write this vector as

1.2 Matrix arithmetic

1.2.1 Matrix addition and scalar multiplication

Then the difference of matrices of the same size is defined by

A − B = A + (−B) = [aij − bij ].

If k ∈ F is a scalar, then we define the scalar multiple

Example 1.2.3. We have  

1.2.3 Matrix-vector multiplication

For x ∈ Fn , we define the matrix-vector product

1.2.4 Matrix multiplication

We then define the matrix product AB to be the m × k matrix given by

Recall that the dot product of x, y ∈ Fn is defined to be

1.2.5 Block form

In certain circumstance, we can also compute determinants in block form. Precisely, if A

(See [Nic, Th. 3.1.5].)

1.3 Matrices and linear transformations

(T1) T (x + y) = T (x) + T (y) for all x, y ∈ V , and

We let L(V, W ) denote the set of all linear transformations from V to W .

Conversely, every linear map Fm → Fn is given by multiplication by a some matrix. Indeed,

Proposition 1.3.1. With A and T defined as above, we have T = TA .

Proof. For x ∈ Fn , we have

It follows from the above discussion that we have a one-to-one correspondence

1.4 Gaussian elimination

• Type I : Interchange two rows of A.

Definition 1.4.2 (Elementary matrices). An n × n elementary matrix is a matrix obtained

• For 1 ≤ i, j ≤ n, i 6= j, we let Pi,j be the elementary matrix obtained from In by

Example 1.4.3. If n = 4, we have

Proof. (a) We leave it as an exercise (see Exercise 1.4.1) to check that

Pi,j Pi,j = I, Mi (a)Mi (a−1 ) = I, Ei,j (a)Ei,j (−a) = I (1.8)

• row k of Ei,j (a) is ek if k 6= j, and

and row j of Ei,j A is

(ej + aei )A = ej A + aei A = (row j of A) + a(row i of A).

Therefore, Ei,j (a)A is the result of adding a times row i to row j.

(a) all nonzero rows are above all zero rows,

(d) each leading 1 is the only nonzero entry in its column.

Proposition 1.4.8. A square matrix is invertible if and only if it is a product of elementary

A = Ek−1 · · · E2−1 E1−1 .

Since inverses of elementary matrices are elementary matrices by Proposition 1.4.4(a), we

Additional recommended exercises: [Nic, §§1.1, 1.2, 2.5].

1.5 Matrix inverses

1.5.1 Left inverses

(c) The matrix  

Indeed, one can check directly that BA = CA = I.

We will prove the converse of Proposition 1.5.3 in Proposition 1.5.15.

Corollary 1.5.4. If A has a left inverse, then A is square or tall.

So x = Cb is the solution of Ax = b. However, we started with the assumption that the

• If ACb = b, then x = Cb is the unique solution of the linear system Ax = b.

Example 1.5.5. Consider the matrices of Example 1.5.2(c): The matrix

has left inverses

Suppose we want to solve the over-determined linear system

We can use either left inverse and compute