Applied Linear Algebra
Applied Linear Algebra
Applied Linear Algebra
MAT 3341
Spring/Summer 2019
Alistair Savage
University of Ottawa
Preface 4
1 Matrix algebra 5
1.1 Conventions and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Matrix arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Matrices and linear transformations . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Matrix inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Orthogonality 48
3.1 Orthogonal complements and projections . . . . . . . . . . . . . . . . . . . . 48
3.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Hermitian and unitary matrices . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 The spectral theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Positive definite matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Computing eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Generalized diagonalization 85
4.1 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Fundamental subspaces and principal components . . . . . . . . . . . . . . . 91
4.3 Pseudoinverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Jordan canonical form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5 The matrix exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2
Contents 3
Index 118
Preface
These are notes for the course Applied Linear Algebra (MAT 3341) at the University of
Ottawa. This is a third course in linear algebra. The prerequisites are uOttawa courses
MAT 1322 and (MAT 2141 or MAT 2342).
In this course we will explore aspects of linear algebra that are of particular use in concrete
applications. For example, we will learn how to factor matrices in various ways that aid in
solving linear systems. We will also learn how one can effectively compute estimates of
eigenvalues when solving for precise ones is impractical. In addition, we will investigate the
theory of quadratic forms. The course will involve a mixture of theory and computation. It
is important to understand why our methods work (the theory) in addition to being able to
apply the methods themselves (the computation).
Acknowledgements: I would like to thank Benoit Dionne, Monica Nevins, and Mike Newman
for sharing with me their lecture notes for this course.
Alistair Savage
4
Chapter 1
Matrix algebra
We begin this chapter by briefly recalling some matrix algebra that you learned in previous
courses. In particular, we review matrix arithmetic (matrix addition, scalar multiplication,
the transpose, and matrix multiplication), linear transformations, and gaussian elimination
(row reduction). Next we discuss matrix inverses. Although you have seen the concept of a
matrix inverse in previous courses, we delve into the topic in further detail. In particular,
we will investigate the concept of one-sided inverses. We then conclude the chapter with a
discussion of LU factorization, which is a very useful technique for solving linear systems.
We let Mm,n (F) denote the set of all m × n matrices with entries in F. We let GL(n, F)
denote the set of all invertible n × n matrices with entries in F. (Here ‘GL’ stands for general
5
6 Chapter 1. Matrix algebra
We will use boldface lowercase letters a, b, x, y, etc. to denote vectors. (In class, we
will often write vectors as ~a, ~b, etc. since bold is hard to write on the blackboard.) Most of
the time, our vectors will be elements of Fn . (Although, in general, they can be elements of
any vector space.) For vectors in Fn , we denote their components with the corresponding
non-bold letter with subscripts. We will write vectors x ∈ Fn in column notation:
x1
x2
x = .. , x1 , x2 , . . . , xn ∈ F.
.
xn
x = (x1 , x2 , . . . , xn ).
For 1 ≤ i ≤ n, we let ei denote the i-th standard basis vector of Fn . This is the vector
ei = (0, . . . , 0, 1, 0, . . . , 0),
where the 1 is in the i-th position. Then {e1 , e2 , · · · , en } is a basis for Fn . Indeed, every
x ∈ Fn can be written uniquely as the linear combination
x = x 1 e 1 + x2 e 2 + · · · + xn e n .
A + B = [aij + bij ].
If A and B are of different sizes, then the sum A + B is not defined. We define the negative
of a matrix A by
−A = [−aij ]
1.2. Matrix arithmetic 7
kA = [kaij ].
We denote the zero matrix by 0. This is matrix with all entries equal to zero. Note that
there is some possibility for confusion here since we will use 0 to denote the real (or complex)
number zero, as well as the zero matrices of different sizes. The context should make it clear
which zero we mean. The context should also make clear what size of zero matrix we are
considering. For example, if A ∈ Mm,n (F) and we write A + 0, then 0 must denote the m × n
zero matrix here.
The following theorem summarizes the important properties of matrix addition and scalar
multiplication.
Proposition 1.2.1. Let A, B, and C be m × n matrices and let k, p ∈ F be scalars. Then
we have the following:
(a) A + B = B + A (commutativity)
(b) A + (B + C) = (A + B) + C (associativity)
(c) 0 + A = A (0 is an additive identity)
(d) A + (−A) = 0 (−A is the additive inverse of A)
(e) k(A + B) = kA + kB (scalar multiplication is distributive over matrix addition)
(f) (k + p)A = kA + pA (scalar multiplication is distributive over scalar addition)
(g) (kp)A = k(pA)
(h) 1A = A
Remark 1.2.2. Proposition 1.2.1 can be summarized as stating that the set Mm,n (F) is a
vector space over the field F under the operations of matrix addition and scalar multiplication.
1.2.2 Transpose
The transpose of an m × n matrix A, written AT , is the n × m matrix whose rows are the
columns of A in the same order. In other words, the (i, j)-entry of AT is the (j, i)-entry of
A. So,
if A = [aij ], then AT = [aji ].
We say the matrix A is symmetric if AT = A. Note that this implies that all symmetric
matrices are square, that is, they are of size n × n for some n.
The matrix
1 −5 7
−5 0 8
7 8 9
is symmetric.
Proposition 1.2.4. Let A and B denote matrices of the same size, and let k ∈ F. Then we
have the following:
(a) (AT )T = A
(b) (kA)T = kAT
(c) (A + B)T = AT + B T
Ax := x1 a1 + x2 a2 + · · · + xn an ∈ Fm .
Example 1.2.5. If
2 −1 0
3 1/2 π −1
A=
−2 1 1
and x = 1 ,
2
0 0 0
then
2 −1 0 −3
+ 1 1/2 + 2 π = −5/2 + 2π .
3
Ax = −1
−2 1 1 5
0 0 0 0
x · y = x1 y 1 + x2 y 2 + · · · + xn y n . (1.1)
1.2. Matrix arithmetic 9
Then another way to compute the matrix product is as follows: The (i, j)-entry of AB is
the dot product of the i-th row of A and the j-column of B. In other words
n
X
C = AB ⇐⇒ cij = ai` b`j for all 1 ≤ i ≤ m, 1 ≤ ` ≤ k.
`=1
Example 1.2.6. If
0 1 −1
2 0 −1 1 1 0 2
A= and B =
0 3 2 −1 0 0 −2
3 1 0
then
3 3 0
AB = .
0 −1 2
Recall that the n × n identity matrix is the matrix
0 ··· ···
1 0
0
. 1 0 ··· 0
. ... ..
I := .. 0 .. . .
. .. . . ..
..
. . . 0
0 0 ··· 0 1
Even though there is an n × n identity matrix for each n, the size of I should be clear from
the context. For instance, if A ∈ Mm,n (F) and we write AI, then I is the n × n identity
matrix. If, on the other hand, we write IA, then I is the m × m identity matrix. In case we
need to specify the size to avoid confusion, we will write In for the n × n identity matrix.
Proposition 1.2.7 (Properties of matrix multiplication). Suppose A, B, and C are matrices
of sizes such that the indicated matrix products are defined. Furthermore, suppose a is a
scalar. Then:
(a) IA = A = AI (I is a multiplicative identity)
(b) A(BC) = (AB)C (associativity)
(c) A(B + C) = AB + AC (distributivity on the left)
(d) (B + C)A = BA + CA (distributivity on the right)
(e) a(AB) = (aA)B = A(aB)
(f) (AB)T = B T AT
Note that matrix multiplication is not commutative in general. First of all, it is possible
that the product AB is defined but BA is not. This the case when A ∈ Mm,n (F) and
B ∈ Mn,k (F) with m 6= k. Now suppose A ∈ Mm,n (F) and B ∈ Mn,m (F). Then AB and BA
are both defined, but they are different sizes when m 6= n. However, even if m = n, so that
A and B are both square matrices, we can have AB 6= BA. For example, if
1 0 0 1
A= and B =
0 0 0 0
10 Chapter 1. Matrix algebra
then
0 1 0 0
AB = 6 = = BA.
0 0 0 0
Of course, it is possible for AB = BA for some specific matrices (e.g. the zero or identity
matrices). In this case, we say that A and B commute. But since this does not hold in
general, we say that matrix multiplication is not commutative.
Similarly, if
2 1 3 −5
X= and A = ,
−1 0 −2 9
then
0 0 2 1
X 0 1 −1 0
0 e2 = ,
A 0 0 3 −5
0 0 −2 9
where we have used horizontal and vertical lines to indicate the blocks. Note that we could
infer from the sizes of X and A, that the 0 in the block matrix must be the zero vector in
F4 and that e2 must be the second standard basis vector in F4 .
Provided the sizes of the blocks match up, we can multiply matrices in block form using
the usual rules for matrix multiplication. For example,
A B X AX + BY
=
C D Y CX + DY
as long as the products AX, BY , CX, and DY are defined. That is, we need the number
of columns of A to equal the number of rows of X, etc. (See [Nic, Th. 2.3.4].) Note that,
since matrix multiplication is not commutative, the order of the multiplication of the blocks
is very important here.
We can also compute transposes of matrices in block form. For example,
T
A T T
T T A B A CT
A B C = B and = .
T C D B T DT
C
1.3. Matrices and linear transformations 11
Exercises.
Recommended exercises: Exercises in [Nic, §§2.1–2.3].
TA : Fn → Fm , x 7→ Ax. (1.3)
T x = T (x1 e1 + · · · + xn en )
= x1 T (e1 ) + · · · + xn (T en ) (since T is linear)
= Ax.
12 Chapter 1. Matrix algebra
Exercises.
Recommended exercises: Exercises in [Nic, §2.6].
1.4. Gaussian elimination 13
Definition 1.4.1 (Elementary row operations). The following are called elementary row
operations on a matrix A with entries in F.
The type of the elementary matrix is the type of the corresponding row operation performed
on In .
Proposition 1.4.4. (a) Every elementary matrix is invertible and the inverse is an ele-
mentary matrix of the same type.
(b) Performing an elementary row operation on a matrix A is equivalent to multiplying A
on the left by the corresponding elementary matrix.
(b) We give the proof for row operations of type III, and leave the proofs for types I and
II as exercises. (See Exercise 1.4.1.) Fix 1 ≤ i, j ≤ n, with i 6= j. Note that
14 Chapter 1. Matrix algebra
ek A = row k of A,
Definition 1.4.5 (Row-echelon form). A matrix is in row-echelon form (and will be called
a row-echelon matrix ) if:
A row-echelon matrix is in reduced row-echelon form (and will be called a reduced row-echelon
matrix ) if, in addition,
Remark 1.4.6. Some references do not require the leading entry (i.e. the first nonzero entry
from the left in a nonzero row) to be 1 in row-echelon form.
Proposition 1.4.7. Every matrix A ∈ Mm,n (F) can be transformed to a row-echelon form
matrix R by performing elementary row operations. Equivalently, there exist finitely many
elementary matrices E1 , E2 , . . . , Ek such that R = E1 E2 · · · Ek A.
Proof. You saw this in previous courses, and so we will omit the proof here. In fact, there
is a precise algorithm, called the gaussian algorithm, for bringing a matrix to row-echelon
form. See [Nic, Th. 1.2.1] for details.
Proof. Since the elementary matrices are invertible by Proposition 1.4.4(a), if A is a product
of invertible matrices, then A is invertible. Conversely, suppose A is invertible. Then it can
be row-reduced to the identity matrix I. Hence, by Proposition 1.4.7, there are elementary
matrices E1 , E2 , . . . , Ek such that I = E1 E2 · · · Ek A. Then
Reducing a matrix all the way to reduced row-echelon form is sometimes called Gauss–
Jordan elimination.
Recall that the rank of a matrix A, denoted rank A, is the dimension of the column space
of A. Equivalently, rank A is the number of nonzero rows (which is equal to the number of
leading 1s) in any row-echelon matrix U that is row equivalent to A (i.e. that can be obtained
from A by row operations). Thus we see that rank A is also the dimension of the row space
of A, as noted earlier.
Recall that every linear system consisting of m linear equations in n variables can be
written in matrix form
Ax = b,
where A is an m × n matrix, called the coefficient matrix ,
x1
x2
x = ..
.
xn
is the vector of variables (or unknowns), and b is the vector of constant terms. We say that
• the linear system is overdetermined if there are more equations than unknowns (i.e. if
m > n),
• the linear system is underdetermined if there are more unknowns than equations (i.e.
if m < n),
• the linear system is square if there are the same number of unknowns as equations (i.e.
if m = n),
• an m × n matrix is tall if m > n, and
• an m × n matrix is wide if m < n.
It follows immediately that the linear system Ax = b is
• overdetermined if and only if A is tall,
• underdetermined if and only if A is wide, and
• square if and only if A is square.
Example 1.4.9. As a refresher, let’s solve the following underdetermined system of linear
equations:
−4x3 + x4 + 2x5 = 11
4x1 − 2x2 + 8x3 + x4 − 5x5 = 5
2x1 − x2 + 2x3 + x4 − 3x5 = 2
We write down the augmented matrix and row reduce:
0 0 −4 1 2 11 2 −1 2 1 −3 2
R1 ↔R3
4 −2 8 1 −5 5 −−−−→ 4 −2 8 1 −5 5
2 −1 2 1 −3 2 0 0 −4 1 2 11
16 Chapter 1. Matrix algebra
2 −1 2 1 −3 2 2 −1 2 1 −3 2
R2 −2R1 R +R2
−− −−→ 0 0 4 −1 1 1 −−3−−→ 0 0 4 −1 1 1
0 0 −4 1 2 11 0 0 0 0 3 12
One can now easily solve the linear system using a technique called back substitution, which
we will discuss in Section 1.6.1. However, to further illustrate the process of row reduction,
let’s continue with gaussian elimination:
2 −1 2 1 −3 2 1
R3
2 −1 2 1 −3 2
0 0 4 −1 1 1 −3−→ 0 0 4 −1 1 1
0 0 0 0 3 12 0 0 0 0 1 4
R1 +3R3 2 −1 2 1 0 14 1
R
2 −1 2 1 0 14
R2 −R3 4 2
−−−−→ 0 0 4 −1 0 −3 −−→ 0 0 1 −1/4
0 −3/4
0 0 0 0 1 4 0 0 0 0 1 4
2 −1 0 3/2 0 31/2 1
R
1 −1/2 0 3/4 0 31/4
R2 −2R2 2 1
−−−−→ 0 0 1 −1/4 0 −3/4 −−→ 0
0 1 −1/4 0 −3/4
0 0 0 0 1 4 0 0 0 0 1 4
The matrix is now in reduced row-echelon form. This reduced matrix corresponds to the
equivalent linear system:
1 3 31
x1 − x
2 2
+ x
4 4
= 4
1 −3
x3 − x
4 4
= 4
x5 = 4
The leading variables are the variables corresponding to leading 1s in the reduced row-echelon
matrix: x1 , x3 , and x5 . The non-leading variables, or free variables, are x2 and x4 . We let
the free variables be parameters:
x2 = s, x4 = t, s, t ∈ F.
Then we solve for the leading variables in terms of these parameters giving the general
solution in parametric form:
31 1 3
x1 = + s− t
4 2 4
x2 =s
3 1
x3 =− + t
4 4
x4 =t
x5 =4
1.5. Matrix inverses 17
Exercises.
1.4.1. Complete the proof of Proposition 1.4.4 by verifying the equalities in (1.8) and veri-
fying part (b) for the case of elementary matrices of types I and II.
XA = I
is called a left inverse of A. If such a left inverse exists, we say that A is left-invertible. Note
that if A has size m × n, then any left inverse X will have size n × m.
Examples 1.5.2. (a) If A ∈ M1,1 (F), then we can think of A simply as a scalar. In this
case a left inverse is equivalent to the inverse of the scalar. Thus, A is left-invertible if
and only if it is nonzero, and in this case it has only one left inverse.
(b) Any nonzero vector a ∈ Fn is left-invertible. Indeed, if ai 6= 0 for some 1 ≤ i ≤ n, then
1 T
ei a = 1 .
ai
1 T
Hence e
ai i
is a left inverse of a. For example, if
2
0
a=
−1 ,
then
1/2 0 0 0 , 0 0 −1 0 , and 0 0 0 1
are all left inverses of a. In fact, a has infinitely many left inverses. See Exercise 1.5.1.
18 Chapter 1. Matrix algebra
Proposition 1.5.3. If A has a left inverse, then the columns of A are linearly independent.
Proof. Suppose A ∈ Mm,n (F) has a left inverse B, and let a1 , . . . , an be the columns of A.
Suppose that
x1 a1 + · · · + xn an = 0
for some x1 , . . . , xn ∈ F. Thus, taking x = (x1 , . . . , xn ), we have
Ax = x1 a1 + · · · + xn an = 0.
Thus
x = Ix = BAx = B0 = 0.
This implies that x1 , . . . , xn = 0. Hence the columns of A are linearly independent.
Proof. Suppose A is a wide matrix, i.e. A is m × n with m < n. Then it has n columns,
each of which is a vector in Fm . Since m < n, these columns cannot be linearly independent.
Hence A cannot have a left inverse.
So we see that only square or tall matrices can be left-invertible. Of course, not every
square or tall matrix is left-invertible (e.g. consider the zero matrix).
Now suppose we want to solve a system of linear equations
Ax = b
in the case where A has a left inverse C. If this system has a solution x, then
Cb = CAx = Ix = x.
Keep in mind that this method only works when A has a left inverse. In particular, A must
be square or tall. So this method only has a chance of working for square or overdetermined
systems.
Example 1.5.6. Using the same matrix A from Example 1.5.5, consider the over-determined
linear system
1
Ax = −1 .
0
We compute
1
1/9
B −1 =
.
1/9
0
20 Chapter 1. Matrix algebra
Thus, if the system has a solution, it must be (1/9, 1/9, 0). However, we check that
1/9 7/9 1
A 1/9 = −10/9 6= −1 .
0 −2/9 0
Thus, the system has no solution. Of course, we could also compute using the left inverse C
as above, or see that
1 1
1/9 1/2
B −1 =
6= = C −1 .
1/9 −1/2
0 0
If the system had a solution, both (1/9, 1/9, 0) and (1/2, −1/2, 0) would be the unique
solution, which is clearly not possible.
AX = I
is called a right inverse of A. If such a right inverse exists, we say that A is right-invertible.
Note that if A has size m × n, then any right inverse X will have size n × m.
Suppose A has a right inverse B. Then
B T AT = (AB)T = I,
AT C T = (CA)T = I,
and so C T is a right inverse of AT . This allows us to translate our results about left inverses
to results about right inverses.
Proposition 1.5.8. (a) The matrix A is left-invertible if and only if AT is right invertible.
Furthermore, if C is a left inverse of A, then C T is a right inverse of AT
(b) Similarly, A is right-invertible if and only if AT is left-invertible. Furthermore, if B is
a right inverse of A, then B T is a left inverse of AT .
(c) If a matrix is right-invertible then its rows are linearly independent.
(d) If A has a right inverse, then A is square or wide.
Proof. (a) We proved this above.
(b) We proved this above. Alternatively, it follows from part (a) and the fact that
T T
(A ) = A.
(c) This follows from Proposition 1.5.3 and part (b) since the rows of A are the columns
of AT .
1.5. Matrix inverses 21
(d) This follows from Corollary 1.5.4 and part (b) since A is square or wide if and only
if AT is square or tall.
We can also transpose Examples 1.5.2.
Examples 1.5.9. (a) If A ∈ M1,1 (F), then a right inverse is equivalent to the inverse of the
scalar. Thus, A is right-invertible if and only if it is nonzero, and in this case it has
only one right inverse.
(b) Any nonzero row matrix a = a1 · · · an ∈ M1,n (F) is right-invertible. Indeed, if
ai 6= 0 for some 1 ≤ i ≤ n, then
1
aei = 1 .
ai
1
Hence ai ei is a right inverse of a.
(c) The matrix
4 −6 −1
A=
3 −4 −1
has right inverses
−7 11 0 0
1 1
B = −8 10 and C = −1 1 .
9 2
11 −16 4 −6
Now suppose we want to solve a linear system
Ax = b
in the case where A has a right inverse B. Note that
ABb = Ib = b.
Thus x = Bb is a solution to his system. Hence, the system has a solution for any b. Of
course, there can be other solutions; the solution x = Bb is just one of them.
This gives us a method to solve any linear system Ax = b in the case that A has a right
inverse. Of course, this implies that A is square or wide. So this method only has a chance
of working for square or underdetermined systems.
Example 1.5.10. Using the matrix A from Example 1.5.9(c) with right inverses B and C,
the linear system
1
Ax =
1
has solutions
4/9 0
1 1
B = 2/9 and C = 0 .
1 1
−5/9 −1
(Of course, there are more. As you learned in previous courses, any linear system with more
than one solution has infinitely many solutions.) Indeed, we can find a solution of Ax = b
for any b.
22 Chapter 1. Matrix algebra
AX = I = XA
X = IX = Y AX = Y I = Y.
AT A ∈ Mn,n (R)
AT Ax = 0
which implies that Ax = 0. (Recall that, for any vector v ∈ Rn , we have v · v = 0 if and
only if v = 0.) Because the columns of A are linearly independent, this implies that x = 0.
Since the only solution to AT Ax = 0 is x = 0, we conclude that AT A is invertible.
Now suppose the columns of A are linearly dependent. Thus, there exists a nonzero
x ∈ Rn such that Ax = 0. Multiplying on the left by AT gives
AT Ax = 0, x 6= 0.
If the columns of A are linearly independent (in particular, A is square or tall), then the
particular left inverse (AT A)−1 AT described in Proposition 1.5.15 is called the pseudoinverse
of A, the generalized inverse of A, or the Moore–Penrose inverse of A, and is denoted A+ .
Recall that left inverses are not unique in general. So this is just one left inverse. However,
when A is square, we have
and so the pseudoinverse reduces to the ordinary inverse (which is unique). Note that this
equation does not make sense when A is not square or, more generally, when A is not
invertible.
We also have a right analogue of Proposition 1.5.15.
Proposition 1.5.16. (a) The rows of A are linearly independent if and only if the matrix
AAT is invertible.
(b) A matrix is right-invertible if and only if its rows are linearly independent. Further-
more, if A is right-invertible, then AT (AAT )−1 is a right inverse of A.
(a) We have
(b) We have
A is right-invertible ⇐⇒ AT is left-invertible
⇐⇒ columns of AT are lin. ind. (by Proposition 1.5.15(b))
⇐⇒ rows of A are lin. ind.
Propositions 1.5.15 and 1.5.16 give us a method to compute right and left inverses, if they
exist. Precisely, these results reduce the problem to the computation of two-sided inverses,
which you have done in previous courses. Later in the course we will develop other, more
efficient, methods for computing left- and right-inverses. (See Sections 3.6 and 4.3.)
1.6. LU factorization 25
Exercises.
1.5.1. Suppose A is a matrix with left inverses B and C. Show that, for any scalars α and
β satisfying α + β = 1, the matrix αB + βC is also a left inverse of A. It follows that if a
matrix has two different left inverses, then it has a infinite number of left inverses.
1.5.3. Let A ∈ Mm,n (R) and let TA : Rn → Rm be the corresponding linear map (see (1.3)).
(a) Prove that A has a left inverse if and only if TA is injective.
(b) Prove that A has a right inverse if and only if TA is surjective.
(c) Prove that A has a two-sided inverse if and only if TA is an isomorphism.
1.5.4. Consider the matrix
1 1 1
A= .
−2 1 4
(a) Is A left-invertible? If so, find a left inverse.
(b) Compute AAT .
(c) Is A right-invertible? If so, find a right inverse.
Additional recommended exercises from [BV18]: 11.2, 11.3, 11.5, 11.6, 11.7, 11.12, 11.13,
11.17, 11.18, 11.22.
1.6 LU factorization
In this section we discuss a certain factorization of matrices that is very useful in solving
linear systems. We follow here the presentation in [Nic, §2.7].
1.6.2 LU factorization
Suppose A is an m × n matrix. Then we can use row reduction to transform A to a row-
echelon matrix U , which is therefore upper triangular. As discussed in Section 1.4, this
reduction can be performed by multiplying on the left by elementary matrices:
A → E1 A → E2 E1 A → · · · → Ek Ek−1 · · · E2 E1 A = U.
It follows that
−1
A = LU, where L = (Ek Ek−1 · · · E2 E1 )−1 = E1−1 E2−1 · · · Ek−1 Ek−1 .
As long as we do not require that U be reduced then, except for row interchanges, none of
the above row operations involve adding a row to a row above it. Therefore, if we can avoid
row interchanges, all the Ei are lower triangular. In this case, L is lower triangular (and
invertible) by Lemma 1.6.2. Thus, we have the following result. We say that a matrix can
be lower reduced if it can be reduced to row-echelon form without using row interchanges.
Proposition 1.6.3. If A can be lower reduced to a row-echelon (hence upper triangular)
matrix U , then we have
A = LU
for some lower triangular, invertible matrix L.
Definition 1.6.4 (LU factorization). A factorization A = LU as in Proposition 1.6.3 is
called an LU factorization or LU decomposition of A.
It is possible that no LU factorization exists, when A cannot be reduced to row-echelon
form without using row interchanges. We will discuss in Section 1.6.3 how to handle this
situation. However, if an LU factorization exists, then the gaussian algorithm gives U and a
procedure for finding L.
28 Chapter 1. Matrix algebra
We have highlighted the leading column at each step. In each leading column, we divide the
top row by the top (pivot) entry to create a 1 in the pivot position. Then we use the leading
one to create zeros below that entry. Then we have
1 0 0
A = LU, where L = −2 2 0 .
−1 3 1
The matrix L is obtained from the identity matrix I3 by replacing the bottom of the first two
columns with the highlighted columns above. Note that rank A = 2, which is the number of
highlighted columns.
The method of Example 1.6.5 works in general, provided A can be lower reduced. Note
that we did not need to calculate the elementary matrices used in our row operations.
Algorithm 1.6.6 (LU algorithm). Suppose A is an m × n matrix of rank r, and that A can
be lower reduced to a row-echelon matrix U . Then A = LU where L is a lower triangular,
invertible matrix constructed as follows:
Then we have Ge1 = c1 . By Lemma 1.6.2, G is lower triangular. In addition, each Ej , and
hence each Ej−1 , is the result of either multiplying row 1 of Im by a nonzero scalar or adding
a multiple of row 1 to another row. Thus, in block form,
0
G = c1
Im−1
Thus A = LU , where
0 1 X1
U=
0 0 U1
is row-echelon and
0 1 0 0
L = c1 = c1 = L(m) (c1 , c2 , . . . , cr ).
Im−1 0 L1 L1
LU factorization is very important in practice. It often happens that one wants to solve
a series of systems
Ax = b1 , Ax = b2 , · · · , Ax = bk
with the same coefficient matrix. It is very efficient to first solve the first system by gaus-
sian elimination, simultaneously creating an LU factorization of A. Then one can use this
factorization to solve the remaining systems quickly by forward and back substitution.
1 0 6 1
Let’s do one more example, this time where the matrix A is invertible.
Then we have
0 1 `11 u11 `11 u12
A= = .
1 0 `21 u11 `21 u12 + `22 u22
In particular, `11 u11 = 0, which implies that `11 = 0 or u11 = 0. But this would mean that
L is singular or U is singular. In either case, we would have
which contradicts the fact that det(A) = −1. Therefore, A has no LU decomposition. By
Algorithm 1.6.6, this means that A cannot be lower reduced. The problem is that we need
to use a row interchange to reduce A to row-echelon form.
The following theorem tells us how to handle LU factorization in general.
Proof. The only thing that can go wrong in the LU algorithm (Algorithm 1.6.6) is that, in
step (b), the leading column (i.e. first nonzero column) may have a zero entry at the top.
This can be remedied by an elementary row operation that swaps two rows. This corresponds
to multiplication by a permutation matrix (see Proposition 1.4.4(b)). Thus, if U is a row
echelon form of A, then we can write
Now, each permutation matrix can be “moved past” each lower triangular matrix to the
right of it, in the sense that, if k > j, then
Pk Lj = L0j Pk ,
where L0j = L(m) (e1 , . . . , ej−1 , c00j ) for some column c00j of length m − j + 1. See Exercise 1.6.2.
Thus, from (1.10) we obtain
(Lr L0r−1 · · · L02 L01 )(Pr Pr−1 · · · P2 P1 )A = U,
for some lower triangular matrices L01 , L02 , . . . , L0r−1 . Setting P = Pr Pr−1 · · · P2 P1 , this implies
that P A has an LU factorization, since Lr L0r−1 · · · L02 L01 is lower triangular and invertible by
Lemma 1.6.2.
Note that Theorem 1.6.9 generalizes Proposition 1.6.3. If A can be lower reduced, then
we can take P = Im in Theorem 1.6.9, which then states that A has an LU factorization.
A matrix that is the product of elementary matrices corresponding to row interchanges
is called a permutation matrix . (We also consider the identity matrix to be a permutation
matrix.) Every permutation matrix P is obtained from the identity matrix by permuting
the rows. Then P A is the matrix obtained from A by performing the same permutation
on the rows of A. The matrix P is a permutation matrix if and only if it has exactly one
1 in each row and column, and all other entries are zero. The elementary permutation
matrices are those matrices obtained from the identity matrix by a single row exchange.
Every permutation matrix is a product of elementary ones.
0 0 1 0 0 0 0 1 0 0 1 0
1.6. LU factorization 33
1 0 0 11
Theorem 1.6.9 is an important factorization theorem that applies to any matrix. If A
is any matrix, this theorem asserts that there exists a permutation matrix P and an LU
factorization P A = LU . Furthermore, it tells us how to find P , and we then know how to
find L and U .
Note that Pi = Pi−1 for each i (since any elementary permutation matrix is its own
inverse). Thus, the matrix A can be factored as
A = P −1 LU,
where P −1 is a permutation matrix, L is lower triangular and invertible, and U is a row-
echelon matrix. This is called a PLU factorization or a P A = LU factorization of A.
1.6.4 Uniqueness
Theorem 1.6.9 is an existence theorem. It tells us that a PLU factorization always exists.
However, it leaves open the question of uniqueness. In general, LU factorizations (and hence
PLU factorizations) are not unique. For example,
−1 0 1 4 −5 −1 −4 5 −1 0 1 4 −5
= = .
4 1 0 0 0 4 16 −20 4 8 0 0 0
(In fact, one can put any value in the (2, 2)-position of the 2 × 2 matrix and obtain the same
result.) The key to this non-uniqueness is the zero row in the row-echelon matrix. Note
that, if A is m × n, then the matrix U has no zero row if and only if A has rank m.
Theorem 1.6.11. Suppose A is an m × n matrix with LU factorization A = LU . If A has
rank m (that is, U has no zero row), then L and U are uniquely determined by A.
Proof. Suppose A = M V is another LU factorization of A. Thus, M is lower triangular and
invertible, and V is row-echelon. Thus we have
LU = M V, (1.11)
34 Chapter 1. Matrix algebra
Then
a aY 1 Z
N U = V =⇒ = .
X XY + N1 U1 0 V1
This implies that
a = 1, Y = Z, X = 0, and N1 U1 = V1 .
By the induction hypothesis, the equality N1 U1 = V1 implies N1 = I. Hence N = I, as
desired.
Recall that an m × m matrix is invertible if and only if it has rank m. Thus, we get the
following special case of Theorem 1.6.11.
Exercises.
1.6.1. Prove Lemma 1.6.2.
1.6.2 ([Nic, Ex. 2.7.11]). Recall the notation L(m) (c1 , c2 , . . . , cr ) from the proof of Algo-
rithm 1.6.6. Suppose 1 ≤ i < j < k ≤ m, and let ci be a column of length m − i + 1. Show
that there is another column c0i of length m − i + 1 such that
Here Pj,k is the m × m elementary permutation matrix (see Definition 1.4.2). Hint: Recall
−1
that Pj,k = Pj,k . Write
Ii 0
Pj,k =
0 Pj−i,k−i
in block form.
In this chapter we will consider the issue of how sensitive a linear system is to small changes
or errors in its coefficients. This is particularly important in applications, where these
coefficients are often the results of measurement, and thus inherently subject to some level
of error. Therefore, our goal is to develop some precise measure of how sensitive a linear
system is to such changes.
2.1 Motivation
Consider the linear systems
Ax = b and Ax0 = b0
where
1 1 2 0 2
A= , b= , b = .
1 1.00001 2.00001 2.00002
Since det A 6= 0, the matrix A is invertible, and hence these linear systems have unique
solutions. Indeed it is not hard to see that the solutions are
1 0 0
x= and x = .
1 2
Note that even though the vector b0 is very close to b, the solutions to the two systems are
quite different. So the solution is very sensitive to the entries of the vector of constants b.
When the solution to the system
Ax = b
is highly sensitive to the entries of the coefficient matrix A or the vector b of constant terms,
we say that the system is ill-conditioned . Ill-conditioned systems are especially problematic
when the coefficients are obtained from experimental results (which always come associated
with some error) or when computations are carried out by computer (which can involve
round-off error).
So how do we know if a linear system is ill-conditioned? To do this, we need to discuss
vector and matrix norms.
36
2.2. Normed vector spaces 37
Exercises.
2.1.1. Consider the following linear systems:
400 −201 x1 200 401 −201 x1 200
= and = .
−800 401 x2 −200 −800 401 x2 −200
Solve these two linear systems (feel free to use a computer) to see how the small change
in the coefficient matrix results in a large change in the solution. So the solution is very
sensitive to the entries of the coefficient matrix.
a + bi, a, b ∈ R.
Definition 2.2.1 (Vector norm, normed vector space). A norm (or vector norm) on an
F-vector space V (e.g. V = Fn ) is a function
k · k: V → R
(N1) kvk ≥ 0,
(N2) if kvk = 0 then v = 0,
(N3) kcvk = |c| kvk, and
(N4) ku + vk ≤ kuk + kvk (the triangule inequality).
(Note that the codomain of the norm is R, regardless of whether F = R or F = C.) A vector
space equipped with a norm is called a normed vector space. Thus, a normed vector space
is a pair (V, k · k), where k · k is a norm on V . However, we will often just refer to V as a
normed vector space, leaving it implied that we have a specific norm k · k in mind.
38 Chapter 2. Matrix norms, sensitivity, and conditioning
(It is a bit harder to show this is a norm in general. The proof uses Minkowski’s inequality.)
As p approaches ∞, this becomes the norm
which is called the ∞-norm or maximum norm. See Exercise 2.2.1. In this course, we’ll
focus mainly on the cases p = 1, 2, ∞.
Remark 2.2.4. It is a theorem of analysis that all norms on Fn are equivalent in the sense
that, if k · k and k · k0 are two norms on Fn , then there is a c ∈ R, c > 0, such that
1
kvk ≤ kvk0 ≤ ckvk for all v ∈ Fn .
c
This implies that they induce the same topology on Fn . That’s beyond the scope of this
course, but it means that, in practice, we can choose whichever norm best suits our particular
application.
Exercises.
2.2.1. Show that (2.2) defines a norm on Fn .
Is this a norm on Fn ? If yes, prove it. If not, show that one of the axioms of a norm is
violated.
40 Chapter 2. Matrix norms, sensitivity, and conditioning
Definition 2.3.1 (Matrix norm). Let A ∈ Mm,n (F). If k · kp and k · kq are norms on Fn and
Fm respectively, we define
kAxkq n
kAkp,q = max : x ∈ F , x 6= 0 . (2.3)
kxkp
This is called the matrix norm, or operator norm, of A with respect to the norms k · kp and
k · kq . We also say that kAkp,q is the matrix norm associated to the norms k · kp and k · kq .
The next lemma tells us that, in order to compute the matrix norm (2.3), it is not
necessary to check the value of kAxkq /kxkp for every x 6= 0. Instead, it is enough to
consider unit vectors.
Proof. Let
kAxkq n
S= : x ∈ F , x 6= 0 and T = {kAxkq : x ∈ Fn , kxkp = 1} .
kxkp
These are both sets of nonnegative real numbers, and we wish to show that they have the
same maximum. To do this, we will show that these sets are actually the same (which is a
stronger assertion).
First note that, if kxkp = 1, then
kAxkq
= kAxkq .
kxkp
Thus T ⊆ S.
Now we want to show the reverse inclusion: S ⊆ T . Suppose s ∈ S. Then there is some
x 6= 0 such that
kAxkq
s= .
kxkp
Define
1
c= ∈ R.
kxkp
Then, by (N3),
1
kcxkp = |c| kxkp = kxkp = 1.
kxkp
2.3. Matrix norms 41
Thus we have
kAxkq |c| kAxkq (N3) kcAxkq kA(cx)kq
s= = = = = kA(cx)kq ∈ T,
kxkp |c| kxkp kcxkp kcxkp
since kcxkp = 1. Thus we have shown the reverse inclusion S ⊆ T .
Having shown both inclusions, it follows that S = T , and hence their maxima are equal.
Remark 2.3.3. In general, a set of real numbers may not have a maximum. (Consider the
set R itself.) Thus, it is not immediately clear that kAkp,q is indeed well defined. However,
one can use Proposition 2.3.2 to show that it is well defined. The set {x ∈ Fn : kxkp = 1} is
compact, and the function x 7→ kAxkq is continuous. It follows from a theorem in analysis
that this function attains a maximum.
When we use the same type of norm (e.g. the 1-norm, the 2-norm, or the ∞-norm) in
both the domain and the codomain, we typically use a single subscript on the matrix norm.
Thus, for instance,
Note that, in principle, we could choose different types of norms for the domain and codomain.
However, in practice, we usually choose the same one.
This set is the union of the four blue line segments shown below:
(0, 1)
(−1, 0) (1, 0)
(0, −1)
Note that, in Example 2.3.4, the matrix norm kAk1 was precisely the 1-norm of one of its
columns (the second column, to be precise). The following result gives the general situation.
Proof. We will prove part (a) and leave parts (b) and (c) as Exercise 2.3.1.
Recall that aj = Aej is the j-th column of A, for 1 ≤ j ≤ n. We have kej k1 = 1 and
m
X
kAej k1 = |ai,j |
i=1
Ax = x1 a1 + · · · + xn an .
2.3. Matrix norms 43
Thus
kAxk1 = kx1 a1 + · · · + xn an k1
≤ kx1 a1 k1 + · · · + kxn an k1 (by the triangle inequality (N4))
≤ |x1 | ka1 k1 + · · · + |xn | kan k1 . (by (N3))
Now, suppose the column of A with the maximum 1-norm is the j-th column, and suppose
kxk1 = 1. Then we have
kAxk1 ≤ |x1 | kaj k1 + · · · + |xn | kaj k1 = (|x1 | + · · · + |xn |) kaj k1 = kaj k1 .
This completes the proof.
Note that part (c) of Theorem 2.3.5 involves an inequality. In practice, the norm kAk2
is difficult to compute, while the Frobnenius norm is much easier.
The following theorem summarizes the most important properties of matrix norms.
Theorem 2.3.6 (Properties of matrix norms). Suppose k · k is a family of norms on Fn ,
n ≥ 1. We also use the notation kAk for the matrix norm with respect to these vector norms.
(a) For all v ∈ Fn and A ∈ Mm,n (F), we have kAvk ≤ kAk kvk.
(b) kIk = 1.
(c) For all A ∈ Mm,n (F), we have kAk ≥ 0 and kAk = 0 if and only if A = 0.
(d) For all c ∈ F and A ∈ Mm,n (F), we have kcAk = |c| kAk.
(e) For all A, B ∈ Mm,n (F), we have kA + Bk ≤ kAk + kBk.
(f) For all A ∈ Mm,n (F) and B ∈ Mn,k (F), we have kABk ≤ kAk kBk.
(g) For all A ∈ Mn,n (F), we have kAk k ≤ kAkk for all k ≥ 1.
(h) If A ∈ GL(n, F), then kA−1 k ≥ 1
kAk
.
Proof. We prove parts (a) and (h) and leave the remaining parts as Exercise 2.3.3. Suppose
v ∈ Fn and A ∈ Mm,n (F). If v = 0, then
kAvk = 0 = kAk kvk.
Now suppose v 6= 0. Then
kAvk kAxk n
≤ max : x ∈ F , x 6= 0 = kAk.
kvk kxk
Multiplying both sides by kv| then gives (a).
Now suppose A ∈ GL(n, F). By the definition of kAk, we can choose x ∈ Fn , x 6= 0, such
that
kAxk
kAk = .
kxk
Then we have
1 kxk kA−1 (Ax)k
= = ≤ kA−1 k
kAk kAxk kAxk
by part (a) for A−1 .
44 Chapter 2. Matrix norms, sensitivity, and conditioning
Exercises.
2.3.1. Prove parts (b) and (c) of Theorem 2.3.5. For part (c), use the Cauchy–Schwarz
inequality
|uT v| ≤ kuk kvk, u, v ∈ Fn ,
(here we view the 1 × 1 matrix uT v as an element of F) and note that the entries of the
product Ax are of the form uT x, where u is a row of A.
2.3.2. Suppose A ∈ Mn,n (F) and that λ is an eigenvalue of A. Show that, for any choice of
vector norm on Fn , we have kAk ≥ |λ|, where kAk is the associated matrix norm of A.
2.3.5. Suppose A ∈ Mm,n (F) and that there is some fixed k ∈ R such that kAvk ≤ kkvk for
all v ∈ Fn . (Here we have fixed some arbitrary norms on Fn and Fm .) Show that kAk ≤ k.
2.3.6. For each of the following matrices, find kAk1 and kAk∞ .
−4
(a) 1
5
(b) −4 1 5
(c) −9 0 2 9
−5 8
(d)
6 2
2i 5
(e) 4 −i
3 5
2.4 Conditioning
Our goal is to develop some measure of how “good” a matrix is as a coefficient matrix of
a linear system. That is, we want some measure that allows us to know whether or not a
matrix can exhibit the bad behaviour we saw in Section 2.1.
Definition 2.4.1 (Condition number). Suppose A ∈ GL(n, F) and let k · k denote a norm
on Fn as well as the associated matrix norm. The value
is called the condition number of the matrix A, relative to the choice of norm k · k.
2.4. Conditioning 45
Note that the condition number depends on the choice of norm. The fact that κ(A) ≥ 1
follows from Theorem 2.3.6(h).
Examples 2.4.2. (a) Consider the matrix from Section 2.1. We have
1 1 −1 5 1.00001 −1
A= and A = 10 .
1 1.00001 −1 1
Therefore, with respect to the 1-norm,
(b) If
2 2 −1 −1 3 −2
B= , then B = .
4 3 2 −4 2
Thus, with respect to the 1-norm,
7
κ(B) = 6 · = 21.
2
(c) If a ∈ R and
1 a −1 1 −a
C= and C =
0 1 0 1
and so, with respect to the 1-norm,
Since a is arbitrary, this example shows that the condition number can be arbitrarily large.
Lemma 2.4.3. If c ∈ F× , then for every invertible matrix A, we have κ(cA) = κ(A).
Proof. Note that (cA)−1 = c−1 A−1 . Thus
κ(cA) = kcAk kc−1 A−1 k = |c| |c−1 | kAk kA−1 k = |cc−1 |κ(A) = κ(A).
Example 2.4.4. If
1016 0
M= ,
0 1016
then M = 1016 I. Hence κ(M ) = κ(I) = 1 by Theorem 2.3.6(b).
Now that we’ve defined the condition number of a matrix, what does it have to do with
the situation discussed in Section 2.1? Suppose we want to solve the system
Ax = b.
Ax0 = b0 .
46 Chapter 2. Matrix norms, sensitivity, and conditioning
Example 2.4.6. Consider the situation from Section 2.1 and Example 2.4.2(a):
1 1 2 0 2 1 0 0
A= , b= , b = , x= , x = .
1 1.00001 2.00001 2.00002 1 2
Thus
k∆bk1 = 10−5 , kbk1 = 4.00001, k∆xk1 = 2, kxk1 = 2.
So we have
k∆bk 10−5 k∆xk
κ(A) = (2.00001)2 · 105 · ≥1= ,
kbk 4.00001 kxk
as predicted by Theorem 2.4.5. The fact that A is ill-conditioned explains the phenomenon
we noticed in Section 2.1: that a small change in b can result in a large change in the solution
x to the system Ax = b.
2.4. Conditioning 47
Exercises.
2.4.1. Show that there always exists a choice of b and ∆b such that we have equality in
Theorem 2.4.5.
2.4.2. If s is very large, we know from Example 2.4.2(c) that the matrix
1 s
C=
0 1
is ill-conditioned. Show that, if b = (1, 1), then the system Cx = b satisfies
k∆xk k∆b||
≤3
kxk kbk
and is therefore well-conditioned. (Here we use the 1-norm.) On the other hand, find a
choice of b for which the system Cx = b is ill-conditioned.
Orthogonality
The notion of orthogonality is fundamental in linear algebra. You’ve encountered this con-
cept in previous courses. Here we will delve into this subject in further detail. We begin by
briefly reviewing the Gram–Schmidt algorithm, orthogonal complements, orthogonal projec-
tion, and diagonalization. We then discuss hermitian and unitary matrices, which are com-
plex analogues of symmetric and orthogonal matrices that you’ve seen before. Afterwards,
we learn about Schur decomposition and prove the important spectral and Cayley–Hamilton
theorems. We also define positive definite matrices and consider Cholesky and QR factor-
izations. We conclude with a discussion of computing/estimating eigenvalues, including the
Gershgorin circle theorem.
v̄ = (v1 , . . . , vn ) ∈ Fn .
When F = R, this is the usual dot product. (Note that, in [Nic, §8.7], the inner product
is defined with the complex conjugation on the second vector.) The inner product has the
following important properties: For all u, v, w ∈ Fn and c, d ∈ F, we have
(IP1) hu, vi = hv, ui,
¯ wi,
(IP2) hcu + dv, wi = c̄hu, wi + dhv,
(IP3) hu, cv + dwi = chu, vi + dhu, wi,
(IP4) hu, ui ∈ R and hu, ui ≥ 0,
48
3.1. Orthogonal complements and projections 49
h·, ·i : V × V → R
satisfying (IP1)–(IP5) is called an inner product on V . In light of (IP4), for any inner
product, we may define p
kvk = hu, ui,
and one can check that this defines a norm on V . For the purposes of this course, we will
stick to the particular inner product (3.1). In this case kvk is the usual 2-norm:
p
kvk = |v1 |2 + · · · + |vn |2 .
Proof. You saw this in previous courses, so we will omit the proof here. It can be found in
[Nic, Th. 8.1.1].
u1 = v1 ,
hu1 , v2 i
u2 = v2 − u1 ,
ku1 k2
hu1 , v3 i hu2 , v3 i
u3 = v3 − 2
u1 − u2 ,
ku1 k ku2 k2
..
.
hu1 , vm i hu2 , vm i hum−1 , vm i
um = vm − 2
u1 − 2
u2 − · · · − um−1 .
ku1 k ku2 k kum−1 k2
Then
50 Chapter 3. Orthogonality
Example 3.1.4. Let’s find an orthogonal basis for the row space of
1 −1 0 1
A = 2 −1 −2 3 .
0 2 0 1
Let v1 , v2 , v3 denote the rows of A. One can check that these rows are linearly independent.
(Reduce A to echelon form and note that it has rank 3.) So they give a basis of the row
space. Let’s use the Gram–Schmidt algorithm to find an orthogonal basis:
u1 = v1 = (1, −1, 0, 1),
hu1 , v2 i 6
u2 = v2 − 2
u1 = (2, −1, −2, 3) − (1, −1, 0, 1) = (0, 1, −2, 1),
ku1 k 3
hu1 , v3 i hu2 , v3 i
u3 = v3 − 2
u1 − u2
ku1 k ku2 k2
−1 3 1 7 5
= (0, 2, 0, 1) − (1, −1, 0, 1) − (0, 1, −2, 1) = , , 1,
3 6 3 6 6
It can be nice to eliminate the fractions (see Remark 3.1.5), so
{(1, −1, 0, 1), (0, 1, −2, 1), (2, 7, 6, 5)}
is an orthogonal basis for the row space of A. If we wanted an orthonormal basis, we would
divide each of these basis vectors by its norm.
Orthogonal (especially orthonormal) bases are particularly nice since it is easy to write
a vector as a linear combinations of the elements of such a basis.
Proposition 3.1.6. Suppose {u1 , . . . , um } is an orthogonal basis of a subspace U of Fn .
Then, for any v ∈ U , we have
m
X hui , vi
v= ui . (3.2)
i=1
kui k2
3.1. Orthogonal complements and projections 51
huj ,vi
Thus cj = kuj k2
, as desired.
Remark 3.1.7. What happens if you apply the Gram–Schmidt algorithm to a set of vectors
that is not linearly independent? Remember that a list of vectors v1 , . . . , vm is linearly
dependent if and only if one of the vectors, say vk , is a linear combination of the previous
ones. Then, using Proposition 3.1.6, one can see that the Gram–Schmidt algorithm will give
uk = 0. Thus, you can still apply the Gram–Schmidt algorithm to linearly dependent sets,
as long as you simply throw out any zero vectors that you obtain in the process.
We read U ⊥ as “U -perp”.
(a) U ⊥ is a subspace of Fn .
(b) {0}⊥ = Fn and (Fn )⊥ = {0}.
(c) If U = Span{u1 , . . . , uk }, then U ⊥ = {v ∈ Fn : hv, ui i = 0 for all i = 1, 2, . . . , k}.
Proof. You saw these properties of the orthogonal complement in previous courses, so we
will not repeat the proofs there. See [Nic, Lem. 8.1.2].
is called the orthogonal projection of v onto U . For the zero subspace U = {0}, we define
proj{0} x = 0.
52 Chapter 3. Orthogonality
Exercises.
3.1.1. Prove that the inner product defined by (3.1) satisfies conditions (IP1)–(IP5).
3.2 Diagonalization
In this section we quickly review the topics of eigenvectors, eigenvalues, and diagonalization
that you saw in previous courses. For a more detailed review of this material, see [Nic, §3.3].
Throughout this section, we suppose that A ∈ Mn,n (F). Recall that if
If F = C, then the characteristic polynomial will factor completely. That is, we have
where the λ1P , . . . , λk are distinct. Then mλi is the called algebraic multiplicity of λi . It
follows that ki=1 mλi = n.
For an eigenvalue λ of A, the set of solutions to the equation
Ax = λx or (A − λI)x = 0
P −1 AP
A = P DP −1 .
We say that matrices A, B ∈ Mn,n (C) are similar if there exists some invertible matrix P
such that A = P BP −1 . So a matrix is diagonalizable if and only if it is similar to a triangular
matrix.
Theorem 3.2.1. Suppose A ∈ Mn,n (C). The following statements are equivalent.
··· ···
λ1 0 0
0 λ2
. 0 ··· 0
... ... ..
D = .. 0 . ,
. .. ... ...
..
. 0
0 0 ··· 0 λn
x−3 0 0
x − 3 −1
cA (x) = det(xI − A) = −1 x − 3 −1 = (x − 3) = (x − 3)2 (x + 1).
0 x+1
4 0 x+1
m−1 = 1, m3 = 2.
For the eigenvalue 3, we compute the corresponding eigenspace E3 by solving the system
(A − 3I)x = 0:
0 0 0 0 1 0 1 0
row reduce
1 0 1 0 − −−−−−→ 0 0 0 0 .
−4 0 −4 0 0 0 0 0
Thus
E3 = Span{(1, 0, −1), (0, 1, 0)}.
In particular, this eigenspace is 2-dimensional, with basis {(1, 0, −1), (0, 1, 0)}.
3.2. Diagonalization 55
Thus
E−1 = Span{(0, 1, 4)}.
In particular, this eigenspace is 1-dimensional, with basis {(0, 1, 4)}.
Since we have dim Eλ = mλ for each eigenvalue λ, the matrix A is diagonalizable. In
particular, we have A = P DP −1 , where
1 0 0 3 0 0
P = 0 1 1 and D = 0 3 0 .
−1 0 4 0 0 −1
Then
x 0 0
cB (x) = det(xI − B) = −2 x 0 = x3 .
0 3 x
Thus B has only one eigenvalue λ = 0, with algebraic multiplicity 3. To find E0 we solve
0 0 0
2 0 0 x = 0
0 −3 0
Exercises.
Recommended exercises: Exercises in [Nic, §3.3].
56 Chapter 3. Orthogonality
cA (x) = det(xI − A) = x2 + 1,
which has roots ±i. Then one can find the associated eigenvectors in the usual way to see
that
1 i 1 1 −i 1
A = =i and A = = −i .
i −1 i −i −1 −i
Thus, when considering eigenvalues, eigenvectors, and diagonalization, it makes much more
sense to work over the complex numbers.
Definition 3.3.1 (Conjugate transpose). The conjugate transpose (or hermitian conjugate)
AH of a complex matrix is defined by
AH = (Ā)T = (AT ).
Note that AH = AT when A is real. In many ways, the conjugate transpose is the “cor-
rect” complex analogue of the transpose for real matrices, in the sense that many theorems
for real matrices involving the transpose remain true for complex matrices when you replace
“transpose” by “conjugate transpose”. We can also rewrite the inner product (3.1) as
hu, vi = uH v. (3.7)
(a) (AH )H = A.
(b) (A + B)H = AH + B H .
(c) (cA)H = c̄AH .
3.3. Hermitian and unitary matrices 57
(d) (AB)H = B H AH .
Thus A is hermitian.
Ax = λx, x ∈ Cn , x 6= 0, λ ∈ C.
Then
= hλx, xi
= λ̄hx, xi. (by (IP2))
Thus we have
(λ − λ̄)hx, xi = 0.
Since x 6= 0, (IP5) implies that λ = λ̄. Hence λ ∈ R.
Proposition 3.3.8. If A ∈ Mn,n (C) is hermitian, then eigenvectors of A corresponding to
distinct eigenvalues are orthogonal.
Proof. Suppose
Ax = λx, Ay = µy, x, y 6= 0, λ 6= µ.
Since A is hermitian, we have λ, µ ∈ R by Proposition 3.3.7. Then we have
λhx, yi = hλx, yi (by (IP2) and λ ∈ R)
= hAx, yi
= hx, Ayi (by Proposition 3.3.6)
= hx, µyi
= µhx, yi. (by (IP3))
Thus we have
(λ − µ)hx, yi = 0.
Since λ 6= µ, it follows that hx, yi = 0.
Proposition 3.3.9. The following conditions are equivalent for a matrix U ∈ Mn,n (C).
(a) U is invertible and U −1 = U H .
(b) The rows of U are orthonormal.
(c) The columns of U are orthonormal.
Proof. The proof of this result is almost identical to the characterization of orthogonal
matrices that you saw in previous courses. For details, see [Nic, Th. 8.2.1] and [Nic, Th. 8.7.6].
You saw in previous courses that symmetric real matrices are always diagonalizable. We
will see in the next section that the same is true for complex hermitian matrices. Before
discussing the general theory, let’s do a simple example that illustrates some of the ideas
we’ve seen in this section.
Exercises.
3.3.1. Recall that, for θ ∈ R, we define the complex exponential
eiθ = cos θ + i sin θ.
Find necessary and sufficient conditions on α, β, γ, θ ∈ R for the matrix
iα iβ
e e
eiγ eiθ
to be hermitian. Your final answer should not involve any complex exponentials or trigono-
metric functions.
60 Chapter 3. Orthogonality
One of our goals in this section is to show that every hermitian matrix is unitarily diago-
nalizable. We first prove an important theorem which has this result as an easy consequence.
Theorem 3.4.2 (Schur’s theorem). If A ∈ Mn,n (C), then there exists a unitary matrix U
such that
U H AU = T
is upper triangular. Moreover, the entries on the main diagonal of T are the eigenvalues of
A (including multiplicities).
{y1 , y2 , . . . , yn }
of Cn . Then
U 1 = y1 y2 · · · yn
is a unitary matrix. In block form, we have
y1H
y H
H 2 λ1 X1
U1 AU1 = .. λ1 y1 Ay2 · · · Ayn = .
. 0 A1
ynH
W1H A1 W1 = T1
3.4. The spectral theorem 61
is upper triangular.
Finally, since A and T are similar matrices, they have the same eigenvalues, and these
eigenvalues are the diagonal entries of T since T is upper triangular.
By Schur’s theorem (Theorem 3.4.2), every matrix A ∈ Mn,n (C) can be written in the
form
A = U T U H = U T U −1 (3.9)
where U is unitary, T is upper triangular, and the diagonal entries of T are the eigenvalues
of T . The expression (3.9) is called a Schur decomposition of A.
Recall that the trace of a square matrix A = [aij ] is
In other words, tr A is the sum of the entries of A on the main diagonal. Similar matrices
have the same trace and determinant (see Exercises 3.4.1 and 3.4.2).
Corollary 3.4.3. Suppose A ∈ Mn,n (C), and let λ1 , . . . , λn denote the eigenvalues of A,
including multiplicities. Then
det A = λ1 λ2 · · · λn and tr A = λ1 + λ2 + · · · + λn .
Proof. Since the statements are clear true for triangular matrices, the corollary follows from
the fact mentioned above that similar matrices have the same determinant and trace.
Schur’s theorem states that every complex square matrix can be “unitarily triangular-
ized”. However, not every complex square matrix can be unitarily diagonalized. For example,
the matrix
1 1
0 1
cannot be unitarily diagonalized. You can see this by find the eigenvectors of A and seeing
that there is no basis of eigenvectors (there is only one eigenvalue, but its corresponding
eigenspace only has dimension one).
Theorem 3.4.4 (Spectral theorem). Every hermitian matrix is unitarily diagonalizable. In
other words, if A is a hermitian matrix, then there exists a unitary matrix U such that U H AU
is diagonal.
62 Chapter 3. Orthogonality
Proof. Suppose A is a hermitian matrix. By Schur’s Theorem (Theorem 3.4.2), there exists
a unitary matrix U such that U H AU = T is upper triangular. Then we have
T H = (U H AU )H = U H AH (U H )H = U H AU = T.
Theorem 3.4.5 (Real spectral theorem, principal axes theorem). The following conditions
are equivalent for A ∈ Mn,n (R).
cA (x) = det(xI − A) = x2 + 4.
Thus the eigenvalues are 2i and −2i. The corresponding eigenvectors are
−1 i
and .
i −1
√
These vectors are orthogonal and both have length 2. Therefore
1 −1 i
U=√
2 i −1
is a unitary matrix such that
H i 0
U AU =
0 −i
is diagonal.
3.4. The spectral theorem 63
Why does the converse of Theorem 3.4.4 fail? Why doesn’t the proof that an orthogonally
diagonalizable real matrix is symmetric carry over to the complex setting? Let’s recall the
proof in the real case. Suppose A is orthogonally diagonalizable. Then there exists a real
orthogonal matrix P (so P −1 = P T ) and a real diagonal matrix D such that A = P DP T .
Then we have T
AT = P DP T = P DT P T = P DP T = A,
where we used the fact that D = DT for a diagonal matrix. Hence A is symmetric. However,
suppose we assume that A is a unitarily diagonalizable. Then there exists a unitary matrix
U and a complex diagonal matrix D such that A = U DU T . We have
H
AH = U DU H = U DH U H ,
and here we’re stuck. We won’t have DH = D unless the entries of the diagonal matrix
D are all real. So the argument fails. It turns out that we need to introduce a stronger
condition on the matrix A.
Definition 3.4.7 (Normal matrix). A matrix N ∈ Mn,n (C) is normal if N N H = N H N .
Clearly every hermitian matrix is normal. Note that the matrix A in Example 3.4.6 is
also normal since
H 0 −2 0 2 4 0 0 2 0 −2
AA = = = = AH A.
2 0 −2 0 0 4 −2 0 2 0
U H AU = D
for some unitary matrix U and diagonal matrix D. Since diagonal matrices commute with
each other, we have DDH = DH D. Now
and
DH D = (U H AH U )(U H AU ) = U H AH AU.
Hence
U H (AAH )U = U H (AH A)U.
Multiplying on the left by U and on the right by U H gives AAH = AH A, as desired.
Now suppose that A ∈ Mn,n (C) is normal, so that AAH = AH A. By Schur’s theorem
(Theorem 3.4.2), we can write
U H AU = T
for some unitary matrix U and upper triangular matrix T . Then T is also normal since
(a) A − λ1 I has zero first column, since the first column of A is (λ1 , 0, . . . , 0).
(b) Then (A − λ1 I)(A − λ2 I) has the first two columns zero because the second column of
(A − λ2 I) is of the form (b, 0, . . . , 0) for some b ∈ C.
(c) Next (A − λ1 I)(A − λ2 I)(A − λ3 I) has the first three columns zero because the third
column of (A − λ3 I) is of the form (c, d, 0, . . . , 0) for some c, d ∈ C.
Continuing in this manner, we see that (A − λ1 I)(A − λ2 I) · · · (A − λn I) has all n columns
zero, and hence is the zero matrix.
Exercises.
3.4.1. Suppose that A and B are similar square matrices. Show that det A = det B.
N ej ∈ Span{e1 , e2 , . . . , ej−1 },
where ei is the i-th standard basis vector. (When j = 1, we interpret the set {e1 , e2 , . . . , ej−1 }
as the empty set. Recall that Span ∅ = {0}.)
(b) Again, suppose that N ∈ Mn,n (C) is upper triangular with diagonal entries equal to
zero. Show that N n = 0.
(c) Suppose A ∈ Nn,n (C) has eigenvalues λ1 , . . . , λn , with multiplicity. Show that A =
P + N for some P, N ∈ Mn,n (C) satisfying N n = 0 and P = U DU T , where U is
unitary and D = diag(λ1 , . . . , λn ). Hint: Use Schur’s Theorem.
Definition 3.5.1 (Positive definite). A hermitian matrix A ∈ Mn,n (C) is positive definite if
hx, Axi ∈ R>0 for all x ∈ Cn , x 6= 0.
By Proposition 3.3.7, we know that the eigenvalues of a hermitian matrix are real.
Proposition 3.5.2. A hermitian matrix is positive definite if and only if all its eigenvalues
λ are positive, that is, λ > 0.
Proof. Suppose A is a hermitian matrix. By the spectral theorem (Theorem 3.4.4), there
exists a unitary matrix U such that U H AU = D = diag(λ1 , . . . , λn ), where λ1 , . . . , λn are
the eigenvalues of A. For x ∈ Cn , define
y1
y2
y = U H x = .. .
.
yn
Then
hx, Axi = xH Ax = xH (U DU H )x
= (U H x)H DU H x = yH Dy = λ1 |y1 |2 + λ2 |y2 |2 + · · · λn |yn |2 . (3.10)
If every λi > 0, then (3.10) implies that hx, Axi > 0 since some yj > 0 (because x = 6 0 and
U is invertible). So A is positive definite.
Conversely, suppose A is positive definite. For j ∈ {1, 2, . . . , n}, let x = U ej 6= 0. Then
y = ej , and so (3.10) gives
λj = hx, Axi > 0.
Hence all the eigenvalues of A are positive.
Remark 3.5.3. A hermitian matrix is positive semi-definite if hx, Axi ≥ 0 for all x 6= 0 in
Cn . Then one can show that a hermitian matrix is positive semi-definite if and only if all
its eigenvalues λ are nonnegative, that is λ ≥ 0. (See Exercise 3.5.1.) One can also consider
negative definite and negative semi-definite matrices and they have analogous properties.
However, we will focus here on positive definite matrices.
Example 3.5.6. Let’s show that, for any invertible matrix U ∈ Mn,n (C), the matrix A = U H U
is positive definite. Indeed, we have
AH = (U H U )H = U H U = A,
where the last equality follows from (IP5) and the fact that U x 6= 0, since x 6= 0 and U is
invertible.
In fact, we will see that the converse to Example 3.5.6 is also true. Before verifying this,
we discuss another important concept.
Definition 3.5.7 (Principal submatrices). If A ∈ Mn,n (C), let (r) A denote the r × r subma-
trix in the upper-left corner of A; that is, (r) A is the matrix obtained from A by deleting the
last n − r rows and columns. The matrices (1) A, (2) A, . . . , (n) A = A are called the principal
submatrices of A.
Example 3.5.8. If
5 7 2−i
A = 6i 0 −3i ,
2 + 9i −3 1
then
(1)
(2) 5 7 (3)
A= 5 , A= , A = A.
6i 0
(r)
Lemma 3.5.9. If A ∈ Mn,n (C) is positive definite, then so is each principal matrix A for
r = 1, 2, . . . , n.
Proof. Write (r)
A P
A=
Q R
in block form. First note that
(r)
(r)
H
A P H A QH
=A=A = .
Q R PH RH
H
Hence (r) A = (r) A , and so A is hermitian.
Now let y ∈ Cr , y 6= 0. Define
y
x= ∈ Cn .
0
68 Chapter 3. Orthogonality
B = (n−1) A.
Then B is hermitian and satisfies (b). Hence, by Lemma 3.5.9 and our induction hypothesis,
we have
B = UHU
for some upper triangular matrix U ∈ Mn−1,n−1 (C) with positive real entries on the main
diagonal. Since A is hermitian, it has block form
B p
A= H , p ∈ Cn−1 , b ∈ R.
p b
If we define
x = (U H )−1 p and c = b − xH x,
then block multiplication gives
H
UHU p U 0 U x
A= = H . (3.11)
pH b x 1 0 c
Taking determinants and using (1.2) gives
(Here we use Exercise 3.3.2.) Since det A > 0 by (b), it follows that c > 0. Thus, the
factorization (3.11) can be modified to give
H
U 0
√ U √
x
A= H .
x c 0 c
3.5. Positive definite matrices 69
Since U is upper triangular with positive real entries on the main diagonal, this proves the
induction step.
It remains to prove the uniqueness assertion in the statement of the theorem. Suppose
that
A = U H U = U1H U1
are two Cholesky factorizations. Then, by Lemma 1.6.2,
is both upper triangular (since U and U1 are) and lower triangular (since U H and U1H are).
Thus D is a diagonal matrix. It follows from (3.12) that
U = DU1 and U1 = DH U,
and so
U = DU1 = DDH U.
Since U is invertible, this implies that DDH = I. Because the diagonal entries of D are
positive real numbers (since this is true for U and U1 ), it follows that D = I. Thus U = U1 ,
as desired.
Remark 3.5.11. (a) If the real matrix A ∈ Mn,n (R) is symmetric (hence also hermitian),
then the matrix U appearing in the Cholesky factorization A = U H U also has real
entries, and so A = U T U . See [Nic, Th. 8.3.3].
(b) Positive semi-definite matrices also have Cholesky factorizations, as long as we allow
the diagonal matrices of U to be zero. However, the factorization is no longer unique
in general.
Theorem 3.5.10 tells us that every positive definite matrix has a Cholesky factorization.
But how do we find the Cholesky factorization?
(a) Transform A to an upper triangular matrix U1 with positive real diagonal entries using
row operations, each of which adds a multiple of a row to a lower row.
(b) Obtain U from U1 by dividing each row of U1 by the square root of the diagonal entry
in that row.
The key is that step (a) is possible for any positive definite matrix A. Let’s do an example
before proving Algorithm 3.5.12.
We can compute
det (1) A = 2 > 0, det (2) A = 11 > 0, det (3) A = det A = 49 > 0.
Thus, by Theorem 3.5.10, A is positive definite and has a unique Cholesky factorization. We
carry out step (a) of Algorithm 3.5.12 as follows:
2 + i R1
i −3 R
2 2
3
R3 + R1
2 i −3 i
R3 + R2
2 i −3
A = −i 5 2i −−−−2−→ 0 9/2 i/2 −−−−9−→ 0 9/2 i/2 = U1 .
−3 −2i 10 0 −i/2 11/2 0 0 49/9
Proof of Algorithm 3.5.12. Suppose A is positive definite, and let A = U H U be the Cholesky
factorization. Let D = diag(d1 , . . . , dn ) be the common diagonal of U and U H . (So the di
are positive real numbers.) Then U H D−1 is lower unitriangular (lower triangular with ones
on the diagonal). Thus L = (U H D−1 )−1 is also lower unitriangular. Therefore we can write
L = Er · · · E2 E1 In ,
where each Ei is an elementary matrix corresponding to a row operation that adds a multiple
of one row to a lower row (we modify columns right to left). Then we have
Er · · · E2 E1 A = LA = D(U H )−1 U H U = DU
is upper triangular with positive real entries on the diagonal. This proves that step (a) of
the algorithm is possible.
Now consider step (b). We have already shown that we can find a lower unitriangular
matrix L1 and an invertible upper triangular matrix U1 , with positive real entries on the
diagonal, such that L1 A = U1 . (In the notation above, L1 = Er · · · E1 and U1 = DU .) Since
A is hermitian, we have
Let D1 = diag(d1 , . . . , dn ) denote diagonal matrix with the same diagonal entries as U1 .
Then (3.13) implies that
L1 U1H D1−1 = U1 LH −1
1 D1 .
−1 H −1
This is both upper triangular (since U1 LH 1 D1 is) and lower unitriangular (since L1 U1 D1
is), and so must equal In . Thus
U1H D1−1 = L−1
1 .
3.6. QR factorization 71
Now let p p
D2 = diag d1 , . . . , dn ,
Since U = D2−1 U1 is the matrix obtained from U1 by dividing each row by the square root
of its diagonal entry, this completes the proof of step (b).
Suppose we have a linear system
Ax = b,
where A is a hermitian (e.g. real symmetric) matrix. Then we can find the Cholesky decom-
position A = U H U and consider the linear system
U H U x = b.
As with the LU decomposition, we can first solve U H y = b by forward substitution, and then
solve U x = y by back substitution. For linear systems that can be put in symmetric form,
using the Cholesky decomposition is roughly twice as efficient as using the LU decomposition.
Exercises.
3.5.1. Show that a hermitian matrix is positive semi-definite if and only if all its eigenvalues
λ are nonnegative, that is, λ ≥ 0.
3.6 QR factorization
Unitary matrices are very easy to invert, since the conjugate transpose is the inverse. Thus,
it is useful to factor an arbitrary matrix as a product of a unitary matrix and a triangular
matrix (which we’ve seen are also nice in many ways). We tackle this problem in this section.
A good reference for this material is [Nic, §8.4]. (However, see Remark 3.6.2.)
Note that the bottom m − n rows of an m × n upper triangular matrix (with m ≥ n) are
zero rows. Thus, we can write a QR factorization A = QR in block form
R1 R1
A = QR = Q = Q1 Q2 = Q1 R1
0 0
72 Chapter 3. Orthogonality
where R1 is an n × n upper triangular matrix whose entries on the main diagonal are
nonnegative real numbers, 0 is the (m − n) × n zero matrix, Q1 is m × n, Q2 is m × (m − n),
and Q1 , Q2 both have orthonormal columns. The factorization A = Q1 R1 is called a thin
QR factorization or reduced QR factorization of A.
Remark 3.6.2. You may find different definitions of QR factorization in other references. In
particular, some references, including [Nic, §8.6], refer to the reduced QR factorization as
a QR factorization (without using the word “reduced”). In addition, [Nic, §8.6] imposes
the condition that the columns of A are linearly independent. This is not needed for the
existence result we will prove below (Theorem 3.6.3), but it will be needed for uniqueness
(Theorem 3.6.6). When consulting other references, be sure to look at which definition they
are using to avoid confusion.
Note that, given a reduced QR factorization, one can easily obtain a QR factorization
by extending the columns of Q1 to an orthonormal basis, and defining Q2 to be the matrix
whose columns are the additional vectors in this basis. It follows that, when m > n, the
QR factorization is not unique. (Once can extend to an orthonormal basis in more than
one way.) However the reduced QR factorization has some chance of being unique. (See
Theorem 3.6.6.)
The power of the QR factorization comes from the fact that there are computer algorithms
that can compute it with good control over round-off error. Finding the QR factorization
involves the Gram–Schmidt algorithm.
Recall that a matrix A ∈ Mm,n (C) has linearly independent columns if and only if its
rank is n, which can only occur if A is tall or square (i.e. m ≥ n).
Theorem 3.6.3. Every tall or square matrix has a QR factorization.
Proof. We will prove the theorem under the additional assumption that A has full rank (i.e.
rank A = n), and then make some remarks about how one can modify the proof to work
without this assumption. We show that A has a reduced QR factorization, from which it
follows (as discussed above) that A has a QR factorization.
Suppose
A = a1 a2 · · · an ∈ Mm,n (C)
with linearly independent columns a1 , a2 , . . . , an . We can use the Gram-Schmidt algorithm
to obtain an orthogonal set u1 , . . . , un spanning the column space of A. Namely, we set
u1 = a1 and
k−1
X huj , ak i
uk = ak − uj for k = 2, 3, . . . , n. (3.14)
j=1
ku2j k
If we define
1
qk = uk for each k = 1, 2, . . . , n,
kuk k
then the q1 , . . . , qn are orthonormal and (3.14) becomes
k−1
X
kuk kqk = ak − hqj , ak iqj . (3.15)
j=1
3.6. QR factorization 73
a1 = ku1 kq1 ,
a2 = hq1 , a2 iq1 + ku2 kq2 ,
a3 = hq1 , a3 iq1 + hq2 , a3 iq2 + ku3 kq3
..
.
an = hq1 , an iq1 + hq2 , an iq2 + . . . + hqn−1 , an iqn−1 + kun kqn .
Writing these equations in matrix form gives us the factorization we’re looking for:
A = a1 a2 a3 · · · an
ku1 k hq1 , a2 i hq1 , a3 i · · · hq1 , an i
0 ku2 k hq2 , a3 i · · · hq2 , an i
= q 1 q2 q3 · · · qn .
0 0 ku3 k · · · hq3 , un i
0 0 0 · · · kun k
What do we do if rank A < n? In this case, the columns of A are linearly dependent,
so some of the columns are in the span of the previous ones. Suppose that ak is the first
such column. Then, as noted in Remark 3.1.7, we get uk = 0. We can fix the proof as
follows: Let v1 , . . . , vr be an orthonormal basis of (col A)⊥ . Then, if we obtain uk = 0 in
the Gram–Schmidt algorithm above, we let qk be one of the vi , using each vi exactly once.
We then continue the proof as above, and the matrix R will have a row of zeros in the k-th
row for each k that gave uk = 0 in the Gram–Schmidt algorithm.
Remark 3.6.4. Note that for a tall or square real matrix, the proof of Theorem 3.6.3 shows
us that we can find a QR factorization with Q and R real matrices.
Thus
u1 = a1 ,
hu1 , a2 i 2
u2 = a2 − 2
u1 = (2, 1, 0) − (1, 0, 1) = (1, 1, −1).
ku1 k 2
So we have
1 1
q1 = u1 = √ (1, 0, 1),
ku1 k 2
74 Chapter 3. Orthogonality
1 1
q2 = u2 = √ (1, 1, −1).
ku2 k 3
We define
√1 √1 √ √
2 3
√1
ku1 k hq1 , a2 i 2 √2
Q1 = q1 q2 = 0 and R1 = = .
3 0 ku2 k 0 3
√1 − √13
2
1 1
q3 = u3 = √ (1, −2, −1).
ku3 k 6
Then, setting
√1 √1 √1
√ √
2 3 6 2 √2
√1 − √26
Q = q1 q2 q 3 = 0 3 and R = 0 3 ,
√1 − √13 − √16 0 0
2
Now that we know that QR factorizations exist, what about uniqueness? As noted earlier,
here we should focus on reduced QR factorizations.
Theorem 3.6.6. Every tall or square matrix A with linearly independent columns has a
unique reduced QR factorization A = QR. Furthermore, the matrix R is invertible.
Proof. Suppose A ∈ Mm,n (C), m ≥ n, has linearly independent columns. We know from
Theorem 3.6.3 that A has a QR factorization A = QR. Furthermore, since Q is invertible,
we have
rank R = rank(QR) = rank A = n.
Since R is m × n and m ≥ n, the columns of R are also linearly independent. It follows that
the entries on the main diagonal of R must be nonzero, and hence positive (since they are
nonnegative real numbers). This also holds for the upper triangular matrix appearing in a
reduced QR factorization.
Now suppose
A = QR and A = Q1 R1
are two reduced QR factorizations of A. By the above, the entries on the main diagonal of
R and R1 are positive. Since R and R1 are square and upper triangular, this implies that
3.6. QR factorization 75
they are invertible. We wish to show that Q = Q1 and R = R1 . Label the columns of Q and
Q1 :
Q = c1 c2 · · · cn and Q1 = d1 d2 · · · dn .
Since Q and Q1 have orthonormal columns, we have
QH Q = In = QH
1 Q1 .
(Note that Q and Q1 are not unitary matrices unless they are square. If they are not
square, then QQH and Q1 QH 1 are not defined. Recall our discussion of one-sided inverses in
−1
Section 1.5.) Therefore, the equality QR = Q1 R1 implies QH 1 Q = R1 R . Let
−1
[tij ] = QH
1 Q = R1 R . (3.16)
Since R and R1 are upper triangular with positive real diagonal elements, we have
hdi , cj i = dH
i cj = tij for all i, j.
since hdi , cj i = tij = 0 for i > j. Writing out these equations explicitly, we have:
c1 = t11 d1 ,
c2 = t12 d1 + t22 d2 ,
c3 = t13 d1 + t23 d2 + t33 d3 , (3.17)
c4 = t14 d1 + t24 d2 + t34 d3 + t44 d4 ,
..
.
The first equation gives
c2 = t22 d2 .
As in (3.18), this implies that c2 = d2 . Then t13 = 0 and t23 = 0 follows in the same way.
Continuing in this way, we conclude that ci = di for all i. Thus, Q = Q1 and, by (3.16),
R = R1 , as desired.
76 Chapter 3. Orthogonality
So far our results are all about tall or square matrices. What about wide matrices?
Corollary 3.6.8. Every invertible (hence square) matrix A has unique factorizations A =
QR and Q = LP where Q and P are unitary, R is upper triangular with positive real diagonal
entries, and L is lower triangular with positive real diagonal entries.
AH A = RH QH QR = RH R,
and so H
(AH A)−1 = R−1 R−1 .
Since R is upper triangular, it is easy to find its inverse. So this gives us an efficient method
of finding left inverses.
Students who took MAT 2342 learned about finding best approximations to (possibly
inconsistent) linear systems. (See [Nic, §5.6].) In particular, if Ax = b is a linear system,
then any solution z to the normal equations
z = (AH A)−1 AH b.
As noted above, (AH A)−1 AH is a left inverse to A. We saw in Section 1.5.1 that if Ax = b
has a solution, then it must be (AH A)−1 AH b. What we’re saying here is that, even there is
no solution, (AH A)−1 AH b is the best approximation to a solution.
3.7. Computing eigenvalues 77
Exercises.
Recommended exercises: Exercises in [Nic, §8.4]. Keep in mind the different definition of
QR factorization used in [Nic] (see [Nic, Def. 8.6]).
The power method uses the fact that, under some mild assumptions, the sequence ky1 k, ky2 k, . . .
converges to λn , and the sequence x1 , x2 , · · · converges to a corresponding eigenvector.
We should be precise about what we mean by convergence here. You learned about
convergence of sequences of real numbers in calculus. What about convergence of vectors?
We say that a sequence of vectors x1 , x2 , . . . converges to a vector v if
lim kxk − vk = 0.
k→∞
This is equivalent to the components of the vectors xk converging (in the sense of sequences
of scalars) to the components of v.
78 Chapter 3. Orthogonality
Theorem 3.7.1. With the notation and assumptions from above, suppose that the initial
vector x0 is of the form
Xn
x0 = ai vi with an 6= 0.
i=1
Proof. Note that, by definition, xk is a unit vector that is a positive real multiple of
n
! n n
X X X
k k k
A x0 = A ai v i = ai A v i = ai λki vi .
i=1 i=1 i=1
Thus
Pn−1 λi k
Pn k
ai λi vi a v
n n + i=1 ai λn vi
xk = Pi=1 = .
k ni=1 ai λki vi k Pn−1 λi k
an vn + i=1 ai λn vi
(Note that we could divide by λn since λn 6= 0. Also, since an 6= 0, the norm in the
denominator is nonzero.) Since |λi | < λn for i 6= n, we have
k
λi
→ 0 as k → ∞.
λn
Hence, as k → ∞, we have
an v n an
xk → = vn .
kan vn k kan k
Similarly,
Pn−1 k
λi
an λn vn + i=1 ai λn
λi vi
kyk+1 k = kAxk k = k
Pn−1 λi
an vn + i=1 ai λn
vi
Pn−1 k+1
λi
an v n + i=1 ai λn
vi
= λn k → λn .
Pn−1 λi
an vn + i=1 ai λn
vi
Remarks 3.7.2. (a) It is crucial that the largest eigenvalue be real. On the other hand,
the assumption that it is positive can be avoided since, if it is negative, we can apply
Theorem 3.7.1 to −A.
3.7. Computing eigenvalues 79
(b) If there are several eigenvalues with maximum norm, then the sequences kyk k and xk
will not converge in general. On the other hand, if λn has multiplicity greater than one, but
is the unique eigenvalue with maximum norm, then the sequence kyk k will always converge
to λn , but the sequence xk may not converge.
(c) If you choose an initial vector x0 at random, it is very unlikely that you will choose
one with an = 0. (This would be the same as choosing a random real number and ending up
with the real number 0.) Thus, the condition in Theorem 3.7.1 that an 6= 0 is not a serious
obstacle in practice.
(d) It is possible to compute the smallest eigenvalue (in norm) by applying the power
method to A−1 . This is called the inverse power method . It is computationally more involved,
since one must solve a linear system at each iteration.
The kyk k are converging to 6, while the xk are converging to √1 (1, 1) ≈ (0.707107, 0.707107),
2
which is an eigenvector of eigenvalue 6.
Students who took MAT 2342 learned about Markov chains. Markov chains are a partic-
ular case of the power method. The matrix A is stochastic if each entry is a nonnegative real
number and the sum of the entries in each column is equal to one. A stochastic matrix is
regular if there exists a positive integer k such that all entries of Ak are strictly positive. One
can prove that every regular stochastic matrix has 1 as a dominant eigenvalue and a unique
steady state vector , which is an eigenvector of eigenvalue 1, whose entries are all nonnegative
and sum to 1. One can then use the Theorem 3.7.1 to find this steady state vector. See [Nic,
§2.9] for further details.
80 Chapter 3. Orthogonality
Remark 3.7.4. Note that the power method only allows us to compute the dominant eigen-
value (or the smallest eigenvalue in norm if we use the inverse power method). What if
we want to find other eigenvalues? In this case, the power method has serious limitations.
If A is hermitian, one can first find the dominant eigenvalue λn with eigenvector vn , and
then repeat the power method with an initial vector orthogonal to vn . At each iteration, we
subtract the projection onto vn to ensure that we remain in the subspace orthogonal to vn .
However, this is quite computationally intensive, and so is not practical for computing all
eigenvalues.
0 < |λ1 | < |λ2 | < · · · < |λn−1 | < |λn |. (3.21)
(The general case is beyond the scope of this course.) These conditions on A ensure that it
is invertible and diagonalizable, with distinct real eigenvalues. We do not assume that A is
symmetric.
The QR method consists in computing a sequence of matrices A1 , A2 , . . . with A1 = A
and
Ak+1 = Rk Qk ,
where Ak = Qk Rk is the QR factorization of Ak , for k ≥ 1. Note that, since A is invertible, it
has a unique QR factorization by Corollary 3.6.8. Recall that the matrices Qk are orthogonal
(since they are real and unitary) and the matrices Rk are upper triangular.
Since
Ak+1 = Rk Qk = Q−1 T
k Ak Qk = Qk Ak Qk , (3.22)
the matrices A1 , A2 , . . . are all similar, and hence have the same eigenvalues.
Theorem 3.7.5. Suppose A ∈ Mn,n (R) satisfies (3.21). In addition, assume that P −1
admits an LU factorization, where P is the matrix of eigenvectors of A, that is, A =
P diag(λ1 , . . . , λn )P −1 . Then the sequence A1 , A2 , . . . produced by the QR method converges
to an upper triangular matrix whose diagonal entries are the eigenvalues of A.
Proof. We will omit the proof of this theorem. The interested student can find a proof in
[AK08, Th. 10.6.1].
Remark 3.7.6. If the matrix A is symmetric, then so are the Ak , by (3.22). Thus, the limit
of the Ak is a diagonal matrix.
In practice, the convergence of the QR method can be slow when the eigenvalues are
close together. The speed can be improved in certain ways.
Q−1 −1
k Ak Qk = Qk (Qk Rk + sk I)Qk = Rk Qk + sk I,
and so we take Ak+1 = Rk Qk +sk I. If the shifts sk are carefully chosen, one can greatly
improve convergence.
• One can first bring the matrix A to upper Hessenberg form, which is a matrix that is
nearly upper triangular (one allows nonzero entries just below the diagonal), using a
technique based on Householder reduction. Then the convergence in the QR algorithm
is faster.
be the sum of the absolute values of the non-diagonal entries in the i-th row. For a ∈ C and
r ∈ R≥0 , let
D(a, r) = {z ∈ C : |z − a| ≤ r}
be the closed disc with centre a and radius r. The discs D(aii , Ri ) are called the Gershgorin
discs of A.
Theorem 3.7.8 (Gershgorin circle theorem). Every eigenvalue of A lies in at least one of
the Gershgorin discs D(aii , Ri ).
Then we have
X
|λ − aii | = aij xj
j6=i
X
≤ |aij | |xj | (by the triangle inequality)
j6=i
X
≤ |aij | (since |xj | ≤ 1 for all j)
j6=i
= Ri .
Corollary 3.7.9. The eigenvalues of A lie in the Gershgorin discs corresponding P to the
columns of A. More precisely, each eigenvalue lies in at least one of the discs D(ajj , i:i6=j |aij |).
Proof. Since A and AT have the same eigenvalues, we can apply Theorem 3.7.8 to AT .
One can interpret the Gershgorin circle theorem as saying that if the off-diagonal entries
of a square matrix have small norms, then the eigenvalues of the matrix are close to the
diagonal entries of the matrix.
It has characteristic polynomial x2 − 4 = (x − 2)(x + 2), and so its eigenvalues are ±2. The
Gershgorin circle theorem tells us that the eigenvalues lie in the discs D(0, 1) and D(0, 4).
iR
R
−2 2
3.7. Computing eigenvalues 83
i
R
−i
Note that, in Examples 3.7.10 and 3.7.11, it was not the case that each Gershgorin disc
contained one eigenvalue. There was one disc that contained no eigenvalues, and one disc
that contained two eigenvalues. In general, one has the following strengthened version of the
Gershgorin circle theorem.
Theorem 3.7.12. If the union of k Gershgorin discs is disjoint from the union of the other
n − k discs, then the former union contains exactly k eigenvalues of A, and the latter union
contains exactly n − k eigenvalues of A.
Proof. The proof of this theorem uses a continuity argument, where one starts with Gersh-
gorin discs that are points, and gradually enlarges them. For details, see [Mey00, p. 498].
Example 3.7.13. Let’s use the Gershgorin circle theorem to estimate the eigenvalues of
−7 0.3 0.2
A= 5 0 2 .
1 −1 10
The Gershgorin discs are
D(−7, 0.5), D(0, 7), D(10, 2).
iR
By Theorem 3.7.12, we know that two eigenvalues lie in the union of discs D(−7, 0.5)∪D(0, 7)
and one lies in the disc D(10, 2).
84 Chapter 3. Orthogonality
Exercises.
3.7.1. Suppose A ∈ Mn,n (C) has eigenvalues λ1 , . . . , λr . (We only list each eigenvalue once
here, even if it has multiplicity greater than one.) Prove that the Gershgorin discs for A are
precisely the sets {λ1 }, . . . , {λr } if and only if A is diagonal.
3.7.2. Let
1.3 0.5 0.1 0.2 0.1
−0.2 0.7 0 0.2 0.1
1
A= −2 4 0.1 −0.1
.
0 0.2 −0.1 2 1
0.05 0 0.1 0.5 1
Use the Gershgorin circle theorem to prove that A is invertible.
3.7.3. Suppose that P is a permutation matrix. (Recall that this means that each row and
column of P have one entry equal to 1 and all other entries equal to 0.)
(a) Show that there are two possibilities for the Gershgorin discs of P .
(b) Show, using different methods, that the eigenvalues of a permutation all have absolute
value 1. Compare this with your results from (a).
(a) Calculate the matrix B = CAC −1 . Recall that A and B have the same eigenvalues.
(b) Give the Gershgorin discs for B, and find the values of z that give the strongest
conclusion for the eigenvalues.
(c) What can we conclude about the eigenvalues of A?
Generalized diagonalization
In this and previous courses, you’ve seen the concept of diagonalization. Diagonalizing
a matrix makes it very easy to work with in many ways: you know the eigenvalues and
eigenvectors, you can easily compute powers of the matrix, etc. However, you know that not
all matrices are diagonalizable. So it is natural to ask if there is some slightly more general
result concerning a nice form in which all matrices can be written. In this chapter we will
consider two such forms: singular value decomposition and Jordan canonical form. Students
who took MAT 2342 also saw singular value decomposition a bit in that course.
85
86 Chapter 4. Generalized diagonalization
(We allow the possibility that r = 0, so that λi = 0 for all i, and also the possibility that
r = n.) By Proposition 3.3.9,
We wish to show that rank A = r, where r is defined in (4.1). We do this by showing that
Since {Aq1 , Aq2 , . . . , Aqr } is orthogonal, it is linearly independent. It remains to show that
it spans im TA . So we need to show that
x = t1 q1 + t2 q2 + · · · + tn qn for some t1 , t2 , . . . , tn ∈ C.
as desired.
DA = diag(σ1 , . . . , σr ).
We also have
(4.6) (4.7)
σi pi = kAqi kpi = Aqi for i = 1, 2, . . . , r.
Using this and (4.5), we have
AQ = Aq1 Aq2 · · · Aqn = σ1 p1 · · · σr pr 0 · · · 0 .
Then we compute
··· ···0
σ1 0 0
.. ... .. .. ..
. . . .
0
· · · σr 0 ··· 0
P ΣA = p1 · · · pr pr+1 · · · pm
0 ··· 0 0 ··· 0
. .. .. ..
.. . . .
0 ··· 0 0··· 0
= σ1 p1 · · · σr pr 0 · · · 0
= AQ.
Since Q−1 = QH , it follows that A = P ΣA QH . Thus we have proved the following theorem.
Theorem 4.1.6. Let A ∈ Mm,n (C), and let σ1 ≥ σ2 ≥ · · · ≥ σr > 0 be the positive singular
values of A. Then r = rank A and we have a factorization
Note that the SVD is not unique. For example, if r < m, then there are infinitely
many ways to extend {p1 , . . . , pr } to an orthonormal basis {p1 , . . . , pm } of Cm . Each such
extension leads to a different matrix P in the SVD. For another example illustrating non-
uniqueness, consider A = In . Then ΣA = In , and A = P ΣA P H is a SVD of A for any unitary
n × n matrix P .
Remark 4.1.7 (Real SVD). If A ∈ Mm,n (R), then we can find a SVD where P and Q are
real (hence orthogonal) matrices. To see this, observe that our proof is valid if we replace C
by R everywhere.
Since this matrix is real, we can write AT instead of AH in our SVD algorithm. We have
2 0 −2
T T 6 −2
A A = 0 8 0 and AA = .
−2 6
−2 0 2
As expected, both of these matrices are symmetric. It’s easier to find the eigenvalues of
AAT . This has characteristic polynomial
Thus, the eigenvalues of AAT are λ1 = 8 and λ2 = 4, both with multiplicity one. It follows
from Lemma 4.1.2(b) that the eigenvalues of AT A are λ1 = 8, λ2 = 4, and λ3 = 0, all with
multiplicity one. So the positive singular values of A are
p √ p
σ1 = λ1 = 2 2 and σ2 = λ2 = 2.
√ √
0 −1/ 2 1/ 2
Q = q1 q2 q3 = 1
0√ 0√ ,
0 1/ 2 1/ 2
√
σ1 0 0 2 2 0 0
Σ= = .
0 σ2 0 0 2 0
In practice, SVDs are not computed using the above method. There are sophisticated
numerical algorithms for calculating the singular values, P , Q, and the rank of an m × n
matrix to a high degree of accuracy. Such algorithms are beyond the scope of this course.
Our algorithm gives us a way of finding one SVD. However, since SVDs are not unique,
it is natural to ask how they are related.
Proof. We have
H
AH A = P ΣQH P ΣQH = QΣH P H P ΣQH = QΣH ΣQH .
(4.10)
Exercises.
4.1.1. Show that A ∈ Mm,n (C) is the zero matrix if and only if all of its singular values are
zero.
x ∈ null A ⇐⇒ Ax = 0
a1
..
⇐⇒ . x = 0
am
a1 x
⇐⇒ ... = 0 (block multiplication)
am x
⇐⇒ ai x = 0 for all i
⇐⇒ hai , xi = 0 for all i
⇐⇒ x ∈ (Span{a1 , . . . , am })⊥ = (row Ā)⊥ . (by Lemma 3.1.9(c))
(b) You saw this in previous courses, so we will not repeat the proof here. See [Nic,
Lem. 8.6.4].
92 Chapter 4. Generalized diagonalization
(c) We leave this as an exercise. The proof can be found in [Nic, Lem. 8.6.4].
Now we can see that any SVD for a matrix A immediately gives orthonormal bases for
the fundamental subspaces of A.
Theorem 4.2.3. Suppose A ∈ Mm,n (F). Let A = P ΣQH be a SVD for A, where
P = u1 · · · um ∈ Mm,m (F) and Q = v1 · · · vn ∈ Mn,n (F)
Then
(a) r = rank A, and the positive singular values of A are d1 , d2 , . . . , dr ;
(b) the fundamental spaces are as follows:
(i) {u1 , . . . , ur } is an orthonormal basis of col A,
(ii) {ur+1 , . . . , um } is an orthonormal basis of null AH ,
(iii) {vr+1 , . . . , vn } is an orthonormal basis of null A,
(iv) {v1 , . . . , vr } is an orthonormal basis of row A.
Proof. (a) This is Lemma 4.1.9.
(b) (i) Since Q is invertible, we have col A = col(AQ) = col(P Σ). Also
diag(d1 , . . . , dr ) 0
P Σ = u1 · · · um = d1 u1 · · · dr ur 0 · · · 0 .
0 0
Thus
col A = Span{d1 u1 , . . . , dr ur } = Span{u1 , . . . , ur }.
Since the u1 , . . . , ur are orthonormal, they are linearly independent. So {u1 , . . . , ur } is an
orthonormal basis for col A.
(ii) We have
(iii) We first show that the proposed basis has the correct size. By the Rank-Nullity
Theorem (see (1.7)), we have
it will follow that Span{vr+1 , . . . , vn } = null A (since the two spaces have the same dimen-
sion), and hence that {vr+1 , . . . , vn } is a basis for null A (since it is linearly independent
because it is orthonormal).
To show the inclusion (4.11), it is enough to show that vj ∈ null A (i.e. Avj = 0) for
j > r. Define
Thus, for 1 ≤ i ≤ n,
kAvj k2 = (Avj )H Avj = vjH AH Avj = vjH (d2j vj ) = d2j kvj k2 = d2j .
Then
⊥
row Ā = (row Ā)⊥ (by Lemma 4.2.2(b))
⊥
= (Span{vr+1 , . . . , vn })
= Span{v1 , . . . , vr }. (by Lemma 4.2.2(c))
Ax = 0 of m equations in n variables.
The set of solutions is precisely null A. If we compute a SVD A = U ΣV T for A then, in the
notation of Theorem 4.2.3, the set {vr+1 , . . . , vn } is an orthonormal basis of the solution set.
SVDs are also closely related to principal component analysis. If A = P ΣQT is a SVD
with, in the notation of Theorem 4.2.3,
then we have
A = P ΣQT
diag(σ1 , . . . , σr ) 0 H
= u1 · · · um v 1 · · · vn
0 0
vH
.1
= σ1 u1 · · · σr ur 0 · · · 0 ..
vnH
= σ1 u1 v1H + · · · + σr ur vrH . (block multiplication)
Thus, we have written A as a sum of r rank one matrices (see Exercise 4.2.1), called the
principle components of A. For 1 ≤ t ≤ r, the matrix
At := σ1 u1 v1H + . . . + σt ut vtH
Exercises.
4.2.1. Suppose u ∈ Cn and v ∈ Cm . Show that if u and v are both nonzero, then the rank
of the matrix uvH is 1.
4.3 Pseudoinverses
In Section 1.5 we discussed left and right inverses. In particular, we discussed the pseudoin-
verse in Section 1.5.4. This was a particular left/right inverse; in general left/right inverses
are not unique. We’ll now discuss this concept in a bit more detail, seeing what property
uniquely characterizes the pseudoinverse and how we can compute it using a SVD. A good
reference for this material is [Nic, §8.6.4].
ACA = A =⇒ BACA = BA =⇒ CA = I,
and so C is a left inverse of A. Thus, middle inverses of A are the same as left inverses.
(b) If A right invertible, then middle inverses of A are the same as right inverses. The
proof is analogous to the one above.
(c) It follows that if A is invertible, then middle inverse are the same as inverses.
In general, middle inverses are not unique, even for square matrices.
Example 4.3.3. If
1 0 0
A= ,
0 0 0
then
1 b
B = 0 0
0 0
is a middle inverse for any b.
While the middle inverse is not unique in general, it turns out that it is unique if we
require that AB and BA be hermitian.
Theorem 4.3.4 (Penrose’ Theorem). For any A ∈ Mm,n (C), there exists a unique B ∈
Mn,m (C) such that
If A is invertible, then A+ = A−1 , as follows from Example 4.3.2. Also, the symmetry in
the conditions (P1) and (P2) imply that A++ = A.
The following proposition shows that the terminology pseudoinverse, as used above, co-
incides with our use of this terminology in Section 1.5.4.
Proof. We proof the first statement; the proof of the second is similar. If rank A = m,
then the rows of A are linearly independent and so, by Proposition 1.5.16(a) (we worked
over R there, but the same result holds over C if we replace the transpose by the conjugate
transpose), AAH is invertible. Then
and
AH (AAH )−1 A AH (AAH )−1 = AH (AAH )−1 (AAH )(AAH )−1 = AH (AAH )−1 .
H
AH (AAH )−1 A = AH (AAH )−1 A,
In turns out that if we have a SVD for A, then it is particularly easy to compute the
pseudoinverse.
Proposition 4.3.7. Suppose A ∈ Mm,n (C) and A = P ΣQH is a SVD for A as in Defini-
tion 4.1.1, with
D 0
Σ= , D = diag(d1 , d2 , . . . , dr ), d1 , d2 , . . . , dr ∈ R>0 .
0 0 m×n
σ1 = 1, σ2 = 0, σ3 = 0.
Thus
1 0
1 0 0
ΣA = = A and Σ0 = 0 0 = AT .
0 0 0
0 0
We then compute
1 1
p1 = Aq1 = .
σ1 0
We extend this to an orthonormal basis of C2 by choosing
0
p2 = .
1
Hence
P = p1 p2 = I2 .
Thus a SVD for A is A = P ΣA QT = ΣA = A. Therefore the pseudoinverse of A is
+ 0 T 0 1 0 0
A = QΣA P = ΣA = = AT .
0 0 0
We conclude this section with a list of properties of the pseudoinverse, many of which
parallel properties of the inverse.
98 Chapter 4. Generalized diagonalization
(a) A++ = A.
(b) If A is invertible, then A+ = A−1 .
(c) The pseudoinverse of a zero matrix is its transpose.
(d) (AT )+ = (A+ )T .
(e) (Ā)+ = A+ .
(f) (AH )+ = (A+ )H .
(g) (zA)+ = z −1 A+ for z ∈ C× .
(h) If A ∈ Mm,n (R), then A+ ∈ Mn,m (R).
Proof. We’ve already proved the first two. The proof of the remaining properties is left as
Exercise 4.3.3.
Exercises.
4.3.1. Suppose that B is a middle inverse for A. Prove that B T is a middle inverse for AT .
Proposition 4.4.1 (Block triangulation). Suppose A ∈ Mn,n (C) has characteristic polyno-
mial
cA (x) = (x − λ1 )m1 (x − λ2 )m2 · · · (x − λk )mk
4.4. Jordan canonical form 99
where λ1 , λ2 , . . . , λk are the distinct eigenvalues of A. Then there exists an invertible matrix
P such that
U1 0 0 ··· 0
0 U2 0
··· 0
P −1 AP = 0 0 U3
··· 0
.. .. .. ..
. . . .
0 0 0 · · · Uk
where each Ui is an mi × mi upper triangular matrix with all entries on the main diagonal
entry equal to λi .
Proof. The proof proceeds by induction on n. See [Nic, Th. 11.1.1] for details.
where mλ is the algebraic multiplicity of λ. (We use the notation Gλ (A) if we want to
emphasize the matrix.)
Recall that the geometric multiplicity of the eigenvalue λ of A is dim Eλ and we have
dim Eλ ≤ mλ . This inequality can be strict, as in Example 4.4.3. (In fact, it is strict for
some eigenvalue precisely when the matrix A is not diagonalizable.)
(Exercise 4.4.1). Thus, it suffices to show that dim null(P −1 BP ) = mi . Using the notation
of Proposition 4.4.1, we have λ = λi for some i and
P −1 BP = (λi I − P −1 AP )mi
mi
λi I − U1 0 ··· 0
0 λi I − U2 · · · 0
=
.. .. ..
. . .
0 0 · · · λi I − Uk
(λi I − U1 )mi 0 ··· 0
0 (λi I − U2 )mi · · · 0
= .
.. .. ..
. . .
mi
0 0 · · · (λi I − Uk )
Definition 4.4.5 (Jordan block). For n ∈ Z>0 and λ ∈ C, the Jordan block J(n, λ) is the
0
n × n matrix with λ’s on the main
diagonal, 1 s on the diagonal above, and 0’s elsewhere.
By convention, we set J(1, λ) = λ .
We have
λ 1 0 0
λ 1 0
λ 1 0 λ 1 0
J(1, λ) = λ , J(2, λ) = , J(3, λ) = 0 λ 1 , J(4, λ) = , etc.
0 λ 0 0 λ 1
0 0 λ
0 0 0 λ
Our goal is to show that Proposition 4.4.1 holds with each block Ui replaced by a Jordan
block. The key is to show the result for λ = 0. We say a linear operator (or matrix) T
is nilpotent if T m = 0 for some m ≥ 1. Every eigenvalue of a nilpotent linear operator or
matrix is equal to zero (Exercise 4.4.2). The converse also holds by Proposition 4.4.1 and
Exercise 3.4.3(b).
Lemma 4.4.6. If A ∈ Mn,n (C) is nilpotent, then there exists an invertible matrix P such
that
P −1 AP = diag(J1 , J2 , . . . , Jk ),
where each Ji is a Jordan block J(m, 0) for some m.
Proof. The proof proceeds by induction on n. See [Nic, Lem. 11.2.1] for details.
4.4. Jordan canonical form 101
Theorem 4.4.7 (Jordan canonical form). Suppose A ∈ Mn,n (C) has distinct (i.e. non-
repeated) eigenvalues λ1 , . . . , λk . For 1 ≤ i ≤ k, let mi be the algebraic multiplicity of λi .
Then there exists an invertible matrix P such that
P −1 AP = diag(J1 , . . . , Jm ), (4.13)
where each J` is a Jordan block corresponding to some eigenvalue λi . Furthermore, the sum
of the sizes of the Jordan block corresponding to λi is equal to mi . The form (4.13) is called
a Jordan canonical form of A.
where each Ui is upper triangular with entries on the main diagonal equal to λi . Suppose
that for each Ui we can find an invertible matrix Pi as in the statement of the theorem. Then
−1
P1 0 0 ··· 0 P1 0 0 · · · 0
0 P2 0
··· 0
0 P2 0 · · · 0
0 0 P3
··· 0 −1 0 0 P3 · · · 0
Q AQ
.. .. .. .. .. .. .. ..
. . . . . . . .
0 0 0 · · · Pk 0 0 0 · · · Pk
−1
P1 U1 P1 0 0 ··· 0
−1
0 P2 U2 P1 0 ··· 0
−1
=
0 0 P3 U3 P3 ··· 0
.. .. .. ..
. . . .
0 0 0 · · · Pk−1 Uk Pk
P −1 (A − λI)P = diag(J1 , . . . , J` ),
Exercises.
4.4.1. Show that (4.12) is an isomorphism.
4.4.2. Show that if T : V → V is a nilpotent linear operator, then zero is the only eigenvalue
of T .
(You probably only saw this for x ∈ R, but it holds for complex values of x as well.) So we
might naively try to define
∞
A
X 1 m
e = A .
m=0
m!
In fact, this is the correct definition. However, we need to justify that this infinite sum
makes sense (i.e. that in converges) and figure out how to compute it. To do this, we use
the Jordan canonical form.
4.5. The matrix exponential 103
where we have used the fact (which follows from (4.15)) that eJ(n,0) is upper triangular
with all entries on the main diagonal equal to 1 (hence det eJ(n,0) = 1). Now suppose A is
arbitrary. Since similar matrices have the same determinant and trace (see Exercises 3.4.1
and 3.4.2), we can use Theorem 4.4.7 to assume that A is in Jordan canonical form. So
A = diag(J1 , . . . , Jk )
The matrix exponential can be used to solve certain initial value problems.. You probably
learned in calculus that the solution to the differential equation
is
x(t) = e(t−t0 )a x0 .
In fact, this continues to hold more generally. If
x(t) = (x1 (t), . . . , xn (t)), x0 (t) = (x01 (t), . . . , x0n (t)), x0 ∈ Rn , A = Mn,n (C),
4.5. The matrix exponential 105
is
x(t) = e(t−t0 )A x0 .
In addition to the matrix exponential, it is also possible to compute other functions of
matrices (sin A, cos A, etc.) using Taylor series for these functions. See [Usm87, Ch. 6] for
further details.
Exercises.
4.5.1. Prove that eA is invertible for any A ∈ Mn,n (C).
4.5.2. Let
0 1 0 0
A= and B = .
0 0 1 0
Show that eA eB 6= eA+B .
4.5.3. If
3 1 0 0 0
0 3 1 0 0
0
A= 0 3 0 0,
0 0 0 −4 1
0 0 0 0 −4
compute eA .
4.5.4. If
0 2 −3
A = 0 0 1 ,
0 0 0
compute eA by summing the power series.
4.5.5. Solve the initial value problem x0 (t) = Ax(t), x(1) = (1, −2, 3), where A is the matrix
of Example 4.5.1.
Chapter 5
Quadratic forms
In your courses in linear algebra you have spent a great deal of time studying linear functions.
The next level of complexity involves quadratic functions. These have a large number of
applications, including in the theory of conic sections, optimization, physics, and statistics.
Matrix methods allow for a unified treatment of quadratic functions. Conversely, quadratic
functions provide useful insights into the theory of eigenvectors and eigenvalues. Good
references for the material in this chapter are [Tre, Ch. 7] and [ND77, Ch. 10].
5.1 Definitions
A quadratic form on Rn is a homogenous polynomial of degree two in the variables x1 , . . . , xn .
In other words, a quadratic form is a polynomial Q(x) = Q(x1 , . . . , xn ) having only terms of
degree two. So only terms that are scalar multiples of x2k and xj xk are allowed.
Every quadratic form on Rn can be written in the form Q(x) = hx, Axi = xT Ax for some
matrix A ∈ Mn,n (R). In general, the matrix A is not unique. For instance, the quadratic
form
Q(x) = 2x21 − x22 + 6x1 x2
can be written as hx, Axi where A is any of the matrices
2 6 2 0 2 3
, , .
0 −1 6 −1 3 −1
In fact, we can choose any matrix of the form
2 6−a
, a ∈ F.
a −1
Note that only the choice of a = 3 yields a symmetric matrix.
Lemma 5.1.1. For any quadratic form Q(x) on Rn , there is a unique symmetric matrix
A ∈ Mn,n (R) such that Q(x) = hx, Axi.
Proof. Suppose X
Q(x) = cij xi xj .
i≤j
106
5.1. Definitions 107
We can also consider quadratic forms on Cn . Typically, we still want the quadratic form
to take real values. Before we consider such quadratic forms, we state an important identity.
(a) If F = C, then
1 X
hx, Ayi = αhx + αy, A(x + αy)i.
4 α=±1,±i
Now suppose that hx, Axi ∈ R for all x ∈ Cn , and A = [aij ]. Using the polarization
identity (Lemma 5.1.3) we have, for all x, y ∈ Cn ,
1
hx, Ayi = hx + y, A(x + y)i − hx − y, A(x − y)i + ihx + iy, A(x + iy)i − ihx − iy, A(x − iy)i
4
(IP2) 1
= hx + y, A(x + y)i − hy − x, A(y − x)i + ihy − ix, A(y − ix)i − ihy + ix, A(y + ix)i
(IP3) 4
(IP1)
= hy, Axi = hAx, yi
= hx, AH yi.
(In the third equality, we used the fact that all the inner products appearing in the expression
are real.) It follows that A = AH (e.g. take x = ei and y = ej ).
In light of Lemma 5.1.4, we define a quadratic form on Cn to be a function of the form
Exercises.
5.1.1. Prove Lemma 5.1.3.
Find all matrices A ∈ M3,3 (R) such that Q(x) = hx, Axi. Which one is symmetric?
Q(x) = c
is an ellipse or a hyperbola (or parallel lines, which is a kind of degenerate hyperbola, or the
empty set) whose axes are parallel to the x-axis and/or y-axis. For example, if F = R, the
set defined by
x21 + 4x22 = 4
5.2. Diagonalization of quadratic forms 109
is the ellipse
x2
x1
x2
x1
where S ∈ Mn,n (F) is some invertible matrix, so that x = Sy. Then we have
Therefore, in the new variables y, the quadratic form has matrix S H AS.
110 Chapter 5. Quadratic forms
y1
√ √ √ √
The set {x : hx, Axi = 1} is the same ellipse, but in the basis (1/ 2, −1/ 2), (1/ 2, 1/ 2).
In other words, it is the same ellipse, rotated −π/4.
5.2. Diagonalization of quadratic forms 111
where
• the number of 1’s is equal to the number of positive eigenvalues of A (with multiplicity),
• the number of −1’s is equal to the number of negative eigenvalues of A (with multiplic-
ity),
• the number of 0’s is equal to the multiplicity of zero as an eigenvalue of A.
Proof. We leave the proof as Exercise 5.2.3.
The sequence (1, . . . , 1, −1, . . . , −1, 0, . . . , 0) appearing in Proposition 5.2.2 is called the
signature of the hermitian matrix A.
Exercises.
5.2.1 ([Tre, Ex. 7.2.2]). For the matrix
2 1 1
A = 1 2 1 ,
1 1 2
unitarily diagonalize the corresponding quadratic form. That is, find a diagonal matrix D
and a unitary matrix U such that D = U H AU .
5.2.2 ([ND77, Ex. 10.2.4]). For each quadratic form Q(x) below, find new variables y in
which the form is diagonal. Then graph the level curves Q(y) = t (in the coordinates y) for
various values of t.
(a) x21 + 4x1 x2 − 2x2x
(b) x21 + 12x1 x2 + 4x22
(c) 2x21 + 4x1 x2 + 4x22
(d) −5x21 − 8x1 x2 − 5x22
(e) 11x21 + 2x1 x2 + 3x22
5.2.3. Prove Proposition 5.2.2.
112 Chapter 5. Quadratic forms
U H AU = D = diag(λ1 , . . . , λn )
y = U −1 x = U H x so that x = U y.
Then
hx, Axi = xH Ax = (U y)H AU y = yH U H AU y = yH Dy = hy, Dyi
and
hx, xi = xH x = (U y)H U y = yH U H U y = yH y = hy, yi.
Thus
hx, Axi hy, Dyi
RA (x) = = = RD (y). (5.1)
hx, xi hy, yi
So we can always reduce the problem of maximizing a Rayleigh quotient to one of maximizing
a diagonalized Rayleigh quotient.
Theorem 5.3.1. Suppose A ∈ Mn,n (C) is hermitian with eigenvalues
λ1 ≤ λ2 ≤ · · · ≤ λn
y = ei ⇐⇒ x = U ei = vi ,
hx, Axi
RD (x) =
hx, xi
hx, A(x1 e1 + · · · + xn en )
=
hx, xi
hx1 e1 + · · · + xn en , x1 λ1 e1 + · · · xn λn en
=
hx, xi
λ1 |x1 | + · · · + λn |xn |2
2
=
hx, xi
λ1 |x1 |2 + · · · + λn |xn |2
≥ = λ1 ,
|x1 |2 + · · · + |xn |2
with equality holding if and only if xi = 0 for all i with λi > λ1 ; that is, if and only if
Dx = λ1 x.
(c) This follows from applying (b) to −A.
We have
h(1, 0, 0), (1, 2, 1)i h(0, 1, 0), (2, −1, 0)i
RA (1, 0, 0) = = 1 and RA (0, 1, 0) = = −1.
1 1
Thus, by Theorem 5.3.1, we have
λ1 ≤ −1 and 1 ≤ λ3 .
√
Additional
√ choices of x can give us improved bounds. In fact, λ1 = − 6, λ2 = −1, and
λ3 = 6. Note that RA (0, 1, 0) = −1 is an eigenvalue, even though (0, 1, 0) is not an
eigenvector. This does not contradict Theorem 5.3.1 since −1 is neither the minimum nor
the maximum eigenvalue.
We studied matrix norms in Section 2.3. In Theorem 2.3.5 we saw that there are easy ways
to compute the 1-norm kAk1 and the ∞-norm kAk∞ . But we only obtained an inequality
for the 2-norm kAk2 . We can now say something more precise.
114 Chapter 5. Quadratic forms
Corollary 5.3.3 (Matrix 2-norm). For any A ∈ Mm,n (C), the matrix 2-norm kAk2 is equal
to σ1 , the largest singular value of A.
Proof. We have
2
kAxk2
kAk22 = max : x 6= 0
kxk2
kAxk22
= max : x 6= 0
kxk22
hAx, Axi
= max : x 6= 0
hx, xi
hx, AH Axi
= max : x 6= 0
hx, xi
= λ,
where√λ is the largest eigenvalue of the hermitian matrix AH A (by Theorem 5.3.1). Since
σ1 = λ, we are done.
Theorem 5.3.1 relates the maximum (resp. minimum) eigenvalue of a hermitian matrix
A to the maximum (resp. minimum) value of the Rayleigh quotient RA (x). But what if we
are interested in the intermediate eigenvalues?
Theorem 5.3.4 (Rayleigh’s principle). Suppose A ∈ Mn,n (C) is hermitian, with eigenvalues
λ1 ≤ λ2 ≤ · · · ≤ λn and an orthonormal set of associated eigenvectors v1 , . . . , vn . For
1 ≤ j ≤ n, let
Then:
(a) RA (x) ≥ λj+1 for all x ∈ Sj , and RA (x) = λj+1 for x ∈ Sj if and only if x is an
eigenvector associated to λj+1 .
(b) RA (x) ≤ λn−j for all x ∈ Tj , and RA (x) = λn−j for x ∈ Tj if and only if x is an
eigenvector associated to λn−j .
Proof. The proof is very similar to that of Theorem 5.3.1, so we will omit it.
Rayleigh’s principle (Theorem 5.3.4) characterizes each eigenvector and eigenvalue of an
n × n hermitian matrix in terms of an extremum (i.e. maximization or minimization) prob-
lem. However, this characterization of the eigenvectors/eigenvalues other than the largest
and smallest requires knowledge of eigenvectors other than the one being characterized. In
particular, to characterize or estimate λj for 1 < j < n, we need the eigenvectors v1 , . . . , vj−1
or vj+1 , . . . , vn .
The following result remedies the aforementioned issue by giving a characterization of
each eigenvalue that is independent of the other eigenvalues.
5.3. Rayleigh’s principle and the min-max theorem 115
Theorem 5.3.5 (Min-max theorem). Suppose A ∈ Mn,n (C) is hermitian with eigenvalues
λ1 ≤ λ2 ≤ · · · ≤ λn .
Then
λk = min {max{RA (x) : x ∈ U, x 6= 0}}
U :dim U =k
and
λk = max {min{RA (x) : x ∈ U, x 6= 0}} ,
U :dim U =n−k+1
where the first min/max in each expression is over subspaces U ⊆ Cn of the given dimension.
Proof. We prove the first assertion, since the proof of the second is similar. Since A is
hermitian, it is unitarily diagonalizable by the spectral theorem (Theorem 3.4.4). So we can
choose an orthonormal basis
u1 , u2 , . . . , un
of eigenvectors, with ui an eigenvector of eigenvalue λi .
Suppose U ⊆ Cn with dim U = k. Then
Hence
U ∩ Span{uk , . . . , un } =
6 {0}.
So we can choose a nonzero vector
n
X
v= ai ui ∈ U
i=k
and Pn Pn
hv, Avi i=k λi |ai |2 i=k λk |ai |
2
RA (v) = = Pn 2
≥ P n 2
= λk .
hv, vi i=k |ai | i=k |ai |
Since this is true for all U , we have
V = Span{u1 , . . . , uk }.
Then
max{RA (x) : x ∈ V, x 6= 0} ≤ λk ,
since λk is the largest eigenvalue in V . Thus we also have
of Example 5.3.2. Let’s use the min-max theorem to get some bounds on λ2 . So n = 3 and
k = 2. Take
V = Span{(1, 0, 0), (0, 0, 1)}.
Then dim V = 2. The nonzero element of V are the vectors
We have
h(x1 , 0, x3 ), (x1 , 2x1 + 5x3 , −x3 )i |x1 |2 − |x3 |2
RA (x1 , 0, x3 ) = = .
|x1 |2 + |x3 |2 |x1 |2 + |x3 |2
This attains its maximum value of 1 at any scalar multiple of (1, 0, 0) and its minimum value
of −1 at any scalar multiple of (0, 0, 1). Thus, by the min-max theorem (Theorem 5.3.5), we
have
and
has maximum value 21 , when x is a scalar multiple of (1, 1). So it is crucial that A be
hermitian in the theorems of this section.
5.3. Rayleigh’s principle and the min-max theorem 117
Exercises.
5.3.1. Prove that x0 maximizes RA (x) subject to x 6= 0 and yields the maximum value
M = RA (x0 ) if and only if x1 = x0 /kx0 k maximizes hx, Axi subject to kxk = 1 and yields
hx1 , Ax1 i = M . Hint: The proof is similar to that of Proposition 2.3.2.
5.3.2 ([ND77, Ex. 10.4.1]). Use the Rayleigh quotient to find lower bounds for the largest
eigenvalue and upper bounds for the smallest eigenvalue of
0 −1 0
A = −1 −1 1 .
0 1 0
5.3.3 ([ND77, Ex. 10.4.2]). An eigenvector associated with the lowest eigenvalue of the matrix
below has the form xa = (1, a, 1). Find the exact value of a by defining the function
f (a) = RA (xa ) and using calculus to minimize f (a). What is the lowest eigenvalue of A?
3 −1 0
A = −1 2 −1 .
0 −1 3
5.3.4 ([ND77, Ex. 10.4.5]). For each matrix A below, use RA to obtain lower bounds on the
greatest eigenvalue and upper bounds on the least eigenvalue.
3 −1 0
(a) −1 2 −1
0 −1 3
7 −16 −8
(b) −16 7 8
−8 8 −5
2 −1 0
(c) −1 3 −1
0 −1 2
5.3.5 ([ND77, Ex. 10.4.6]). Using v3 = (1, −1, −1) as an eigenvector associated with the
largest eigenvalue λ3 of the matrix A of Exercise 5.3.2, use RA to obtain lower bounds on
the second largest eigenvalue λ2 .
118
Index 119
[AK08] Grégoire Allaire and Sidi Mahmoud Kaber. Numerical linear algebra, volume 55
of Texts in Applied Mathematics. Springer, New York, 2008. Translated from
the 2002 French original by Karim Trabelsi. URL: https://doi.org/10.1007/
978-0-387-68918-0.
[Mey00] Carl Meyer. Matrix analysis and applied linear algebra. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2000. With 1 CD-ROM
(Windows, Macintosh and UNIX) and a solutions manual (iv+171 pp.). URL:
https://doi.org/10.1137/1.9780898719512.
[ND77] Ben Noble and James W. Daniel. Applied linear algebra. Prentice-Hall, Inc., En-
glewood Cliffs, N.J., second edition, 1977.
[Pen55] R. Penrose. A generalized inverse for matrices. Proc. Cambridge Philos. Soc.,
51:406–413, 1955.
121