Sanet - ST 3030528146
Sanet - ST 3030528146
Sanet - ST 3030528146
Advanced
Linear and
Matrix Algebra
Advanced Linear and Matrix Algebra
Nathaniel Johnston
Advanced Linear
and Matrix Algebra
123
Nathaniel Johnston
Department of Mathematics and
Computer Science
Mount Allison University
Sackville, NB, Canada
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
For Devon
…who was very eager at age 2 to contribute to this book:
ndfshfjds kfdshdsf kdfsh kdsfhfdsk hdfsk
The Purpose of this Book
Linear algebra, more so than any other mathematical subject, can be approached in numerous ways.
Many textbooks present the subject in a very concrete and numerical manner, spending much of their
time solving systems of linear equations and having students perform laborious row reductions on
matrices. Many other books instead focus very heavily on linear transformations and other
basis-independent properties, almost to the point that their connection to matrices is considered an
inconvenient after-thought that students should avoid using at all costs.
This book is written from the perspective that both linear transformations and matrices are useful
objects in their own right, but it is the connection between the two that really unlocks the magic of
linear algebra. Sometimes when we want to know something about a linear transformation, the easiest
way to get an answer is to grab onto a basis and look at the corresponding matrix. Conversely, there
are many interesting families of matrices and matrix operations that seemingly have nothing to do with
linear transformations, yet can nonetheless illuminate how some basis-independent objects and
properties behave.
This book introduces many difficult-to-grasp objects such as vector spaces, dual spaces, and tensor
products. Because it is expected that this book will accompany one of the first courses where students
are exposed to such abstract concepts, we typically sandwich this abstractness between concrete
examples. That is, we first introduce or emphasize a standard, prototypical example of the object to be
introduced (e.g., Rn ), then we discuss its abstract generalization (e.g., vector spaces), and finally we
explore other specific examples of that generalization (e.g., the vector space of polynomials and the
vector space of matrices).
This book also delves somewhat deeper into matrix decompositions than most others do. We of course
cover the singular value decomposition as well as several of its applications, but we also spend quite a
bit of time looking at the Jordan decomposition, Schur triangularization, and spectral decomposition,
and we compare and contrast them with each other to highlight when each one is appropriate to use.
Computationally-motivated decompositions like the QR and Cholesky decompositions are also
covered in some of this book’s many “Extra Topic” sections.
vii
viii Preface
determinant of a matrix, as well as eigenvalues and eigenvectors. These preliminary topics are briefly
reviewed in Appendix A.1.
Because these books aim to not overlap with each other and repeat content, we do not discuss some
topics that are instead explored in that book. In particular, diagonalization of a matrix via its eigen-
values and eigenvectors is discussed in the introductory book and not here. However, many extensions
and variations of diagonalization, such as the spectral decomposition (Section 2.1.2) and Jordan
decomposition (Section 2.4) are explored here.
This text makes heavy use of notes in the margin, which are used to introduce some additional
terminology or provide reminders that would be distracting in the main text. They are most commonly
used to try to address potential points of confusion for the reader, so it is best not to skip them.
For example, if we want to clarify why a particular piece of notation is the way it is, we do so in the
margin so as to not derail the main discussion. Similarly, if we use some basic fact that students are
expected to be aware of (but have perhaps forgotten) from an introductory linear algebra course, the
margin will contain a brief reminder of why it’s true.
Exercises
Several exercises can be found at the end of every section in this book, and whenever possible there
are three types of them:
• There are computational exercises that ask the reader to implement some algorithm or make
use of the tools presented in that section to solve a numerical problem.
• There are true/false exercises that test the reader’s critical thinking skills and reading com-
prehension by asking them whether some statements are true or false.
• There are proof exercises that ask the reader to prove a general statement. These typically are
either routine proofs that follow straight from the definition (and thus were omitted from the
main text itself), or proofs that can be tackled via some technique that we saw in that section.
Roughly half of the exercises are marked with an asterisk (), which means that they have a solution
provided in Appendix C. Exercises marked with two asterisks () are referenced in the main text and
are thus particularly important (and also have solutions in Appendix C).
The material covered in Chapters 1 and 2 is mostly standard material in upper-year undergraduate
linear algebra courses. In particular, Chapter 1 focuses on abstract structures like vector spaces and
inner products, to show students that the tools developed in their previous linear algebra course can be
applied to a much wider variety of objects than just lists of numbers like vectors in Rn . Chapter 2 then
explores how we can use these new tools at our disposal to gain a much deeper understanding of
matrices.
Chapter 3 covers somewhat more advanced material—multilinearity and the tensor product—which is
aimed particularly at advanced undergraduate students (though we note that no knowledge of abstract
algebra is assumed). It could serve perhaps as content for part of a third course, or as an independent
study in linear algebra. Alternatively, that chapter is also quite aligned with the author’s research
interests as a quantum information theorist, and it could be used as supplemental reading for students
who are trying to learn the basics of the field.
Sectioning
The sectioning of the book is designed to make it as simple to teach from as possible. The author
spends approximately the following amount of time on each chunk of this book:
• Subsection: 1–1.5 hour lecture
• Section: 2 weeks (3–4 subsections per section)
• Chapter: 5–6 weeks (3–4 sections per chapter)
• Book: 12-week course (2 chapters, plus some extra sections)
Of course, this is just a rough guideline, as some sections are longer than others. Furthermore, some
instructors may choose to include material from Chapter 3, or from some of the numerous in-depth
“Extra Topic” sections. Alternatively, the additional topics covered in those sections can serve as
independent study topics for students.
Half of this book’s sections are called “Extra Topic” sections. The purpose of the book being arranged
in this way is that it provides a clear main path through the book (Sections 1.1–1.4, 2.1–2.4, and 3.1–
3.3) that can be supplemented by the Extra Topic sections at the reader’s/instructor’s discretion. It is
expected that many courses will not even make it to Chapter 3, and instead will opt to explore some
of the earlier Extra Topic sections instead.
We want to emphasize that the Extra Topic sections are not labeled as such because they are less
important than the main sections, but only because they are not prerequisites for any of the main
sections. For example, norms and isometries (Section 1.D) are used constantly throughout advanced
mathematics, but they are presented in an Extra Topic section since the other sections of this book do
not depend on them (and also because they lean quite a bit into “analysis territory”, whereas most
of the rest of the book stays firmly in “algebra territory”).
Similarly, the author expects that many instructors will include the section on the direct sum and
orthogonal complements (Section 1.B) as part of their course’s core material, but this can be done at
their discretion. The subsections on dual spaces and multilinear forms from Section 1.3.2 can be
omitted reasonably safely to make up some time if needed, as can the subsection on Gershgorin discs
(Section 2.2.2), without drastically affecting the book’s flow.
For a graph that depicts the various dependencies of the sections of this book on each other, see
Figure H.
x Preface
Figure H: A graph depicting the dependencies of the sections of this book on each other. Solid arrows indicate that the section
is required before proceeding to the section that it points to, while dashed arrows indicate recommended (but not required) prior
reading. The main path through the book consists of Sections 1–4 of each chapter. The extra sections A–D are optional and can
be explored at the reader’s discretion, as none of the main sections depend on them.
Acknowledgments
Thanks are extended to Geoffrey Cruttwell, Mark Hamilton, David Kribs, Chi-Kwong Li, Benjamin
Lovitz, Neil McKay, Vern Paulsen, Rajesh Pereira, Sarah Plosker, Jamie Sikora, and John Watrous for
various discussions that have either directly or indirectly improved the quality of this book.
Thank you to Everett Patterson, as well as countless other students in my linear algebra classes at
Mount Allison University, for drawing my attention to typos and parts of the book that could be
improved.
Preface xi
Parts of the layout of this book were inspired by the Legrand Orange Book template by Velimir
Gayevskiy and Mathias Legrand at LaTeXTemplates.com.
Finally, thank you to my wife Kathryn for tolerating me during the years of my mental absence glued
to this book, and thank you to my parents for making me care about both learning and teaching.
xiii
xiv Table of Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Symbol Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
1. Vector Spaces
Emil Artin
Rn denotes the set of Our first exposure to linear algebra is typically a very concrete thing—it consists
vectors with n (real) of some basic facts about Rn (a set made up of lists of real numbers, called
entries and Mm,n
denotes the set of vectors) and Mm,n (a set made up of m × n arrays of real numbers, called
m × n matrices. matrices). We can do numerous useful things in these sets, such as solve
systems of linear equations, multiply matrices together, and compute the rank,
determinant, and eigenvalues of a matrix.
When we look carefully at our procedures for doing these calculations,
as well as our proofs for why they work, we notice that most of them do not
actually require much more than the ability to add vectors together and multiply
them by scalars. However, there are many other mathematical settings where
addition and scalar multiplication work, and almost all of our linear algebraic
techniques work in these more general settings as well.
With this in mind, our goal right now is considerably different from what
it was in introductory linear algebra—we want to see exactly how far we can
push our techniques. Instead of defining objects and operations in terms of
explicit formulas and then investigating what properties they satisfy (as we
have done up until now), we now focus on the properties that those familiar
objects have and ask what other types of objects have those properties.
For example, in a typical introduction to linear algebra, the dot product of
two vectors v and w in Rn is defined by
v · w = v1 w1 + v2 w2 + · · · + vn wn ,
and then the “nice” properties that the dot product satisfies are investigated. For
example, students typically learn the facts that
v·w = w·v and v · (w + x) = (v · w) + (v · x) for all v, w, x ∈ Rn
almost immediately after being introduced to the dot product. In this chapter,
we flip this approach around and instead define an “inner product” to be any
function satisfying those same properties, and then show that everything we
learned about the dot product actually applies to every single inner product
(even though many inner products look, on the surface, quite different from the
dot product).
In order to use our linear algebraic tools with objects other than vectors in Rn ,
we need a proper definition that tells us what types of objects we can consider in
a linear algebra setting. The following definition makes this precise and serves
as the foundation for this entire chapter. Although the definition looks like an
absolute beast, the intuition behind it is quite straightforward—the objects that
we work with should behave “like” vectors in Rn . That is, they should have the
same properties (like commutativity: v + w = w + v) that vectors in Rn have
with respect to vector addition and scalar multiplication.
More specifically, the following definition lists 10 properties that must be
satisfied in order for us to call something a “vector space” (like R3 ) or a “vector”
(like (1, 3, −2) ∈ R3 ). These 10 properties can be thought of as the answers to
the question “what properties of vectors in Rn can we list without explicitly
referring to their entries?”
Definition 1.1.1 Let F be a set of scalars (usually either R or C) and let V be a set with
Vector Space two operations called addition and scalar multiplication. We write the
addition of v, w ∈ V as v + w, and the scalar multiplication of c ∈ F and v
R is the set of real
as cv.
numbers and C is If the following ten conditions hold for all v, w, x ∈ V and all c, d ∈ F, then
the set of complex V is called a vector space and its elements are called vectors:
numbers (see
Appendix A.3). a) v+w ∈ V (closure under addition)
b) v+w = w+v (commutativity)
c) (v + w) + x = v + (w + x) (associativity)
d) There exists a “zero vector” 0 ∈ V such that v + 0 = v.
e) There exists a vector −v such that v + (−v) = 0.
Notice that the first f) cv ∈ V (closure under scalar multiplication)
five properties
g) c(v + w) = cv + cw (distributivity)
concern addition,
while the last five h) (c + d)v = cv + dv (distributivity)
concern scalar i) c(dv) = (cd)v
multiplication. j) 1v = v
Remark 1.1.1 Not much would be lost throughout this book if we were to replace the set
Fields and of scalars F from Definition 1.1.1 with “either R or C”. In fact, in many
Sets of Scalars cases it is even enough to just explicitly choose F = C, since oftentimes if
a property holds over C then it automatically holds over R simply because
R ⊆ C.
It’s also useful to However, F more generally can be any “field”, which is a set of objects
know that every field
has a “0” and a “1”:
in which we can add, subtract, multiply, and divide according to the
numbers such that usual laws of arithmetic (e.g., ab = ba and a(b + c) = ab + ac for all
0a = 0 and 1a = a for a, b, c ∈ F)—see Appendix A.4. Just like we can keep Rn in mind as the
all a ∈ F. standard example of a vector space, we can keep R and C in mind as the
standard examples of a field.
We now look at several examples of sets that are and are not vector spaces,
to try to get used to this admittedly long and cumbersome definition. As our
first example, we show that Rn is indeed a vector space (which should not be
surprising—the definition of a vector space was designed specifically so as to
1.1 Vector Spaces and Subspaces 3
mimic Rn ).
(v + w) + x = (v1 + w1 , . . . , vn + wn ) + (x1 , . . . , xn )
= (v1 + w1 + x1 , . . . , vn + wn + xn )
= (v1 , . . . , vn ) + (w1 + x1 , . . . , wn + xn ) = v + (w + x).
c(v + w) = c(v1 + w1 , . . . , vn + wn )
= (cv1 + cw1 , . . . , cvn + cwn ) = cv + cw.
used to) is also a vector space. That is, the sequence space
N is the set of natural
def
numbers. That is, FN = (x1 , x2 , x3 , . . .) : x j ∈ F for all j ∈ N
N = {1, 2, 3, . . .}.
is a vector space with the standard addition and scalar multiplication operations
(we leave the proof of this fact to Exercise 1.1.7).
To get a bit more comfortable with vector spaces, we now look at several
examples of vector spaces that, on the surface, look significantly different from
these spaces of tuples and sequences.
Example 1.1.2 Show that Mm,n , the set of m × n matrices, is a vector space.
The Set of Matrices
Solution:
is a Vector Space
We have to check the ten properties described by Definition 1.1.1. If
A, B,C ∈ Mm,n and c, d ∈ F then:
a) A + B is also an m × n matrix (i.e., A + B ∈ Mm,n ).
b) For each 1 ≤ i ≤ m and 1 ≤ j ≤ n, the (i, j)-entry of A + B is ai, j +
The “(i, j)-entry” of a
matrix is the scalar in bi, j = bi, j + ai, j , which is also the (i, j)-entry of B + A. It follows
its i-th row and j-th that A + B = B + A.
column. We denote
the (i, j)-entry of A c) The fact that (A + B) + C = A + (B + C) follows similarly from
and B by ai, j and bi, j , looking at the (i, j)-entry of each of these matrices and using asso-
respectively, or ciativity of addition in F.
sometimes by [A]i, j
and [B]i, j . d) The “zero vector” in this space is the zero matrix O (i.e., the m × n
matrix with every entry equal to 0), since A + O = A.
e) We define −A to be the matrix whose (i, j)-entry is −ai, j , so that
A + (−A) = O.
f) cA is also an m × n matrix (i.e., cA ∈ Mm,n ).
g) The (i, j)-entry of c(A + B) is c(ai, j + bi, j ) = cai, j + cbi, j , which is
the (i, j)-entry of cA + cB, so (A + B) = cA + cB.
h) Similarly to property (g), the (i, j)-entry of (c + d)A is (c + d)ai, j =
cai, j + dai, j , which is also the (i, j)-entry of cA + dA. It follows that
(c + d)A = cA + dA.
i) Similarly to property (g), the (i, j)-entry of c(dA) is c(dai, j ) =
(cd)ai, j , which is also the (i, j)-entry of (cd)A, so c(dA) = (cd)A.
j) The fact that 1A = A is clear.
def
Example 1.1.3 Show that the set of real-valued functions F = f : R → R is a vector
The Set of Functions space.
is a Vector Space Solution:
Once again, we have to check the ten properties described by Defini-
tion 1.1.1. To do this, we will repeatedly use the fact that two functions
are the same if and only if their outputs are always the same (i.e., f = g if
and only if f (x) = g(x) for all x ∈ R). With this observation in mind, we
note that if f , g, h ∈ F and c, d ∈ R then:
a) f + g is the function defined by ( f + g)(x) = f (x) + g(x) for all
x ∈ R. In particular, f + g is also a function, so f + g ∈ F .
b) For all x ∈ R, we have
so f + g = g + f .
All of these c) For all x ∈ R, we have
properties of
functions are trivial.
The hardest part of ( f + g) + h (x) = f (x) + g(x) + h(x)
this example is not = f (x) + g(x) + h(x) = f + (g + h) (x),
the math, but rather
getting our heads
around the so ( f + g) + h = f + (g + h).
unfortunate notation d) The “zero vector” in this space is the function 0 with the property
that results from
that 0(x) = 0 for all x ∈ R.
using parentheses to
group function e) Given a function f , the function − f is simply defined by (− f )(x) =
addition and also to − f (x) for all x ∈ R. Then
denote inputs of
functions.
f + (− f ) (x) = f (x) + (− f )(x) = f (x) − f (x) = 0
for all x ∈ R, so f + (− f ) = 0.
f) c f is the function defined by (c f )(x) = c f (x) for all x ∈ R. In
particular, c f is also a function, so c f ∈ F .
g) For all x ∈ R, we have
c( f + g) (x) = c f (x) + g(x) = c f (x) + cg(x) = (c f + cg)(x),
so c( f + g) = c f + cg.
h) For all x ∈ R, we have
(c + d) f (x) = (c + d) f (x) = c f (x) + d f (x) = (c f + d f )(x),
so (c + d) f = c f + d f .
i) For all x ∈ R, we have
c(d f ) (x) = c d f (x) = (cd) f (x) = (cd) f (x),
so c(d f ) = (cd) f .
j) For all x ∈ R, we have (1 f )(x) = 1 f (x) = f (x), so 1 f = f .
In the previous three examples, we did not explicitly specify what the
addition and scalar multiplication operations were upfront, since there was an
6 Chapter 1. Vector Spaces
obvious choice on each of these sets. However, it may not always be clear what
these operations should actually be, in which case they have to be explicitly
defined before we can start checking whether or not they turn the set into a
vector space.
In other words, vector spaces are a package deal with their addition and
scalar multiplication operations—a set might be a vector space when the op-
erations are defined in one way but not when they are defined another way.
Furthermore, the operations that we call addition and scalar multiplication
might look nothing like what we usually call “addition” or “multiplication”.
All that matters is that those operations satisfy the ten properties from Defini-
tion 1.1.1.
Example 1.1.4 Let V = {x ∈ R : x > 0} be the set of positive real numbers. Show that V
A Vector Space is a vector space when we define addition ⊕ on it via usual multiplication
With Weird of real numbers (i.e., x ⊕ y = xy) and scalar multiplication on it via
Operations exponentiation (i.e., c x = xc ).
Solution:
In this example, we Once again, we have to check the ten properties described by Defini-
use bold variables
tion 1.1.1. Well, if x, y, z ∈ V and c, d ∈ R then:
like x when we think
of objects as vectors a) x ⊕ y = xy, which is still a positive number since x and y are both
in V, and we use positive, so x ⊕ y ∈ V.
non-bold variables
like x when we think
b) x ⊕ y = xy = yx = y ⊕ x.
of them just as c) (x ⊕ y) ⊕ z = (xy)z = x(yz) = x ⊕ (y ⊕ z).
positive real
numbers. d) The “zero vector” in this space is the number 1 (so we write 0 = 1),
since 0 ⊕ x = 1x = x = x.
e) For each vector x ∈ V, we define −x = 1/x. This works because
x ⊕ (−x) = x(1/x) = 1 = 0.
f) c x = xc , which is still positive since x is positive, so c x ∈ V.
g) c (x + y) = (xy)c = xc yc = (c x) ⊕ (c y).
h) (c + d) x = xc+d = xc xd = (c x) ⊕ (d x).
i) c (d x) = (xd )c = xcd = (cd) x.
j) 1 x = x1 = x = x.
The previous examples illustrate that vectors, vector spaces, addition, and
scalar multiplication can all look quite different from the corresponding con-
cepts in Rn . As one last technicality, we note that whether or not a set is a
vector space depends on the field F that is being considered. The field can often,
but not always, be inferred from context.
Before presenting an example that demonstrates why the choice of field is
not always obvious, we establish some notation and remind the reader of some
terminology. The transpose of a matrix A ∈ Mm,n is the matrix AT ∈ Mn,m
whose (i, j)-entry is a j,i . That is, AT is obtained from A by reflecting its entries
across its main diagonal. For example, if
" # 1 4
1 2 3
A= then AT = 2 5 .
4 5 6
3 6
Similarly, the conjugate transpose of A ∈ Mm,n (C) is the matrix
1.1 Vector Spaces and Subspaces 7
A∗ ∈ Mn,m (C) whose (i, j)-entry is a j,i , where the horizontal line denotes
def T
complex conjugation (i.e., a + ib = a − ib). In other words, A∗ = A . For
Complex example, if
conjugation is
reviewed in
Appendix A.3.2. 1 2 − 3i
1 3 − i 2i
A= then A∗ = 3 + i i .
2 + 3i −i 0
−2i 0
Example 1.1.5 Let MHn be the set of n × n Hermitian matrices. Show that if the field is
Is the Set of F = R then MH n is a vector space, but if F = C then it is not.
Hermitian Matrices
Solution:
a Vector Space?
Before presenting the solution, we clarify that the field F does not
specify what the entries of the Hermitian matrices are: in both cases (i.e.,
when F = R and when F = C), the entries of the matrices themselves can
be complex. The field F associated with a vector space just determines
what types of scalars are used in scalar multiplication.
To illustrate this point, we start by showing that if F = C then MH n is
not a vector space. For example, property (f) of vector spaces fails because
0 1 0 i
A= ∈ MH 2 , but iA = ∈/ MH2.
1 0 i 0
That is, MH H
2 (and by similar reasoning, Mn ) is not closed under multipli-
cation by complex scalars.
On the other hand, to see that MHn is a vector space when F = R, we
check the ten properties described by Definition 1.1.1. If A, B,C ∈ MH n
and c, d ∈ R then:
a) (A + B)∗ = A∗ + B∗ = A + B, so A + B ∈ MH n.
d) The zero matrix O is Hermitian, so we choose it as the “zero vector”
in MHn.
e) If A ∈ MH H
n then −A ∈ Mn too.
f) Since c is real, we have (cA)∗ = cA∗ = cA, so cA ∈ MH n (whereas
if c were complex then we would just have (cA)∗ = cA, which does
not necessarily equal cA, as we saw above).
All of the other properties of vector spaces follow immediately using
the same arguments that we used to show that Mm,n is a vector space in
Example 1.1.2.
In cases where we wish to clarify which field we are using for scalar
multiplication in a vector space V, we say that V is a vector space “over” that
field. For example, we say that MH n (the set of n × n Hermitian matrices) is a
vector space over R, but not a vector space over C. Alternatively, we refer to
8 Chapter 1. Vector Spaces
the field in question as the ground field of V (so the ground field of MH
n , for
example, is R).
Despite the various vector spaces that we have seen looking so different on
the surface, not much changes when we do linear algebra in this more general
setting. To get a feeling for how we can prove things about vector spaces in
general, we now prove our very first theorem. We can think of this theorem as
answering the question of why we did not include some other properties like
“0v = 0” in the list of defining properties of vector spaces, even though we did
include “1v = v”. The reason is simply that listing these extra properties would
be redundant, as they follow from the ten properties that we did list.
0v = 0v + 0 (property (d))
Proving things about = 0v + 0v + (−(0v)) (property (e))
vector spaces will
= (0v + 0v) + − (0v) (property (c))
quickly become less
tedious than in this = (0 + 0)v + − (0v) (property (h))
theorem—as we
develop more tools, = 0v + − (0v) (0 + 0 = 0 in every field)
we will find ourselves = 0. (property (e))
referencing the
defining properties
(a)–(j) less and less. Now that we have 0v = 0 to work with, proving that (−1)v = −v is a bit
more straightforward:
1.1.1 Subspaces
It is often useful to work with vector spaces that are contained within other
vector spaces. This situation comes up often enough that it gets its own name:
Theorem 1.1.2 Let V be a vector space over a field F and let S ⊆ V be non-empty. Then
Determining if a Set S is a subspace of V if and only if the following two conditions hold:
is a Subspace a) If v, w ∈ S then v + w ∈ S, and (closure under addition)
b) if v ∈ S and c ∈ F then cv ∈ S. (closure under scalar mult.)
Proof. For the “only if” direction, properties (a) and (b) in this theorem are
properties (a) and (f) in Definition 1.1.1 of a vector space, so of course they
must hold if S is a subspace (since subspaces are vector spaces).
For the “if” direction, we have to show that all ten properties (a)–(j) in
Definition 1.1.1 of a vector space hold for S:
• Properties (a) and (f) hold by hypothesis.
• Properties (b), (c), (g), (h), (i), and (j) hold for all vectors in V, so they
certainly hold for all vectors in S too, since S ⊆ V.
Every subspace, just • For property (d), we need to show that 0 ∈ S. If v ∈ S then we know that
like every vector
0v ∈ S too since S is closed under scalar multiplication. However, we
space, must contain
a zero vector. know from Theorem 1.1.1(a) that 0v = 0, so we are done.
• For property (e), we need to show that if v ∈ S then −v ∈ S too. We
know that (−1)v ∈ S since S is closed under scalar multiplication, and
we know from Theorem 1.1.1(b) that (−1)v = −v, so we are done.
Example 1.1.6 Let P p be the set of real-valued polynomials of degree at most p. Show
The Set of Polynomials that P p a subspace of F , the vector space of all real-valued functions.
is a Subspace
Solution:
We just have to check the two properties described by Theorem 1.1.2.
The degree of a Hopefully it is somewhat clear that adding two polynomials of degree at
polynomial is the
most p results in another polynomial of degree at most p, and similarly
largest exponent to
which the variable is multiplying a polynomial of degree at most p by a scalar results in another
raised, so a one, but we make this computation explicit.
polynomial of
degree at most p
Suppose f (x) = a p x p + · · · + a1 x + a0 and g(x) = b p x p + · · · + b1 x + b0
looks like f (x) = are polynomials of degree at most p (i.e., f , g ∈ P p ) and c ∈ R is a scalar.
a p x p + · · · + a1 x + a0 , Then:
where a p , . . ., a1 ,
a0 ∈ R. The degree of a) We compute
f is exactly p if a p 6= 0.
See Appendix A.2 for ( f + g)(x) = (a p x p + · · · + a1 x + a0 ) + (b p x p + · · · + b1 x + b0 )
an introduction to
polynomials.
= (a p + b p )x p + · · · + (a1 + b1 )x + (a0 + b0 ),
(c f )(x) = c(a p x p + · · · + a1 x + a0 )
= (ca p )x p + · · · + (ca1 )x + (ca0 ),
Example 1.1.7 Show that the set of n × n upper triangular matrices a subspace of Mn .
The Set of Upper
Solution:
Triangular Matrices
Again, we have to check the two properties described by Theorem 1.1.2,
is a Subspace
and it is again somewhat clear that both of these properties hold. For ex-
ample, if we add two matrices with zeros below the diagonal, their sum
Recall that a matrix will still have zeros below the diagonal.
A is called upper
triangular if ai, j = 0
We now formally prove that these properties hold. Suppose A, B ∈ Mn
whenever i > j. For are upper triangular (i.e., ai, j = bi, j = 0 whenever i > j) and c ∈ F is a
example, a 2 × 2 scalar.
upper triangular
matrix has the form a) The (i, j)-entry of A + B is ai, j + bi, j = 0 + 0 = 0 whenever i > j,
" # and
a b
A= .
0 c b) the (i, j)-entry of cA is cai, j = 0c = 0 whenever i > j.
Since both properties are satisfied, we conclude that the set of n × n upper
triangular matrices is indeed a subspace of Mn .
Example 1.1.8 Show that the set of non-invertible 2 × 2 matrices is not a subspace of M2 .
The Set of Non-
Solution:
Invertible Matrices
This set is not a subspace because it is not closed under addition. For
is Not a Subspace
Similar examples can example, the following matrices are not invertible:
be used to show
that, for all n ≥ 1, the 1 0 0 0
set of non-invertible A= , B= .
0 0 0 1
n × n matrices is not
a subspace of Mn .
1.1 Vector Spaces and Subspaces 11
which is invertible. It follows that property (a) of Theorem 1.1.2 does not
hold, so this set is not a subspace of M2 .
Example 1.1.9 Show that Z3 , the set of 3-entry vectors with integer entries, is not a
The Set of Integer- subspace of R3 .
Entry Vectors
Solution:
is Not a Subspace
This set is a subset of R3 and is closed under addition, but it is not a
subspace because it is not closed under scalar multiplication. Because we
are asking whether or not it is a subspace of R3 , which uses R as its ground
field, we must use the same scalars here as well. However, if v ∈ Z3 and
c ∈ R then cv may not be in Z3 . For example, if
1
v = (1, 2, 3) ∈ Z3 then / Z3 .
v = (1/2, 1, 3/2) ∈
2
It follows that property (b) of Theorem 1.1.2 does not hold, so this set is
not a subspace of R3 .
Example 1.1.10 Show that the set c00 ⊂ FN of sequences with only finitely many non-zero
The Set of Eventu- entries is a subspace of the sequence space FN .
ally-Zero Sequences
Solution:
is a Subspace
Once again, we have to check properties (a) and (b) of Theorem 1.1.2.
That is, if v, w ∈ c00 have only finitely many non-zero entries and c ∈ F is
In the notation c00 , a scalar, then we have to show that (a) v + w and (b) cv each have finitely
the “c” refers to the
many non-zero entries as well.
sequences
converging and the These properties are both straightforward to show—if v has m non-zero
“00” refers to how entries and w has n non-zero entries then v + w has at most m + n non-zero
they converge to 0
and eventually
entries and cv has either m non-zero entries (if c 6= 0) or 0 non-zero entries
equal 0. (if c = 0).
Definition 1.1.3 Suppose V is a vector space over a field F. A linear combination of the
Linear Combina- vectors v1 , v2 , . . . , vk ∈ V is any vector of the form
tions
c1 v1 + c2 v2 + · · · + ck vk ,
x2 − 3x − 4 = c1 (x2 − x + 2) + c2 (2x2 − 3x + 1)
= (c1 + 2c2 )x2 + (−c1 − 3c2 )x + (2c1 + c2 ).
We now use the fact that two polynomials are equal if and only if their
coefficients are equal: we set the coefficients of x2 on both sides of the
equation equal to each other, the coefficients of x equal to each other, and
the constant terms equal to each other. This gives us the linear system
1 = c1 + 2c2
−3 = −c1 − 3c2
−4 = 2c1 + c2 .
This linear system can be solved using standard techniques (e.g., Gaussian
elimination) to find the unique solution c1 = −3, c2 = 2. It follows that f
is a linear combination of g and h: f (x) = −3g(x) + 2h(x).
Example 1.1.12 Determine whether or not the identity matrix I ∈ M2 (C) is a linear com-
Linear Combina- bination of the three matrices
tions of Matrices
0 1 0 −i 1 0
X= , Y= , and Z = .
The four matrices 1 0 i 0 0 −1
I, X,Y, Z ∈ M2 (C) are
sometimes called Solution:
the Pauli matrices. We want to know whether or not there exist c1 , c2 , c3 ∈ C such that
I = c1 X + c2Y + c3 Z. Writing this matrix equation out more explicitly
1.1 Vector Spaces and Subspaces 13
gives
1 0 0 1 0 −i 1 0
= c1 + c2 + c3
0 1 1 0 i 0 0 −1
c3 c1 − ic2
= .
c1 + ic2 −c3
The (1, 1)-entry of the above matrix equation tells us that c3 = 1, but
the (2, 2)-entry tells us that c3 = −1, so this system of equations has no
solution. It follows that I is not a linear combination of X, Y , and Z.
Linear combinations are useful for the way that they combine both of the
vector space operations (vector addition and scalar multiplication)—instead of
phrasing linear algebraic phenomena in terms of those two operations, we can
often phrase them more elegantly in terms of linear combinations.
For example, we saw in Theorem 1.1.2 that a subspace is a subset of a
vector space that is closed under vector addition and scalar multiplication.
Equivalently, we can combine those two operations and just say that a subspace
is a subset of a vector space that is closed under linear combinations (see
Exercise 1.1.11). On the other hand, a subset B of a vector space that is not
closed under linear combinations is necessarily not a subspace.
However, it is often useful to consider the smallest subspace containing B.
We call this smallest subspace the span of B, and to construct it we just take all
linear combinations of members of B:
Definition 1.1.4 Suppose V is a vector space and B ⊆ V is a set of vectors. The span of B,
Span denoted by span(B), is the set of all (finite!) linear combinations of vectors
from B:
( )
k
def
span(B) = ∑ c j v j k ∈ N, c j ∈ F and v j ∈ B for all 1 ≤ j ≤ k .
j=1
In Rn , for example, the span of a single vector is the line through the origin
in the direction of that vector, and the span of two non-parallel vectors is the
plane through the origin containing those vectors (see Figure 1.1).
Figure 1.1: The span of a set of vectors is the smallest subspace that contains
all of those vectors. In Rn , this smallest subspace is a line, plane, or hyperplane
containing all of the vectors in the set.
interpretation, but algebraically spans still work much like they do in Rn . For
We use span(1, x, x2 ) example, span(1, x, x2 ) = P 2 (the vector space of polynomials with degree at
as slight shorthand
for span({1, x, x2 })—
most 2) since every polynomial f ∈ P 2 can be written in the form f (x) =
we sometimes omit c1 + c2 x + c3 x2 for some c1 , c2 , c3 ∈ R. Indeed, this is exactly what it means for
the curly set braces. a polynomial to have degree at most 2. More generally, span(1, x, x2 , . . . , x p ) =
P p.
However, it is important to keep in mind that linear combinations are
always finite, even if B is not. To illustrate this point, consider the vector space
P = span(1, x, x2 , x3 , . . .), which is the set of all polynomials (of any degree).
If we recall from calculus that we can represent the function f (x) = ex in the
form
∞ n
This is called a Taylor x x2 x3 x4
series for ex (see ex = ∑ = 1+x+ + + +··· ,
Appendix A.2.2). n=0 n! 2 6 24
we might expect that ex ∈ P, since we have written ex as a sum of scalar
While all subspaces multiples of 1, x, x2 , and so on. However, ex ∈
/ P since ex can only be written
of Rn are “closed” as an infinite sum of polynomials, not a finite one (see Figure 1.2).
(roughly speaking,
they contain their
edges / limits /
F (all real-valued functions)
boundaries), this
example illustrates P (polynomials) cos(x) ex
the fact that some
vector spaces (like
P) are not. 2x85 + 3x7 − x2 + 6
P2 (quadratic functions)
x2 − x + 3
P 1 (linear functions)
To make this idea of
a function being on P 0 (constant functions)
the “boundary” of P
precise, we need a 2x − 1
way of measuring
f (x) = 3
the distance
between
functions—we
describe how to do
this in Sections 1.3.4
and 1.D.
Figure 1.2: The vector spaces P 0 ⊂ P 1 ⊂ P 2 ⊂ · · · ⊂ P p ⊂ · · · ⊂ P ⊂ F are subspaces
of each other in the manner indicated. The vector space P of all polynomials is
interesting for the fact that it does not contain its boundary: functions like ex and
cos(x) can be approximated by polynomials and are thus on the boundary of P,
but are not polynomials themselves (i.e., they cannot be written as a finite linear
combination of 1, x, x2 , . . .).
As we suggested earlier, our primary reason for being interested in the span
of a set of vectors is that it is always a subspace. We now state and prove this
fact rigorously.
Theorem 1.1.3 Let V be a vector space and let B ⊆ V. Then span(B) is a subspace of V.
Spans are Subspaces
Proof. We need to check that the two closure properties described by Theo-
rem 1.1.2 are satisfied. We thus suppose that v, w ∈ span(B) and b ∈ F. Then
(by the definition of span(B)) there exist scalars c1 , c2 , . . . , ck , d1 , d2 , . . . , d` ∈ F
1.1 Vector Spaces and Subspaces 15
Definition 1.1.5 Suppose V is a vector space over a field F and B ⊆ V. We say that B is
Linear Dependence linearly dependent if there exist scalars c1 , c2 , . . ., ck ∈ F, at least one of
and Independence which is not zero, and vectors v1 , v2 , . . ., vk ∈ B such that
c1 v1 + c2 v2 + · · · + ck vk = 0.
Figure 1.3: A set of vectors is linearly independent if and only if each vector
contributes a new direction or dimension.
c1 v1 + c2 v2 + · · · + ck vk = 0
Example 1.1.14 Determine whether or not the following set of matrices linearly indepen-
Linear Independence dent in M2 (R):
of Matrices
1 1 1 −1 −1 2
B= , , .
1 1 −1 1 2 −1
Solution:
Since this set is finite, we want to check whether the equation
1 1 1 −1 −1 2 0 0
c1 + c2 + c3 =
1 1 −1 1 2 −1 0 0
c1 + c2 − c3 = 0
c1 − c2 + 2c3 = 0
In general, to find an c1 − c2 + 2c3 = 0
explicit linear c1 + c2 − c3 = 0.
combination that
demonstrates linear
dependence, we Solving this linear system via our usual methods reveals that c3 is a
can choose the free free variable (so there are infinitely many solutions) and c1 = −(1/2)c3 ,
variable(s) to be any c2 = (3/2)c3 . It follows that B is linearly dependent, and in particular,
non-zero value(s)
choosing c3 = 2 gives c1 = −1 and c2 = 3, so
that we like.
1 1 1 −1 −1 2 0 0
− +3 +2 = .
1 1 −1 1 2 −1 0 0
1.1 Vector Spaces and Subspaces 17
c0 + c1 x + c2 x2 + · · · + c p x p = 0 (1.1.1)
All we are doing has a unique solution (linear independence) or infinitely many solutions
here is showing that
(linear dependence). By plugging x = 0 into that equation, we see that
if f ∈ P p satisfies
f (x) = 0 for all x then c0 = 0.
all of its coefficients Taking the derivative of both sides of Equation (1.1.1) then reveals that
equal 0. This likely
seems “obvious”, but c1 + 2c2 x + 3c3 x2 + · · · + pc p x p−1 = 0,
it is still good to pin
down why it is true.
and plugging x = 0 into this equation gives c1 = 0. By repeating this
procedure (i.e., taking the derivative and then plugging in x = 0) we
similarly see that c2 = c3 = · · · = c p = 0, so B is linearly independent.
c1 v1 + c2 v2 + · · · + ck vk = 0
Example 1.1.17 Is the set B = {cos(x), sin(x), sin2 (x)} linearly independent in F ?
Linear Independence
Solution:
of a Set of Functions
Since this set is finite, we want to determine whether or not there exist
18 Chapter 1. Vector Spaces
We could also plug Plugging in x = 0 tells us that c1 = 0. Then plugging in x = π/2 tells
in other values of x to
us that c2 + c3 = 0, and plugging in x = 3π/2 tells us that −c2 + c3 = 0.
get other equations
involving c1 , c2 , c3 . Solving this system of equations involving c2 and c3 reveals that c2 = c3 =
0 as well, so B is a linearly independent set.
Example 1.1.18 Is the set B = {sin2 (x), cos2 (x), cos(2x)} linearly independent in F ?
Linear Dependence
Solution:
of a Set of Functions
Since this set is finite, we want to determine whether or not there exist
scalars c1 , c2 , c3 ∈ R (not all equal to 0) such that
On the surface, this set looks linearly independent, and we could try
proving it by plugging in specific x values to get some equations involving
c1 , c2 , c3 (just like in Example 1.1.17). However, it turns out that this won’t
work—the resulting system of linear equations will not have a unique
solution, regardless of the x values we choose. To see why this is the case,
recall the trigonometric identity
This identity follows
from the angle-sum
identity cos(θ + φ ) =
cos(2x) = cos2 (x) − sin2 (x).
cos(θ ) cos(φ ) −
sin(θ ) sin(φ ). In In particular, this tells us that if c1 = 1, c2 = −1 and c3 = 1, then
particular, plug in
θ = x and φ = x. sin2 (x) − cos2 (x) + cos(2x) = 0,
so B is linearly dependent.
1.1.3 Bases
In the final two examples of the previous subsection (i.e., Examples 1.1.17 and
1.1.18, which involve functions in F ), we had to fiddle around and “guess” an
approach that would work to show linear (in)dependence. In Example 1.1.17,
we stumbled upon a proof that the set is linearly independent by luckily picking
Another method of a bunch of x values that gave us a linear system with a unique solution, and
proving linear
in Example 1.1.18 we were only able to prove linear dependence because
independence in F
is explored in we conveniently already knew a trigonometric identity that related the given
Exercise 1.1.21. functions to each other.
The reason that determining linear (in)dependence in F is so much more
difficult than in P p or Mm,n is that we do not have a nice basis for F that we can
work with, whereas we do for the other vector spaces that we have introduced
so far (Rn , polynomials, and matrices). In fact, we have been working with
those nice bases already without even realizing it, and without even knowing
what a basis is in vector spaces other than Rn .
1.1 Vector Spaces and Subspaces 19
f (x) = c0 + c1 x + c2 x2 + · · · + c p x p .
The set {E1,1 , E1,2 , . . . , Em,n } is a basis (called the standard basis) of Mm,n .
This fact is hopefully believable enough, but it is proved explicitly in Exer-
cise 1.1.13).
Example 1.1.20 Is the set of polynomials B = {1, x, 2x2 − 1, 4x3 − 3x} a basis of P 3 ?
Strange Basis of
Solution:
Polynomials
We start by checking whether or not span(B) = P 3 . That is, we deter-
mine whether or not an arbitrary polynomial a0 + a1 x + a2 x2 + a3 x3 ∈ P 3
can be written as a linear combination of the members of B:
c0 + c1 x + c2 (2x2 − 1) + c3 (4x3 − 3x) = a0 + a1 x + a2 x2 + a3 x3 .
Example 1.1.21 Let P E and P O be the sets of even and odd polynomials, respectively:
Bases of Even and def def
Odd Polynomials P E = f ∈ P : f (−x) = f (x) and P O = f ∈ P : f (−x) = − f (x) .
so f + g ∈ P E too.
b) If f ∈ P E and c ∈ R then
so c f ∈ P E too.
To find a basis of P E , we first notice that {1, x2 , x4 , . . .} ⊂ P E . This set
is linearly independent since it is a subset of the linearly independent set
{1, x, x2 , x3 , . . .} from Example 1.1.16. To see that it spans P E , we notice
that if
f (x) = a0 + a1 x + a2 x2 + a3 x3 + · · · ∈ P E
then f (x) + f (−x) = 2 f (x), so
Just as was the case in Rn , the reason why we require a basis to span a vec-
tor space V is so that we can write every vector v ∈ V as a linear combination
of those basis vectors, and the reason why we require a basis to be linearly
independent is so that those linear combinations are unique. The following the-
orem pins this observation down, and it roughly says that linear independence
is the property needed to remove redundancies in linear combinations.
Theorem 1.1.4 Suppose B is a basis of a vector space V. For every v ∈ V, there is exactly
Uniqueness of one way to write v as a linear combination of the vectors in B.
Linear Combinations
∗(c) {sin2 (x), cos2 (x), 1} ⊂ F 1.1.6 Consider the subset S = {(a + bi, a − bi) : a, b ∈ R}
(d) {sin(x), cos(x), 1} ⊂ F of C2 . Explain why S is a vector space over the field R, but
∗(e) {sin(x + 1), sin(x), cos(x)} ⊂ F not over C.
(f) {ex , e−x } ⊂ F
∗∗(g) {ex , xex , x2 ex } ⊂ F ∗∗1.1.7 Show that the sequence space FN is a vector space.
∗∗1.1.8 Show that the set MSn of n × n symmetric matrices ∗1.1.17 Let S1 and S2 be subspaces of a vector space V.
(i.e., matrices A satisfying AT = A) is a subspace of Mn . (a) Show that S1 ∩ S2 is also a subspace of V.
(b) Provide an example to show that S1 ∪ S2 might not
1.1.9 Let Pn denote the set of n-variable polynomials, Pnp be a subspace of V.
denote the set of n-variable polynomials of degree at most p,
and HP np denote the set of n-variable homogeneous poly- 1.1.18 Let B and C be subsets of a vector space V.
nomials (i.e., polynomials in which every term has the exact
same degree) of degree p, together with the 0 function. For (a) Show that span(B ∩C) ⊆ span(B) ∩ span(C).
example, (b) Provide an example for which span(B ∩ C) =
span(B) ∩ span(C) and another example for which
x3 y + xy2 ∈ P24 and x4 y2 z + xy3 z3 ∈ HP 73 . span(B ∩C) ( span(B) ∩ span(C).
(a) Show that Pn is a vector space.
(b) Show that Pnp is a subspace of Pn . ∗∗1.1.19 Let S1 and S2 be subspaces of a vector space V.
(c) Show that HP np is a subspace of Pnp . The sum of S1 and S2 is defined by
[Side note: We explore HP np extensively in Sec- def
tion 3.B.] S1 + S2 = {v + w : v ∈ S1 , w ∈ S2 }.
(a) If V = R3 , S1 is the x-axis, and S2 is the y-axis, what
∗∗1.1.10 Let C be the set of continuous real-valued func- is S1 + S2 ?
tions, and let D be the set of differentiable real-valued func- (b) If V = M2 , MS2 is the subspace of symmetric ma-
tions. trices, and MsS2 is the subspace of skew-symmetric
matrices (i.e., the matrices B satisfying BT = −B),
(a) Briefly explain why C is a subspace of F .
what is MS2 + MsS 2 ?
(b) Briefly explain why D is a subspace of F .
(c) Show that S1 + S2 is always a subspace of V.
∗∗1.1.22 Let a1 , a2 , . . . , an be n distinct real numbers (i.e., [Hint: Construct the Wronskian of this set of functions (see
ai 6= a j whenever i 6= j). Show that the set of functions Exercise 1.1.21) and notice that it is the determinant of a
{ea1 x , ea2 x , . . . , ean x } is linearly independent. Vandermonde matrix.]
Figure 1.4: When we write a vector as a linear combination of basis vectors, the
coefficients of that linear combination tell us how far it points in the direction of
those basis vectors.
1.2 Coordinates and Linear Transformations 25
It is often easier to just keep track of and work with these coefficients c1 , c2 ,
. . ., cn in a linear combination, rather than the original vector v itself. However,
We abuse notation a there is one technicality that we have to deal with before we can do this: the
bit when using bases.
Order does not fact that bases are sets of vectors, and sets do not care about order. For example,
matter in sets like {e1 , e2 } and {e2 , e1 } are both the same standard basis of R2 . However, we want
{e1 , e2 }, but it matters to be able to talk about things like the “first” vector in a basis and the “third”
if that set is a basis.
coefficient in a linear combination. For this reason, we typically consider bases
to be ordered—even though they are written using the same notation as sets,
their order really is meant as written.
Definition 1.2.1 Suppose V is a vector space over a field F with a finite (ordered) basis
Coordinate Vectors B = {v1 , v2 , . . . , vn }, and v ∈ V. Then the unique scalars c1 , c2 , . . ., cn ∈ F
for which
v = c1 v1 + c2 v2 + · · · + cn vn
are called the coordinates of v with respect to B, and the vector
def
[v]B = (c1 , c2 , . . . , cn )
Example 1.2.1 Show that the following sets of vectors C are linearly independent in the
Checking Linear indicated vector space V:
(In)Dependence via a) V = R4 , C = {(0, 2, 0, 1), (0, −1, 1, 2), (3, −2, 1, 0)},
Coordinate Vectors
26 Chapter 1. Vector Spaces
has a unique solution. Explicitly, this linear system has the form
3c3 = 0
2c1 − c2 − 2c3 = 0
c2 + c3 = 0
c1 + 2c2 =0
Need a refresher on which can be solved via Gaussian elimination to see that the unique
solving linear
solution is c1 = c2 = c3 = 0, so C is indeed a linearly independent
systems? See
Appendix A.1.1. set.
b) We could check linear independence directly via the methods of
Section 1.1.2, but we instead compute the coordinate vectors of the
members of C with respect to the standard basis B = {1, x, x2 , x3 }
of P 3 :
[2x + x3 ]B = (0, 2, 0, 1),
[−x + x2 + 2x3 ]B = (0, −1, 1, 2), and
2
[3 − 2x + x ]B = (3, −2, 1, 0).
These coordinate vectors are exactly the vectors from part (a), which
we already showed are linearly independent in R4 , so C is linearly
independent in P 3 as well.
c) Again, we start by computing the coordinate vectors of the members
of C with respect to the standard basis B = {E1,1 , E1,2 , E2,1 , E2,2 } of
M2 :
The notation here is " #
quite unfortunate. 0 2
= (0, 2, 0, 1),
The inner square 0 1
brackets indicate B
" #
that these are
0 −1
matrices, while the = (0, −1, 1, 2), and
outer square 1 2
B
brackets denote the " #
coordinate vectors 3 −2
of these matrices. = (3, −2, 1, 0).
1 0
B
Again, these coordinate vectors are exactly the vectors from part (a),
which we already showed are linearly independent in R4 , so C is
linearly independent in M2 .
The above example highlights the fact that we can really think of R4 , P 3 ,
and M2 as “essentially the same” vector spaces, just dressed up and displayed
differently: we can work with the polynomial a0 + a1 x + a2 x2 + a3 x3 in the
1.2 Coordinates and Linear Transformations 27
same way that we work with the vector (a0 , a1 , a2 , a3 ) ∈ R4 , which we can
work with in the same way as the matrix
a0 a1
∈ M2 .
a2 a3
However, it is also often useful to represent vectors in bases other than the
standard basis, so we briefly present an example that illustrates how to do this.
Example 1.2.2 Find the coordinate vector of 2 + 7x + x2 ∈ P 2 with respect to the basis
Coordinate Vectors B = {x + x2 , 1 + x2 , 1 + x}.
in Weird Bases
Solution:
We want to find scalars c1 , c2 , c3 ∈ R such that
We did not explicitly
prove that B is a 2 + 7x + x2 = c1 (x + x2 ) + c2 (1 + x2 ) + c3 (1 + x).
basis of P 2 here—try
to convince yourself
that it is.
By matching up coefficients of powers of x on the left- and right-hand
sides above, we arrive at the following system of linear equations:
c2 + c3 = 2
c1 + c3 = 7
c1 + c2 =1
To find c1 , c2 , c3 , just This linear system has c1 = 3, c2 = −2, c3 = 4 as its unique solution, so
apply Gaussian
our desired coordinate vector is
elimination like usual.
2 + 7x + x2 B = (c1 , c2 , c3 ) = (3, −2, 4).
Theorem 1.2.1 Suppose n ≥ 1 is an integer and V is a vector space with a basis B consisting
Linearly Independent of n vectors.
Sets Versus a) Any set of more than n vectors in V must be linearly dependent, and
Spanning Sets b) any set of fewer than n vectors cannot span V.
Proof. For property (a), suppose that a set C has m > n vectors, which we call
v1 , v2 , . . . , vm . To see that C is necessarily linearly dependent, we must show
that there exist scalars c1 , c2 , . . ., cm , not all equal to zero, such that
c1 v1 + c2 v2 + · · · + cm vm = 0.
more columns than rows) and thus must have infinitely many solutions. In
particular, it has at least one non-zero solution, from which it follows that C is
linearly dependent.
Part (b) is proved in a similar way and thus left as Exercise 1.2.36.
By recalling that a basis of a vector space V is a set that both spans V and
is linearly independent, we immediately get the following corollary:
Corollary 1.2.2 Suppose n ≥ 1 is an integer and V is a vector space with a basis consisting
Uniqueness of of n vectors. Then every basis of V has exactly n vectors.
Size of Bases
For example, we saw in Example 1.1.19 that the set
1 0 0 1 0 −i 1 0
, , ,
0 1 1 0 i 0 0 −1
is a basis of M2 (C), which is consistent with Corollary 1.2.2 since the standard
basis {E1,1 , E1,2 , E2,1 , E2,2 } also contains exactly 4 matrices. In a sense, this
tells us that M2 (C) contains 4 “degrees of freedom” or requires 4 (complex)
numbers to describe each matrix that it contains. This quantity (the number of
vectors in any basis) gives a useful description of the “size” of the vector space
that we are working with, so we give it a name:
Table 1.1: The standard basis and dimension of some common vector spaces.
We refer to {0} as There is one special case that is worth special attention, and that is the
the zero vector
vector space V = {0}. The only basis of this vector space is the empty set
space.
{}, not {0} as we might first guess, since any set containing 0 is necessarily
linearly dependent. Since the empty set contains no vectors, we conclude that
dim({0}) = 0.
If we already know the dimension of the vector space that we are working
with, it becomes much simpler to determine whether or not a given set is a
1.2 Coordinates and Linear Transformations 29
basis of it. In particular, if a set contains the right number of vectors (i.e., its
size coincides with the dimension of the vector space) then we can show that it
is a basis just by showing that it is linearly independent or spanning—the other
property comes for free (see Exercise 1.2.27).
Remark 1.2.1 Notice that we have not proved a theorem saying that all vector spaces
Existence of Bases? have bases (whereas this fact is true of subspaces of Rn and is typically
proved in this case in introductory linear algebra textbooks). The reason
for this omission is that constructing bases is much more complicated in
arbitrary (potentially infinite-dimensional) vector spaces.
While it is true that all finite-dimensional vector spaces have bases (in
fact, this is baked right into Definition 1.2.2(a)), it is much less clear what
a basis of (for example) the vector space F of real-valued functions would
look like. We could try sticking all of the “standard” functions that we
know of into a set like
{1, x, x2 , . . . , |x|, ln(x2 + 1), cos(x), sin(x), 1/(3 + 2x2 ), e6x , . . .},
but there will always be lots of functions left over that are not linear
combinations of these familiar functions.
Most (but not all!) It turns out that the existence of bases depends on something called
mathematicians the “axiom of choice”, which is a mathematical axiom that is independent
accept the axiom of
of the other set-theoretic underpinnings of modern mathematics. In other
choice, and thus
would say that every words, we can neither prove that every vector space has a basis, nor can
vector space has a we construct a vector space that does not have one.
basis. However, the
axiom of choice is
From a practical point of view, this means that it is simply not possible
non-constructive, so to write down a basis of many vector spaces like F , even if they exist. Even
we still cannot in the space C of continuous real-valued functions, any basis necessarily
actually write down contains some extremely hideous and pathological functions. Roughly
explicit examples of
bases in many speaking, the reason for this is that any finite linear combination of “nice”
infinite-dimensional functions will still be “nice”, but there are many “not nice” continuous
vector spaces. functions out there.
For example, there are continuous functions that are nowhere differ-
entiable (i.e., they do not have a derivative anywhere). This means that
no matter how much we zoom in on the graph of the function, it never
starts looking like a straight line, but rather looks jagged at all zoom levels.
Every basis of C must contain strange functions like these, and an explicit
Again, keep in mind example is the Weierstrass function W defined by
that this does not
mean that W is a ∞
cos(2n x)
linear combination W (x) = ∑ ,
of cos(2x), cos(4x), n=1 2n
cos(8x), . . ., since this
sum has infinitely whose graph is displayed on the next page:
many terms in it.
30 Chapter 1. Vector Spaces
y
1
0.5
x
-2 -1 -0.5 0.5 1 2
-0.5
Definition 1.2.3 Suppose V is a vector space with bases B = {v1 , v2 , . . . , vn } and C. The
Change-of-Basis change-of-basis matrix from B to C, denoted by PC←B , is the n × n matrix
Matrix whose columns are the coordinate vectors [v1 ]C , [v2 ]C , . . . , [vn ]C :
def
PC←B = [v1 ]C | [v2 ]C | · · · | [vn ]C .
It is worth emphasizing the fact that in the above definition, we place the
coordinate vectors [v1 ]C , [v2 ]C , . . ., [vn ]C into the matrix PC←B as its columns,
not its rows. Nonetheless, when considering those vectors in isolation, we still
write them using round parentheses and in a single row, like [v2 ]C = (1, 4, −3),
just as we have been doing up until this point. The reason for this is that vectors
(i.e., members of Fn ) do not have a shape—they are just lists of numbers, and
we can arrange those lists however we like.
The change-of-basis matrix from B to C, as its name suggests, converts
coordinate vectors with respect to B into coordinate vectors with respect to C.
That is, we have the following theorem:
Theorem 1.2.3 Suppose B and C are bases of a finite-dimensional vector space V, and let
Change-of-Basis PC←B be the change-of-basis matrix from B to C. Then
Matrices a) PC←B [v]B = [v]C for all v ∈ V, and
−1
b) PC←B is invertible and PC←B = PB←C .
Furthermore, PC←B is the unique matrix with property (a).
1.2 Coordinates and Linear Transformations 31
To see that property (b) holds, we just note that using property (a) twice tells
us that PB←C PC←B [v]B = [v]B for all v ∈ V, so PB←C PC←B = I, which implies
−1
PC←B = PB←C .
Finally, to see that PB←C the unique matrix satisfying property (a), suppose
P ∈ Mn is any matrix for which P[v]B = [v]C for all v ∈ V. For every 1 ≤ j ≤ n,
if v = v j then we see that [v]B = [v j ]B = e j (the j-th standard basis vector in
Fn ), so P[v]B = Pe j is the j-th column of P. On the other hand, it is also the
case that P[v]B = [v]C = [v j ]C . The j-th column of P thus equals [v j ]C for each
1 ≤ j ≤ n, so P = PC←B .
The two properties described by the above theorem are illustrated in Fig-
ure 1.5: the bases B and C provide two different ways of making the vector
space V look like Fn , and the change-of-basis matrices PC←B and PB←C convert
these two different representations into each other.
V
One of the
advantages of using
change-of-basis v
matrices, rather than
computing each
coordinate vector [ · ]C [ · ]B
directly, is that we
can re-use
change-of-basis Fn Fn
matrices to change
multiple vectors PC←B
between bases.
[v]C [v]B
−1
PC←B = PB←C
Example 1.2.3 Find the change-of-basis matrices PB←C and PC←B for the bases
Computing a Change-
of-Basis Matrix B = {x + x2 , 1 + x2 , 1 + x} and C = {1, x, x2 }
Solution:
To start, we find the coordinate vectors of the members of B with
respect to the basis C. Since C is the standard basis of P 2 , we can eyeball
these coordinate vectors:
x + x2 C = (0, 1, 1), 1 + x2 C = (1, 0, 1), and [1 + x]C = (1, 1, 0).
The simplest method for finding PB←C from here is to compute the
inverse of PC←B :
This inverse can be
found by −1 1 1
row-reducing [ A | I ] −1 1
to [ I | A−1 ] (see PB←C = PC←B = 1 −1 1 .
2
Appendix A.1.3). 1 1 −1
Generally, The previous example was not too difficult since C happened to be the
converting vectors
into the standard
standard basis of P 2 , so we could compute PC←B just by “eyeballing” the
basis is much easier standard basis coefficients, and we could then compute PB←C just by taking
than converting into the inverse of PC←B . However, if C weren’t the standard basis, computing the
other bases. columns of PC←B would have been much more difficult—each column would
require us to solve a linear system.
A quicker and easier way to compute a change-of-basis matrix is to change
from the input basis B to the standard basis E (if one exists in the vector space
being considered), and then change from E to the output basis C, as described
by the following theorem:
Theorem 1.2.4 Let V be a finite-dimensional vector space with bases B, C, and E. Then
Computing Change-
of-Basis Matrices PC←B = PC←E PE←B .
(Method 1)
Example 1.2.4 Use Theorem 1.2.4 to find the change-of-basis matrix PC←B , where
Computing a Change-
of-Basis Matrix 1 0 0 1 0 −i 1 0
B= , , , and
Between Ugly Bases 0 1 1 0 i 0 0 −1
1 0 1 1 1 1 1 1
C= , , ,
0 0 0 0 1 0 1 1
We showed that the are bases of M2 (C). Then compute [v]C if [v]B = (1, 2, 3, 4).
set B is indeed a
basis of M2 (C) in Solution:
Example 1.1.19. You To start, we compute the change-of-basis matrices from B and C into
should try to
convince yourself the standard basis
that C is also a basis.
1 0 0 1 0 0 0 0
E= , , , ,
0 0 0 0 1 0 0 1
We then get PE←B by placing these vectors into a matrix as its columns:
1 0 0 1
0 1 −i 0
PE←B = .
0 1 i 0
1 0 0 −1
Finally, since we were asked for [v]C if [v]B = (1, 2, 3, 4), we compute
1 −1 i 1 1 3 + 3i
0 0 −2i 0 2 −6i
[v]C = PC←B [v]B = = .
−1 1 i 1 3 5 + 3i
1 0 0 −1 4 −3
The previous example perhaps seemed a bit long and involved. Indeed, to
make use of Theorem 1.2.4, we have to invert PE←C and then multiply two
matrices together. The following theorem tells us that we can reduce the amount
of work required slightly by instead row reducing a certain cleverly-chosen
matrix based on PE←B and PE←C . This takes about the same amount of time
as inverting PE←C , and lets us avoid the matrix multiplication step that comes
afterward.
Corollary 1.2.5 Let V be a finite-dimensional vector space with bases B, C, and E. Then
Computing Change- the reduced row echelon form of the augmented matrix
of-Basis Matrices
(Method 2) [ PE←C | PE←B ] is [ I | PC←B ].
To help remember this corollary, notice that [PE←C | PE←B ] has C and B in
the same order as [I | PC←B ], and we start with [PE←C | PE←B ] because changing
into the standard basis E is easy.
Proof of Corollary 1.2.5. Recall that PE←C is invertible, so its reduced row
echelon form is I and thus the RREF of [ PE←C | PE←B ] has the form [ I | X ]
for some matrix X.
To see that X = PC←B (and thus complete the proof), recall that sequences of
elementary row operations correspond to multiplication on the left by invertible
matrices (see Appendix A.1.3), so there is an invertible matrix Q such that
Q[ PE←C | PE←B ] = [ I | X ].
Example 1.2.5 Use Corollary 1.2.5 to find the change-of-basis matrix PC←B , where
Computing a Change-
of-Basis Matrix via 1 0 0 1 0 −i 1 0
B= , , , and
Row Operations 0 1 1 0 i 0 0 −1
1 0 1 1 1 1 1 1
C= , , ,
0 0 0 0 1 0 1 1
Since the left block is now the identity matrix I, it follows from Corol-
lary 1.2.5 that the right block is PC←B :
1 −1 i 1
0 0 −2i 0
PC←B = ,
−1 1 i 1
1 0 0 −1
Definition 1.2.4 Let V and W be vector spaces over the same field F. A linear transforma-
Linear tion is a function T : V → W that satisfies the following two properties:
Transformations a) T (v + w) = T (v) + T (w) for all v, w ∈ V, and
b) T (cv) = cT (v) for all v ∈ V and c ∈ F.
Example 1.2.6 Show that the matrix transpose is a linear transformation. That is, show
The Transpose is a that the function T : Mm,n → Mn,m defined by T (A) = AT is a linear
Linear Transformation transformation.
Solution:
We need to show that the two properties of Definition 1.2.4 hold. That
is, we need to show that (A + B)T = AT + BT and (cA)T = cAT for all
A, B ∈ Mm,n and c ∈ F. Both of these properties follow almost immediately
from the definition, so we do not dwell on them.
Example 1.2.7 The trace is the function tr : Mn (F) → F that adds up the diagonal entries
The Trace is a of a matrix:
Linear Transformation def
tr(A) = a1,1 + a2,2 + · · · + an,n for all A ∈ Mn (F).
Example 1.2.8 Show that the derivative is a linear transformation. That is, show that the
The Derivative is a function D : D → F defined by D( f ) = f 0 is a linear transformation.
Linear Transformation
Solution:
We need to show that the two properties of Definition 1.2.4 hold. That
Recall that D is the is, we need to show that ( f + g)0 = f 0 + g0 and (c f )0 = c f 0 for all f , g ∈ D
vector space of and c ∈ R. Both of these properties are typically presented in introductory
differentiable
real-valued
calculus courses, so we do not prove them here.
functions. Sorry for
using two different There are also a couple of particularly commonly-occurring linear transfor-
“D”s in the same mations that it is useful to give names to: the zero transformation O : V → W
sentence to mean
different things (curly is the one defined by O(v) = 0 for all v ∈ V, and the identity transformation
D is a vector space, I : V → V is the one defined by I(v) = v for all v ∈ V. If we wish to clar-
block D is the ify which spaces the identity and zero transformations are acting on, we use
derivative linear subscripts as in IV and OV ,W .
transformation).
On the other hand, it is perhaps helpful to see an example of a linear
algebraic function that is not a linear transformation.
Example 1.2.9 Show that the determinant, det : Mn (F) → F, is not a linear transformation
The Determinant is Not when n ≥ 2.
a Linear Transformation
Solution:
To see that det is not a linear transformation, we need to show that at
least one of the two defining properties from Definition 1.2.4 do not hold.
Well, det(I) = 1, so
so property (a) of that definition fails (property (b) also fails for a similar
reason).
We now do for linear transformations what we did for vectors in the previous
section: we give them coordinates so that we can explicitly write them down
using numbers from the ground field F. More specifically, just like every vector
in a finite-dimensional vector space can be associated with a vector in Fn , every
linear transformation between vector spaces can be associated with a matrix in
Mm,n (F):
Theorem 1.2.6 Let V and W be vector spaces with bases B and D, respectively, where
Standard Matrix of a B = {v1 , v2 , . . . , vn } and W is m-dimensional. A function T : V → W is a
Linear Transformation linear transformation if and only if there exists a matrix [T ]D←B ∈ Mm,n
for which
[T (v)]D = [T ]D←B [v]B for all v ∈ V.
Furthermore, the unique matrix [T ]D←B with this property is called the
standard matrix of T with respect to the bases B and D, and it is
def
[T ]D←B = [T (v1 )]D | [T (v2 )]D | · · · | [T (vn )]D .
In other words, this theorem tells us that instead of working with a vector
v ∈ V, applying the linear transformation T : V → W to it, and then converting
it into a coordinate vector with respect to the basis D, we can convert v to its
coordinate vector [v]B and then multiply by the matrix [T ]D←B (see Figure 1.6).
38 Chapter 1. Vector Spaces
W V
T (v) T v
satisfies [T ]D←B [v]B = [T (v)]D , and no other matrix has this property. To see
that [T ]D←B [v]B = [T (v)]D , suppose that [v]B = (c1 , c2 , . . . , cn ) (i.e., v = c1 v1 +
c2 v2 + · · · + cn vn ) and do block matrix multiplication:
c
.1
[T ]D←B [v]B = [T (v1 )]D | · · · | [T (vn )]D .. (definition of [T ]D←B )
This proof is almost cn
identical to that of
= c1 [T (v1 )]D + · · · + cn [T (vn )]D (block matrix mult.)
Theorem 1.2.3. The
reason for this is that = [c1 T (v1 ) + · · · + cn T (vn )]D (by Exercise 1.2.22)
change-of-basis
matrices are exactly = [T (c1 v1 + · · · + cn vn )]D (linearity of T )
the standard = [T (v)]D . (v = c1 v1 + · · · cn vn )
matrices of the
identity
transformation.
The proof of uniqueness of [T ]D←B is almost identical to the proof of
uniqueness of PC←B from Theorem 1.2.3, so we leave it to Exercise 1.2.35.
For simplicity of notation, in the special case when V = W and B = D
we denote the standard matrix [T ]B←B simply by [T ]B . If V furthermore has a
standard basis E then we sometimes denote [T ]E simply by [T ].
Example 1.2.10 Find the standard matrix [T ] of the transposition map T : M2 → M2 with
Standard Matrix of the respect to the standard basis E = {E1,1 , E1,2 , E2,1 , E2,2 }.
Transposition Map
Solution:
T , E T , E T , and E T ,
We need to compute the coefficient vectors of E1,1 1,2 2,1 2,2
1.2 Coordinates and Linear Transformations 39
We generalize this so
example to higher
dimensions in
1 0 0 0 a a "
Exercise 1.2.12.
c #
0 0 1 0 b a c
[T ]E [A]E =
= = = AT E ,
0 1 0 0 c b b d
E
0 0 0 1 d d
as desired.
Example 1.2.11 Find the standard matrix [D]C←B of the derivative map D : P 3 → P 2 with
Standard Matrix of the respect to the standard bases B = {1, x, x2 , x3 } ⊂ P 3 and C = {1, x, x2 } ⊂
Derivative P 2.
Solution:
We need to compute the coefficient vectors of D(1) = 0, D(x) = 1,
D(x2 ) = 2x, and D(x3 ) = 3x2 , and place them (in that order) into the matrix
[D]C←B :
2
[0]C = (0, 0, 0), [1]C = (1, 0, 0), [2x]C = (0, 2, 0), 3x C = (0, 0, 3).
It follows that
h 2 i 0 1 0 0
[D]C←B = [0]C [1]C [2x]C 3x C = 0 0 2 0 .
0 0 0 3
[ f ]B = (a0 , a1 , a2 , a3 ), so
a
0 1 0 0 0 a1
a1 2
[D]C←B [ f ]B = 0 0 2 0
a2 = 2a 2 = a1 + 2a2 x + 3a3 x C .
0 0 0 3 a 3a3
3
Keep in mind that if we change the input and/or output vector spaces of the
derivative map D, then its standard matrix can look quite a bit different (after
all, even just changing the bases used on these vector spaces can change the
standard matrix). For example, if we considered D as a linear transformation
from P 3 to P 3 , instead of from P 3 to P 2 as in the previous example, then its
standard matrix would be 4 × 4 instead of 3 × 4 (it would have an extra zero
row at its bottom).
The next example illustrates this observation with the derivative map on a
slightly more exotic vector space.
Example 1.2.12 Let B = {ex , xex , x2 ex } be a basis of the vector space V = span(B). Find
Standard Matrix of the the standard matrix [D]B of the derivative map D : V → V with respect to
Derivative (Again) B.
Solution:
We need to compute the coefficient vectors of D(ex ) = ex , D(xex ) =
Recall that spans are e + xex , and D(x2 ex ) = 2xex + x2 ex , and place them (in that order) as
x
always subspaces. B columns into the matrix [D]B :
was shown to be
linearly independent x x x
in Exercise 1.1.2(g), so e B = (1, 0, 0), e + xex B = (1, 1, 0), 2xe + x2 ex B = (0, 2, 1).
it is indeed a basis of
V. It follows that
h i 1 1 0
[D]B = ex B ex + xex B 2xex + x2 ex B = 0 1 2 .
0 0 1
V W X
S(T (v)) =
v T (v) (S ◦ T )(v)
T S
S◦T
Figure 1.7: The composition of S and T , denotes by S ◦ T , is the function that sends
v ∈ V to S(T (v)) ∈ X .
In fact, this is the primary reason that matrix multiplication is defined in the
seemingly bizarre way that it is—we want it to capture the idea of applying
one matrix (linear transformation) after another to a vector.
Theorem 1.2.7 Suppose V, W, and X are finite-dimensional vector spaces with bases B,
Composition of C, and D, respectively. If T : V → W and S : W → X are linear transfor-
Linear Transformations mations then S ◦ T : V → X is a linear transformation, and its standard
matrix is
[S ◦ T ]D←B = [S]D←C [T ]C←B .
Notice that the
middle C subscripts
match, while the left
D subscripts and the Proof. Thanks to the uniqueness condition of Theorem 1.2.6, we just need
right B subscripts also to show that [(S ◦ T )(v)]D = [S]D←C [T ]C←B [v]B for all v ∈ V. To this end, we
match. compute [(S ◦ T )(v)]D by using Theorem 1.2.6 applied to each of S and T
individually:
T k = |T ◦ T {z
def
◦ · · · ◦ T}.
k copies
Example 1.2.13 Use standard matrices to compute the fourth derivative of x2 ex + 2xex .
Iterated Derivatives
Solution:
We let B = {ex , xex , x2 ex } and V = span(B) so that we can make use
of the standard matrix
1 1 0
[D]B = 0 1 2
0 0 1
With this standard matrix in hand, one way to find the fourth derivative
of x2 ex + 2xex is to multiply its coordinate vector (0, 2, 1) by D4 B :
(x2 ex + 2xex )0000 B
= D4 B x2 ex + 2xex B
1 4 12 0 20
= 0 1 8 2 = 10 .
0 0 1 1 1
As always, notice in Proof. We simply multiply the matrix PE←D [T ]D←C PC←B on the right by an
this theorem that
arbitrary coordinate vector [v]B , where v ∈ V. Well,
adjacent subscripts
always match (e.g.,
the two “D”s are PE←D [T ]D←C PC←B [v]B = PE←D [T ]D←C [v]C (since PC←B [v]B = [v]C )
next to each other, = PE←D [T (v)]D (by Theorem 1.2.6)
as are the two “C”s).
= [T (v)]E .
However, we know from Theorem 1.2.6 that [T ]E←B is the unique matrix for
which [T ]E←B [v]B = [T (v)]E for all v ∈ V, so it follows that
PE←D [T ]D←C PC←B = [T ]E←B , as claimed.
A schematic that illustrates the statement of the above theorem is provided
by Figure 1.8. All it says is that there are two different (but equivalent) ways
of converting [v]B into [T (v)]E : we could multiply [v]B by [T ]E←B , or we
could convert [v]B into basis C, then multiply by [T ]D←C , and then convert
into basis E.
1.2 Coordinates and Linear Transformations 43
W V
T (v) T v
Again, read this
figure as starting at [ · ]E [ · ]B
the top-right corner [ · ]D [ · ]C
and moving to the Fm Fm Fn Fn
bottom-left.
Example 1.2.14 Compute the standard matrix [T ]C←B of the transpose map T : M2 (C) →
Representing the M2 (C), where
Transpose in
Weird Bases 1 0 0 1 0 −i 1 0
B= , , , and
0 1 1 0 i 0 0 −1
1 0 1 1 1 1 1 1
C= , , ,
0 0 0 0 1 0 1 1
We could also
[T ]C←B = PC←E [T ]E PE←B
compute [T ]C←B 1 −1 0 0 1 0 0 0 1 0 0 1
directly from the
definition, but that is 0 1 −1 0 0 0 1 0 0 1 −i 0
=
a lot more work as it 0 0 1 −1 0 1 0 0 0 1 i 0
requires the
computation of
0 0 0 1 0 0 0 1 1 0 0 −1
numerous ugly
1 −1 0 0 1 0 0 1
coordinate vectors.
0 1 −1 0 0 1 i 0
=
0 0 1 −1 0 1 −i 0
0 0 0 1 1 0 0 −1
1 −1 −i 1
0 0 2i 0
= .
−1 1 −i 1
1 0 0 −1
Theorem 1.2.9 Suppose V and W are finite-dimensional vector spaces with bases B and
Invertibility of D, respectively, and dim(V) = dim(W). Then a linear transformation
Linear Transformations T : V → W is invertible if and only if its standard matrix [T ]D←B is
invertible, and −1 −1
T B←D
= [T ]D←B .
Proof. We make use of the fact from Exercise 1.2.30 that the standard matrix
of the identity transformation is the identity matrix: [T ]B = I if and only if
T = IV .
For the “only if” direction, note that if T is invertible then we have
I = [IV ]B = T −1 ◦ T B = T −1 B←D [T ]D←B .
1.2 Coordinates and Linear Transformations 45
Since [T −1 ]B←D and [T ]D←B multiply to the identity matrix, it follows that they
are inverses of each other.
We show in For the “if” direction, suppose that [T ]D←B is invertible, with inverse matrix
Exercise 1.2.21 that if
A. Then there is some linear transformation S : W → V such that A = [S]B←D ,
dim(V) 6= dim(W)
then T cannot so for all v ∈ V we have
possibly be invertible.
[v]B = A[T ]D←B [v]B = [S]D←B [T ]D←B [v]B = [(S ◦ T )(v)]B .
Solution:
The “standard” way We let B = {ex , xex , x2 ex } and V = span(B) so that we can make use
to compute this
of the standard matrix
indefinite integral
directly would be to
use integration by 1 1 0
parts twice. [D]B = 0 1 2
0 0 1
via this method, we could run into trouble since D : P 3 → P 3 is not invertible:
direct computation reveals that its standard matrix is 4 × 4, but has rank 3. After
all, it has a 1-dimensional null space consisting of the coordinate vectors of
the constant functions, which are all mapped to 0. One way to get around this
problem is to use the pseudoinverse, which we introduce in Section 2.C.1.
L(c1 , c2 , c3 , . . .) = (c2 , c3 , c4 , . . .)
so R ◦ L 6= IFN .
We can also introduce many other properties and quantities associated with
linear transformations, such as their range, null space, rank, and eigenvalues.
In all of these cases, the definitions are almost identical to what they were for
matrices, and in the finite-dimensional case they can all be handled simply by
appealing to what we already know about matrices.
In particular, we now define the following concepts concerning a linear
transformation T : V → W:
def
We show that • range(T ) = {T (x) : x ∈ V},
range(T ) is a def
subspace of W and • null(T ) = {x ∈ V : T (x) = 0},
null(T ) is a subspace def
of V in Exercise 1.2.24.
• rank(T ) = dim(range(T )), and
def
• nullity(T ) = dim(null(T )).
In all of these cases, we can compute these quantities by converting ev-
erything to standard matrices and coordinate vectors, doing our computations
on matrices using the techniques that we already know, and then converting
1.2 Coordinates and Linear Transformations 47
Example 1.2.16 Determine the range, null space, rank, and nullity of the derivative map
Range and Null Space D : P 3 → P 3.
of the Derivative
Solution:
We first compute these objects directly from the definitions.
• The range of D is the set of all polynomials of the form D(p), where
p ∈ P 3 . Since the derivative of a degree-3 polynomial has degree-2
(and conversely, every degree-2 polynomial is the derivative of some
degree-3 polynomial), we conclude that range(D) = P 2 .
• The null space of D is the set of all polynomials p for which D(p) =
0. Since D(p) = 0 if and only if p is constant, we conclude that
null(D) = P 0 (the constant functions).
• The rank of D is the dimension of its range, which is dim(P 2 ) = 3.
• The nullity of D is the dimension of its null space: dim(P 0 ) = 1.
Alternatively, we could have arrived at these answers by working
with the standard matrix of D. Using an argument analogous to that of
Example 1.2.11, we see that the standard matrix of D with respect to the
In Example 1.2.11, D standard basis B = {1, x, x2 , x3 } is
was a linear
transformation into
P 2 (instead of P 3 ) so 0 1 0 0
its standard matrix 0 0 2 0
there was 3 × 4 [D]B = .
0 0 0 3
instead of 4 × 4.
0 0 0 0
Straightforward computation then shows the following:
• range([D]B ) = span (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) . These three
vectors are [1]B , [x]B , and [x2 ]B , so range(D) = span(1, x, x2 ) = P 2 ,
as we saw earlier.
• null([D]B ) = span (1, 0, 0, 0) . Since (1, 0, 0, 0) = [1]B , we con-
clude that null(D) = span(1) = P 0 , as we saw earlier.
• rank([D]B ) = 3, so rank(D) = 3 as well.
• nullity([D]B ) = 1, so nullity(D) = 1 as well.
In the above example, we were able to learn about the range, null space,
rank, and nullity of a linear transformation by considering the corresponding
properties of its standard matrix (with respect to any basis). These facts are
hopefully intuitive enough (after all, the entire reason we introduced standard
matrices is because they act on Fn in the same way that the linear transformation
acts on V), but they are proved explicitly in Exercise 1.2.25.
We can also define eigenvalues and eigenvectors of linear transformations in
If T : V → W with
V 6= W then T (v) and almost the exact same way that is done for matrices. If V is a vector space over a
λ v live in different field F then a non-zero vector v ∈ V is an eigenvector of a linear transformation
vector spaces, so it T : V → V with corresponding eigenvalue λ ∈ F if T (v) = λ v. We furthermore
does not make
say that the eigenspace corresponding to a particular eigenvalue is the set
sense to talk about
them being equal. consisting of all eigenvectors corresponding to that eigenvalue, together with 0.
48 Chapter 1. Vector Spaces
Just like the range and null space, eigenvalues and eigenvectors can be
computed either straight from the definition or via the corresponding properties
of a standard matrix.
Example 1.2.17 Compute the eigenvalues and corresponding eigenspaces of the transpose
Eigenvalues and map T : M2 → M2 .
Eigenvectors of the
Solution:
Transpose
We compute the eigenvalues and eigenspaces of T by making use of its
standard matrix with respect to the standard basis E = {E1,1 , E1,2 , E2,1 , E2,2 }
that we computed in Example 1.2.10:
1 0 0 0
0 0 1 0
[T ] = .
0 1 0 0
0 0 0 1
which has basis B = {(0, 1, −1, 0)}. This vector is the coordi-
nate vector of the matrix
0 1
,
−1 0
Example 1.2.18 Find a square root of the transpose map T : M2 (C) → M2 (C). That is,
Square Root of the find a linear transformation S : M2 (C) → M2 (C) with the property that
Transpose S2 = T .
Solution:
Since we already know how to solve problems like this for matrices,
Throughout this we just do the corresponding matrix computation on the standard matrix
example,
[T ] rather than trying to solve it “directly” on T itself. That is, we find a
E = {E1,1 , E1,2 , E2,1 , E2,2 }
is the standard basis matrix square root of [T ] via diagonalization.
of M2 . First, we diagonalize [T ]. We learned in Example 1.2.17 that [T ] has
eigenvalues 1 and −1, with corresponding eigenspace bases equal to
{(1, 0, 0, 0), (0, 1, 1, 0), (0, 0, 0, 1)} and {(0, 1, −1, 0)},
to choose
Recall that to
diagonalize a matrix, 1 0 0 0 2 0 0 0
we place the 1
0 1 0 1 0 1 1 0
eigenvalues along P= , P−1 = , and
the diagonal of D 0 1 0 −1 2 0 0 0 2
and bases of the
corresponding
0 0 1 0 0 1 −1 0
eigenspaces as 1 0 0 0
columns of P in the
same order. 0 1 0 0
D= .
0 0 1 0
0 0 0 −1
Example 1.2.19 Let B = {sin(x), cos(x)} and V = span(B). Find a square root D1/2
of
How to Take Half the derivative map D : V → V and use it to compute D1/2 sin(x) and
of a Derivative D1/2 cos(x) .
Solution:
Just like in the previous example, instead of working with D itself, we
We think of D1/2 as work with its standard matrix, which we can compute straightforwardly to
the “half derivative”, be
just like D2 is the
0 −1
second derivative. [D]B = .
1 0
1/2
While we could compute [D]B by diagonalizing [D]B (we do exactly this
in Exercise 1.2.18), it is simpler to recognize [D]B as a counter-clockwise
Recall that the rotation by π/2 radians:
standard matrix of a
counter-clockwise y y
rotation by angle θ is
" #
cos(θ ) − sin(θ )
.
sin(θ ) cos(θ )
Plugging in θ = π/2 [D]B
[cos(x)]B −−−→ [D sin(x)]B
gives exactly [D]B .
= e2
x x
[sin(x)]B = e1 [D cos(x)]B
Unraveling this standard matrix back into the linear transformation D1/2
shows that the half derivatives of sin(x) and cos(x) are
To double-check our
1
work, we could D1/2 sin(x) = √ sin(x) + cos(x) and
apply D1/2 to sin(x) 2
twice and see that
1/2
1
we get cos(x). D cos(x) = √ cos(x) − sin(x) .
2
∗∗1.2.12 Show that the standard matrix of the transpose 1.2.20 Suppose V and W are vector spaces and T :
map T : Mm,n → Mn,m with respect to the standard basis V → W is an invertible linear transformation. Let B =
E = {E1,1 , E1,2 , . . . , Em,n } is the mn × mn block matrix that, {v1 , v2 , . . . , vn } ⊆ V be a set of vectors.
for all i and j, has E j,i in its (i, j)-block:
(a) Show that if B is linearly independent then so is
E1,1 E2,1 · · · Em,1 T (B) = {T (v1 ), T (v2 ), . . . , T (vn )}.
(b) Show that if B spans V then T (B) spans W.
E1,2 E2,2 · · · Em,2
[T ] = .
. (c) Show that if B is a basis of V then T (B) is a basis of
. .. .. ..
W.
. . . . (d) Provide an example to show that none of the results
E1,n E2,n · · · Em,n of parts (a), (b), or (c) hold if T is not invertible.
∗ 1.2.15 Let T : P 2 → P 2 be the linear transformation [Side note: This means that the function T : V → Fn defined
defined by T ( f (x)) = f (x + 1). by T (v) = [v]B is an invertible linear transformation.]
(a) Show that B is a basis of V if and only if it is linearly ∗∗ 1.2.32 Show that the “right shift” map R from Re-
independent. mark 1.2.2 is indeed a linear transformation.
(b) Show that B is a basis of V if and only if it spans V.
Now that we are familiar with how most of the linear algebraic objects from
Rn (e.g., subspaces, linear independence, bases, linear transformations, and
eigenvalues) generalize to vector spaces in general, we take a bit of a detour to
discuss some ideas that it did not really make sense to talk about when Rn was
the only vector space in sight.
1.3.1 Isomorphisms
Recall that every finite-dimensional vector space V has a basis B, and we can
use that basis to represent a vector v ∈ V as a coordinate vector [v]B ∈ Fn ,
where F is the ground field. We used this correspondence between V and
Fn to motivate the idea that these vector spaces are “the same” in the sense
that, in order to do a linear algebraic calculation in V, we can instead do the
corresponding calculation on coordinate vectors in Fn .
We now make this idea of vector spaces being “the same” a bit more precise
and clarify under exactly which conditions this “sameness” happens.
Definition 1.3.1 Suppose V and W are vector spaces over the same field. We say that V
Isomorphisms and W are isomorphic, denoted by V ∼ = W, if there exists an invertible
linear transformation T : V → W (called an isomorphism from V to W).
We can think of isomorphic vector spaces as having the same structure and
the same vectors as each other, but different labels on those vectors. This is
perhaps easiest to illustrate by considering the vector spaces M1,2 and M2,1
1.3 Isomorphisms and Linear Forms 55
The expression “the of row and column vectors, respectively. Vectors (i.e., matrices) in these vector
map is not the
spaces have the forms
territory” seems
relevant here—we
typically only care a
a b ∈ M1,2 and ∈ M2,1 .
about the b
underlying vectors,
not the form we use The fact that we write the entries of vectors in M1,2 in a row whereas we
to write them down.
write those from M2,1 in a column is often just as irrelevant as if we used a
different font when writing the entries of the vectors in one of these vector
spaces. Indeed, vector addition and scalar multiplication in these spaces are
both performed entrywise, so it does not matter how we arrange or order those
entries.
To formally see that M1,2 and M2,1 are isomorphic, we just construct the
“obvious” isomorphism between them: the transpose map T : M1,2 → M2,1
satisfies
a
T a b = .
b
Furthermore, we already noted in Example 1.2.6 that the transpose is a linear
transformation, and it is clearly invertible (it is its own inverse, since transposing
twice gets us back to where we started), so it is indeed an isomorphism. The
same argument works to show that each of Fn , Mn,1 , and M1,n are isomorphic.
Remark 1.3.1 The fact that Fn , Mn,1 , and M1,n are isomorphic justifies something that
Column Vectors is typically done right from the beginning in linear algebra—treating mem-
and Row Vectors bers of Fn (vectors), members of Mn,1 (column vectors), and members of
M1,n (row vectors) as the same thing.
When we walk about isomorphisms and vector spaces being “the
same”, we only mean with respect to the 10 defining properties of vector
spaces (i.e., properties based on vector addition and scalar multiplication).
A “morphism” is a We can add column vectors in the exact same way that we add row vectors,
function and the
and similarly scalar multiplication works the exact same for those two
prefix “iso” means
“identical”. The word types of vectors. However, other operations like matrix multiplication may
“isomorphism” thus behave differently on these two sets (e.g., if A ∈ Mn then Ax makes sense
means a function when x is a column vector, but not when it is a row vector).
that keeps things
“the same”.
As an even simpler example of an isomorphism, we have implicitly been
using one when we say things like vT w = v · w for all v, w ∈ Rn . Indeed, the
quantity v · w is a scalar in R, whereas vT w is actually a 1 × 1 matrix (after
all, it is obtained by multiplying a 1 × n matrix by an n × 1 matrix), so it does
not quite make sense to say that they are “equal” to each other. However, the
spaces R and M1,1 (R) are trivially isomorphic, so we typically sweep this
technicality under the rug.
Before proceeding, it is worthwhile to specifically point out some basic
properties of isomorphisms that follow almost immediately from facts that we
already know about (invertible) linear transformations in general:
We prove these two • If T : V → W is an isomorphism then so is T −1 : W → V.
properties in
Exercise 1.3.6. •If T : V → W and S : W → X are isomorphisms then so is S ◦ T : V → X .
In particular, if V ∼
= W and W ∼= X then V ∼
= X.
For example, we essentially showed in Example 1.2.1 that M2 ∼ = R4 and
4 ∼ 3 ∼ 3
R = P , so it follows that M2 = P as well. The fact that these vector spaces
56 Chapter 1. Vector Spaces
Example 1.3.1 Show that the vector spaces V = span(ex , xex , x2 ex ) and R3 are isomorphic.
Isomorphism of a
Solution:
Space of Functions
The standard way to show that two spaces are isomorphic is to con-
struct an isomorphism between them. To this end, consider the linear
Our method of transformation T : R3 → V defined by
coming up with this
map T is very
naïve—just send a T (a, b, c) = aex + bxex + cx2 ex .
basis of R3 to a basis
of V. This technique It is straightforward to show that this function is a linear transforma-
works fairly generally tion, so we just need to convince ourselves that it is invertible. To this
and is a good way
of coming up with
end, we recall from Exercise 1.1.2(g) that B = {ex , xex , x2 ex } is linearly
“obvious” independent and thus a basis of V, so we can construct the standard matrix
isomorphisms. [T ]B←E , where E = {e1 , e2 , e3 } is the standard basis of R3 :
h i
[T ]B←E = T (1, 0, 0) B T (0, 1, 0) B T (0, 0, 1) B
h i 1 0 0
= ex B xex B x2 ex B = 0 1 0 .
0 0 1
Since [T ]B←E is clearly invertible (the identity matrix is its own inverse),
T is invertible too and is thus an isomorphism.
Example 1.3.2 Show that the vector space of polynomials P and the vector space of
Polynomials are eventually-zero sequences c00 from Example 1.1.10 are isomorphic.
Isomorphic to
Solution:
Eventually-Zero
Sequences As always, our method of showing that these two spaces are isomorphic
is to explicitly construct an isomorphism between them. As with the
previous examples, there is an “obvious” choice of isomorphism T : P →
c00 , and it is defined by
Theorem 1.3.1 = Fn .
Suppose V is an n-dimensional vector space over a field F. Then V ∼
Isomorphisms of
Finite-Dimensional
Vector Spaces Proof. We just recall from Exercise 1.2.22 that if B is any basis of V then the
function T : V → Fn defined by T (v) = [v]B is an invertible linear transforma-
tion (i.e., an isomorphism).
In particular, the above theorem tells us that any two vector spaces of the
same (finite) dimension over the same field are necessarily isomorphic, since
they are both isomorphic to Fn and thus to each other. The following corollary
states this observation precisely and also establishes its converse.
Corollary 1.3.2 Suppose V and W are vector spaces over the same field and V is finite-
Finite-Dimensional dimensional. Then V ∼
= W if and only if dim(V) = dim(W).
Vector Spaces are
Isomorphic
Proof. We already explained how Theorem 1.3.1 gives us the “if” direction, so
we now prove the “only if” direction. To this end, we just note that if V ∼
=W
then there is an invertible linear transformation T : V → W, so Exercise 1.2.21
tells us that dim(V) = dim(W).
Remark 1.3.2 In a sense, we did not actually do anything new in this subsection—we
Why Isomorphisms? already knew about linear transformations and invertibility, so it seems
natural to wonder why we would bother adding the “isomorphism” layer
of terminology on top of it.
An isomorphism While it’s true that there’s nothing really “mathematically” new about
T : V → V (i.e., from a
isomorphisms, the important thing is the new perspective that it gives us.
vector space back
to itself ) is called an It is very useful to be able to think of vector spaces as being the same as
automorphism. each other, as it can provide us with new intuition or cut down the amount
of work that we have to do.
For example, instead of having to do a computation or think about
vector space properties in P 3 or M2 , we can do all of our work in R4 ,
which is likely a fair bit more intuitive. Similarly, we can always work with
whichever of c00 (the space of eventually-zero sequences) or P (the space
If it is necessary to of polynomials) we prefer, since an answer to any linear algebraic question
clarify which type of
in one of those spaces can be straightforwardly converted into an answer
isomorphism we are
talking about, we to the corresponding question in the other space via the isomorphism that
call an isomorphism we constructed in Example 1.3.2.
in the linear algebra
sense a vector
More generally, isomorphisms are used throughout all of mathemat-
space isomorphism. ics, not just in linear algebra. In general, they are defined to be invertible
58 Chapter 1. Vector Spaces
Definition 1.3.2 Suppose V is a vector space over a field F. Then a linear transformation
Linear Forms f : V → F is called a linear form.
Linear forms are Linear forms can be thought of as giving us snapshots of vectors—knowing
sometimes instead
the value of f (v) tells us what v looks like from one particular direction or
called linear
functionals. angle (just like having a photograph tells us what an object looks like from
one side), but not necessarily what it looks like as a whole. For example, the
linear transformations f1 , f2 , . . ., fn described above each give us one of v’s
coordinates (i.e., they tell us what v looks like in the direction of one of the
standard basis vectors), but tell us nothing about its other coordinates.
Alternatively, linear forms can be thought of as the building blocks that
make up more general linear transformations. For example, consider the linear
transformation T : R2 → R2 (which is not a linear form) defined by
T (x, y) = (x + 2y, 3x − 4y) for all (x, y) ∈ R2 .
If we define f (x, y) = x + 2y and g(x, y) = 3x − 4y then it is straightforward
to check that f and g are each linear forms, and T (x, y) = ( f (x, y), g(x, y)).
That is, T just outputs the value of two linear forms. Similarly, every linear
transformation into an n-dimensional vector space can be thought of as being
made up of n linear forms (one for each of the n output dimensions).
is a linear form.
Solution:
This follows immediately from the more general fact that multiplica-
tion by a matrix is a linear transformation. Indeed, if we let A = v ∈ M1,n
be v as a row vector, then fv (w) = Aw for all column vectors w.
Recall that the dot
product on Rn is In particular, if F = R then the previous example tells us that
defined by
v · w = v1 w1 + · · · + vn wn . fv (w) = v · w for all w ∈ Rn
1.3 Isomorphisms and Linear Forms 59
is a linear form. This is actually the “standard” example of a linear form, and
the one that we should keep in mind as our intuition builder. We will see shortly
that every linear form on a finite-dimensional vector space can be written in this
way (in the exact same sense that every linear transformation can be written as
a matrix).
Example 1.3.6 Let C[a, b] be the vector space of continuous real-valued functions on the
Integration is a interval [a, b]. Show that the function I : C[a, b] → R defined by
Linear Form Z b
I( f ) = f (x) dx
a
We now pin down the claim that we made earlier that every linear form
on a finite-dimensional vector space looks like one half of the dot product. In
particular, to make this work we just do what we always do when we want to
make abstract vector space concepts more concrete—we represent vectors as
coordinate vectors with respect to some basis.
Theorem 1.3.3 Let B be a basis of a finite-dimensional vector space V over a field F, and
The Form of let f : V → F be a linear form. Then there exists a unique vector v ∈ V
Linear Forms such that
f (w) = [v]TB [w]B for all w ∈ V,
where we are treating [v]B and [w]B as column vectors.
Example 1.3.7 Let E2 : P 3 → R be the evaluation map from Example 1.3.5, defined by
The Evaluation Map E2 ( f ) = f (2), and let E = {1, x, x2 , x3 } be the standard basis of P 3 . Find
as a Dot Product a polynomial g ∈ P 3 such that E2 ( f ) = [g]E · [ f ]E for all f ∈ P 3 .
This example just Solution:
illustrates how
Theorem 1.3.3 works If we write f (x) = a + bx + cx2 + dx3 (i.e., [ f ]E = (a, b, c, d)) then
out for the linear
form E2 . E2 ( f ) = f (2) = a + 2b + 4c + 8d = (1, 2, 4, 8) · (a, b, c, d).
1.3 Isomorphisms and Linear Forms 61
Definition 1.3.3 Let V be a vector space over a field F. Then the dual of V, denoted by V ∗ ,
Dual of a is the vector space consisting of all linear forms on V.
Vector Space
The fact that V ∗ is indeed a vector space is established by Exercise 1.2.28(a),
and part (b) of that same exercise even tells us that dim(V ∗ ) = dim(V) when
V is finite-dimensional, so V and V ∗ are isomorphic by Corollary 1.3.2. In
fact, one simple isomorphism from V ∗ to V is exactly the one that sends a
linear form f to its corresponding vector v from Theorem 1.3.3. However, this
isomorphism between V and V ∗ is somewhat strange, as it depends on the
particular choice of basis that we make on V—if we change the basis B in
Theorem 1.3.3 then the vector v corresponding to a linear form f changes as
well.
The fact that the isomorphism between V and V ∗ is basis-dependent sug-
gests that something somewhat unnatural is going on, as many (even finite-
dimensional) vector spaces do not have a “natural” or “standard” choice of
basis. However, if we go one step further and consider the double-dual space
Yes, it is pretty V ∗∗ consisting of linear forms acting on V ∗ then things become a bit more
awkward to think
about what the well-behaved, so we now briefly explore this double-dual space.
members of V ∗∗ are. Using the exact same ideas as earlier, if V is finite-dimensional then we
They are linear forms
acting on linear
still have dim(V ∗∗ ) = dim(V ∗ ) = dim(V), so all three of these vector spaces
forms (i.e., functions are isomorphic. However, V and V ∗∗ are isomorphic in a much more natural
of functions). way than V and V ∗ , since there is a basis-independent choice of isomorphism
between them. To see what it is, notice that for every vector v ∈ V we can
define a linear form φv ∈ V ∗∗ by
φv ( f + g) = ( f + g)(v) (definition of φv )
= f (v) + g(v) (definition of “ + ” in V ∗ )
= φv ( f ) + φv (g) (definition of φv )
62 Chapter 1. Vector Spaces
Example 1.3.8 Show that for each φ ∈ (P 3 )∗∗ there exists f ∈ P 3 such that
The Double-Dual of
Polynomials φ (Ex ) = f (x) for all x ∈ R,
and
f (x) = Ex ( f ) = c1,x E1 + c2,x E2 + c3,x E3 + c4,x E4 ( f )
= c1,x E1 ( f ) + c2,x E2 ( f ) + c3,x E3 ( f ) + c4,x E4 ( f )
= c1,x f (1) + c2,x f (2) + c3,x f (3) + c4,x f (4)
Proof. We must show that T is linear and invertible, and we again do not have
to be clever to do so. Rather, as long as we keep track of what space all of these
Yes, T is a function objects live in and parse the notation carefully, then linearity and invertibility
that sends vectors to
functions of of T follow almost immediately from the relevant definitions.
functions of vectors. Before showing that T is linear, we first make a brief note on notation.
If you recall that V
itself might be a
Since T maps into V ∗∗ , we know that T (v) is a function acting on V ∗ . We thus
vector space made use the (admittedly unfortunate) notation T (v)( f ) to refer to the scalar value
up of functions then that results from applying the function T (v) to f ∈ V ∗ . Once this notational
your head might nightmare is understood, the proof of linearity is straightforward:
explode.
1.3 Isomorphisms and Linear Forms 63
This double-dual space V ∗∗ and its correspondence with V likely still seems
quite abstract, so it is useful to think about what it means when V = Fn , which
The close we typically think of as consisting of column vectors. Theorem 1.3.3 tells
relationship between
V and V ∗∗ is why we
us that each f ∈ (Fn )∗ corresponds to some row vector vT (in the sense that
use the term “dual f (w) = vT w for all w ∈ Fn ). Theorem 1.3.4 says that if we go one step further,
space” in the first then each φ ∈ (Fn )∗∗ corresponds to some column vector w ∈ Fn in the sense
place—duality refers that φ ( f ) = f (w) = vT w for all f ∈ V ∗ (i.e., for all vT ).
to an operation or
concept that, when For this reason, it is convenient (and for the most part, acceptable) to think
applied a second
time, gets us back to
of V as consisting of column vectors and V ∗ as consisting of the corresponding
where we started. row vectors. In fact, this is exactly why we use the notation V ∗ for the dual
space in the first place—it is completely analogous to taking the (conjugate)
transpose of the vector space V. The fact that V ∗ is isomorphic to V, but in a
way that depends on the particular basis chosen, is analogous to the fact that if
v ∈ Fn is a column vector then v and vT have the same size (and entries) but
not shape, and the fact that V ∗∗ is so naturally isomorphic to V is analogous to
the fact that (vT )T and v have the same size and shape (and are equal).
Remark 1.3.3 While V ∗ is defined as a set of linear forms on V, this can be cumbersome
Linear Forms Versus to think about once we start considering vector spaces like V ∗∗ (and,
Vector Pairings heaven forbid, V ∗∗∗ ), as it is somewhat difficult to make sense of what a
function of a function (of a function...) “looks like”.
Instead, notice that if w ∈ V and f ∈ V ∗ then the expression f (w) is
linear in each of w and f , so we can think of it just as combining members
of two vector spaces together in a linear way, rather than as members of
one vector space acting on the members of another. One way of making
this observation precise is via Theorem 1.3.3, which says that applying a
linear form f ∈ V ∗ to w ∈ V is the same as taking the dot product of two
vectors [v]B and [w]B (at least in the finite-dimensional case).
64 Chapter 1. Vector Spaces
Definition 1.3.4 Suppose V and W are vector spaces over the same field F. Then a function
Bilinear Forms f : V × W → F is called a bilinear form if it satisfies the following
properties:
a) It is linear in its first argument:
i) f (v1 + v2 , w) = f (v1 , w) + f (v2 , w) and
ii) f (cv1 , w) = c f (v1 , w) for all c ∈ F, v1 , v2 ∈ V, and w ∈ W.
Recall that the b) It is linear in its second argument:
notation
i) f (v, w1 + w2 ) = f (v, w1 ) + f (v, w2 ) and
f : V × W → F means
that f takes two ii) f (v, cw1 ) = c f (v, w1 ) for all c ∈ F, v ∈ V, and w1 , w2 ∈ W.
vectors as
input—one from V While the above definition might seem like a mouthful, it simply says that
and one from
W—and provides a f is a bilinear form exactly if it becomes a linear form when one of its inputs is
scalar from F as held constant. That is, for every fixed vector w ∈ W the function gw : V → F
output. defined by gw (v) = f (v, w) is a linear form, and similarly for every fixed vector
v ∈ V the function hv : W → F defined by hv (w) = f (v, w) is a linear form.
Yet again, we look at some examples to try to get a feeling for what bilinear
forms look like.
is a bilinear form.
Solution:
We could work through the four defining properties of bilinear forms,
but an easier way to solve this problem is to recall from Example 1.3.3 that
the dot product is a linear form if we keep the first vector v fixed, which
establishes property (b) in Definition 1.3.4.
Since v·w = w·v, it is also the case that the dot product is a linear form
if we keep the second vector fixed, which in turn establishes property (a)
in Definition 1.3.4. It follows that the dot product is indeed a bilinear form.
The function The real dot product is the prototypical example of a bilinear form, so keep
f (x, y) = xy is a it in mind when working with bilinear forms abstractly to help make them seem
bilinear form but not a bit more concrete. Perhaps even more simply, notice that multiplication (of
a linear
transformation. real numbers) is a bilinear form. That is, if we define a function f : R × R → R
Linear simply via f (x, y) = xy, then f is a bilinear form. This of course makes sense
transformations must since multiplication of real numbers is just the one-dimensional dot product.
be linear “as a
whole”, whereas In order to simplify proofs that certain functions are bilinear forms, we
bilinear forms just can check linearity in the first argument by showing that f (v1 + cv2 , w) =
need to be linear
with respect to each f (v1 , w) + c f (v2 , w) for all c ∈ F, v1 , v2 ∈ V, and w ∈ W, rather than checking
variable vector addition and scalar multiplication separately as in conditions (a)(i)
independently. and (a)(ii) of Definition 1.3.4 (and we similarly check linearity in the second
argument).
1.3 Isomorphisms and Linear Forms 65
Example 1.3.10 Let V be a vector space over a field F. Show that the function g : V ∗ × V →
The Dual Pairing F defined by
is a Bilinear Form
g( f , v) = f (v) for all f ∈ V ∗, v ∈ V
is a bilinear form.
Solution:
We just notice that g is linear in each of its input arguments individually.
For the first input argument, we have
Example 1.3.11 Let A ∈ Mm,n (F) be a matrix. Show that the function f : Fm × Fn → F
Matrices are defined by
Bilinear Forms
f (v, w) = vT Aw for all v ∈ Fm , w ∈ Fn
is a bilinear form.
Solution:
In this example (and Once again, we just check the defining properties from Definition 1.3.4,
as always, unless
all of which follow straightforwardly from the corresponding properties
specified otherwise),
F refers to an of matrix multiplication:
arbitrary field.
a) For all v1 , v2 ∈ Fm , w ∈ Fn , and c ∈ F we have
is a bilinear form.
Solution:
We could check the defining properties from Definition 1.3.4, but an
easier way to show that f is bilinear is to notice that we can group its
Notice that the coefficients into a matrix as follows:
coefficient of vi w j
goes in the " #
(i, j)-entry of the 3 −4 w1
f (v, w) = v1 v2 .
matrix. This always 5 1 w2
happens.
Theorem 1.3.5 Let B and C be bases of m- and n-dimensional vector spaces V and W,
The Form of respectively, over a field F, and let f : V × W → F be a bilinear form.
Bilinear Forms There exists a unique matrix A ∈ Mm,n (F) such that
Proof. We just use the fact that bilinear forms are linear when we keep one
of their inputs constant, and we then leech off of the representation of linear
forms that we already know from Theorem 1.3.3.
Specifically, if we denote the vectors in the basis B by B = {v1 , v2 , . . . , vm }
then [v j ]B = e j for all 1 ≤ j ≤ m and Theorem 1.3.3 tells us that the linear form
g j : W → F defined by g j (w) = f (v j , w) can be written as g j (w) = aTj [w]C
n
for some fixed (column)
T T T
vector a j ∈ F . If we let A be the matrix with rows
a1 , . . . am (i.e., A = a1 | · · · | am ) then
Here we use the fact
that eTj A equals the f (v j , w) = g j (w) = aTj [w]C
j-th row of A (i.e., aTj ). = eTj A[w]C = [v j ]TB A[w]C for all 1 ≤ j ≤ m, w ∈ W.
Example 1.3.13 Find the matrix A ∈ M2 (F) with the property that
The 2 × 2 Determinant
as a Matrix det v | w = vT Aw for all v, w ∈ F2 .
Solution:
Recall that det v | w = v1 w2 − v2 w1 , while direct calculation
shows that
By simply comparing these two expressions, we see that the unique matrix
The fact that A is A that makes them equal to each other has entries a1,1 = 0, a1,2 = 1,
skew-symmetric (i.e.,
a2,1 = −1, and a2,2 = 0. That is,
AT = −A)
corresponds to the
fact that swapping T 0 1
det v|w = v Aw if and only if A = .
two columns of a −1 0
matrix multiplies its
determinant by−1
(i.e., det v | w =
− det w | v ). Multilinear Forms
In light of Example 1.3.13, it might be tempting to think that the determinant
of a 3 × 3 matrix can be represented via a single fixed 3 × 3 matrix, but this
is not the case—the determinant of a 3 × 3 matrix is not a bilinear form, but
rather it is linear in the three columns of the input matrix. More generally, the
determinant of a p × p matrix is multilinear—linear in each of its p columns.
This generalization of bilinearity is captured by the following definition, which
requires that the function being considered is a linear form when all except for
one of its inputs are held constant.
68 Chapter 1. Vector Spaces
Definition 1.3.5 Suppose V1 , V2 , . . . , V p are vector spaces over the same field F. A function
Multilinear Forms f : V1 × V2 × · · · × V p → F is called a multilinear form if, for each 1 ≤
j ≤ p and each v1 ∈ V1 , v2 ∈ V2 , . . ., v p ∈ V p , it is the case that the function
A multilinear form g : V j → F defined by
with p input
arguments is
sometimes called
g(v) = f (v1 , . . . , v j−1 , v, v j+1 , . . . , v p ) for all v ∈ V j
p-linear.
is a linear form.
¸
,1 ¸
a1,3 ,2 ¸
,1 a2,3
,1 a1,3 ¸
a1,2 ,2 ,2 a1,3
,3
· 1,1,1 ,1 a1,2 a2,3 ,4
a a2,2 ,3 a2,3
,3 a1,3
,1 · 1,1,2 a2,2
,2 a1,2 ,4 3,4
a2,1 a · 1,1,3 ,3 a1,2 a2,
,2 a a2,2
a2,1 ,3 · 1,1,4 a2,2
,4
a2,1 a
,4
a2,1
Theorem 1.3.6 Suppose V1 , . . . , V p are finite-dimensional vector spaces over a field F and
The Form of f : V1 × · · · × V p → F is a multilinear form. For each 1 ≤ i ≤ p and vi ∈ Vi ,
Multilinear Forms let vi, j denote the j-th coordinate of vi with respect to some basis of Vi .
Then there exists a unique p-dimensional array (with entries {a j1 ,..., j p }),
called the standard array of f , such that
We already know from Theorems 1.3.3 and 1.3.5 that this result holds
when p = 1 or p = 2, which establishes the base case of the induction. For
the inductive step, suppose that the result is true for all (p − 1)-linear forms
We use j1 here acting on V2 × · · · × V p . If we let B = {w1 , w2 , . . . , wm } be a basis of V1 then the
instead of just j since
inductive hypothesis tells us that the (p−1)-linear forms g j1 : V2 ×· · ·×Vn → F
it will be convenient
later on. defined by
g j1 (v2 , . . . , v p ) = f (w j1 , v2 , . . . , v p )
can be written as
The scalar a j1 , j2 ,..., j p
here depends on j1 f (w j1 , v2 , . . . , v p ) = g j1 (v2 , . . . , v p ) = ∑ a j1 , j2 ,..., j p v2, j2 · · · v p, j p
(not just j2 , . . . , jn ) j2 ,..., j p
since each choice
of w j1 gives a for some fixed family of scalars {a j1 , j2 ,..., j p }.
different If we write an arbitrary vector v1 ∈ V1 as a linear combination of the
(p − 1)-linear form g j1 .
basis vectors w1 , w2 , . . . , wm (i.e., v1 = v1,1 w1 + v1,2 w2 + · · · + v1,m wm ), it then
follows from linearity of the first argument of f that
!
m m
f (v1 , . . . , v p ) = f ∑ v1, j1 w j1 , v2 , . . . , v p (v1 = ∑ v1, j1 w j1 )
j1 =1 j1 =1
m
= ∑ v1, j1 f (w j1 , v2 , . . . , v p ) (multilinearity of f )
j1 =1
m
= ∑ v1, j1 ∑ a j1 , j2 ,..., j p v2, j2 · · · v p, j p (inductive hypothesis)
j1 =1 j2 ,..., j p
for all v1 ∈ V1 , . . . , v p ∈ V p , which completes the inductive step and shows that
the family of scalars {a j1 ,..., j p } exists.
To see that the scalars {a j1 ,..., j p } are unique, just note that if we choose v1
to be the j1 -th member of the basis of V1 , v2 to be the j2 -th member of the
basis of V2 , and so on, then we get
f (v1 , . . . , v p ) = a j1 ,..., j p .
In particular, these scalars are completely determined by f .
For example, the determinant of a 3 × 3 matrix is a 3-linear (trilinear?)
form, so it can be represented by a single fixed 3 × 3 × 3 array (or equivalently,
a family of 33 = 27 scalars), just like the determinant of a 2 × 2 matrix is a
bilinear form and thus can be represented by a 2 × 2 matrix (i.e., a family of
22 = 4 scalars). The following example makes this observation explicit.
Solution:
We recall the following explicit formula for the determinant of a 3 × 3
matrix:
det v|w|x = v1 w2 x3 +v2 w3 x1 +v3 w1 x2 −v1 w3 x2 −v2 w1 x3 −v3 w2 x1 .
70 Chapter 1. Vector Spaces
Definition 1.3.6 Suppose that F = R or F = C and that V is a vector space over F. Then an
Inner Product inner product on V is a function h·, ·i : V × V → F such that the following
three properties hold for all c ∈ F and all v, w, x ∈ V:
The notation
a) hv, wi = hw, vi (conjugate symmetry)
h·, ·i : V × V → F b) hv, w + cxi = hv, wi + chv, xi (linearity)
means that h·, ·i is a c) hv, vi ≥ 0, with equality if and only if v = 0. (pos. definiteness)
function that takes in
two vectors from V
and outputs a single If F = R then inner products are indeed bilinear forms, since property (b)
number from F. gives linearity in the second argument and then the symmetry condition (a)
guarantees that it is also linear in its first argument. However, if F = C then
“Sesquilinear” they are instead sesquilinear forms—they are linear in their second argument,
means
but only conjugate linear in their first argument:
“one-and-a-half
linear”.
hv + cx, wi = hw, v + cxi = hw, vi + chw, xi = hv, wi + chx, wi.
Remark 1.3.4 Perhaps the only “weird” property in the definition of an inner product is
Why a Complex the fact that we require hv, wi = hw, vi rather than the seemingly simpler
Conjugate? hv, wi = hw, vi. The reason for this strange choice is that if F = C then
there does not actually exist any function satisfying hv, wi = hw, vi as well
as properties (b) and (c)—if there did, then for all v 6= 0 we would have
To get a bit more comfortable with inner products, we now present several
examples of standard inner products in the various vector spaces that we have
been working with. As always though, keep the real dot product in mind as the
canonical example of an inner product around which we build our intuition.
Just like in Rn , we often denote the inner product (i.e., the dot product)
from the above example by v · w instead of hv, wi.
Example 1.3.16 Show that the function h·, ·i : Mm,n (F) × Mm,n (F) → F defined by
The Frobenius
Inner Product hA, Bi = tr(A∗ B) for all A, B ∈ Mm,n (F),
In other words, this inner product is what we get if we forget about the
shape of A and B and just take their dot product as if they were vectors in
Fmn . The fact that this is an inner product now follows directly from the
fact that the dot product on Fn is an inner product.
Example 1.3.17 Let a < b be real numbers and let C[a, b] be the vector space of continuous
An Inner Product on functions on the real interval [a, b]. Show that the function h·, ·i : C[a, b] ×
Continuous Functions C[a, b] → R defined by
Z b
h f , gi = f (x)g(x) dx for all f , g ∈ C[a, b]
a
with equality if and only if f (x) = 0 for all x ∈ [a, b] (i.e., f is the
zero function).
We can make a bit more sense of the above inner product on C[a, b] if we
think of definite integrals as “continuous sums”. While the dot product v · w on
Rn adds up all values of v j w j for 1 ≤ j ≤ n, this inner product h f , gi on C[a, b]
“adds up” all values of f (x)g(x) for a ≤ x ≤ b (and weighs them appropriately
so that the sum is finite).
All of the inner products that we have seen so far are called the standard
inner products on spaces that they act on. That is, the dot product is the standard
inner product on Rn or Cn , the Frobenius inner product is the standard inner
product on Mm,n , and
Z b
h f , gi = f (x)g(x) dx
a
is the standard inner product on C[a, b]. We similarly use P[a, b] and P p [a, b]
to denote the spaces of polynomials (of degree at most p, respectively) acting
on the real interval [a, b], and we assume that the inner product acting on these
spaces is this standard one unless we indicate otherwise.
Inner products can also look quite a bit different from the standard ones
that we have seen so far, however. The following example illustrates how the
same vector space can have multiple different inner products, and at first glance
they might look quite different than the standard inner product.
is an inner product on R2 .
Solution:
Properties (a) and (b) of Definition 1.3.6 follow fairly quickly from the
definition of this function, but proving property (c) is somewhat trickier.
74 Chapter 1. Vector Spaces
We will show in To this end, it is helpful to rewrite this function in the form
Theorem 1.4.3 that
we can always
rewrite inner
hv, wi = (v1 + 2v2 )(w1 + 2w2 ) + v2 w2 .
products in a similar
manner so as to It follows that
make their positive
definiteness hv, vi = (v1 + 2v2 )2 + v22 ≥ 0,
“obvious”.
with equality if and only if v1 + 2v2 = v2 = 0, which happens if and only
if v1 = v2 = 0 (i.e., v = 0), as desired.
Recall that a vector space V is not just a set of vectors, but rather it also
includes a particular addition and scalar multiplication operation as part of it.
Similarly, if we have a particular inner product in mind then we typically group
it together with V and call it an inner product space. If there is a possibility
for confusion among different inner products (e.g., because there are multiple
different inner product spaces V and W being used simultaneously) then we
may write them using notation like h·, ·iV or h·, ·iW .
The Norm Induced by the Inner Product
Now that we have inner products to work with, we can define the length of
a vector in a manner that is completely analogous to how we did it with the
dot product in Rn . However, in this setting of general vector spaces, we are a
bit beyond the point of being able to draw a geometric picture of what length
means (for example, the “length” of a matrix does not quite make sense), so
we change terminology slightly and instead call this function a “norm”.
Definition 1.3.7 Suppose that V is an inner product space. Then the norm induced by the
Norm Induced by inner product is the function k · k : V → R defined by
the Inner Product p
def
kvk = hv, vi for all v ∈ V.
We use the notation When V = Rn or V = Cn and the inner product is just the usual dot product,
k · k to refer to the
the norm induced by the inner product is just the usual length of a vector, given
standard length on
Rn or Cn , but also to by q
refer to the norm √
kvk = v · v = |v1 |2 + |v2 |2 + · · · + |vn |2 .
induced by
whichever inner However, if we change which inner product we are working with, then the
product we are norm induced by the inner product changes as well. For example, the norm on
currently discussing.
If there is ever a R2 induced by the weird inner product of Example 1.3.18 has the form
chance for p q
confusion, we use kvk∗ = hv, vi = (v1 + 2v2 )2 + v22 ,
subscripts to
distinguish different which is different from the norm that we are used to (we use the notation k · k∗
norms. just to differentiate this norm from the standard length k · k).
The norm induced by the standard (Frobenius) inner product on Mm,n from
Example 1.3.16 is given by
s
p m n
The Frobenius norm is kAkF = tr(A A) = ∑ ∑ |ai, j |2 .
∗
also sometimes i=1 j=1
called the
Hilbert–Schmidt This norm on matrices is often called the Frobenius norm, and it is usually
norm and denoted
written as kAkF rather than just kAk to avoid confusion with another matrix
by kAkHS .
norm that we will see a bit later in this book.
1.3 Isomorphisms and Linear Forms 75
Perhaps not surprisingly, the norm induced by an inner product satisfies the
same basic properties as the length of a vector in Rn . The next few theorems
are devoted to establishing these properties.
Proof. Both of these properties follow fairly quickly from the definition. For
property (a), we compute
p p q
kcvk = hcv, cvi = chcv, vi = chv, cvi
q q
= cchv, vi = |c|2 kvk2 = |c|kvk.
Property (b) follows immediately from the property of inner products that says
that hv, vi ≥ 0, with equality if and only if v = 0.
The above theorem tells us that it makes sense to break vectors into their
“length” and “direction”, just like we do with vectors in Rn . Specifically, we can
write every vector v ∈ V in the form v = kvku, where u ∈ V has kuk = 1 (so
we call u a unit vector). In particular, if v 6= 0 then we can choose u = v/kvk,
which we think of as encoding the direction of v.
The two other main properties that vector length in Rn satisfies are the
Cauchy–Schwarz inequality and the triangle inequality. We now show that
these same properties hold for the norm induced by any inner product.
Rearranging and taking the square root of both sides gives us |hv, wi| ≤ kvkkwk,
which is exactly the Cauchy–Schwarz inequality.
To see that equality holds if and only if {v, w} is a linearly dependent set,
suppose that |hv, wi| = kvkkwk. We can then follow the above proof backward
to see that 0 = kcv + dwk2 , so cv + dw = 0 (where c = kwk = 6 0), so {v, w} is
linearly dependent. In the opposite direction, if {v, w} is linearly dependent
then either v = 0 (in which case equality clearly holds in the Cauchy–Schwarz
inequality since both sides equal 0) or w = cv for some c ∈ F. Then
and if we apply it to the standard inner product on C[a, b] then it says that
Z b
2 Z b
Z b
2 2
f (x)g(x) dx ≤ f (x) dx g(x) dx for all f , g ∈ C[a, b].
a a a
These examples illustrate the utility of thinking abstractly about vector spaces.
These matrix and integral inequalities are tricky to prove directly from prop-
erties of the trace and integrals, but follow straightforwardly when we forget
about the fine details and only think about vector space properties.
Just as was the case in Rn , the triangle inequality now follows very quickly
from the Cauchy–Schwarz inequality.
We can then take the square root of both sides of this inequality to see that
kv + wk ≤ kvk + kwk, as desired.
The above argument demonstrates that equality holds in the triangle in-
equality if and only if Re(hv, wi) = |hv, wi| = kvkkwk. We know from the
Cauchy–Schwarz inequality that the second of these equalities holds if and
only if {v, w} is linearly dependent (i.e., v = 0 or w = cv for some c ∈ F). In
this case, the first equality holds if and only if v = 0 or Re(hv, cvi) = |hv, cvi|.
Well, Re(hv, cvi) = Re(c)kvk2 and |hv, cvi| = |c|kvk2 , so we see that equality
holds in the triangle inequality if and only if Re(c) = |c| (i.e., 0 ≤ c ∈ R).
It is worth noting that isomorphisms in general do not preserve inner prod-
The fact that inner ucts, since inner products cannot be derived from only scalar multiplication
products cannot be
and vector addition (which isomorphisms do preserve). For example, the iso-
expressed in terms of
vector addition and morphism T : R2 → R2 defined by T (v) = (v1 + 2v2 , v2 ) has the property that
scalar multiplication
means that they T (v) · T (w) = (v1 + 2v2 , v2 ) · (w1 + 2w2 , w2 )
give us some extra
structure that vector = v1 w1 + 2v1 w2 + 2v2 w1 + 5v2 w2 .
spaces alone do not
have. In other words, T turns the usual dot product on R2 into the weird inner product
from Example 1.3.18.
However, even though isomorphisms do not preserve inner products, they
at least do always convert one inner product into another one. That is, if V
and W are vector spaces, h·, ·iW is an inner product on W, and T : V → W
is an isomorphism, then we can define an inner product on V via hv1 , v2 iV =
hT (v1 ), T (v2 )iW (see Exercise 1.3.25).
1.3.2 Determine which of the following functions are and (c) On R2 , the function
are not linear forms.
hv, wi = 3v1 w1 + v1 w2 + v2 w1 + 3v2 w2 .
∗(a) The function f : Rn → R defined by f (v) = kvk.
(b) The function f : Fn → F defined by f (v) = v1 . ∗(d) On Mn , the function hA, Bi = tr(A∗ + B).
∗(c) The function f : M2 → M2 defined by (e) On P 2 , the function
" #!
c a
hax2 + bx + c, dx2 + ex + f i = ad + be + c f .
a b
f = .
c d d b ∗(f) On C[−1, 1], the function
Z 1
(d) The determinant of a matrix (i.e., the function det : f (x)g(x)
h f , gi = √ dx.
Mn → F). −1 1 − x2
∗(e) The function g : P → R defined by g( f ) = f 0 (3),
where f 0 is the derivative of f . 1.3.4 Determine which of the following statements are
(f) The function g : C → R defined by g( f ) = cos( f (0)). true and which are false.
∗(a) If T : V → W is an isomorphism then so is T −1 :
W → V.
(b) Rn is isomorphic to Cn .
∗(c) Two vector spaces V and W over the same field are
isomorphic if and only if dim(V) = dim(W).
78 Chapter 1. Vector Spaces
(d) If w ∈ Rn then the function fw : Rn → R defined by (a) Show that if the ground field is R then
fw (v) = v · w for all v ∈ Rn is a linear form.
1
∗(e) If w ∈ Cn then the function fw : Cn → C defined by hv, wi = kv + wk2 − kv − wk2
fw (v) = v · w for all v ∈ Cn is a linear form. 4
(f) If A ∈ Mn (R) is invertible then so is the bilinear for all v, w ∈ V.
form f (v, w) = vT Aw. (b) Show that if the ground field is C then
∗(g) If Ex : P 2 → R is the evaluation map defined by
Ex ( f ) = f (x) then E1 + E2 = E3 . 1 3 1
hv, wi = ∑ ik kv + ik wk2
4 k=0
∗∗ 1.3.6 Suppose V, W, and X are vector spaces and 1.3.15 Let V be an inner product space and let k · k be the
T : V → W and S : W → X are isomorphisms. norm induced by V’s inner product.
(a) Show that T −1 : W → V is an isomorphism. (a) Show that if the ground field is R then
(b) Show that S ◦ T : V → X is an isomorphism.
2 !
1
v w
hv, wi = kvkkwk 1 −
−
∗∗1.3.7 Show that for every linear form f : Mm,n (F) → 2 kvk kwk
F, there exists a matrix A ∈ Mm,n (F) such that f (X) =
tr(AT X) for all X ∈ Mm,n (F). for all v, w ∈ V.
[Side note: This representation of the inner product
implies the Cauchy–Schwarz inequality as an imme-
∗∗1.3.8 Show that the result of Exercise 1.3.7 still holds if diate corollary.]
we replace every instance of Mm,n (F) by MSn (F) or every (b) Show that if the ground field is C then
instance of Mm,n (F) by MH n (and set F = R in this latter
2
case). 1
v w
hv, wi = kvkkwk 1 −
−
2 kvk kwk
∗∗1.3.9 Suppose V is an inner product space. Show that
2 !
i
v iw
hv, 0i = 0 for all v ∈ V. +i−
−
2
kvk kwk
(b) Suppose W = V and C = B. Show that f is conju- 1.3.23 Let Ex : P → R be the evaluation map defined by
gate symmetric (i.e., f (v, w) = f (w, v) for all v ∈ V Ex ( f ) = f (x). Show that the set
and w ∈ W) if and only if the matrix A from part (a)
is Hermitian. Ex : x ∈ R
(c) Suppose W = V and C = B. Show that f is an inner
product if and only if the matrix A from part (a) is is linearly independent.
Hermitian and satisfies v∗ Av ≥ 0 for all v ∈ Cm with
equality if and only if v = 0. ∗∗ 1.3.24 Let T : V → V ∗∗ be the canonical double-
[Side note: A matrix A with these properties is called dual isomorphism described by Theorem 1.3.4. Com-
positive definite, and we explore such matrices in plete the proof of that theorem by showing that if B =
Section 2.2.] {v1 , v2 , . . . , vn } ⊆ V is linearly independent then so is
C = {T (v1 ), T (v2 ), . . . , T (vn )} ⊆ V ∗∗ .
1.3.19 Suppose V is a vector space with basis B =
{v1 , . . . , vn }. Define linear functionals f1 , . . ., fn ∈ V ∗ by ∗∗1.3.25 Suppose that V and W are vector spaces over a
( field F, h·, ·iW is an inner product on W, and T : V → W is
1 if i = j,
fi (v j ) = an isomorphism. Show that the function h·, ·iV : V × V → F
0 otherwise. defined by hv1 , v2 iV = hT (v1 ), T (v2 )iW is an inner prod-
def
Show that the set B∗ = { f1 , . . . , fn } is a basis of V ∗ . uct on V.
[Side note: B∗ is called the dual basis of B.]
1.3.26 Suppose V is a vector space over a field F,
f : V × · · · × V → F is a multilinear form, and T1 , . . . , Tn :
1.3.20 Let c00 be the vector space of eventually-zero real V → V are linear transformations. Show that the function
= RN .
sequences from Example 1.1.10. Show that c∗00 ∼ g : V × · · · × V → F defined by
[Side note: This example shows that the dual of an infinite-
dimensional vector space can be much larger than the origi- g(v1 , . . . , vn ) = f (T1 (v1 ), . . . , Tn (vn ))
nal vector space. For example, c00 has dimension equal to is a multilinear form.
the cardinality of N, but this exercise shows that c∗00 has
dimension at least as large as the cardinality of R. See also
Exercise 1.3.23.] ∗∗ 1.3.27 An array A is called antisymmetric if swap-
ping any two of its indices swaps the sign of its entries
(i.e., a j1 ,..., jk ,..., j` ,..., j p = −a j1 ,..., j` ,..., jk ,..., j p for all k 6= `).
∗∗ 1.3.21 Let Ex : P p → R be the evaluation map de- Show that for each p ≥ 2 there is, up to scaling, only one
fined by Ex ( f ) = f (x) (see Example 1.3.5) and let E = p × p × · · · × p (p times) antisymmetric array.
{1, x, x2 , . . . , x p } be the standard basis of P p . Find a polyno-
mial gx ∈ P p such that Ex ( f ) = [gx ]E · [ f ]E for all f ∈ P p . [Side note: This array corresponds to the determinant in the
sense of Theorem 1.3.6.]
∗∗1.3.22 Let Ex : P p → R be the evaluation map defined [Hint: Make use of permutations (see Appendix A.1.5) or
by Ex ( f ) = f (x). Show that if c0 , c1 , . . . , c p ∈ R are distinct properties of the determinant.]
then {Ec0 , Ec1 , . . . , Ec p } is a basis of (P p )∗ .
Now that we know how to generalize the dot product from Rn to other vector
spaces (via inner products), it is worth revisiting the concept of orthogonality.
Recall that two vectors v, w ∈ Rn are orthogonal when v · w = 0. We define
orthogonality in general inner product spaces completely analogously:
Definition 1.4.1 Suppose V is an inner product space. Two vectors v, w ∈ V are called
Orthogonality orthogonal if hv, wi = 0.
c1 v1 + c2 v2 + · · · + ck vk = 0. (1.4.1)
Since all of the vectors in B are non-zero we know that kv1 k 6= 0, so this implies
c1 = 0.
A similar computation involving hv2 , 0i shows that c2 = 0, and so on up to
hvk , 0i showing that ck = 0, so we conclude c1 = c2 = · · · = ck = 0 and thus B
is linearly independent.
Example 1.4.1 Show that the set B = {1, x, 2x2 − 1} ⊂ P 2 [−1, 1] is mutually orthogonal
Checking Mutual with respect to the inner product
Orthogonality of Z 1
Polynomials f (x)g(x)
h f , gi = √ dx.
Recall that P 2 [−1, 1] −1 1 − x2
is the vector space
of polynomials of Solution:
degree at most 2
acting on the
We explicitly compute all 3 possible inner products between these
interval [−1, 1].
1.4 Orthogonality and Adjoints 81
When combined with Theorem 1.4.1, the above example shows that the set
{1, x, 2x2 − 1} is linearly independent in P 2 [−1, 1]. It is worth observing that a
set of vectors may be mutually orthogonal with respect to one inner product but
not another—all that is needed to show linear independence in this way is that
it is mutually orthogonal with respect to at least one inner product. Conversely,
for every linearly independent set there is some inner product with respect to
which it is mutually orthogonal, at least in the finite-dimensional case (see
Exercise 1.4.25).
Definition 1.4.2 Suppose V is an inner product space with basis B ⊂ V. We say that B is an
Orthonormal Bases orthonormal basis of V if
a) hv, wi = 0 for all v 6= w ∈ B, and (mutual orthogonality)
b) kvk = 1 for all v ∈ B. (normalization)
The notation |B| Proof. Recall that Theorem 1.4.1 tells us that if B is mutually orthogonal and
means the number
its members are unit vectors (and thus non-zero) then B is linearly independent.
of vectors in B.
We then make use of Exercise 1.2.27(a), which tells us that a set with dim(V)
vectors is a basis if and only if it is linearly independent.
For example, performing the integration indicated above shows that h1, xi =
(12 )/2 − (02 )/2 = 1/2 6= 0, so the polynomials 1 and x are not orthogonal in
this inner product.
Example 1.4.2 Construct an orthonormal basis of P 2 [−1, 1] with respect to the inner
An Orthonormal Basis product
Z 1
of Polynomials f (x)g(x)
h f , gi = √ dx.
−1 1 − x2
Solution:
We already showed that the set B = {1, x, 2x2 − 1} is mutually orthogo-
nal with respect to this inner product in Example 1.4.1. To turn this set into
an orthonormal basis, we just normalize these polynomials (i.e., divide
them by their norms):
Be careful here: we s
might guess that p Z 1
1 √
k1k = 1, but this is not k1k = h1, 1i = √ dx = π,
true. We have to go −1 1 − x2
through the
computation with s r
Z 1
the indicated inner p x2 π
product. kxk = hx, xi = √ dx = , and
−1 1 − x2 2
s r
q Z 1
2 π
(2x2 − 1)2
k2x − 1k = h2x2 − 1, 2x2 − 1i = √. dx =
These integrals can 1 − x2 −1 2
be evaluated by
substituting x = sin(u). √ √ √ √ √
It follows that the set C = 1/ π, 2x/ π, 2(2x2 − 1)/ π is
a mutually orthogonal set of normalized vectors. Since C consists of
dim(P 2 ) = 3 vectors, we know from Theorem 1.4.2 that it is an orthonor-
mal basis of P 2 [−1, 1] (with respect to this inner product).
1.4 Orthogonality and Adjoints 83
Theorem 1.4.3 says that the same is true if we replace E by any orthonormal
basis of Mm,n . However, for some inner products it is not quite so obvious how
to find an orthonormal basis and thus represent it as a dot product.
Example 1.4.4 Find a basis B of P 1 [0, 1] (with the standard inner product) with the
A Polynomial property that h f , gi = [ f ]B · [g]B for all f , g ∈ P 1 [0, 1].
Inner Product
Solution:
as the Dot
Product First, we recall that the standard inner product on P 1 [0, 1] is
Z 1
h f , gi = f (x)g(x) dx.
0
The following corollary shows that we can similarly think of the norm in-
duced by any inner product as “the same” as the usual vector length on Rn or Cn :
√ √
Figure 1.11: The length of v = (1, 2) is kvk = 12 + 22 = 5. On the other hand, kvk∗
measures
p the length of √ v when
it is
represented
√ √in the basis B = {(1, 0), (−2, 1)}:
kvk∗ = (1 + 4)2 + 22 = 29 and
[v]B
= 52 + 22 = 29.
that shows that finding coordinate vectors with respect to orthonormal bases is
trivial. For example, recall that if B = {e1 , e2 , . . . , en } is the standard basis of
Rn then, for each 1 ≤ j ≤ n, the j-th coordinate of a vector v ∈ Rn is simply
e j · v = v j . The following theorem says that coordinates in any orthonormal
basis can similarly be found simply by computing inner products (instead of
solving a linear system, like we have to do to find coordinate vectors with
respect to general bases).
Proof. We simply make use of Theorem 1.4.3 to represent the inner product as
[u j ]B = e j simply the dot product of coordinate vectors. In particular, we recall that [u j ]B = e j
because u j = 0u1 +
for all 1 ≤ j ≤ n and then compute
· · · + 1u j + · · · + 0un ,
and sticking the
coefficients of that hu1 , vi, hu2 , vi, . . . , hun , vi = [u1 ]B · [v]B , [u2 ]B · [v]B , . . . , [un ]B · [v]B
linear combination = (e1 · [v]B , e2 · [v]B , . . . , en · [v]B ),
in a vector gives e j .
which equals [v]B since e1 · [v]B is the first entry of [v]B , e2 · [v]B is its second
entry, and so on.
The above theorem can be interpreted as telling us that the inner product
between v ∈ V and a unit vector u ∈ V measures how far v points in the
direction of u (see Figure 1.12).
Example 1.4.5 Compute [A]B (the coordinate vector of A ∈ M2 (C) with respect to B) if
A Coordinate
Vector with 1 1 0 1 0 1 1 0 −i 1 1 0
B= √ ,√ ,√ ,√
Respect to the 2 0 1 2 1 0 2 i 0 2 0 −1
Pauli Basis " #
and A = 5 2 − 3i .
2 + 3i −3
Solution:
We already showed that B is an orthonormal basis of M2 (C) (with
respect to the Frobenius inner product) in Example 1.4.3, so we can
1.4 Orthogonality and Adjoints 87
Figure 1.13: An illustration of our method for turning any basis of R2 into an or-
thonormal basis of R2 . The process works by (b) normalizing one of the vectors,
(c) moving the other vector so that they are orthogonal, and then (d) normalizing
the second vector. In higher dimensions, the process continues in the same way
by repositioning the vectors one at a time so that they are orthogonal to the rest,
and then normalizing.
Proof of Theorem 1.4.6. We prove this result by induction on j. For the base
j = 1 case, we simply note that u1 is indeed a unit vector and span(u1 ) =
span(v1 ) since u1 and v1 are scalar multiples of each other.
For the inductive step, suppose that for some particular j we know that
{u1 , u2 , . . . , u j } is a mutually orthogonal set of unit vectors and
1 ≤ k ≤ j and compute
* +
j
v j+1 − ∑i=1 hui , v j+1 iui
huk , u j+1 i = uk ,
(definition of u j+1 )
j
v j+1 − ∑i=1 hui , v j+1 iui
j
huk , v j+1 i − ∑i=1 hui , v j+1 ihuk , ui i
=
(expand the inner product)
j
v j+1 − ∑i=1 hui , v j+1 iui
The fact that the By rearranging the definition of u j+1 , we see that v j+1 ∈ span(u1 , u2 , . . . , u j+1 ).
only When we combine this fact with Equation (1.4.2), this implies
( j + 1)-dimensional
subspace of a span(u1 , u2 , . . . , u j+1 ) ⊇ span(v1 , v2 , . . . , v j+1 ).
( j + 1)-dimensional
vector space is that
vector space itself is
The vi ’s are linearly independent, so the span on the right has dimension j + 1.
hopefully intuitive Similarly, the ui ’s are linearly independent (they are mutually orthogonal, so
enough, but it was linear independence follows from Theorem 1.4.1), so the span on the left also
proved explicitly in has dimension j + 1, and thus the two spans must in fact be equal.
Exercise 1.2.31.
Since finite-dimensional inner product spaces (by definition) have a basis
consisting of finitely many vectors, and Theorem 1.4.6 tells us how to convert
any such basis into one that is orthonormal, we now know that every finite-
dimensional inner product space has an orthonormal basis:
Corollary 1.4.7 Every finite-dimensional inner product space has an orthonormal basis.
Existence of
Orthonormal Bases We now illustrate how to use the Gram–Schmidt process to find orthonormal
bases of various inner product spaces.
Example 1.4.7 Find an orthonormal basis (with respect to the usual dot product) of the
Finding an plane S ⊂ R3 with equation x − y − 2z = 0.
Orthonormal Basis
Solution:
of a Plane
We start by picking any basis of S. Since S is 2-dimensional, a basis
is made up of any two vectors in S that are not multiples of each other. By
v1 and v2 can be inspection, v1 = (2, 0, 1) and v2 = (3, 1, 1) are vectors that work, so we
found by choosing x
choose B = {v1 , v2 }.
and y arbitrarily and
using the equation To create an orthonormal basis from B, we apply the Gram–Schmidt
x − y − 2z = 0 to solve process—we define
for z.
v1 1
u1 = = √ (2, 0, 1)
kv1 k 5
and
1
w2 = v2 − (u1 · v2 )u1 = (3, 1, 1) − 57 (2, 0, 1) = (1, 5, −2),
5
w2 1
u2 = = √ (1, 5, −2).
kw2 k 30
n o
Just like other bases, It follows that C = {u1 , u2 } = √15 (2, 0, 1), √130 (1, 5, −2) is an orthonor-
orthonormal bases
are very non-unique. mal basis of V, as displayed below:
There are many
B = {v1 , v2 } C = {u1 , u2 }
other orthonormal © ª n o
bases of S. = (2, 0, 1), (3, 1, 1) = √15 (2, 0, 1), √130 (1, 5, −2)
z z
v2
v1
u1
x x
y y
u2
1.4 Orthogonality and Adjoints 91
Example 1.4.8 Find an orthonormal basis of P 2 [0, 1] with respect to the inner product
Finding an Z 1
Orthonormal Basis h f , gi = f (x)g(x) dx.
of Polynomials 0
Solution:
Recall that the Once again, we apply the Gram–Schmidt process to the standard basis
standard basis is not
orthonormal in this
B = {1, x, x2 } to create an orthonormal basis C = {h1 , h2 , h3 }. To start, we
inner product since, define h1 (x) = 1/k1k = 1. The next member of the orthonormal basis is
for example, computed via
h1, xi = 1/2.
g2 (x) = x − hh1 , xih1 (x) = x − h1, xi1 = x − 1/2,
s
. Z1 √
Notice that {h1 , h2 } is h2 (x) = g2 (x)/kg2 k = (x − 1/2) (x − 1/2)2 dx = 3(2x − 1).
exactly the 0
orthonormal basis of
P 1 [0, 1] that we The last member of the orthonormal basis C is similarly computed via
constructed back in
Example 1.4.4. We
g3 (x) = x2 − h1 , x2 h1 (x) − h2 , x2 h2 (x)
were doing the
Gram–Schmidt = x2 − 1, x2 1 − 12 x − 1/2, x2 (x − 1/2)
process back there
without realizing it. = x2 − 1/3 − (x − 1/2) = x2 − x + 1/6, and
h3 (x) = g3 (x)/kg3 k
s
. Z1
2
= (x − x + 1/6) (x2 − x + 1/6)2 dx
0
√
= 5(6x2 − 6x + 1).
√ √
It follows that C = 1, 3(2x − 1), 5(6x2 − 6x + 1) is an orthonor-
mal basis of P 2 [0, 1]. While this basis looks a fair bit uglier than the
standard basis {1, x, x2 } algebraically, its members are more symmetric
about the midpoint x = 1/2 and more evenly distributed across the interval
[0, 1] geometrically, as shown below:
C = {h1 , h2 , h3 } =
n √ √ o
© ª
B = 1, x, x2 1, 3(2x − 1), 5(6x2 − 6x + 1)
However, we should y y
not expect to be
able to directly
1
1
“see” whether or not 2
a basis of P 2 [0, 1] (or
3/4
any of its variants) is y = h1 (x)
orthonormal. 1
1/2 x
x2 0 x
1/4 1/4 1/2 3/4 1
-1 y = h3 (x)
x
1/4 1/2 3/4 1 y = h2 (x)
92 Chapter 1. Vector Spaces
Definition 1.4.3 Suppose that V and W are inner product spaces and T : V → W is a linear
Adjoint transformation. Then a linear transformation T ∗ : W → V is called the
Transformation adjoint of T if
The ground field so the conjugate transpose matrix A∗ is the adjoint of A. However, it also makes
must be R or C in sense to talk about the adjoint of linear transformations between more exotic
order for inner vector spaces, as we now demonstrate with the trace (which we recall from
products, and thus
adjoints, to make Example 1.2.7 is the linear transformation tr : Mn (F) → F that adds up the
sense. diagonal entries of a matrix).
Example 1.4.9 Show that the adjoint of the trace tr : Mn (F) → F with respect to the
The Adjoint standard Frobenius inner product is given by
of the Trace
tr∗ (c) = cI for all c ∈ F.
Solution:
In the equation Our goal is to show that hc, tr(A)i = hcI, Ai for all A ∈ Mn (F) and
hc, tr(A)i = hcI, Ai, the
all c ∈ F. Recall that the Frobenius inner product is defined by hA, Bi =
left inner product is
on F (i.e., it is the tr(A∗ B), so this condition is equivalent to
1-dimensional dot
product c tr(A) = tr (cI)∗ A for all A ∈ Mn (F), c ∈ F.
hc, tr(A)i = c tr(A)) and
the right inner This equation holds simply by virtue of linearity of the trace:
product is the
Frobenius inner
product on Mn (F). tr (cI)∗ A = tr cI ∗ A = c tr(IA) = c tr(A),
as desired.
to us? Second, how do we know that there is not another linear transformation
that is also an adjoint of the trace? That is, how do we know that tr∗ (c) = cI is
the adjoint of the trace rather than just an adjoint of it?
The following theorem answers both of these questions by showing that, in
finite dimensions, every linear transformation has exactly one adjoint, and it
can be computed by making use of orthonormal bases of the two vector spaces
V and W.
Theorem 1.4.8 Suppose that V and W are finite-dimensional inner product spaces with
Existence and orthonormal bases B and C, respectively. If T : V → W is a linear trans-
Uniqueness of formation then there exists a unique adjoint transformation T ∗ : W → V,
the Adjoint and its standard matrix satisfies
∗ ∗
T B←C = [T ]C←B .
In infinite dimensions, Proof. To prove uniqueness of T ∗ , suppose that T ∗ exists, let v ∈ V and w ∈ W,
some linear
and compute hT (v), wi in two different ways:
transformations fail
to have an adjoint
(see Remark 1.4.1). hT (v), wi = hv, T ∗ (w)i (definition of T ∗ )
However, if it exists = [v]B · T ∗ (w) B (Theorem 1.4.3)
then it is still unique
(see Exercise 1.4.23). = [v]B · T ∗ B←C [w]C ) (definition of standard matrix)
= [v]∗B T ∗ B←C [w]C . (definition of dot product)
Similarly,
We thus might expect that if we equip P 2 with the standard inner product
Z 1
h f , gi = f (x)g(x) dx
0
To fix the above problem and find the actual adjoint of D, we must mimic
the above calculation with an orthonormal basis of P 2 rather than the standard
basis E = {1, x, x2 }.
Example 1.4.10 Compute the adjoint of the differentiation map D : P 2 [0, 1] → P 2 [0, 1]
The Adjoint of (with respect to the standard inner product).
the Derivative
Solution:
Fortunately, we already computed an orthonormal basis of P 2 [0, 1]
back in Example 1.4.8, and it is C = {h1 , h2 , h3 }, where
√ √
h1 (x) = 1, h2 (x) = 3(2x − 1), and h3 (x) = 5(6x2 − 6x + 1).
D(h1 (x)) = 0,
√ √
D(h2 (x)) = 2 3 = 2 3h1 (x), and
√ √ √
D(h3 (x)) = 12 5x − 6 5 = 2 15h2 (x).
Example 1.4.11 Show that the adjoint of the transposition map T : Mm,n → Mn,m , with
The Adjoint of respect to the Frobenius inner product, is also the transposition map.
the Transpose
Solution:
Our goal is to show that hAT , Bi = hA, BT i for all A ∈ Mm,n and
B ∈ Mn,m . Recall that the Frobenius inner product is defined by hA, Bi =
tr(A∗ B), so this is equivalent to
Recall that A = (AT )∗
is the entrywise tr AB = tr A∗ BT for all A ∈ Mm,n , B ∈ Mn,m .
complex conjugate
of A.
These two quantities can be shown to be equal by brute-force calcula-
tion of the traces and matrix multiplications in terms of the entries of A and
B, but a more elegant way is to use properties of the trace and transpose:
T
tr AB = tr AB (transpose does not change trace)
= tr BT A∗ (transpose of a product)
= tr A∗ BT . (cyclic commutativity of trace)
Remark 1.4.1 The reason that Theorem 1.4.8 specifies that the vector spaces must be
The Adjoint in finite-dimensional is that some linear transformations acting on infinite-
Infinite Dimensions dimensional vector spaces do not have an adjoint. To demonstrate this phe-
nomenon, consider the vector space c00 of all eventually-zero sequences
of real numbers (which we first introduced in Example 1.1.10), together
with inner product
Since all of the
∞
sequences here are (v1 , v2 , v3 , . . .), (w1 , w2 , w3 , . . .) = ∑ vi wi .
eventually zero, all of i=1
the sums considered
here only have Then consider the linear transformation T : c00 → c00 defined by
finitely many
non-zero terms, so
!
∞ ∞ ∞
we do not need to T (v1 , v2 , v3 , . . .) = ∑ vi , ∑ vi , ∑ vi , . . . .
worry about limits or i=1 i=2 i=3
convergence.
A straightforward calculation reveals that, if the adjoint T ∗ : c00 → c00
exists, it must have the form
!
1 2 3
T ∗ (w1 , w2 , w3 , . . .) = ∑ wi , ∑ wi , ∑ wi , . . . .
i=1 i=1 i=1
For example, the identity matrix is unitary since its columns are the standard
basis vectors e1 , e2 , . . . , en , which form the standard basis of Fn , which is
orthonormal. As a slightly less trivial example, consider the matrix
1 1 −1
U=√ ,
2 1 1
1.4 Orthogonality and Adjoints 97
√ √
which we can show is unitary simply by noting that (1, 1)/ 2, (1, −1)/ 2
(i.e., the set consisting of the columns of U) is an orthonormal basis of R2 .
For a refresher on We can make geometric sense of unitary matrices if we recall that the
how to think of
columns of a matrix tell us where that matrix sends the standard basis vectors
matrices
geometrically as e1 , e2 , . . . , en . Thus, just like invertible matrices are those that send the unit
linear square grid to a parallelogram grid (without squishing it down to a smaller
transformations, see dimension), unitary matrices are those that send the unit square grid to a
Appendix A.1.2.
(potentially rotated or reflected) unit square grid, as in Figure 1.14.
y y
Pe2
We will show shortly, e2 P
in Examples 1.4.12 −−−−−→
and 1.4.13, that all invertible
Pe1
rotation matrices e1
and all reflection x x
matrices are indeed
unitary. y y
e2 U
−−−−−→ Ue2
unitary
Ue1
e1
x x
Figure 1.14: A non-zero matrix P ∈ M2 is invertible if and only if it sends the unit
square grid to a parallelogram grid (whereas it is non-invertible if and only if it
sends that grid to a line). A matrix U ∈ M2 is unitary if and only if it sends the unit
square grid to a unit square grid that is potentially rotated and/or reflected, but
not skewed.
For this reason, we often think of unitary matrices as the most “rigid” or
“well-behaved” invertible matrices that exist—they preserve not just the dimen-
sion of Fn , but also its shape (but maybe not its orientation). The following
theorem provides several additional characterizations of unitary matrices that
can help us understand them in other ways and perhaps make them a bit more
intuitive.
Table 1.2: A comparison of the properties of invertible matrices and the corre-
sponding stronger properties of unitary matrices. The final properties (that invertible
matrices are vector space automorphisms while unitary matrices are inner product
space automorphisms) means that invertible matrices preserve linear combina-
tions, whereas unitary matrices preserve linear combinations as well as the dot
product (property (e) of Theorem 1.4.9).
The final two properties of Theorem 1.4.9 provide us with another natural
geometric interpretation of unitary matrices. Condition (f) tells us that unitary
matrices are exactly those that preserve the length of every vector. Similarly,
since the dot product can be used to measure angles between vectors, condi-
tion (e) says that unitary matrices are exactly those that preserve the angle
between every pair of vectors, as in Figure 1.15.
y y
Uv4
v1 Uv3
v3
v2
v4
U
−−−−→
unitary Uv2
x x
Uv1
Figure 1.15: Unitary matrices are those that preserve the lengths of vectors as well
as the angles between them.
To prove this Proof of Theorem 1.4.9. We start by showing that conditions (a)–(d) are equiv-
theorem, we show
alent to each other. The equivalence of conditions (c) and (d) follows from the
that the 6 properties
imply each other as fact that, for square matrices, a one-sided inverse is necessarily a two-sided
follows: inverse.
(a) To see that (a) is equivalent to (d), we write U = u1 | u2 | · · · | un and
(f) (b) then use block matrix multiplication to multiply by U ∗ :
∗
u1 u1 · u1 u1 · u2 · · · u1 · un
(e) (c) u2
∗
u2 · u1 u2 · u2 · · · u2 · un
(d) U ∗U = . u1 | u2 | · · · | un = ... .. .. .. .
.. . . .
u∗n un · u1 un · u2 · · · un · un ,
This product equals I if and only if its diagonal entries equal 1 and its off-
diagonal entries equal 0. In other words, U ∗U = I if and only if ui · ui = 1 for
all i and ui · u j = 0 whenever i 6= j. This says exactly that {u1 , u2 , . . . , un } is a
set of mutually orthogonal normalized vectors. Since it consists of exactly n
1.4 Orthogonality and Adjoints 99
as desired.
The implication (f) For the implication (f) =⇒ (e), note that if kUvk2 = kvk2 for all v ∈ Fn
=⇒ (e) is the “tough
then (Uv) · (Uv) = v · v for all v ∈ Fn . If x, y ∈ Fn then this tells us (by choosing
one” of this proof.
v = x + y) that U(x + y) · U(x + y) = (x + y) · (x + y). Expanding this dot
product on both the left and right then gives
Here we use the fact
that (Ux) · (Uy) + (Ux) · (Ux) + 2Re (Ux) · (Uy) + (Uy) · (Uy) = x · x + 2Re(x · y) + y · y.
(Uy) · (Ux) = (Ux) ·
(Uy) + (Ux) · (Uy) =
By then using the facts that (Ux) · (Ux) = x · x and (Uy) · (Uy) = y · y, we can
2Re (Ux) · (Uy) .
simplify the above equation to the form
Re (Ux) · (Uy) = Re(x · y).
so in this case we have (Ux) · (Uy) = x · y for all x, y ∈ Fn too, establishing (e).
Finally, to see that (e) =⇒ (d), note that if we rearrange (Uv)·(Uw) = v·w
slightly, we get
(U ∗U − I)v · w = 0 for all v, w ∈ Fn .
2
If we choose w = (U ∗U − I)v then this implies
(U ∗U − I)v
= 0 for all
v ∈ Fn , so (U ∗U − I)v = 0 for all v ∈ Fn . This in turn implies U ∗U − I = O, so
U ∗U = I, which completes the proof.
Checking whether or not a matrix is unitary is now quite simple, since we
just have to check whether or not U ∗U = I. For example, if we return to the
matrix
The fact that this 1 1 −1
matrix is unitary U=√
makes sense 2 1 1
geometrically if we from earlier, we can now check that it is unitary simply by computing
notice that
it rotates R2
counter-clockwise 1 1 1 1 −1 1 0
U ∗U = = .
by π/4 (45◦ ). 2 −1 1 1 1 0 1
Example 1.4.12 Recall from introductory linear algebra that the standard matrix of the
Rotation Matrices linear transformation Rθ : R2 → R2 that rotates R2 counter-clockwise by
are Unitary an angle of θ is " #
θ cos(θ ) − sin(θ )
R = .
sin(θ ) cos(θ )
Show that Rθ is unitary.
Solution:
Since rotation matrices do not change the length of vectors, we know
that
θ they must be unitary. To verify this a bit more directly we compute
∗ θ
R R :
" #" #
θ ∗ θ cos(θ ) sin(θ ) cos(θ ) − sin(θ )
R R =
− sin(θ ) cos(θ ) sin(θ ) cos(θ )
" #
cos2 (θ ) + sin2 (θ ) sin(θ ) cos(θ ) − cos(θ ) sin(θ )
=
cos(θ ) sin(θ ) − sin(θ ) cos(θ ) sin2 (θ ) + cos2 (θ )
Recall that
sin2 (θ ) + cos2 (θ ) = 1 1 0
= .
for all θ ∈ R. 0 1
∗
Since Rθ Rθ = I, we conclude that Rθ is unitary.
Example 1.4.13 Recall from introductory linear algebra that the standard matrix of the
Reflection Matrices linear transformation Fu : Rn → Rn that reflects Rn through the line in the
are Unitary direction of the unit vector u ∈ Rn is
[Fu ] = 2uuT − I.
where the third equality comes from the fact that u is a unit vector, so
uT u = kuk2 = 1. Since [Fu ]∗ [Fu ] = I, we conclude the [Fu ] is unitary.
Again, the previous two examples provide exactly the intuition that we
should have for unitary matrices—they are the ones that rotate and/or reflect
Fn , but do not stretch, shrink, or otherwise distort it. They can be thought of as
very rigid linear transformations that leave the size and shape of Fn intact, but
possibly change its orientation.
Remark 1.4.2 Many sources refer to real unitary matrices as orthogonal matrices (but
Orthogonal still refer to complex unitary matrices as unitary). However,
Matrices
1.4 Orthogonality and Adjoints 101
1.4.4 Projections
As one final application of inner products and orthogonality, we now introduce
projections, which we roughly think of as linear transformations that squish
vectors down into some given subspace. For example, when discussing the
Gram–Schmidt process, we implicitly used the fact that if u ∈ V is a unit vector
then the linear transformation Pu : V → V defined by Pu (v) = hu, viu squishes
v down onto span(u), as in Figure 1.16. Indeed, Pu is a projection onto span(u).
y y
v
(v1 , 0) (v1 , 0)
P
−−→
u
Pu (v)
u vi
x hu, x
Figure 1.16: Given a unit vector u, the linear transformation Pu (v) = hu, viu is a
projection onto the line in the direction of u.
Figure 1.18: A rank-2 projection P : R3 → R3 projects onto a plane (its range). After it
projects once, projecting again has no additional effect, so P2 = P. An orthogonal
projection is one for which, as in (b), P(v) is always orthogonal to v − P(v).
Example 1.4.14 Determine which of the following matrices are projections. If they are
Determining if a projections, determine whether or not they are orthogonal and describe the
Matrix is a Projection subspace of Rn that they project onto.
a) P = 1 −1 b) Q = 1 0 0
1 1
0 1 1/2
0 0 0
1.4 Orthogonality and Adjoints 103
c) R = 5/6 1/6 −1/3
1/6 5/6 1/3
−1/3 1/3 1/3
Solutions:
a) This matrix is not a projection, since direct computation shows that
2 0 −2 1 −1
P = 6= = P.
2 0 1 1
We return to oblique Although projections in general have their uses, we are primarily interested
projections in
in orthogonal projections, and will focus on them for the remainder of this
Section 1.B.2.
section. One of the nicest features of orthogonal projections is that they are
uniquely determined by the subspace that they project onto (i.e., there is only
one orthogonal projection for each subspace), and they can be computed in a
straightforward way from any orthonormal basis of that subspace, at least in
finite dimensions.
In order to get more comfortable with constructing and making use of or-
thogonal projections, we start by describing what they look like in the concrete
setting of matrices that project Fn down onto some subspace of it.
104 Chapter 1. Vector Spaces
Recall that Proof. We start by showing that the matrix P = AA∗ is indeed an orthogonal
(AB)∗ = B∗ A∗ .
Plugging in B = A∗
projection onto S. To verify this claim, we write A = u1 | u2 | · · · | um .
gives (AA∗ )∗ = AA∗ . Then notice that P∗ = (AA∗ )∗ = AA∗ = P and
In the special case when S is 1-dimensional (i.e., a line), the above result
simply says that P = uu∗ , where u is a unit vector pointing in the direction
of that line. It follows that Pv = (uu∗ )v = (u · v)u, which recovers the fact
that we noted earlier about functions of this form (well, functions of the form
P(v) = hu, viu) projecting down onto the line in the direction of u.
1.4 Orthogonality and Adjoints 105
More generally, if we expand out the product P = AA∗ using block matrix
multiplication, we see that if {u1 , u2 , . . . , um } is an orthonormal basis of S then
This is a special case u∗1
of the rank-one sum
decomposition from
u∗2
m
Theorem A.1.3. P = u1 | u2 | · · · | um
.. =
∑ u j u∗j .
. j=1
u∗m
Example 1.4.15 Construct the orthogonal projection P onto the plane S ⊂ R3 with equation
Finding an x − y − 2z = 0.
Orthogonal Projection
Solution:
Onto a Plane
Recall from Example 1.4.7 that one orthonormal basis of S is
Even though there
n o
are lots of C = {u1 , u2 } = √15 (2, 0, 1), √130 (1, 5, −2) .
orthonormal bases
of S, they all It follows from Theorem 1.4.10 that the (unique!) orthogonal projection
produce the same
projection P.
onto S is
∗ ∗
2
1 1 1 h i
It is worth comparing P = u1 u1 + u2 u2 = 0 2 0 1 + 5 1 5 −2
5 30
P to the orthogonal 1 −2
projection onto the
plane x − y + 2z = 0 5 1 2
from 1
= 1 5 −2 .
Example 1.4.14(c). 6
2 −2 2
One of the most useful features of orthogonal projections is that they do not
just project a vector v anywhere in their range, but rather they always project
down to the closest vector in their range, as illustrated in Figure 1.19. This
observation hopefully makes some intuitive sense, since the shortest path from
us to the ceiling above us is along a line pointing straight up (i.e., the shortest
path is orthogonal to the ceiling), but we make it precise in the following
theorem.
z
S v
The distance kv −
between two k w2
k
kv − P(v)k
w1
vectors v and w is kv
−
kv − wk. w2
w1
y
P(v)
x
Figure 1.19: The fastest way to get from a vector to a nearby plane is to go “straight
down” to it. In other words, the closest vector to v in a subspace S is P(v), where P
is the orthogonal projection of onto S. That is, kv − P(v)k ≤ kv − wk for all w ∈ S.
106 Chapter 1. Vector Spaces
where the final equality follows from the Pythagorean theorem for inner prod-
ucts (Exercise 1.3.12). Since kP(v) − wk2 ≥ 0, it follows immediately that
kv − wk ≥ kv − P(v)k. Furthermore, equality holds if and only if kP(v) − wk =
0, which happens if and only if P(v) = w.
Example 1.4.16 Find the closest vector to v = (3, −2, 2) in the plane S ⊂ R3 defined by
Finding the Closest the equation x − 4y + z = 0.
Vector in a Plane
Solution:
Theorem 1.4.11 tells us that the closest vector to v in S is Pv, where
P is the orthogonal projection onto S. To construct P, we first need an
orthonormal basis of S so that we can use Theorem 1.4.10. To find an
orthonormal basis of S, we proceed as we did in Example 1.4.7—we
apply the Gram–Schmidt process to any set consisting of two linearly
To find these vectors independent vectors in S, like B = {(1, 0, −1), (0, 1, 4)}.
(1, 0, −1) and (0, 1, 4),
choose x and y Applying the Gram–Schmidt process to B gives the following orthonor-
arbitrarily and then mal basis C of S:
solve for z via n o
x − 4y + z = 0. C = √12 (1, 0, −1), 31 (2, 1, 2) .
Recall that Then find the closest thing to a solution—that is, find a vector x ∈ R3 that
“inconsistent” means
minimizes kAx − bk.
“has no solutions”.
Solution:
To see that this linear system is inconsistent (i.e., has no solutions), we
just row reduce the augmented matrix [ A | b ]:
1 2 3 2 1 0 −1 0
4 5 6 −1 row reduce 0 1 2 0 .
−−−−−−→
7 8 9 0 0 0 0 1
The bottom row of this row echelon form tells us that the original linear
system has no solutions, since it corresponds to the equation 0x1 + 0x2 +
0x3 = 1.
The fact that this linear system is inconsistent means exactly that
b∈/ range(A), and to find that “closest thing” to a solution (i.e., to minimize
kAx − bk), we just orthogonally project b onto range(A). We thus start by
This orthonormal constructing an orthonormal basis C of range(A):
basis can be found
by applying the n o
Gram–Schmidt C = √166 (1, 4, 7), √111 (−3, −1, 1) .
process to the first
two columns of A,
which span its range
The orthogonal projection onto range(A) is then
(see Theorem A.1.2). √ √
5 2 −1 1/ 66 −3/ 11
1 √ √
We develop a more P = BB∗ = 2 2 2 , where B =
4/ 66 −1/ 11 ,
6 √ √
direct method of −1 2 5 7/ 66 1/ 11
constructing the
projection onto
range(A) in which tells us (via Theorem 1.4.11) that the vector Ax that minimizes
Exercise 1.4.30.
108 Chapter 1. Vector Spaces
kAx − bk is
5 2 −1 2 4
1 1
Ax = Pb = 2 2 2 −1 = 1 .
6 3
−1 2 5 0 −2
The method of Example 1.4.17 for using projections to find the “closest
thing” to a solution of an unsolvable linear system is called linear least squares,
and it is extremely widely-used in statistics. If we want to fit a model to a set of
data points, we typically have far more data points (equations) than parameters
of the model (variables), so our model will typically not exactly match all of the
data. However, with linear least squares we can find the model that comes as
close as possible to matching the data. We return to this method and investigate
it in more depth and with some new machinery in Section 2.C.1.
While Theorem 1.4.10 only applies directly in the finite-dimensional case,
we can extend them somewhat to the case where only the subspace being
projected onto is finite-dimensional, but the source vector space V is potentially
infinite-dimensional. We have to be slightly more careful in this situation, since
it does not make sense to even talk about the standard matrix of the projection
when V is infinite-dimensional:
n
P P(v) = ∑ hui , P(v)iui (definition of P(v))
i=1
* +
n n
It is worth comparing
this result to = ∑ ui , ∑ hu j , viu j ui (definition of P(v))
Theorem 1.4.5—we i=1 j=1
are just giving n
coordinates to P(v) = ∑ hu j , vihui , u j iui (linearity of the inner product)
with respect to i, j=1
{u1 , u2 , . . . , un }. n
= ∑ hui , viui = P(v). ({u1 , . . . , un } is an orthonormal basis)
i=1
Example 1.4.18 Find the degree-2 polynomial f with the property that the integral
Finding the Closest Z 1
Polynomial to a ex − f (x)2 dx
Function −1
is as small as possible.
Solution:
The important fact to identify here is that we are being asked to mini-
mize kex − f (x)k2 (which is equivalent to minimizing kex − f (x)k) as f
ranges over the subspace P 2 [−1, 1] of C[−1, 1]. By Theorem 1.4.11, our
goal is thus to construct P(ex ), where P is the orthogonal projection from
C[−1, 1] onto P 2 [−1, 1], which is guaranteed to exist by Theorem 1.4.12
This basis can be since P 2 [−1, 1] is finite-dimensional. To construct P, we need an orthonor-
found by applying
the Gram–Schmidt mal basis of P 2 [−1, 1], and one such basis is
process to the
n √ √ o
standard basis
C = p1 (x), p2 (x), p3 (x) = √12 , √32 x, √58 (3x2 − 1) .
{1, x, x2 } much like
we did in
Example 1.4.8. Next, we compute the inner product of ex with each of these basis
110 Chapter 1. Vector Spaces
polynomials:
Z 1
p1 (x), ex = √1
2 −1
ex dx = √12 e − 1e ≈ 1.6620,
√ Z 1 √
p2 (x), ex = √3
2 −1
xex
dx = e
6
≈ 0.9011, and
√ Z 1 √ 2
5(e
p3 (x), ex = √5
8 −1
(3x2 − 1)ex dx = √ −7)
2e
≈ 0.2263.
It is worth noting that the solution to the previous example is rather close
For a refresher on to the degree-2 Taylor polynomial for ex , which is x2 /2 + x + 1. The reason
Taylor polynomials,
that these polynomials are close to each other, but not exactly the same as each
see Appendix A.2.2.
other, is that the Taylor polynomial is the polynomial that best approximates
ex at x = 0, whereas the one that we constructed in the previous example
is the polynomial that best approximates ex on the whole interval [−1, 1]
(see Figure 1.20). In fact, if we similarly use orthogonal projections to find
polynomials that approximate ex on the interval [−c, c] for some scalar c > 0,
those polynomials get closer and closer to the Taylor polynomial as c goes to 0
(see Exercise 1.4.32).
Figure 1.20: Orthogonal projections of ex and its Taylor polynomials each approx-
imate it well, but orthogonal projections provide a better approximation over
an entire interval ([−1, 1] in this case) while Taylor polynomials provide a better
approximation at a specific point (x = 0 in this case).
1.4 Orthogonality and Adjoints 111
A proof that this limit to be the degree-p Taylor polynomial for ex centered at 0, then
equals 0 is outside of
the scope of this
2 Z 1 x
book, but is typically lim
ex − Tp (x)
= lim e − Tp (x)2 dx = 0.
covered in p→∞ p→∞ −1
differential calculus
courses. The idea is Since kex − P(ex )k ≤ kex − Tp (x)k for all p, this implies kex − P(ex )k = 0,
simply that as the so ex = P(ex ). However, this is impossible since ex ∈ / P[−1, 1], so the
degree of the Taylor
polynomial orthogonal projection P does not actually exist.
increases, it
approximates ex We close this section with one final simple result that says that orthogonal
better and better.
projections never increase the norm of a vector. While this hopefully seems
somewhat intuitive, it is important to keep in mind that it does not hold for
oblique projections. For example, if the sun is straight overhead then shadows
(i.e., projections) are shorter than the objects that cast them, but if the sun is
low in the sky then our shadow may be longer than we are tall.
Proof. We just move things around in the inner product and use the Cauchy–
Schwarz inequality:
Keep in mind that P
is an orthogonal
projection, so P∗ = P
kP(v)k2 = hP(v), P(v)i = h(P∗ ◦ P)(v), vi
(1.4.3)
and P2 = P here. = hP2 (v), vi = hP(v), vi ≤ kP(v)kkvk.
(a) Show that A∗ A is a diagonal matrix. ∗∗1.4.22 Suppose V and W are finite-dimensional inner
(b) Give an example to show that AA∗ might not be a product spaces and T : V → W is a linear transformation
diagonal matrix. with adjoint T ∗ . Show that rank(T ∗ ) = rank(T ).
∗1.4.15 Let ω = e2πi/n (which is an n-th root of unity). ∗∗1.4.23 Show that every linear transformation T : V →
Show that the Fourier matrix F ∈ Mn (C) defined by W has at most one adjoint map, even when V and W are
infinite-dimensional. [Hint: Use Exercise 1.4.27.]
1 1 1 ··· 1
1 ω ω2 ··· ω n−1
∗∗1.4.24 Suppose F = R or F = C and consider a function
1 1 ω2 ω4 ··· ω 2n−2
F=√
.
h·, ·i : Fn × Fn → F.
n . . . .
. . . . . . (a) Show that h·, ·i is an inner product if and only if there
. . . . .
exists an invertible matrix P ∈ Mn (F) such that
1 ω n−1 ω 2n−2 ··· ω (n−1)(n−1)
hv, wi = v∗ (P∗ P)w for all v, w, ∈ Fn .
is unitary. [Hint: Change a basis in Theorem 1.4.3.]
[Side note: F is, up to scaling, a Vandermonde matrix.] (b) Find a matrix P associated with the weird inner prod-
n−1
uct from Example 1.3.18. That is, for that inner prod-
[Hint: Try to convince yourself that ∑ ω k = 0.] uct find a matrix P ∈ M2 (R) such that
k=0 hv, wi = v∗ (P∗ P)w for all v, w, ∈ R2 .
1.4.16 Suppose that B and C are bases of a finite- (c) Explain why h·, ·i is not an inner product if the matrix
dimensional inner product space V. P from part (a) is not invertible.
(a) Show that if B and C are each orthonormal then
[v]B · [w]B = [v]C · [w]C . ∗∗1.4.25 Suppose V is a finite-dimensional vector space
(b) Provide an example that shows that if B and C are and B ⊂ V is linearly independent. Show that there is an
not both orthonormal bases, then it might be the case inner product on V with respect to which B is orthonormal.
that [v]B · [w]B 6= [v]C · [w]C .
[Hint: Use the method of Exercise 1.4.24 to construct an
inner product.]
∗∗1.4.17 Suppose A ∈ Mm,n (F) and B ∈ Mn,m (F).
(a) Suppose F = R. Show that (Ax) · y = x · (By) for all 1.4.26 Find an inner product on R2 with respect to which
x ∈ Rn and y ∈ Rm if and only if B = AT . the set {(1, 0), (1, 1)} is an orthonormal basis.
(b) Suppose F = C. Show that (Ax) · y = x · (By) for all
x ∈ Cn and y ∈ Cm if and only if B = A∗ .
∗∗1.4.27 Suppose V and W are inner product spaces and
[Side note: In other words, the adjoint of A is either AT or T : V → W is a linear transformation. Show that
A∗ , depending on the ground field.]
hT (v), wi = 0 for all v ∈ V and w ∈ W.
if and only if T = O.
∗∗1.4.18 Suppose F = R or F = C, E is the standard basis
of Fn , and B is any basis of Fn . Show that a change-of-basis
matrix PE←B is unitary if and only if B is an orthonormal ∗∗1.4.28 Suppose V is an inner product space and T :
basis of Fn . V → V is a linear transformation.
(a) Suppose the ground field is C. Show that
∗∗1.4.19 Suppose F = R or F = C and B,C ∈ Mm,n (F). hT (v), vi = 0 for all v ∈ V
Show that the following are equivalent:
if and only if T = O.
a) B∗ B = C∗C, [Hint: Mimic part of the proof of Theorem 1.4.9.]
b) (Bv) · (Bw) = (Cv) · (Cw) for all v, w ∈ Fn , and (b) Suppose the ground field is R. Show that
c) kBvk = kCvk for all v ∈ Fn .
hT (v), vi = 0 for all v ∈ V
[Hint: If C = I then we get some of the characterizations of
if and only if T ∗ = −T .
unitary matrices from Theorem 1.4.9. Mimic that proof.]
(a) Show that A∗ A is invertible. 1.4.33 Much like we can use polynomials to approx-
(b) Show that P = A(A∗ A)−1 A∗ is the orthogonal projec- imate functions via orthogonal projections, we can also
tion onto range(A). use trigonometric functions to approximate them. Doing so
[Side note: This exercise generalizes Theorem 1.4.10 to the gives us something called the function’s Fourier series.
case when the columns of A are just a basis of its range, but
(a) Show that, for each n ≥ 1, the set
not necessarily an orthonormal one.]
Bn = {1, sin(x), sin(2x), . . . , sin(nx),
∗∗1.4.31 Show that if P ∈ Mn is a projection then there cos(x), cos(2x), . . . , cos(nx)}
exists an invertible matrix Q ∈ Mn such that
" # is mutually orthogonal in the usual inner product on
I O −1 C[−π, π].
P=Q r Q , (b) Rescale the members of Bn so that they have norm
O O
equal to 1.
where r = rank(P). In other words, every projection is diag- (c) Orthogonally project the function f (x) = x onto
onalizable and has all eigenvalues equal to 0 or 1. span(Bn ).
[Hint: What eigen-properties do the vectors in the range and [Side note: These are called the Fourier approxi-
null space of P have?] mations of f , and letting n → ∞ gives its Fourier
series.]
[Side note: We prove a stronger decomposition for orthogo- (d) Use computer software to plot the function f (x) = x
nal projections in Exercise 2.1.24.] from part (c), as well as its projection onto span(B5 ),
on the interval [−π, π].
∗∗1.4.32 Let 0 < c ∈ R be a scalar.
(a) Suppose Pc : C[−c, c] → P 2 [−c, c] is an orthogo-
nal projection. Compute Pc (ex ). [Hint: We worked
through the c = 1 case in Example 1.4.18.]
(b) Compute the polynomial limc→0+ Pc (ex ) and notice
that it equals the degree-2 Taylor polynomial of ex at
x = 0. Provide a (not necessarily rigorous) explana-
tion for why we would expect these two polynomials
to coincide.
Having inner products to work with also let us introduce orthonormal bases
We could have also and unitary matrices, which can be thought of as the “best-behaved” bases and
used inner products
invertible matrices that exist, respectively.
to define angles in
general vector In the finite-dimensional case, none of the aforementioned topics change
spaces. much when going from Rn to abstract vector spaces, since every such vector
space is isomorphic to Rn (or Fn , if the vector space is over a field F). In partic-
ular, to check some purely linear-algebraic concept like linear independence
or invertibility, we can simply convert abstract vectors and linear transforma-
tions into vectors in Fn and matrices in Mm,n (F), respectively, and check the
corresponding property there. To check some property that also depends on
an inner product like orthogonality or the length of a vector, we can similarly
convert everything into vectors in Fn and matrices in Mm,n (F) as long as we
are careful to represent the vectors and matrices in orthonormal bases.
For this reason, not much is lost in (finite-dimensional) linear algebra if we
explicitly work with Fn instead of abstract vector spaces, and with matrices
instead of linear transformations. We will often switch back and forth between
these two perspectives, depending on which is more convenient for the topic
at hand. For example, we spend most of Chapter 2 working specifically with
matrices, though we will occasionally make a remark about what our results
say for linear transformations.
Recall that the trace of a square matrix A ∈ Mn is the sum of its diagonal
entries:
tr(A) = a1,1 + a2,2 + · · · + an,n .
While we are already familiar with some nice features of the trace (such as the
Recall that a fact that it is similarity-invariant, and the fact that tr(A) also equals the sum of
function f is
the eigenvalues of A), it still seems somewhat arbitrary—why does adding up
similarity-invariant if
f (A) = f (PAP−1 ) for the diagonal entries of a matrix tell us anything interesting?
all invertible P. In this section, we explore the trace in a bit more depth in an effort to
explain where it “comes from”, in the same sense that the determinant can
be thought of as the answer to the question of how to measure how much
a linear transformation expands or contracts space. In brief, the trace is the
“most natural” or “most useful” linear form on Mn —it can be thought of as the
additive counterpart of the determinant, which in some sense is similarly the
“most natural” multiplicative function on Mn .
Even though matrix multiplication itself is not commutative, this property lets
us treat it as if it were commutative in some situations. To illustrate what we
mean by this, we note that the following example can be solved very quickly
with the trace, but is quite difficult to solve otherwise.
Example 1.A.1 Show that there do not exist matrices A, B ∈ Mn such that AB − BA = I.
The Matrix AB − BA
Solution:
To see why such matrices cannot exist, simply take the trace of both
The matrix AB − BA is sides of the equation:
sometimes called
the commutator of A tr(AB − BA) = tr(AB) − tr(BA) = tr(AB) − tr(AB) = 0,
and B, and is
denoted by
def but tr(I) = n. Since n 6= 0, no such matrices A and B can exist.
[A, B] = AB − BA.
Remark 1.A.1 The previous example can actually be extended into a theorem. Using the
A Characterization of exact same logic as in that example, we can see that the matrix equation
Trace-Zero Matrices AB − BA = C can only ever have a solution when tr(C) = 0, since
tr(C) = tr(AB − BA) = tr(AB) − tr(BA) = tr(AB) − tr(AB) = 0.
This fact is proved in Remarkably, the converse of this observation is also true—for any matrix
[AM57].
C with tr(C) = 0, there exist matrices A and B of the same size such that
1.A Extra Topic: More About the Trace 117
One of the most remarkable facts about the trace is that, not only does it
satisfy this commutativity property, but it is essentially the only linear form
that does so. In particular, the following theorem says that the only linear forms
for which f (AB) = f (BA) are the trace and its scalar multiples:
To this end, we first recall that since f is linear it must be the case that
f (O) = 0. Next, we notice that Ei, j E j, j = Ei, j but E j, j Ei, j = O whenever i 6= j,
so property (a) of f given in the statement of the theorem implies
0 = f (O) = f E j, j Ei, j = f Ei, j E j, j = f (Ei, j ) whenever i 6= j,
which is one of the two facts that we wanted to show.
To similarly prove the other fact (i.e., f (E j, j ) = 1 for all 1 ≤ j ≤ n), we
notice that E1, j E j,1 = E1,1 , but E j,1 E1, j = E j, j for all 1 ≤ j ≤ n, so
f (E j, j ) = f E j,1 E1, j = f E1, j E j,1 = f (E1,1 ) for all 1 ≤ j ≤ n.
However, since I = E1,1 + E2,2 + · · · + En,n , it then follows that
n = f (I) = f (E1,1 + E2,2 + · · · + En,n )
= f (E1,1 ) + f (E2,2 ) + · · · + f (En,n ) = n f (E1,1 ),
so f (E1,1 ) = 1, and similarly f (E j, j ) = 1 for all 1 ≤ j ≤ n, which completes
the proof.
There are also a few other ways of thinking of the trace as the unique
function with certain properties. For example, there are numerous functions of
matrices that are similarity-invariant (e.g., the rank, trace, and determinant),
but the following corollary says that the trace is (up to scaling) the only one
that is linear.
118 Chapter 1. Vector Spaces
Proof. To see that (a) =⇒ (b), we proceed much like in the proof of The-
orem 1.A.1—our goal is to show that f (E j, j ) = 1 and f (Ei, j ) = 0 for all
1 ≤ i 6= j ≤ n.
In fact, this shows The reason that f (E j, j ) = 1 is simply that E j, j is a rank-1 projection for
that if f is a linear
each 1 ≤ j ≤ n. To see that f (Ei, j ) = 0 when i 6= j, notice that E j, j + Ei, j
function for which
f (P) = 1 for all rank-1 is also a rank-1 (oblique) projection, so f (E j, j + Ei, j ) = 1. However, since
projections P ∈ Mn f (E j, j ) = 1, linearity of f tells us that f (Ei, j ) = 0. It follows that for every
then f = tr. matrix A ∈ Mn (F) we have
!
n n n
f (A) = f ∑ ai, j Ei, j = ∑ ai, j f (Ei, j ) = ∑ a j, j = tr(A),
i, j=1 i, j=1 j=1
as desired.
The fact that (b) =⇒ (a) is left to Exercise 1.A.4.
If we have an inner product to work with (and are thus working over the
field F = R or F = C), we can ask whether or not the above theorem can be
strengthened to consider only orthogonal projections (i.e., projections P for
which P∗ = P). It turns out that if F = C then this works—if f (P) = rank(P) for
all orthogonal projections P ∈ Mn (C) then f (A) = tr(A) for all A ∈ Mn (C).
However, this is not true if F = R (see Exercise 1.A.7).
1.A Extra Topic: More About the Trace 119
Figure 1.21: The determinant det(I +xA) can be split up into four pieces, as indicated
here at the bottom-right. The blue region has area approximately equal to 1,
the purple region has area proportional to x2 , and the orange region has area
xa1,1 + xa2,2 = xtr(A). When x is close to 0, this orange region is much larger than the
purple region and thus determines the growth rate of det(I + xA).
120 Chapter 1. Vector Spaces
It is often useful to break apart a large vector space into multiple subspaces
that do not intersect each other (except at the zero vector, where intersection
is unavoidable). For example, it is somewhat natural to think of R2 as being
made up of two copies of R, since every vector (x, y) ∈ R2 can be written in
the form (x, 0) + (0, y), and the subspaces {(x, 0) : x ∈ R} and {(0, y) : y ∈ R}
are each isomorphic to R in a natural way.
Similarly, we can think of R3 as being made up of either three copies of R,
or a copy of R2 and a copy of R, as illustrated in Figure 1.22. The direct sum
provides a way of making this idea precise, and we explore it thoroughly in
this section.
Figure 1.22: The direct sum lets us break down R3 (and other vector spaces) into
smaller subspaces that do not intersect each other except at the origin.
Definition 1.B.1 Let V be a vector space with subspaces S1 and S2 . We say that V is the
The Internal (internal) direct sum of S1 and S2 , denoted by V = S1 ⊕ S2 , if
Direct Sum a) span(S1 ∪ S2 ) = V, and
b) S1 ∩ S2 = {0}.
For now, we just refer The two defining properties of the direct sum mimic very closely the two
to this as a direct
defining properties of bases (Definition 1.1.6). Just like bases must span the
sum (without caring
about the “internal” entire vector space, so too must subspaces in a direct sum, and just like bases
part of its name). We must be “small enough” that they are linearly independent, subspaces in a direct
will distinguish sum must be “small enough” that they only contain the zero vector in common.
between this and
another type of It is also worth noting that the defining property (a) of the direct sum
direct sum later. is equivalent to saying that every vector v ∈ V can be written in the form
v = v1 + v2 for some v1 ∈ S1 and v2 ∈ S2 . The reason for this is simply that
in any linear combination of vectors from S1 and S2 , we can group the terms
122 Chapter 1. Vector Spaces
In other words, the from S1 into v1 and the terms from S2 into v2 . That is, if we write
direct sum is a
special case of the
(not necessarily v = c1 x1 + c2 x2 + · · · + ck xk + d1 y1 + d2 y2 + · · · + dm ym , (1.B.1)
| {z } | {z }
direct) sum
call this v1 call this v2
V = S1 + S2 from
Exercise 1.1.19.
where x1 , . . . , xk ∈ S1 , y1 , . . . , ym ∈ S2 , and c1 , . . . , ck , d1 , . . . , dm ∈ F, then we
can just define v1 and v2 to be the parenthesized terms indicated above.
Example 1.B.1 Determine whether or not R3 = S1 ⊕ S2 , where S1 and S2 are the given
Checking Whether or subspaces.
Not Subspaces a) S1 is the x-axis and S2 is the y-axis.
Make a Direct Sum b) S1 is xy-plane and S2 is the yz-plane.
c) S1 is the line through the origin in the direction of the vector (0, 1, 1)
and S2 is the xy-plane.
Solutions:
a) To determine whether or not V = S1 ⊕ S2 , we must check the two
In part (a) of this defining properties of the direct sum from Definition 1.B.1. Indeed,
example, S1 and S2
property (a) does not hold since span(S1 ∪ S2 ) is just the xy-plane,
are “too small”.
not all of R3 , so R3 6= S1 ⊕ S2 :
z
S2
y
x S1
x S1
so R3
6= S1 ⊕ S2 .
c) To see that span(S1 ∪ S2 ) = R3 we must show that we can write
every vector (x, y, z) ∈ R3 as a linear combination of vectors from
S1 and S2 . One way to do this is to notice that
(x, y, z) ∈ span(S1 ∪ S2 ),
so span(S1 ∪ S2 ) = R3 .
To see that S1 ∩S2 = {0}, suppose (x, y, z) ∈ S1 ∩S2 . Since (x, y, z) ∈
S1 , we know that x = 0 and y = z. Since (x, y, z) ∈ S2 , we know that
z = 0. It follows that (x, y, z) = (0, 0, 0) = 0, so S1 ∩ S2 = {0}, and
z
S1
S2
Notice that the line
S1 is not orthogonal
to the plane S2 . The y
intuition here is the
same as that of
linear thus R3 = S1 ⊕ S2 . x
independence, not
of orthogonality.
The direct sum extends straightforwardly to three or more subspaces as well.
In general, we say that V is the (internal) direct sum of subspaces S1 , S2 , . . . , Sk
if span(S1 ∪ S2 ∪ · · · ∪ Sk ) = V and
The notation with the [
big “∪” union symbol Si ∩ span S j = {0} for all 1 ≤ i ≤ k. (1.B.2)
here is analogous to j6=i
big-Σ notation for
sums. Indeed, Equation (1.B.2) looks somewhat complicated on the surface, but it
just says that each subspace Si (1 ≤ i ≤ k) has no non-zero intersection with
(the span of) the rest of them. In this case we write either
k
M
V = S1 ⊕ S 2 ⊕ · · · ⊕ S k or V= S j.
j=1
For example, if S1 , S2 , and S3 are the x-, y-, and z-axes in R3 , then R3 =
S1 ⊕ S2 ⊕ S3 . More generally, Rn can be written as the direct sum of its n
coordinate axes. Even more generally, given any basis {v1 , v2 , . . . , vk } of a
finite-dimensional vector space V, it is the case that
This claim is proved
in Exercise 1.B.4.
V = span(v1 ) ⊕ span(v2 ) ⊕ · · · ⊕ span(vk ).
Proof. We already noted that v can be written in this form back in Equa-
tion (1.B.1). To see uniqueness, suppose that there exist v1 , w1 ∈ S1 and
v2 , w2 ∈ S2 such that
v = v1 + v2 = w1 + w2 .
Recall that S1 is a Subtracting v2 + w1 from the above equation shows that v1 − w1 = w2 − v2 .
subspace, so if
Since v1 −w1 ∈ S1 and w2 −v2 ∈ S2 , and S1 ∩S2 = {0}, this implies v1 −w1 =
v1 , w1 ∈ S1 then
v1 − w1 ∈ S1 too (and w2 − v2 = 0, so v1 = w1 and v2 = w2 , as desired.
similarly for S2 ).
Similarly, we now show that if we combine bases of the subspaces S1 and
S2 then we get a basis of V = S1 ⊕ S2 . This hopefully makes some intuitive
sense—every vector in V can be represented uniquely as a sum of vectors in S1
and S2 , and every vector in those subspaces can be represented uniquely as a
linear combination of their basis vectors.
Theorem 1.B.2 Suppose that V is a vector space and S1 , S2 ⊆ V are subspaces with bases
Bases of Direct Sums B and C, respectively. Then
a) span(B ∪C) = V if and only if span(S1 ∪ S2 ) = V, and
Here, B ∪C refers to b) B ∪C is linearly independent if and only if S1 ∩ S2 = {0}.
the union of B and C
as a multiset, so that In particular, B ∪C is a basis of V if and only if V = S1 ⊕ S2 .
if there is a common
vector in each of B
and C then we Proof. For part (a), we note that if span(B ∪ C) = V then it must be the case
immediately regard
B ∪C as linearly
that span(S1 ∪ S2 ) = V as well, since span(B ∪C) ⊂ span(S1 ∪ S2 ). In the other
dependent. direction, if span(S1 ∪ S2 ) = V then we can write every v ∈ V in the form v = v1 +
v2 , where v1 ∈ S1 and v2 ∈ S2 . Since B and C are bases of S1 and S2 , respectively,
we can write v1 and v2 as linear combinations of vectors from those sets:
v = v1 + v2 = c1 x1 + c2 x2 + · · · + ck xk + d1 y1 + d2 y2 + · · · + dm ym ,
where x1 , x2 , . . ., xk ∈ S1 , y1 , y2 , . . ., ym ∈ S2 , and c1 , c2 , . . ., ck , d1 , d2 , . . .,
dm ∈ F. We have thus written v as a linear combination of vectors from B ∪C,
so span(B ∪C) = V.
For part (b), suppose that S1 ∩ S2 = {0} and consider some linear combi-
nation of vectors from B ∪C that equals the zero vector:
c1 x1 + c2 x2 + · · · + ck xk + d1 y1 + d2 y2 + · · · + dm ym = 0, (1.B.3)
| {z } | {z }
As was the case with call this v1 call this v2
proofs about bases,
these proofs are where the vectors and scalars come from the same spaces as they did in part (a).
largely uninspiring Our goal is to show that c1 = c2 = · · · = ck = 0 and d1 = d2 = · · · = dm = 0,
definition-chasing which implies linear independence of B ∪C.
affairs. The theorems
themselves should To this end, notice that Equation (1.B.3) says that 0 = v1 + v2 , where
be somewhat v1 ∈ S1 and v2 ∈ S2 . Since 0 = 0 + 0 is another way of writing 0 as a sum of
intuitive though. something from S1 and something from S2 , Theorem 1.B.1 tells us that v1 = 0
and v2 = 0. It follows that
c1 x1 + c2 x2 + · · · + ck xk = 0 and d1 y1 + d2 y2 + · · · + dm ym = 0,
1.B Extra Topic: Direct Sum, Orthogonal Complement 125
Example 1.B.2 Let P E and P O be the subspaces of P consisting of the even and odd
Even and Odd polynomials, respectively:
Polynomials as a
Direct Sum P E = { f ∈ P : f (−x) = f (x)} and P O = { f ∈ P : f (−x) = − f (x)}.
Show that P = P E ⊕ P O .
We introduced PE
Solution:
and P O in
Example 1.1.21. We could directly show that span(P E ∪ P O ) = P and P E ∩ P O = {0},
but perhaps an easier way is consider how the bases of these vector spaces
relate to each other. Recall from Example 1.1.21 that B = {1, x2 , x4 , . . .} is
a basis of P E and C = {x, x3 , x5 , . . .} is a basis of P O . Since
B ∪C = {1, x, x2 , x3 , . . .}
For example, we can now see straight away that the subspaces S1 and S2
from Examples 1.B.1(a) and (b) do not form a direct sum decomposition of R3
since in part (a) we have dim(S1 ) + dim(S2 ) = 1 + 1 = 2 6= 3, and in part (b)
we have dim(S1 ) + dim(S2 ) = 2 + 2 = 4 6= 3. It is perhaps worthwhile to work
through a somewhat more exotic example in a vector space other than Rn .
A = AT = −A,
This example only
works if the ground
from which it follows that A = O, so MSn ∩ MsS
n = {O}.
field does not have For property (a), we have to show that every matrix A ∈ Mn can be
1 + 1 = 0 (like in Z2 , written in the form A = B +C, where BT = B and CT = −C. We can check
for example). Fields
with this property are via direct computation that
said to have
“characteristic 2”, 1 1
A= A + AT + A − AT ,
and they often 2 2
| {z } | {z }
require special
symmetric skew-symmetric
attention.
n(n + 1) n(n − 1)
dim(MSn ) = , dim(MsS
n )= , and dim(Mn ) = n2 .
2 2
Note that these quantities are in agreement with Corollary 1.B.3, since
n(n + 1) n(n − 1)
dim(MSn ) + dim(MsS
n )= + = n2 = dim(Mn ).
2 2
Remark 1.B.1 While the decomposition of Example 1.B.3 works fine for complex matri-
The Complex ces (as well as matrices over almost any other field), a slightly different
Cartesian decomposition that makes use of the conjugate transpose is typically used
Decomposition in that setting instead. Indeed, we can write every matrix A ∈ Mn (C) as a
sum of a Hermitian and a skew-Hermitian matrix via
1 1
A= A + A∗ + A − A∗ . (1.B.4)
2 2
| {z } | {z }
Hermitian skew-Hermitian
2 a + ib
b
Re
-1 1 2 a 3
-1
We return to this idea
of “matrix versions”
of certain subsets of Indeed, Hermitian matrices are often thought of as the “matrix version”
complex numbers in of real numbers, and skew-Hermitian matrices as the “matrix version” of
Figure 2.6. imaginary numbers.
Proof. Part (a) of the theorem follows immediately from Corollary 1.B.3, so
we focus our attention on part (b). Also, condition (iii) immediately implies
conditions (i) and (ii), since those two conditions define exactly what V =
S1 ⊕ S2 means. We thus just need to show that condition (i) implies (iii) and
that (ii) also implies (iii).
To see that condition (i) implies condition (iii), we first note that if B and C
128 Chapter 1. Vector Spaces
However, equality must actually be attained since Theorem 1.B.2 tells us that
(since (i) holds) span(B ∪C) = V. It then follows from Exercise 1.2.27 (since
B ∪ C spans V and has dim(V) vectors) that B ∪ C is a basis of V, so using
Theorem 1.B.2 again tells us that V = S1 ⊕ S2 .
The proof that condition (ii) implies condition (iii) is similar, and left as
Exercise 1.B.6.
Definition 1.B.2 Suppose V is an inner product space and B ⊆ V is a set of vectors. The
Orthogonal orthogonal complement of B, denoted by B⊥ , is the subspace of V con-
Complement sisting of the vectors that are orthogonal to everything in B:
def
B⊥ is read as “B B⊥ = v ∈ V : hv, wi = 0 for all w ∈ B .
perp”, where “perp”
is short for
perpendicular. For example, two lines in R2 that are perpendicular to each other are orthog-
onal complements of each other, as are a plane and a line in R3 that intersect
at right angles (see Figure 1.23). The idea is that orthogonal complements
break an inner product space down into subspaces that only intersect at the zero
vector, much like (internal) direct sums do, but with the added restriction that
those subspaces must be orthogonal to each other.
(2, 1)
x
(1, −2)
We replaced the
basis vector (1, 1, 0)
with (2, 2, 0) just to
make this picture a x
bit prettier. The
author is very y
superficial. (2, 2, 0)
Example 1.B.5 Describe the orthogonal complement of the set in Mn consisting of just
Orthogonal the identity matrix: B = {I}.
Complements of
Solution:
Matrices
Recall from Example 1.3.16 that the standard (Frobenius) inner prod-
uct on Mn is hX,Y i = tr(X ∗Y ), so X ∈ B⊥ if and only if hX, Ii = tr(X ∗ ) =
0. This is equivalent to simply requiring that tr(X) = 0.
130 Chapter 1. Vector Spaces
Example 1.B.6 Describe the orthogonal complement of the set B = {x, x3 } ⊆ P 3 [−1, 1].
Orthogonal
Solution:
Complements of
Our goal is to find all polynomials f (x) = ax3 + bx2 + cx + d with the
Polynomials
property that
Z 1 Z 1
h f (x), xi = x f (x) dx = 0 and h f (x), x3 i = x3 f (x) dx = 0.
−1 −1
Proof. First note that B and C are disjoint since the only vector that S and
S ⊥ have in common is 0, since that is the only vector orthogonal to itself.
With that in mind, write B = {u1 , u2 , . . . , um } and C = {v1 , v2 , . . . , vn }, so
that B ∪ C = {u1 , u2 , . . . , um , v1 , v2 , . . . , vn }. To see that B ∪ C is a mutually
orthogonal set, notice that
We thus just need to show that span(B ∪C) = V. To this end, recall from
Exercise 1.4.20 that we can extend B ∪C to an orthonormal basis of V: we can
find k ≥ 0 unit vectors w1 , w2 , . . . wk such that the set
{u1 , u2 , . . . , um , v1 , v2 , . . . , vn , w1 , w2 , . . . wk }
The above theorem has a few immediate (but very useful) corollaries. For
example, we can now show that orthogonal complements really are a stronger
version of direct sums:
In particular, this Proof. If property (a) holds (i.e., S2 = S1⊥ ) then the fact that hv, wi = 0 for all
theorem tells us that
v ∈ S1 and w ∈ S2 is clear, so we just need to show that V = S1 ⊕ S2 . Well,
if V is
finite-dimensional Theorem 1.B.5 tells us that if B and C are orthonormal bases of S1 and S2 ,
then for every respectively, then B ∪C is an orthonormal basis of V. Theorem 1.B.2 then tells
subspace S ⊆ V we us that V = S1 ⊕ S2 .
have V = S ⊕ S ⊥ .
In the other direction, property (b) immediately implies S2 ⊆ S1⊥ , so we
just need to show the opposite inclusion. To that end, suppose w ∈ S1⊥ (i.e.,
hv, wi = 0 for all v ∈ S1 ). Then, since w ∈ V = S1 ⊕ S2 , we can write w =
w1 + w2 for some w1 ∈ S1 and w2 ∈ S2 . It follows that
for all v ∈ S1 , where the final equality follows from the fact that w2 ∈ S2 and
thus hv, w2 i = 0 by property (b). Choosing v = w1 then gives hw1 , w1 i = 0, so
w1 = 0, so w = w2 ∈ S2 , as desired.
This fact is proved in Exercise 1.B.14 and illustrated in Figure 1.24. Note in
particular that after taking the orthogonal complement of any set B once, further
orthogonal complements just bounce back and forth between B⊥ and span(B).
span(B) B⊥
⊥
B
⊥
⊥
..
.
Figure 1.24: The orthogonal complement of B is B⊥ , and the orthogonal com-
plement of B⊥ is span(B). After that point, taking the orthogonal complement of
span(B) results in B⊥ again, and vice-versa.
The above example tells us that every projection breaks space down into two
1.B Extra Topic: Direct Sum, Orthogonal Complement 133
they live in, and similarly the dimensions of range(A∗ ) and null(A) add up
to the dimension of the input space R5 that they live in. Furthermore, it is
straightforward to check that the vector in this basis of null(A∗ ) is orthogonal
to each of the basis vectors for range(A), and the vectors in this basis for null(A)
are orthogonal to each of the basis vectors for range(A∗ ). All of these facts
can be explained by observing that the fundamental subspaces of any linear
transformation are in fact orthogonal complements of each other:
Theorem 1.B.7 Suppose V and W are finite-dimensional inner product spaces and T :
Orthogonality of the V → W is a linear transformation. Then
Fundamental a) range(T )⊥ = null(T ∗ ), and
Subspaces b) null(T )⊥ = range(T ∗ ).
Proof. The proof of this theorem is surprisingly straightforward. For part (a),
To help remember we just argue as we did in Example 1.B.8(b)—we observe that w ∈ range(T )⊥
this theorem, note is equivalent to several other conditions:
that each equation
in it contains exactly
one T , one T ∗ , one w ∈ range(T )⊥ ⇐⇒ hT (v), wi = 0 for all v∈V
∗
range, one null ⇐⇒ hv, T (w)i = 0 for all v ∈ V
space, and one
orthogonal ⇐⇒ T ∗ (w) = 0
complement. ⇐⇒ w ∈ null(T ∗ ).
It follows that range(T )⊥ = null(T ∗ ), as desired. Part (b) of the theorem can
now be proved by making use of part (a), and is left to Exercise 1.B.15.
The way to think of this theorem is as saying that, for every linear transfor-
mation T : V → W, we can decompose the input space V into an orthogonal
direct sum V = range(T ∗ ) ⊕ null(T ) such that T acts like an invertible map
on one space (range(T ∗ )) and acts like the zero map on the other (null(T )).
Similarly, we can decompose the output space W into an orthogonal direct sum
W = range(T ) ⊕ null(T ∗ ) such that T maps all of V onto one space (range(T ))
Okay, T maps things and maps nothing into the other (null(T ∗ )). These relationships between the
to 0 ∈ null(T ∗ ), but
four fundamental subspaces are illustrated in Figure 1.26.
that’s it! The rest of
null(T ∗ ) is untouched. For example, the matrix A from Equation (1.B.5) acts as a rank 3 linear
transformation that sends R5 to R4 . This means that there is a 3-dimensional
subspace range(A∗ ) ⊆ R5 on which A just “shuffles things around” to an-
other 3-dimensional subspace range(A) ⊆ R4 . The orthogonal complement of
range(A∗ ) is null(A), which accounts for the other 2 dimensions of R5 that are
We return to the “squashed away”.
fundamental
subspaces in If we recall from Exercise 1.4.22 that every linear transformation T acting
Section 2.3.1. on a finite-dimensional inner product space has rank(T ) = rank(T ∗ ), we im-
mediately get the following corollary that tells us how large the fundamental
subspaces are compared to each other.
Corollary 1.B.8 Suppose V and W are finite-dimensional inner product spaces and T :
Dimensions of the V → W is a linear transformation. Then
Fundamental a) rank(T ) + nullity(T ) = dim(V), and
Subspaces b) rank(T ) + nullity(T ∗ ) = dim(W).
1.B Extra Topic: Direct Sum, Orthogonal Complement 135
V W
ra
ng
e(
T)
We make the idea
)
∗
T
that T acts “like” an
e(
ng
invertible linear
ra
transformation on
range(T ∗ ) precise in T (vr )
Exercise 1.B.16.
T
vr
0 T (vn ) = 0
T)
ll(
nu
T
vn
)
∗
T
ll(
nu
Figure 1.26: Given a linear transformation T : V → W, range(T ∗ ) and null(T ) are
orthogonal complements in V, while range(T ) and null(T ∗ ) are orthogonal comple-
ments in W. These particular orthogonal decompositions of V and W are useful
because T acts like the zero map on null(T ) (i.e., T (vn ) = 0 for each vn ∈ null(T ))
and like an invertible linear transformation on range(T ∗ ) (i.e., for each w ∈ range(T )
there exists a unique vr ∈ range(T ∗ ) such that T (vr ) = w).
Definition 1.B.3 Let V and W be vector spaces over the same field F. Then the external
The External direct sum of V and W, denoted by V ⊕ W, is the vector space with
Direct Sum vectors and operations defined as follows:
Vectors: ordered pairs (v, w), where v ∈ V and w ∈ W.
In other words, Vector addition: (v1 , w1 ) + (v2 , w2 ) = (v1 + v2 , w1 + w2 ) for all v1 , v2 ∈
V ⊕ W is the
V and w1 , w2 ∈ W.
Cartesian product of
V and W, together Scalar mult.: c(v, w) = (cv, cw) for all c ∈ F, v ∈ V, and w ∈ W.
with the entry-wise
addition and scalar
multiplication It is hopefully believable that the external direct sum V ⊕ W really is a
operations. vector space, so we leave the proof of that claim to Exercise 1.B.19. For now,
we look at the canonical example that motivates the external direct sum.
Example 1.B.9 Show that F ⊕ F2 (where “⊕” here means the external direct sum) is iso-
The Direct Sum morphic to F3 .
of Fn
136 Chapter 1. Vector Spaces
Solution:
By definition, F ⊕ F2 consists of vectors of the form (x, (y, z)), where
x ∈ F and (y, z) ∈ F2 (together with the “obvious” vector addition and
scalar multiplication operations). It is straightforward to check that erasing
the inner set of parentheses is an isomorphism (i.e., the linear map T :
F ⊕ F2 → F3 defined by T (x, (y, z)) = (x, y, z) is an isomorphism), so
F ⊕ F2 ∼= F3 .
Theorem 1.B.9 Suppose V and W are vector spaces with bases B and C, respectively, and
Bases of the define the following subsets of the external direct sum V ⊕ W:
External Direct
Sum B0 = (v, 0) : v ∈ B and C0 = (0, w) : w ∈ C .
It follows that
! !
k m
(x, y) = (x, 0) + (0, y) = ∑ ci vi , 0 + 0, ∑ d j w j
i=1 j=1
k m
= ∑ ci (vi , 0) + ∑ d j (0, w j ),
i=1 j=1
Corollary 1.B.10 Suppose V and W are finite-dimensional vector spaces with external direct
Dimension of sum V ⊕ W. Then
(External) Direct
Sums dim(V ⊕ W) = dim(V) + dim(W).
Compare this Proof. Just observe that the sets B0 and C0 from Theorem 1.B.9 have empty
corollary (and its
intersection (keep in mind that they cannot even have (0, 0) in common, since
proof) to
Corollary 1.B.3. B and C are bases so 0 ∈
/ B,C), so
as claimed.
We close this section by noting that, in practice, people often just talk about
the direct sum, without specifying whether they mean the internal or external
one (much like we use the same notation for each of them). The reason for this
is two-fold:
• First, it is always clear from context whether a direct sum is internal or
external. If the components in the direct sum were first defined and then
a larger vector space was constructed via their direct sum, it is external.
On the other hand, if a single vector space was first defined and then
it was broken down into two component subspaces, the direct sum is
internal.
• Second, the internal and external direct sums are isomorphic in a natural
way. If V and W are vector spaces with external direct sum V ⊕ W, then
we cannot quite say that V ⊕ W is the internal direct sum of V and W,
since they are not even subspaces of V ⊕ W. However, V ⊕ W is the
internal direct sum of its subspaces
V 0 = (v, 0) : v ∈ V and W 0 = (0, w) : w ∈ W ,
1.B.3 Determine which of the following statements are (a) Show that if hv, w1 i = hv, w2 i = 0 for all v ∈ S,
true and which are false. w1 ∈ W1 and w2 ∈ W2 then W1 = W2 .
(b) Provide an example to show that W1 may not equal
(a) If S1 , S2 ⊆ V are subspaces such that V = S1 ⊕ S2 W2 if we do not have the orthogonality requirement
then V = S2 ⊕ S1 too. of part (a).
∗(b) If V is an inner product space then V ⊥ = {0}.
(c) If A ∈ Mm,n , v ∈ range(A), and w ∈ null(A) then
v · w = 0. 1.B.13 Show that if V is a finite-dimensional inner product
⊥
∗(d) If A ∈ Mm,n , v ∈ range(A), and w ∈ null(A∗ ) then space and B ⊆ V then B⊥ = span(B) .
v · w = 0.
(e) If a vector space V has subspaces S1 , S2 , . . . , Sk satis- ∗∗1.B.14 Show that if V is a finite-dimensional inner prod-
fying span(S1 ∪S2 ∪· · ·∪Sk ) = V and Si ∩S j = {0} uct space and B ⊆ V then (B⊥ )⊥ = span(B).
for all i 6= j then V = S1 ⊕ · · · ⊕ Sk .
∗(f) The set [Hint: Make use of Exercises 1.B.12 and 1.B.13.]
{(e1 , e1 ), (e1 , e2 ), (e1 , e3 ), (e2 , e1 ), (e2 , e2 ), (e2 , e3 )}
is a basis of the external direct sum R2 ⊕ R3 . ∗∗1.B.15 Prove part (b) of Theorem 1.B.7. That is, show
(g) The external direct sum P 2 ⊕ P 3 has dimension 6. that if V and W are finite-dimensional inner product spaces
and T : V → W is a linear transformation then null(T )⊥ =
range(T ∗ ).
∗∗1.B.4 Suppose V is a finite-dimensional vector space
with basis {v1 , v2 , . . . , vk }. Show that
V = span(v1 ) ⊕ span(v2 ) ⊕ · · · ⊕ span(vk ). ∗∗ 1.B.16 Suppose V and W are finite-dimensional in-
ner product spaces and T : V → W is a linear transforma-
tion. Show that the linear transformation S : range(T ∗ ) →
∗∗ 1.B.5 Suppose V is a vector space with subspaces
range(T ) defined by S(v) = T (v) is invertible.
S1 , S2 ⊆ V that have bases B and C, respectively. Show
that if B ∪ C (as a multiset) is linearly independent then [Side note: S is called the restriction of T to range(T ∗ ).]
S1 ∩ S2 = {0}.
[Side note: This completes the proof of Theorem 1.B.2.] ∗∗ 1.B.17 Let P E [−1, 1] and P O [−1, 1] denote the sub-
spaces of even and odd polynomials, respectively, in
P[−1, 1]. Show that (P E [−1, 1])⊥ = P O [−1, 1].
∗∗1.B.6 Complete the proof of Theorem 1.B.4 by show-
ing that condition (b)(ii) implies condition (b)(iii). That is,
show that if V is a finite-dimensional vector space with sub- ∗∗1.B.18 Let C[0, 1] be the inner product space of continu-
spaces S1 , S2 ⊆ V such that dim(V) = dim(S1 ) + dim(S2 ) ous functions on the real interval [0, 1] and let S ⊂ C[0, 1]
and S1 ∩ S2 = {0}, then V = S1 ⊕ S2 . be the subspace
S = f ∈ C[0, 1] : f (0) = 0 .
∗∗1.B.7 Suppose V is a finite-dimensional vector space
(a) Show that S ⊥ = {0}.
with subspaces S1 , S2 ⊆ V. Show that there is at most one
[Hint: If f ∈ S ⊥ , consider the function g ∈ S defined
projection P : V → V with range(P) = S1 and null(P) = S2 .
by g(x) = x f (x).]
[Side note: As long as S1 ⊕ S2 = V, there is actually exactly (b) Show that (S ⊥ )⊥ 6= S.
one such projection, by Exercise 1.B.8.]
[Side note: This result does not contradict Theorem 1.B.6
or Exercise 1.B.14 since C[0, 1] is not finite-dimensional.]
∗∗1.B.8 Suppose A, B ∈ Mm,n have linearly independent
columns and range(A) ⊕ null(B∗ ) = Fm (where F = R or
∗∗1.B.19 Show that if V and W are vector spaces over the
F = C).
same field then their external direct sum V ⊕ W is a vector
(a) Show that B∗ A is invertible. space.
(b) Show that P = A(B∗ A)−1 B∗ is the projection onto
range(A) along null(B∗ ).
∗∗1.B.20 Complete the proof of Theorem 1.B.9 by show-
[Side note: Compare this exercise with Exercise 1.4.30, ing that if B and C are bases of vector spaces V and W,
which covered orthogonal projections.] respectively, then the set B0 ∪ C0 (where B0 and C0 are as
defined in the statement of that theorem) is linearly indepen-
1.B.9 Let S be a subspace of a finite-dimensional inner dent.
product space V. Show that if P is the orthogonal projection
onto S then I − P is the orthogonal projection onto S ⊥ . ∗∗1.B.21 In this exercise, we pin down the details that
show that the internal and external direct sums are isomor-
∗∗1.B.10 Let B be a set of vectors in an inner product phic. Let V and W be vector spaces over the same field.
space V. Show that B⊥ is a subspace of V. (a) Show that the sets
V 0 = {(v, 0) : v ∈ V} and W 0 = {(0, w) : w ∈ W}
1.B.11 Suppose that B and C are sets of vectors in an inner
product space V such that B ⊆ C. Show that C⊥ ⊆ B⊥ . are subspaces of the external direct sum V ⊕ W.
(b) Show that V 0 ∼
= V and W 0 ∼= W.
(c) Show that V ⊕ W = V 0 ⊕ W 0 , where the direct sum
1.B.12 Suppose V is a finite-dimensional inner prod- on the left is external and the one on the right is
uct space and S, W1 , W2 ⊆ V are subspaces for which internal.
V = S ⊕ W1 = S ⊕ W2 .
1.C Extra Topic: The QR Decomposition 139
A = LU,
Theorem 1.C.1 Suppose F = R or F = C, and A ∈ Mm,n (F). There exists a unitary ma-
QR Decomposition trix U ∈ Mm (F) and an upper triangular matrix T ∈ Mm,n (F) with non-
negative real entries on its diagonal such that
A = UT.
In particular, this first Proof of Theorem 1.C.1. We start by proving this result in the special case
argument covers the when m ≥ n and A has linearly independent columns. We partition A as a block
case when A is
square and matrix according to its columns, which we denote by v1 , v2 , . . ., vn ∈ Fm :
invertible (see
Theorem A.1.1). A = v1 | v2 | · · · | vn .
v j = t1, j u1 + t2, j u2 + · · · + t j, j u j ,
where
This formula for ti, j
(
j−1
ui · v j if i < j, and
follows from t j, j =
v j − ∑ (ui · v j )ui
and ti, j =
rearranging the
i=1
0 if i > j.
formula in
Theorem 1.4.6 so as
to solve for v j (and
We then extend {u1, u2 , . . . , un } to an orthonormal
basis {u1 , u2 , . . . , um }
choosing the inner of Fm and define U = u1 | u2 | · · · | um , noting that orthonormality of
product to be the its columns implies that U is unitary. We also define T ∈ Mm,n to be the
dot product). It also upper triangular matrix with ti, j as its (i, j)-entry for all 1 ≤ i ≤ m and 1 ≤
follows from
Theorem 1.4.5, which j ≤ n (noting that its diagonal entries t j, j are clearly real and non-negative, as
tells us that required). Block matrix multiplication then shows that the j-th column of UT
t j, j = u j · v j too. is
t1, j
..
.
t j, j
UT e j = u1 | u2 | · · · | um
0 = t1, j u1 + t2, j u2 + · · · + t j, j u j = v j ,
..
.
0
Example 1.C.1 Compute the QR decomposition of the matrix A = 1 3 3 .
Computing a QR 2 2 −2
Decomposition −2 2 1
Solution:
As suggested by the proof of Theorem 1.C.1, we can find the QR
decomposition of A by applying the Gram–Schmidt process to the columns
v1 , v2 , and v3 of A to recursively construct mutually orthogonal vectors
j−1
w j = v j − ∑i=1 (ui · v j )ui and their normalizations u j = w j /kw j k for
j = 1, 2, 3:
j wj uj u j · v1 u j · v2 u j · v3
The 1 (1, 2, −2) (1, 2, −2)/3 3 1 −1
“lower-triangular” 2 (8, 4, 8)/3 (2, 1, 2)/3 – 4 2
inner products like 3 (2, −2, −1) (2, −2, −1)/3 – – 3
u2 · v1 exist, but are
irrelevant for the
Gram–Schmidt It follows that A has QR decomposition A = UT , where
process and QR
decomposition. Also, 1 2 2
1
the “diagonal” inner U = u1 | u2 | u3 = 2 1 −2 and
products come for 3
free since −2 2 −1
u j · v j = kw j k, and we
already computed
u1 · v1 u1 · v2 u1 · v3 3 1 −1
this norm when T = 0 u2 · v2 u2 · v3 = 0 4 2 .
computing 0 0 u3 · v3 0 0 3
u j = w j /kw j k.
Example 1.C.2 Compute a QR decomposition of the matrix A = 3 0 1 2 .
Computing a
−2 −1 −3 2
Rectangular QR
−6 −2 −2 5
Decomposition
Solution:
Since A has more columns than rows, its columns cannot possibly
form a linearly independent set. We thus just apply the Gram–Schmidt
We showed in the
process to its leftmost 3 columns v1 , v2 , and v3 , while just computing dot
proof of products with its 4th column v4 for later use:
Theorem 1.C.1 that
the 4th column of T j wj uj u j · v1 u j · v2 u j · v3 u j · v4
is U ∗ v4 , whose entries
are exactly the dot 1 (3, −2, −6) (3, −2, −6)/7 7 2 3 −4
products in the final 2 (−6, −3, −2)/7 (−6, −3, −2)/7 – 1 1 −4
column here: u j · v4
for j = 1, 2, 3. 3 (4, −12, 6)/7 (2, −6, 3)/7 – – 2 1
142 Chapter 1. Vector Spaces
Example 1.C.3 Compute a QR decomposition of the matrix A = −1 −3 1 .
Computing a −2 2 −2
Tall/Thin QR
2 −2 0
Decomposition
4 0 −2
Solution:
Since A has more rows than columns, we can start by applying the
Gram–Schmidt process to its columns, but this will only get us the leftmost
3 columns of the unitary matrix U:
j wj uj u j · v1 u j · v2 u j · v3
1 (−1, −2, 2, 4) (−1, −2, 2, 4)/5 5 −1 −1
2 (−16, 8, −8, 4)/5 (−4, 2, −2, 1)/5 – 4 −2
3 (−4, −8, −2, −4)/5 (−2, −4, −1, −2)/5 – – 2
We showed that To find its 4th column, we just extend its first 3 columns {u1 , u2 , u3 } to an
every mutually
orthogonal set of
orthonormal basis of R4 . Up to sign, the unique unit vector u4 that works
unit vectors can be as the 4th column of U is u4 = (2, −1, −4, 2)/5, so it follows that A has
extended to an QR decomposition A = UT , where
orthonormal basis in
Exercise 1.4.20. To do
so, just add vector
−1 −4 −2 2
not in the span of 1 −2 2 −4 −1
U = u1 | u2 | u3 | u4 = and
the current set, 5 2 −2 −1 −4
apply
Gram–Schmidt, and 4 1 −2 2
repeat.
u1 · v1 u1 · v2 u1 · v3 5 −1 −1
0 u2 · v2 u2 · v3
= 0 4 −2
T = 0 .
0 u3 · v3 0 0 2
0 0 0 0 0 0
Remark 1.C.1 The method of computing the QR decomposition that we presented here,
Computing QR based on the Gram–Schmidt process, is typically not actually used in
Decompositions practice. The reason for this is that the Gram–Schmidt process is numer-
ically unstable. If a set of vectors is “close” to linearly dependent then
changing those vectors even slightly can drastically change the resulting
orthonormal basis, and thus small errors in the entries of A can lead to a
wildly incorrect QR decomposition.
Numerically stable methods for computing the QR decomposition (and
1.C Extra Topic: The QR Decomposition 143
numerical methods for linear algebraic tasks in general) are outside of the
scope of this book, so the interested reader is directed to a book like [TB97]
for their treatment.
Ax = UT x = U(T x) = Uy = b,
Example 1.C.4 Use the QR decomposition to find all solutions of the linear system
Solving Linear
Systems via a QR w
3 0 1 2 1
Decomposition x
−2 −1 −3 2 = 0 .
y
−6 −2 −2 5 z 4
Solution:
We constructed the following QR decomposition A = UT of the coef-
ficient matrix A in Example 1.C.2:
3 −6 2 7 2 3 −4
1
U = −2 −3 −6 and T = 0 1 1 −4 .
7
−6 −2 3 0 0 2 1
It follows that z is a free variable, and w, x, and y are leading variables with
w = −z/2, x = −3 + 9z/2, and y = 1 − z/2. It follows that the solutions
of this linear system (as well as the original linear system) are the vectors
of the form (w, x, y, z) = (0, −3, 1, 0) + z(−1, 9, −1, 2)/2.
Remark 1.C.2 While solving a linear system via the QR decomposition is simpler than
Multiple Methods for solving it directly via Gaussian elimination, actually computing the QR
Solving Repeated decomposition in the first place takes just as long as solving the linear
Linear Systems system directly. For this reason, the QR decomposition is typically only
used in this context to solve multiple linear systems, each of which has the
same coefficient matrix (which we can pre-compute a QR decomposition
of) but different right-hand-side vectors.
There are two other standard methods for solving repeated linear
systems of the form Ax j = b j ( j = 1, 2, 3, . . .) that it is worth comparing
to the QR decomposition:
• We could pre-compute A−1 and then set x j = A−1 b j for each j. This
method has the advantage of being conceptually simple, but it is
If A ∈ Mn then all slower and less numerically stable than the QR decomposition.
three of these
methods take O(n3 ) • We could pre-compute an LU decomposition A = LU, where L
operations to do the is lower triangular and U is upper triangular, and then solve the
pre-computation pair of triangular linear systems Ly j = b j and Ux j = y j for each j.
and then O(n2 )
operations to solve a
This method is roughly twice as quick as the QR decomposition,
linear system, but its numerical stability lies somewhere between that of the QR
compared to the decomposition and the method based on A−1 described above.
O(n3 ) operations
needed to solve a Again, justification of the above claims is outside of the scope of this
linear system directly book, so we direct the interested reader to a book on numerical linear
via Gaussian algebra like [TB97].
elimination.
Proof of Theorem 1.C.1. We just string together three facts about the determi-
nant that we already know:
We review these • The determinant is multiplicative, so | det(A)| = | det(U)|| det(T )|,
properties of
determinants in • U is unitary, so Exercise 1.4.11 tells us that | det(U)| = 1, and
Appendix A.1.5. • T is upper triangular, so | det(T )| = |t1,1 · t2,2 · · ·tn,n | = t1,1 · t2,2 · · ·tn,n ,
with the final equality following from the fact that the diagonal entries of
T are non-negative.
1.C.3 Determine which of the following statements are 1.C.6 Suppose that A ∈ M2 is the (non-invertible) matrix
true and which are false. with QR decomposition A = UT , where
" # " #
∗(a) Every matrix A ∈ Mm,n (R) has a QR decomposition 1 3 4 0 0
A = UT with U and T real. U= and T = .
5 4 −3 0 1
(b) If A ∈ Mn has QR decomposition A = UT then
det(A) = t1,1 · t2,2 · · ·tn,n . Find another QR decomposition of A.
∗(c) If A ∈ Mm,n has QR decomposition A = UT then
range(A) = range(T ). [Side note: Contrast this with the fact from Exercise 1.C.5
(d) If A ∈ Mn is invertible and has QR decomposition that QR decompositions of invertible matrices are unique.]
A = UT then, for each 1 ≤ j ≤ n, the span of the left-
most j columns of A equals the span of the leftmost ∗1.C.7 In this exercise, we investigate when the QR de-
j columns of U. composition of a rectangular matrix A ∈ Mm,n is unique.
(a) Show that if n ≥ m and the left m × m block of A is
1.C.4 Suppose that A ∈ Mn is an upper triangular matrix. invertible then its QR decomposition is unique.
(a) Show that A is invertible if and only if all of its diag- (b) Provide an example that shows that if n < m then the
onal entries are non-zero. QR decomposition of A may not be unique, even if
(b) Show that if A is invertible then A−1 is also upper its top n × n block is invertible.
triangular. [Hint: First show that if b has its last k
entries equal to 0 (for some k) then the solution x to 1.C.8 Suppose A ∈ Mm,n . In this exercise, we demonstrate
Ax = b also has its last k entries equal to 0.] the existence of several variants of the QR decomposition.
(c) Show that if A is invertible then the diagonal entries
(a) Show that there exists a unitary matrix U and a lower
of A−1 are the reciprocals of the diagonal entries of
triangular matrix S such that A = US.
A, in the same order.
[Hint: What happens to a matrix’s QR decomposition
if we swap its rows and/or columns?]
∗∗1.C.5 Show that if A ∈ Mn is invertible then its QR (b) Show that there exists a unitary matrix U and an
decomposition is unique. upper triangular matrix T such that A = TU.
[Hint: Exercise 1.4.12 might help.] (c) Show that there exists a unitary matrix U and a lower
triangular matrix S such that A = SU.
Definition 1.D.1 Suppose that F = R or F = C and that V is a vector space over F. Then
Norm of a Vector a norm on V is a function k · k : V → R such that the following three
properties hold for all c ∈ F and all v, w ∈ V:
a) kcvk = |c|kvk (absolute homogeneity)
b) kv + wk ≤ kvk + kwk (triangle inequality)
c) kvk ≥ 0, with equality if and only if v = 0 (positive definiteness)
1.D Extra Topic: Norms and Isometries 147
The motivation for why each of the above defining properties should hold
for any reasonable measure of “length” or “size” is hopefully somewhat clear—
(a) if we multiply a vector by a scalar then its length scaled by the same amount,
(b) the shortest path between two points is the straight line connecting them,
and (c) we do not want lengths to be negative.
Every norm induced by an inner product is indeed a norm, as was estab-
lished by Theorems 1.3.7 (which established properties (a) and (c)) and 1.3.9
(which established property (b)). However, there are also many different norms
out there that are not induced by any inner product, both on Rn and on other
vector spaces. We now present several examples.
c) The fact that kvk1 = |v1 |+· · ·+|vn | ≥ 0 is clear. To see that kvk1 = 0
if and only if v = 0, we similarly just notice that |v1 | + · · · + |vn | = 0
if and only if v1 = · · · = vn = 0 (indeed, if any v j were non-zero
then |v j | > 0, so |v1 | + · · · + |vn | > 0 too).
, 3)
(4
v=
x
kvk1 = 4 + 3 = 7
148 Chapter 1. Vector Spaces
In analogy with the 1-norm from the above example, we refer to the usual
vector length on Cn , defined by
q
def
kvk2 = |v1 |2 + |v2 |2 + · · · + |vn |2 ,
as the 2-norm in this section, in reference to the exponent that appears in the
terms being summed. We sometimes denote it by k · k2 instead of k · k to avoid
confusion with the notation used for norms in general.
Example 1.D.2 The ∞-norm (or max norm) on Cn is the function defined by
The ∞-Norm
for all v ∈ Cn .
def
(or “Max” Norm) kvk∞ = max |v j |
1≤ j≤n
x
k · k1
k · k2
k · k∞
Figure 1.27: The unit balls for the 1-, 2-, and ∞-norms on R2 .
to explicitly find constants c and C that work. For example, we can directly
check that for any vector v ∈ Cn we have
kvk∞ = max |v j | ≤ |v1 | + |v2 | + · · · + |vn | = kvk1 ,
1≤ j≤n
Be careful—the unit
ball of the norm
2k · k∞ is half as large
as that of k · k∞ , not
twice as big. After
all, 2k · k∞ ≤ 1 if and
only if k · k∞ ≤ 1/2.
Figure 1.28: An illustration in R2 of the fact that (a) kvk∞ ≤ kvk1 ≤ 2kvk∞ and
(b) 12 kvk1 ≤ kvk∞ ≤ kvk1 . In particular, one norm is lower bounded by another
one if and only if the unit ball of the former norm is contained within the unit ball
of the latter.
which is exactly the ∞-norm of v (and thus explains why we called it the
∞-norm in the first place). To informally see why this limit holds, we just
notice that increasing the exponent p places more and more importance on the
largest component of v compared to the others (this is proved more precisely in
Exercise 1.D.8).
To verify that the p-norm is indeed a norm, we have to check the three
properties of norms from Definition 1.D.1. Properties (a) and (c) (absolute
It is commonly the homogeneity and positive definiteness) are straightforward enough:
case that the
triangle inequality is a) If v ∈ Cn and c ∈ C then
the difficult property
of a norm to verify, kcvk p = |cv1 | p +· · ·+|cvn | p )1/p = |c| |v1 | p +· · ·+|vn | p )1/p = |c|kvk p .
while the other two
properties (absolute c) The fact that kvk p ≥ 0 is clear. To see that kvk p = 0 if and only if v = 0,
homogeneity and
positive definiteness) we just notice that kvk p = 0 if and only if |v1 | p + · · · + |vn | p = 0, if and
are much simpler. only if v1 = · · · = vn = 0, if and only if v = 0.
Proving that the triangle inequality holds for the p-norm is much more
involved, so we state it separately as a theorem. This inequality is impor-
tant enough and useful enough that it is given its own name—it is called
Minkowski’s inequality.
1.D Extra Topic: Norms and Isometries 151
4
x2p
3
(1 − t)x1p + tx2p ≥
¢p 2
(1 − t)x1 + tx2
1
For an introduction
to convex functions,
x1p x
1 2
see Appendix A.5.2. x1 x2
(1 − t)x1 + tx2
Figure 1.29: The function f (x) = x p (pictured here with p = 2.3) is convex on the
interval [0, ∞), so any line segment between two points on its graph lies above the
graph itself.
Taking the p-th root of both sides of this inequality gives us exactly Minkowski’s
inequality and thus completes the proof.
Equivalence and Hölder’s Inequality
Now that we know that p-norms are indeed norms, it is instructive to try to draw
their unit balls, much like we did in Figure 1.27 for the 1-, 2-, and ∞-norms.
The following theorem shows that the p-norm can only decrease as p increases,
which will help us draw the unit balls shortly.
Since w = v/kvkq , this implies that kwk p = kvk p /kvkq ≥ 1, so kvk p ≥ kvkq ,
as desired.
The argument above covers the case when both p and q are finite. The case
when q = ∞ is proved in Exercise 1.D.7.
By recalling that larger norms have smaller unit balls, we can interpret
Theorem 1.D.3 as saying that the unit ball of the p-norm is contained within
The reason for the the unit ball of the q-norm whenever 1 ≤ p ≤ q ≤ ∞. When we combine this
unit ball inclusion is
with the fact that we already know what the unit balls look like when p = 1, 2,
that kvk p ≤ 1 implies
kvkq ≤ kvk p = 1 too. or ∞ (refer back to Figure 1.27), we get a pretty good idea of what they look
Generally (even like for all values of p (see Figure 1.30). In particular, the sides of the unit ball
beyond just p-norms), gradually “bulge out” as p increases from 1 to ∞.
larger norms have
smaller unit balls.
Figure 1.30: The unit ball of the p-norm on R2 for several values of 1 ≤ p ≤ ∞.
1.D Extra Topic: Norms and Isometries 153
and s s
n √ n
kxk2 = ∑ 12 = n and kyk2 = ∑ |v j |2 = kvk2 .
j=1 j=1
√
It follows that kvk1 = |x · y| ≤ kxk2 kyk2 = nkvk2 , as claimed.
To prove the second inequality, we notice that
Again, the inequality s s s
here follows simply n n 2 n √
because each |vi | is
no larger than the
kvk2 = ∑ |vi |2 ≤ ∑ max |v j |
1≤ j≤n
= ∑ kvk2∞ = nkvk∞ .
i=1 i=1 i=1
largest |vi |.
Before we can prove an analogous result for arbitrary p- and q-norms, we
first need one more technical helper theorem. Just like Minkowski’s inequality
(Theorem 1.D.2) generalizes the triangle inequality from the 2-norm to arbitrary
p-norms, the following theorem generalizes the Cauchy–Schwarz inequality
from the 2-norm to arbitrary p-norms.
Inequality (1.D.2) is Proof of Theorem 1.D.5. Before proving the desired inequality involving vec-
called Young’s
inequality, and it tors, we first prove the following inequality involving real numbers x, y ≥ 0:
depends on the fact
that 1/p + 1/q = 1. x p yq
xy ≤ + . (1.D.2)
p q
To see why this inequality holds, first notice that it is trivial if x = 0 or
y = 0, so we can assume from now on that x, y > 0. Consider the function
f : (0, ∞) → R defined by f (x) = ln(x). Standard calculus techniques show
ln(x) refers to the that f is concave (sometimes called “concave down” in introductory calculus
natural logarithm courses) whenever x > 0—any line connecting two points on the graph of f
(i.e., the base-e
logarithm) of x.
lies below its graph (see Figure 1.32).
y
2
ln(x2 ) f (x) = ln(x)
To see that f is ¢
concave, we notice ln (1 − t)x1 + tx2 ≥
that its second
derivative is
(1 − t) ln(x1 ) + t ln(x2 )
x
f 00 (x) = −1/x2 , which x1 1 2 3 4 x2 5
is negative as long
as x > 0 (see -1 (1 − t)x1 + tx2
Theorem A.5.2). ln(x1 )
-2
Figure 1.32: The function f (x) = ln(x) is concave on the interval (0, ∞), so any line
segment between two points on its graph lies below the graph itself.
Example 1.D.3 Compute the p-norms of the function f (x) = x + 1 in C[0, 1].
Computing the p-Norm
Solution:
of a Function
For finite values of p, we just directly compute the value of the desired
integral:
Z 1
1/p
We do not have to p
take the absolute
kx + 1k p = (x + 1) dx
0
value of f since
f (x) ≥ 0 on the
1 1/p 1/p
(x + 1) p+1 2 p+1 − 1
interval [0, 1].
= = .
p+1 p+1
x=0
For the p = ∞ case, we just notice that the maximum value of f (x) on
the interval [0, 1] is
kx + 1k∞ = max {x + 1} = 1 + 1 = 2.
0≤x≤1
One property of Furthermore, most of the previous theorems that we saw concerning p-
p-norms that does
norms carry through with minimal changes to this new setting too, simply by
not carryover is the
rightmost inequality replacing vectors by functions and sums by integrals. To illustrate this fact, we
in Theorem 1.D.6. now prove Minkowski’s inequality for the p-norm of a function. However, we
leave numerous other properties of the p-norm of a function to the exercises.
1.D Extra Topic: Norms and Isometries 157
Taking the p-th root of both sides of this inequality gives us exactly Minkowski’s
inequality.
We prove Hölder’s Minkowski’s inequality establishes the triangle inequality of the p-norm
inequality for
of a function, just like it did for the p-norm of vectors in Cn . The other two
functions in
Exercise 1.D.14. defining properties of norms (absolute homogeneity and positive definiteness)
are much easier to prove, and are left as Exercise 1.D.13.
Theorem 1.D.8 Let V be a normed vector space. Then k · kV is induced by an inner product
Jordan–von Neumann if and only if
Theorem
2kvk2V + 2kwk2V = kv + wk2V + kv − wk2V for all v, w ∈ V.
Before proving this theorem, we note that the equation 2kvk2V + 2kwk2V =
kv + wk2V + kv − wk2V is sometimes called the parallelogram law, since it
relates the lengths of the sides of a parallelogram to the lengths of its diagonals,
The parallelogram as illustrated in Figure 1.33. In particular, it says that the sum of squares of
law is a
the side lengths of a parallelogram always equals the sum of squares of the
generalization of the
Pythagorean diagonal lengths of that parallelogram, as long as the way that we measure
theorem, which is “length” comes from an inner product.
what we get if we
apply the y y
parallelogram law to
v+w
rectangles. v
w
w
v−w
v
x x
Figure 1.33: The Jordan–von Neumann theorem says that a norm is induced by
an inner product if and only if the sum of squares of norms of diagonals of a
parallelogram equals the sum of squares of norms of its sides (i.e., if and only if the
We originally proved parallelogram law holds for that norm).
the ‘’only if”
direction back in
Exercise 1.3.13. Proof of Theorem 1.D.8. To see the “only if” claim, we note that if k · kV is
1.D Extra Topic: Norms and Isometries 159
kv + wk2V + kv − wk2V = hv + w, v + wi + hv − w, v − wi
= hv, vi + hv, wi + hw, vi + hw, wi
+ hv, vi − hv, wi − hw, vi + hw, wi
= 2hv, vi + 2hw, wi
= 2kvk2V + 2kwk2V .
The “if” direction of the proof is much more involved, and we only prove
the case when the underlying field is F = R (we leave the F = C case to
Exercise 1.D.16). To this end, suppose that the parallelogram law holds and
define the following function h·, ·i : V × V → R (which we will show is an inner
This formula for an product):
inner product in 1
terms of its induced hv, wi = kv + wk2V − kv − wk2V .
norm is sometimes 4
called the To see that this function really is an inner product, we have to check the
polarization identity.
We first encountered
three defining properties of inner products from Definition 1.3.6. Property (a)
this identity back in is straightforward, since
Exercise 1.3.14.
1 1
hv, wi = kv + wk2V − kv − wk2V = kw + vk2V − kw − vk2V = hw, vi.
4 4
Similarly, property (c) follows fairly quickly from the definition, since
1 1
hv, vi = kv + vk2V − kv − vk2V = k2vk2V = kvk2V ,
4 4
which is non-negative and equals 0 if and only if v = 0. In fact, this furthermore
Properties (a) shows that the norm induced by this inner product is indeed kvkV .
and (c) do not
actually depend on
All that remains is to prove property (b) of inner products holds (i.e.,
the parallelogram hv, w + cxi = hv, wi + chv, xi for all v, w, x ∈ V and all c ∈ R). This task is
law holding—that is significantly more involved than proving properties (a) and (c) was, so we split
only required for it up into four steps:
property (b).
i) First, we show that hv, w + xi = hv, wi + hv, xi for all v, w, x ∈ V. To this
end we use the parallelogram law (i.e., the hypothesis of this theorem) to
see that
Inner products By repeating this entire argument with w and x replaced by −w and −x,
provide more
respectively, we similarly see that
structure than just a
norm does, so this
theorem can be kv − w − xk2V = kwk2V + kxk2V + kv − wk2V + kv − xk2V
(1.D.4)
thought of as telling
us exactly when that
− 12 kv + w − xk2V − 21 kv − w + xk2V .
extra structure is
present. Finally, we can use Equations (1.D.3) and (1.D.4) to get what we want:
1
hv, w + xi = kv + w + xk2V − kv − w − xk2V
4
1
= kv + wk2V + kv + xk2V − kv − wk2V − kv − xk2V
4
1 1
= kv + wk2V − kv − wk2V + kv + xk2V − kv − xk2V
4 4
= hv, wi + hv, xi,
qhv, cxi = qhv, p(x/q)i = qphv, x/qi = phv, qx/qi = phv, xi,
where we used (ii) and the fact that p and q are integers in the second
and third equalities. Dividing the above equation through by q then gives
hv, cxi = (p/q)hv, xi = chv, xi, as desired.
Yes, this proof is still iv) Finally, to see that hv, cxi = chv, xi for all c ∈ R, we use the fact that, for
going on.
each fixed v, w ∈ V, the function f : R → R defined by
1 1
f (c) = hv, cxi = kv + cwk2V − kv − cwk2V
c 4c
is continuous (this follows from the fact that all norms are continuous
when restricted to a finite-dimensional subspace like span(v, w)—see
Section 2.D and Appendix B.1 for more discussion of this fact), and com-
positions and sums/differences of continuous functions are continuous).
1.D Extra Topic: Norms and Isometries 161
We expand on this Well, we just showed in (iii) that f (c) = hv, wi for all c ∈ Q. When
type of argument
combined with continuity of f , this means that f (c) = hv, wi for all
considerably in
Section 2.D. c ∈ R. This type of fact is typically covered in analysis courses and texts,
but hopefully it is intuitive enough even if you have not taken such a
course—the rational numbers are dense in R, so if there were a real
number c with f (c) 6= hv, wi then the graph of f would have to “jump”
at c (see Figure 1.34) and thus f would not be continuous.
f (c1 ) 6= hv, wi
f (c) = hv, wi
for all c ∈ Q
c
/Q
c1 ∈
Figure 1.34: If a continuous function f (c) is constant on the rationals, then it must
be constant on all of R—otherwise, it would have a discontinuity.
We have thus shown that h·, ·i is indeed an inner product, and that it induces
k · kV , so the proof is complete.
The parallelogram law thus does not hold whenever 4 6= 21+2/p (i.e.,
whenever p 6= 2). It follows from Theorem 1.D.8 that the p-norms are not
induced by any inner product when p 6= 2.
In fact, the only norms that we have investigated so far that are induced
by an inner product are the “obvious” ones: the 2-norm on Fn and C[a, b], the
Frobenius norm on Mm,n (F), and the norms that we can construct from the
inner products described by Corollary 1.4.4.
162 Chapter 1. Vector Spaces
1.D.3 Isometries
Recall from Section 1.4.3 that if F = R or F = C then a unitary matrix U ∈
Mn (F) is one that preserves the usual 2-norm of all vectors: kUvk2 = kvk2
for all v ∈ Fn . Unitary matrices are extraordinary for the fact that they have
so many different, but equivalent, characterizations (see Theorem 1.4.9). In
this section, we generalize this idea and look at what we can say about linear
transformations that preserve any particular norm, including ones that are not
induced by any inner product.
Definition 1.D.4 Let V and W be normed vector spaces and let T : V → W be a linear
Isometries transformation. We say that T is an isometry if
In other words, a For example, the linear transformation T : R2 → R3 (where we use the
unitary matrix is an
isometry of the
usual 2-norm on R2 and R3 ) defined by T (v1 , v2 ) = (v1 , v2 , 0) is an isometry
2-norm on Fn . since
q q
kT (v)k2 = v21 + v22 + 02 = v21 + v22 = kvk2 for all v ∈ V.
Recall that a rotation Geometrically, part (a) makes sense because U acts as a rotation
matrix (by an angle
counter-clockwise by π/4:
θ counter-clockwise)
is a matrix of the " #
form cos(π/4) − sin(π/4)
" # U= ,
cos(θ ) − sin(θ ) sin(π/4) cos(π/4)
.
sin(θ ) cos(θ )
and rotating vectors does not change their 2-norm (see Example 1.4.12).
However, rotating vectors can indeed change their 1-norm, as demonstrated
in part (b):
1.D Extra Topic: Norms and Isometries 163
y
√
Ue1 = (1, 1)/ 2 kUe1 k2 = 1
√
kUe1 k1 = √1 + √12 = 2
2
U
1/ 2
√
π/4
x
√ e1 = (1, 0) ke1 k2 = 1
1/ 2
ke1 k1 = 1
In situations where we have an inner product to work with (like in part (a)
of the previous example), we can characterize isometries in much the same way
that we characterized unitary matrices back in Section 1.4.3. In fact, the proof
of the upcoming theorem is so similar to that of the equivalence of conditions
(d)–(f) in Theorem 1.4.9 that we defer it to Exercise 1.D.22.
Theorem 1.D.9 Let V and W be inner product spaces and let T : V → W be a linear
Characterization transformation. Then the following are equivalent:
of Isometries a) T is an isometry (with respect to the norms induced by the inner
products on V and W),
b) T ∗ ◦ T = IV , and
c) hT (x), T (y)iW = hx, yiV for all x, y ∈ V.
If we take the standard matrix of both sides of part (b) of this theorem with
respect to an orthonormal basis B of V, then we see that T is an isometry if and
only if
The first equality here [T ]∗B [T ]B = [T ∗ ]B [T ]B = [T ∗ ◦ T ]B = [IV ]B = I.
follows from
Theorem 1.4.8 and In the special case when V = W, this observation can be phrased as follows:
relies on B being an
orthonormal basis.
! If B is an orthonormal basis of an inner product space V, then
T : V → V is an isometry of the norm induced by the inner
product on V if and only if [T ]B is unitary.
On the other hand, if a norm is not induced by an inner product then it can
be quite a bit more difficult to get our hands on what its isometries are. The
following theorem provides the answer for the ∞-norm:
if and only if every row and column of P has a single non-zero entry, and
Real matrices of this
that entry has magnitude (i.e., absolute value) equal to 1.
type are sometimes
called signed Before proving this theorem, it is worth illustrating exactly what it means.
permutation
matrices.
When F = R, it says that isometries of the ∞-norm only have entries 1, −1, and
164 Chapter 1. Vector Spaces
0, and each row and column has exactly one non-zero entry. For example,
0 −1 0
P= 0 0 1
−1 0 0
In particular, this implies that every column of P has at least one entry with
absolute value 1. Similarly, P(e j ± ek ) = p j ± pk for all j, k, so
It turns out that the
1-norm has the exact max |pi, j ± pi,k | = kp j ± pk k∞ = kP(e j ± ek )k∞
1≤i≤n
same isometries (see
Exercise 1.D.24). In = ke j ± ek k∞ = 1 for all 1 ≤ j 6= k ≤ n.
fact, all p-norms
other than the It follows that if |pi, j | = 1 then pi,k = 0 for all k 6= j (i.e., the rest of the i-th
2-norm have these
same isometries row of P equals zero). Since each column p j contains an entry with |pi, j | = 1,
[CL92, LS94], but it follows that every row and column contains exactly one non-zero entry—the
proving this is one with absolute value 1.
beyond the scope
of this book. In particular, it is worth noting that the above theorem says that all isome-
tries of the ∞-norm are also isometries of the 2-norm, since P∗ P = I for any
matrix of the form described by Theorem 1.D.10.
lim kvk p = kvk∞ for all v ∈ Cn . for all v, w ∈ V. [Hint: Consider vectors of the form v =
p→∞ x + y and w = x − y.]
(a) Briefly explain why lim p→∞ |v| p = 0 whenever v ∈ C
satisfies |v| < 1. ∗∗1.D.16 Prove the “if” direction of the Jordan–von Neu-
(b) Show that if v ∈ Cn is such that kvk∞ = 1 then mann theorem when the field is F = C. In particular, show
lim kvk p = 1. that if the parallelogram law holds on a complex normed
p→∞ vector space V then the function defined by
[Hint: Be slightly careful—there might be multiple
values of j for which |v j | = 1.] 1 3 1
(c) Use absolute homogeneity of the p-norms and the
hv, wi = ∑ ik kv + ik wk2V
4 k=0
∞-norm to show that
is an inner product on V.
lim kvk p = kvk∞ for all v ∈ Cn . [Hint: Show that hv, iwi = ihv, wi and apply the theorem
p→∞
from the real case to the real and imaginary parts of hv, wi.]
∗∗1.D.9 Prove the p = 1, q = ∞ case of Hölder’s inequality.
That is, show that 1.D.17 Suppose A ∈ Mm,n (C), and 1 ≤ p ≤ ∞.
|v · w| ≤ kvk1 kwk∞ for all v, w ∈ Cn . (a) Show that the following quantity k · k p is a norm on
Mm,n (C):
∗∗1.D.10 In this exercise, we determine when equality
kAvk p
holds in Minkowski’s inequality (Theorem 1.D.2) on Cn . def
kAk p = maxn : v 6= 0 .
v∈C kvk p
(a) Show that if p > 1 or p = ∞ then kv + wk p =
kvk p + kwk p if and only if either w = 0 or v = cw [Side note: This is called the induced p-norm of A.
for some 0 ≤ c ∈ R. In the p = 2 case it is also called the operator norm
(b) Show that if p = 1 then kv + wk p = kvk p + kwk p if of A, and we explore it in Section 2.3.3.]
and only if there exist non-negative scalars {c j } ⊆ R (b) Show that kAvk p ≤ kAk p kvk p for all v ∈ Cn .
such that, for each 1 ≤ j ≤ n, either w j = 0 or (c) Show that kA∗ kq = kAk p , where 1 ≤ q ≤ ∞ is such
v j = c jw j. that 1/p + 1/q = 1.
[Hint: Make use of Exercise 1.D.11.]
(d) Show that this norm is not induced by an inner prod-
1.D.11 In this exercise, we determine when equality holds uct, regardless of the value of p (as long as m, n ≥ 2).
in Hölder’s inequality (Theorem 1.D.5) on Cn .
166 Chapter 1. Vector Spaces
1.D.18 Suppose A ∈ Mm,n (C), and consider the induced ∗∗1.D.24 Show that P ∈ Mn (C) is an isometry of the 1-
p-norms introduced in Exercise 1.D.17. norm if and only if every row and column of P has a single
(a) Show that the induced 1-norm of A is its maximal non-zero entry, and that entry has absolute value 1.
column sum: [Hint: Try mimicking the proof of Theorem 1.D.10. It might
( ) be helpful to use the result of Exercise 1.D.10(b).]
m
kAk1 = max ∑ |ai, j | .
1≤ j≤n i=1
∗∗1.D.25 In the proof of Theorem 1.D.10, we only proved
(b) Show that the induced ∞-norm of A is its maximal the “only if” direction in the case when F = R. Explain what
row sum: changes/additions need to be made to the proof to handle
( ) the F = C case.
n
kAk∞ = max ∑ |ai, j | .
1≤i≤m j=1
∗∗1.D.26 Show that the 1-norm and the ∞-norm on P[0, 1]
1.D.19 Find all isometries of the norm on R2 defined by are inequivalent. That is, show that there do not exist real
kvk = max{|v1 |, 2|v2 |}. scalars c,C > 0 such that
ckpk1 ≤ kpk∞ ≤ Ckpk1 for all p ∈ P[0, 1].
∗1.D.20 Suppose that V and W are finite-dimensional [Hint: Consider the polynomials of the form pn (x) = xn for
normed vector spaces. Show that if T : V → W is an isome- some fixed n ≥ 1.]
try then dim(V) ≤ dim(W).
Irving Kaplansky
We also explored the Many of the most useful linear algebraic tools are those that involve breaking
QR decomposition matrices down into the product of two or more simpler matrices (or equivalently,
in Section 1.C1.C, breaking linear transformations down into the composition of two or more
which says that simpler linear transformations). The standard example of this technique from
every matrix can be
written as a product introductory linear algebra is diagonalization, which says that it is often the
of a unitary matrix case that we can write a matrix A ∈ Mn in the form A = PDP−1 , where P is
and an upper invertible and D is diagonal.
triangular matrix.
More explicitly, some variant of the following theorem is typically proved
in introductory linear algebra courses:
Another brief review and Dk is trivial to compute (for diagonal matrices, matrix multiplication is
of diagonalization
the same as entrywise multiplication). It follows that after diagonalizing a
is provided in
Appendix A.1.7. matrix, we can compute any power of it via just two matrix multiplications—
pre-multiplication of Dk by P, and post-multiplication of Dk by P−1 . In a sense,
we have off-loaded the difficulty of computing matrix powers into the difficulty
of diagonalizing a matrix.
Av1
A
v2 v1 Av2
x x
" #
2 1
Figure 2.1: The matrix A = “looks diagonal” (i.e., just stretches space, but
1 2
does not skew or rotate it) when viewed in the basis B = {v1 , v2 } = {(1, 1), (−1, 1)}
We say that two matrices A, B ∈ Mn (F) are similar if there exists an invert-
With this terminology ible matrix P ∈ Mn (F) such that A = PBP−1 (and we call such a transformation
in hand, a matrix is
of A into B a similarity transformation). The argument that we just provided
diagonalizable if
and only if it is similar shows that two matrices are similar if and only if they represent the same linear
to a diagonal matrix. transformation on Fn , just represented in different bases. The main goal of this
chapter is to better understand similarity transformations (and thus change of
bases). More specifically, we investigate the following questions:
1) What if A is not diagonalizable—how “close” to diagonal can we make
D = P−1 AP when P is invertible? That is, how close to diagonal can we
make A via a similarity transformation?
2) How simple can we make A if we multiply by something other than P
and P−1 on the left and right? For example, how simple can we make A
via a unitary similarity transformation—a similarity transformation
in which the invertible matrix P is unitary? Since a change-of-basis
matrix PE←B is unitary if and only if B is an orthonormal basis of Fn
(see Exercise 1.4.18), this is equivalent to asking how simple we can
make the standard matrix of a linear transformation if we represent it in
an orthonormal basis.
We start this chapter by probing question (2) above—how simple can we can
make a matrix via a unitary similarity? That is, given a matrix A ∈ Mn (C),
how simple can U ∗ AU be if U is a unitary matrix? Note that we need the field
here to be R or C so that it even makes sense to talk about unitary matrices,
and for now we restrict our attention to C since the answer in the real case will
follow as a corollary of the answer in the complex case.
2.1 The Schur and Spectral Decompositions 169
Recall that an upper We know that we cannot hope in general to get a diagonal matrix via unitary
triangular matrix is similarity, since not every matrix is even diagonalizable via any similarity.
one with zeros in However, the following theorem, which is the main workhorse for the rest
every position below
the main diagonal. of this section, says that we can get partway there and always get an upper
triangular matrix.
Theorem 2.1.1 Suppose A ∈ Mn (C). There exists a unitary matrix U ∈ Mn (C) and an
Schur upper triangular matrix T ∈ Mn (C) such that
Triangularization
A = UTU ∗ .
Proof. We prove the result by induction on n (the size of A). For the base case,
we simply notice that the result is trivial if n = 1, since every 1 × 1 matrix is
upper triangular.
For the inductive step, suppose that every (n − 1) × (n − 1) matrix can be
upper triangularized (i.e., can be written in the form described by the theorem).
This step is why Since A is complex, the fundamental theorem of algebra (Theorem A.3.1) tells
we need to work
us that its characteristic polynomial pA has a root, so A has an eigenvalue. If
over C, not R. Not all
polynomials factor we denote one of its corresponding eigenvectors by v ∈ Cn , then we recall that
over R, so not all real eigenvectors are non-zero by definition, so we can scale it so that kvk = 1.
matrices have real
eigenvalues.
We can then extend v to an orthonormal basis of Cn via Exercise 1.4.20.
In other words, we can find a unitary matrix V ∈ Mn (C) with v as its first
column:
To extend v to an
orthonormal basis, V = v | V2 ,
pick any vector
where V2 ∈ Mn,n−1 (C) satisfies V2∗ v = 0 (since V is unitary so v is orthogonal
not in its span, apply
Gram–Schmidt, and to every column of V2 ). Direct computation then shows that
repeat. ∗
∗
∗ v
V AV = v | V2 A v | V2 = Av | AV2
The final step in this V2∗
computation follows ∗
because Av = λ v, v Av v∗ AV2 λ v∗ AV2
= = .
v∗ v = kvk2 = 1, and λV2∗ v V2∗ AV2 0 V2∗ AV2
V2∗ v = 0.
We now apply the inductive hypothesis—since V2∗ AV2 is an (n − 1) × (n − 1)
matrix, there exists a unitary matrix U2 ∈ Mn−1 (C) and an upper triangular
T2 ∈ Mn−1 (C) such that V2∗ AV2 = U2 T2U2∗ . It follows that
λ v∗ AV2 1 0T λ v∗ AV2 1 0T
V ∗ AV = = .
0 U2 T2U2∗ 0 U2 0 T2 0 U2∗
By multiplying on the left by V and on the right by V ∗ , we see that A = UTU ∗ ,
where
Recall from
Exercise 1.4.8 1 0T λ v∗ AV2
U =V and T = .
that the product 0 U2 0 T2
of unitary matrices
is unitary. Since U is unitary and T is upper triangular, this completes the inductive step
and the proof.
In a Schur triangularization A = UTU ∗ , the diagonal entries of T are
necessarily the same as the eigenvalues of A. To see why this is the case, recall
that (a) A and T must have the same eigenvalues since they are similar, and
170 Chapter 2. Matrix Decompositions
(b) the eigenvalues of a triangular matrix are its diagonal entries. However,
the other pieces of Schur triangularization are highly non-unique—the unitary
matrix U and the entries in the strictly upper triangular portion of T can vary
wildly.
Solutions:
a) We construct the triangularization by mimicking the proof of Theo-
rem 2.1.1, so we start by constructing a unitary matrix whose first
column is an eigenvector of A. It is straightforward to show that its
We computed eigenvalues are −1 and 6, with corresponding eigenvectors (1, −1)
these eigenvalues
and (2, 5), respectively.
and eigenvectors in
Example A.1.1. See If we (arbitrarily)√choose the eigenvalue/eigenvector pair λ = −1
Appendix A.1.6 if you and v = (1, −1)/ 2, then we can extend v to the orthonormal basis
need a refresher on
eigenvalues and n o
eigenvectors. √1 (1, −1), √1 (1, 1)
2 2
b) Again, we mimic the proof of Theorem 2.1.1 and thus start by con-
structing a unitary matrix whose first column is an eigenvector of
B. Its eigenvalues are −1, 3, and 4, with corresponding eigenvec-
If we chose a tors (−4, 1, 3), (1, 1, 0), and (2, 2, 1), respectively. We (arbitrarily)
different eigen-
choose the pair λ = 4, v = (2, 2, 1)/3, and then we compute an or-
value/eigenvector
pair, we would still thonormal basis of C3 that contains v (notice that we normalized v
get a valid Schur so that this is possible).
triangularization √
of B, but it would look
To this end, we note that the unit vector w = (1, −1, 0)/ 2 is clearly
completely different orthogonal to v, so we just need to find one more unit vector or-
than the one we thogonal to both v and w. We can either solve a linear system,or
compute here. use the Gram–Schmidt
√ process, or use the cross product to find
x = (−1, −1, 4)/ 18, which does the job. It follows that the set
n o
{v, w, x} = 13 (2, 2, 1), √12 (1, −1, 0), √118 (−1, −1, 4)
For larger matrices, We now iterate—we repeat this entire procedure with the bottom-
we have to iterate
right 2 × 2 block of V1∗ BV1 . Well, the eigenvalues of
even more. For
example, to
compute a Schur −1 0
triangularization of a 4 3
4 × 4 matrix, we must
compute an
eigenvector of a are −1 and 3, with corresponding eigenvectors
√ (1, −1) and
√ (0, 1),
4 × 4 matrix, then a respectively. The orthonormal basis (1, −1)/ 2, (1, 1)/ 2 con-
3 × 3 matrix, and tains one of its eigenvectors, so we define V2 to be the unitary matrix
then a 2 × 2 matrix.
with these vectors as its columns. Then B has Schur triangularization
B = UTU ∗ via
1 0T
U = V1
In these examples,
0 V2
√ √
the eigenvectors
were simple 2/3 1/ 2 −1/ 18 1 0 0
√ √ √ √
enough to extend to
= 2/3 −1/ 2
−1/ 18 0 1/ √2 1/√2
orthonormal bases √ 0 −1/ 2 1/ 2
by inspection. In 1/3 0 4/ 18
more complicated
situations, use the 2 2 1
1
Gram–Schmidt = 2 −1 −2
process. 3
1 −2 2
and
2 2 1 1 2 2 2 2 1
1
T = U ∗ BU = 2 −1 −2 2 1 2 2 −1 −2
9
1 −2 2 3 −3 4 1 −2 2
4 −1 3
= 0 −1 −4 .
0 0 3
Remark 2.1.1 Schur triangularizations are one of the few things that really do require
Schur us to work with complex matrices, as opposed to real matrices (or even
Triangularizations matrices over some more exotic field). The reason for this is that the first
are Complex step in finding a Schur triangularization is finding an eigenvector, and
some real matrices do not have any real eigenvectors. For example, the
172 Chapter 2. Matrix Decompositions
matrix
1 −2
A=
1 −1
has no real eigenvalues and thus no real Schur triangularization (since
the diagonal entries of its triangularization T necessarily have the same
eigenvalues as A). However, it does have a complex Schur triangularization:
A = UTU ∗ , where
"√ # "√ #
1 2(1 + i) 1 + i 1 i 2 3−i
U=√ √ and T = √ √ .
6 2 −2 2 0 −i 2
Before proving this result, we clarify exactly what we mean by it. Since
the characteristic polynomial pA really is a polynomial, there are coefficients
In fact, an = (−1)n . c0 , c1 , . . . , cn ∈ C such that
pA (λ ) = an λ n + · · · + a2 λ 2 + a1 λ + a0 .
What we mean by pA (A) is that we plug A into this polynomial in the naïve
way—we just replace each power of λ by the corresponding power of A:
Recall that, for
every matrix pA (A) = an An + · · · + a2 A2 + a1 A + a0 I.
A ∈ Mn , we have
A0 = I. 1 2
For example, if A = then
3 4
" #!
1−λ 2
pA (λ ) = det(A − λ I) = det = (1 − λ )(4 − λ ) − 6
3 4−λ
= λ 2 − 5λ − 2,
which has its leftmost k + 1 columns equal to the zero vector. This completes
the inductive step and the proof.
More generally, if For example, because the characteristic polynomial of a 2 × 2 matrix A is
A ∈ Mn (C) then the
constant coefficient
pA (λ ) = λ 2 − tr(A)λ + det(A), in this case the Cayley–Hamilton theorem says
of pA is det(A) and its that every 2 × 2 matrix satisfies the equation A2 = tr(A)A − det(A)I, which can
coefficient of λ n−1 is be verified directly by giving names to the 4 entries of A and computing all of
(−1)n−1 tr(A). the indicated quantities:
" #
a b 1 0
tr(A)A − det(A)I = (a + d) − (ad − bc)
c d 0 1
" #
a2 + bc ab + bd
= = A2 .
ac + cd bc + d 2
Solution:
Another way to solve As noted earlier, the characteristic polynomial of A is pA (λ ) = λ 2 −
this would be to 5λ − 2, so A2 − 5A − 2I = O. Rearranging somewhat gives A2 = 5A + 2I.
square both sides of
the equation To get higher powers of A as linear combinations of A and I, just multiply
A2 = 5A + 2I, this equation through by A:
which gives
A4 = 25A2 + 20A + 4I. A3 = 5A2 + 2A = 5(5A + 2I) + 2A = 25A + 10I + 2A = 27A + 10I, and
Substituting in
A2 = 5A + 2I then A4 = 27A2 + 10A = 27(5A + 2I) + 10A = 135A + 54I + 10A
gives the answer.
= 145A + 54I.
Example 2.1.3 Use the Cayley–Hamilton theorem to find the inverse of the matrix
Matrix Inverses via
Cayley–Hamilton 1 2
A= .
3 4
Solution:
Alternatively, first As before, we know from the Cayley–Hamilton theorem that A2 −
check that det(A) 6= 0
so that we know A−1
5A − 2I = O. If we solve this equation for I, we get I = (A2 − 5A)/2.
exists, and then find Factoring this equation gives I = A(A − 5I)/2, from which it follows that
it by multiplying both A is invertible, with inverse
sides of A2 − 5A −
2I = O by A−1 . " #!
−1 1 1 2 5 0 1 −4 2
A = (A − 5I)/2 = − = .
2 3 4 0 5 2 3 −1
It is maybe worth noting that the final example above is somewhat contrived,
since it only works out so cleanly due to the given matrix having a very simple
characteristic polynomial. For more complicated matrices, large powers are
typically still best computed via diagonalization.
176 Chapter 2. Matrix Decompositions
The fact that A is not unitary follows from the fact that A∗ A (as computed
above) does not equal the identity matrix, and the fact that A is neither
Hermitian nor skew-Hermitian can be seen just by inspecting the entries
of the matrix.
Our primary interest in normal matrices comes from the following theorem,
which says that normal matrices are exactly those that can be diagonalized
by a unitary matrix. Equivalently, they are exactly the matrices whose Schur
triangularizations are actually diagonal.
Theorem 2.1.4 Suppose A ∈ Mn (C). Then there exists a unitary matrix U ∈ Mn (C) and
Spectral diagonal matrix D ∈ Mn (C) such that
Decomposition
(Complex Version) A = UDU ∗
and
t1,1 t1,2 · · · t1,n t1,1 0 ··· 0
0 t2,2 · · · t2,n ···
£ ∗¤ t1,2 t2,2 0
T T 1,1 = . .. .. .. .. .. .. ..
.. . . . . . . .
0 0 · · · tn,n t1,n t2,n · · · tn,n
1,1
= |t1,1 | + |t1,2 | + · · · + |t1,n | .
2 2 2
and
t1,1 0 ··· 0 t1,1 0 ··· 0
£ ∗¤ 0 t2,2 · · · t2,n 0 t2,2 ··· 0
T T 2,2 = . .. .. .. .. .. .. ..
.. . . . . . . .
0 0 · · · tn,n 0 t2,n · · · tn,n
2,2
= |t2,2 | + |t2,3 | + · · · + |t2,n | ,
2 2 2
which implies |t2,2 |2 = |t2,2 |2 + |t2,3 |2 + · · · + |t2,n |2 . Since each term in this sum
is non-negative, it follows that |t2,3 |2 = · · · = |t2,n |2 = 0, so the only non-zero
entry in the second row of T is its (2, 2)-entry.
By repeating this argument for [T ∗ T ]k,k = [T T ∗ ]k,k for each 3 ≤ k ≤ n, we
similarly see that all of the off-diagonal entries in T equal 0, so T is diagonal.
We can thus simply choose D = T , and the proof is complete.
The spectral decomposition is one of the most important theorems in all of
linear algebra, and it can be interpreted in at least three different ways, so it is
worth spending some time thinking about how to think about this theorem.
Interpretation 1. A matrix is normal if and only if its Schur triangularizations are actually
diagonalizations. To be completely clear, the following statements about
a matrix A ∈ Mn (C) are all equivalent:
• A is normal.
• At least one Schur triangularization of A is diagonal.
• Every Schur triangularization of A is diagonal.
The equivalence of the final two points above might seem somewhat
strange given how non-unique Schur triangularizations are, but most of
that non-uniqueness comes from the strictly upper triangular portion of
the triangular matrix T in A = UTU ∗ . Setting all of the strictly upper
triangular entries of T to 0 gets rid of most of this non-uniqueness.
Interpretation 2. A matrix is normal if and only if it is diagonalizable (in the usual sense
of Theorem 2.0.1) via a unitary matrix. In particular, recall the columns
The “usual
diagonalization of the matrix P in a diagonalization A = PDP−1 necessarily form a basis
procedure” is (of Cn ) of eigenvectors of A. It follows that we can compute spectral
illustrated in decompositions simply via the usual diagonalization procedure, but
Appendix A.1.7.
making sure to choose the eigenvectors to be an orthonormal basis of Cn .
This method is much quicker and easier to use than the method suggested
by Interpretation 1, since Schur triangularizations are awful to compute.
Example 2.1.6 Find a spectral decomposition of the matrix A = 1 2 .
A Small Spectral 2 1
Decomposition
Solution:
We start by finding its eigenvalues:
" #!
1−λ 2
det(A − λ I) = det = (1 − λ )2 − 4 = λ 2 − 2λ − 3.
2 1−λ
Setting this polynomial equal to 0 and solving (either via factoring or the
2.1 The Schur and Spectral Decompositions 179
The first row of the RREF above tells us that the eigenvec-
tors v = (v1 , v2 , v3 , v4 ) satisfy v1 = −v2 , and its second row
tells us that v3 = −v4 , so the eigenvectors of A correspond-
ing to the eigenvalue λ = 0 are exactly those of the form
v = v2 (−1, 1, 0, 0) + v4 (0, 0, −1, 1).
We want to choose two vectors of this form (since λ = 0 has
multiplicity 2), but we have to be slightly careful to choose
them so that not only are they normalized, but they are also
orthogonal to√each other. In this case,√there is an obvious choice:
(−1, 1, 0, 0)/ 2 and (0, 0, −1, 1)/ 2 are orthogonal, so we
We could choose them.
have also √ chosen
(1, −1, 0, 0)/ √2 and To construct a spectral decomposition of A, we then place the eigen-
(0, 0, 1, −1)/ 2, or values as the diagonal entries in a diagonal matrix D, and we place the
(1, −1, 1, −1)/2 and
(1, −1, −1, 1)/2, but normalized and orthogonal eigenvectors to which they correspond as
√
not (1, −1, 0, 0)/ 2 columns (in the same order) in a matrix U:
and (1, −1, 1, −1)/2,
for example. 0 0 0 0 −2 0 1 + i 1 − i
1
0 0 0 0 2 0 1 + i 1 − i
D= , U = √ .
0 0 2 + 2i 0 2 2 0 −2 1 − i 1 + i
0 0 0 2 − 2i 0 2 1−i 1+i
I’m glad that we did To double-check our work, we could verify that U is indeed unitary, and
not do this for a 5 × 5
A = UDU ∗ , so we have indeed found a spectral decomposition of A.
matrix. Maybe we’ll
save that for the
exercises? In the previous example, we were able to find an orthonormal basis of the
2-dimensional eigenspace just by inspection. However, it might not always be
so easy—if we cannot just “see” an orthonormal basis of an eigenspace, then
we can construct one by applying the Gram–Schmidt process (Theorem 1.4.6)
to any basis of the space.
Now that we know that every normal matrix is diagonalizable, it’s worth
taking a moment to remind ourselves of the relationships between the various
families of matrices that we have introduced so far. See Figure 2.2 for a visual
representation of these relationships and a reminder of which matrix families
contain each other.
182 Chapter 2. Matrix Decompositions
Mn (C)
diagonalizable
Theorem 2.1.6 Suppose A ∈ Mn (R). Then there exists a unitary matrix U ∈ Mn (R) and
Spectral diagonal matrix D ∈ Mn (R) such that
Decomposition
(Real Version) A = UDU T
Proof. We already proved the “only if” direction of the theorem, so we jump
straight to proving the “if” direction. To this end, recall that the complex
spectral decomposition (Theorem 2.1.4) says that we can find a complex unitary
matrix U ∈ Mn (C) and diagonal matrix D ∈ Mn (C) such that A = UDU ∗ ,
and the columns of U are eigenvectors of A corresponding to the eigenvalues
along the diagonal of D.
To see that D must in fact be real, we observe that since A is real and
symmetric, we have
In fact, this argument
also shows that
every (potentially λ kvk2 = λ v∗ v = v∗ Av = v∗ A∗ v = (v∗ Av)∗ = (λ v∗ v)∗ = λ kvk2 ,
complex) Hermitian
matrix has only real which implies λ = λ (since every eigenvector v is, by definition, non-zero), so
eigenvalues. λ is real.
To see that we can similarly choose U to be real, we must construct an
orthonormal basis of Rn consisting of eigenvectors of A. To do so, we first
recall from Corollary 2.1.5 that eigenvectors v and w of A corresponding to
different eigenvalues λ 6= µ, respectively, are necessarily orthogonal to each
other, since real symmetric matrices are normal.
It thus suffices to show that we can find a real orthonormal basis of each
eigenspace of A, and since we can use the Gram–Schmidt process (Theo-
rem 1.4.6) to find an orthonormal basis of any subspace, it suffices to just show
that each eigenspace of A is the span of a set of real vectors. The key observa-
Recall that v is the tion that demonstrates this fact is that if a vector v ∈ Cn is in the eigenspace
complex conjugate
of v (i.e., if v = x + iy of A corresponding to an eigenvalue λ then so is v, since taking the complex
for some x, y ∈ Rn conjugate of the equation Av = λ v gives
then v = x − iy). Also,
Re(v) = (v + v)/2 and Av = Av = λ v = λ v.
Im(v) = (v − v)/(2i)
are called the real
In particular, this implies that if {v1 , v2 , . . . , vk } is any basis of the eigenspace,
and imaginary parts
of v. Refer to then the set {Re(v1 ), Im(v1 ), Re(v2 ), Im(v2 ), . . . , Re(vk ), Im(vk )} has the same
Appendix A.3 for span. Since each vector in this set is real, the proof is now complete.
a refresher on
complex numbers. Another way of phrasing the real spectral decomposition is as saying that
if V is a real inner product space then a linear transformation T : V → V is
self-adjoint if and only if T looks diagonal in some orthonormal basis of V.
Geometrically, this means that T is self-adjoint if and only if it looks like a
rotation and/or reflection (i.e., a unitary matrix U), composed with a diagonal
scaling, composed with a rotation and/or reflection back (i.e., the inverse unitary
The real spectral matrix U ∗ ).
decomposition can
also be proved This gives us the exact same geometric interpretation of the real spectral
“directly”, without decomposition that we had for its complex counterpart—symmetric matrices
making use of its are exactly those that stretch (but do not rotate or skew) some orthonormal
complex version.
See Exercise 2.1.20.
basis. However, the important distinction here from the case of normal matrices
is that the eigenvalues and eigenvalues of symmetric matrices are real, so we
can really visualize this all happening in Rn rather than in Cn , as in Figure 2.3.
184 Chapter 2. Matrix Decompositions
y y
√ Au2
u2 = (1, 1)/ 2 A = UDU T Au1 = 3u2
= −u1
x x
√
u1 = (1, −1)/ 2
Example 2.1.8 Find a real spectral decomposition of the matrix A = 1 2 2 .
A Real Spectral 2 1 2
Decomposition 2 2 1
Solution:
Since this matrix is symmetric, we know that it has a real spectral
decomposition. To compute it, we first find its eigenvalues:
1−λ 2 2
det(A − λ I) = det 2 1−λ 2
2 2 1−λ
= (1 − λ )3 + 8 + 8 − 4(1 − λ ) − 4(1 − λ ) − 4(1 − λ )
= (1 + λ )2 (5 − λ ).
Before moving on, it is worth having a brief look at Table 2.1, which sum-
marizes the relationship between the real and complex spectral decompositions,
and what they say about normal, Hermitian, and symmetric matrices.
186 Chapter 2. Matrix Decompositions
2.1.1 For each of the following matrices, say whether it is 2.1.4 Find a spectral decomposition of each of the follow-
(i) unitary, (ii) Hermitian, (iii) skew-Hermitian, (iv) normal. ing normal matrices.
" # " #
It may have multiple properties or even none of the listed ∗ (a) 3 2 (b) 1 1
properties.
" # " # 2 3 −1 1
" #
∗ (a) 2 2 (b) 1 2 ∗ (c) 0 −i (d) 1 0 1
−2 2 3 4
" # " # i 0 0 1 0
∗ (c) 1 1 2i (d) 1 0 ∗ (e) 1 1 0 1 0 1
√
5 2i 1 0 1 (f) 2i 0 0
" # " # 0 1 1
∗ (e) 0 0 (f) 1 1+i 1 0 1 0 1 + i −1 + i
0 0 1+i 1 0 −1 + i 1 + i
" # " #
∗ (g) 0 −i (h) i 1
i 0 −1 2i 2.1.5 Determine which of the following statements are
true and which are false.
(a) If A = UTU ∗ is a Schur triangularization of A then
2.1.2 Determine which of the following matrices are nor-
the eigenvalues of A are along the diagonal of T .
mal.
" # " # ∗ (b) If A, B ∈ Mn (C) are normal then so is A + B.
∗ (a) 2 −1 (b) 1 1 1 (c) If A, B ∈ Mn (C) are normal then so is AB.
∗ (d) The set of normal matrices is a subspace of Mn (C).
1 3 1 1 1
" # (e) If A, B ∈ Mn (C) are similar and A is normal, then B
∗ (c) 1 1 (d) 1 2 3 is normal too.
−1 1 3 1 2 ∗ (f) If A ∈ Mn (R) is normal then there exists a uni-
tary matrix U ∈ Mn (R) and a diagonal matrix D ∈
∗ (e) 1 2 −3i 2 3 1
Mn (R) such that A = UDU T .
(f) 2 + 3i 0 0
2 2 2 (g) If A = UTU ∗ is a Schur triangularization of A then
3i 2 4 0 7i 0 A2 = UT 2U ∗ is a Schur triangularization of A2 .
∗ (g) 1 2 0 0 0 18 ∗ (h) If all of the eigenvalues of A ∈ Mn (C) are real, it
√ √
(h) must be Hermitian.
3 4 5 2 2 i
(i) If A ∈ M3 (C) has eigenvalues 1, 1, and 0 (counted
0 6 7 1 1 1
√
via algebraic multiplicity), then A3 = 2A2 − A.
2 i 1
2.1.6 Suppose A ∈ Mn (C). Show that there exists a uni-
2.1.3 Compute a Schur triangularization of the following tary matrix U ∈ Mn (C) and a lower triangular matrix
matrices. L ∈ Mn (C) such that
" # " #
∗ (a) 6 −3 (b) 7 −5 A = ULU ∗ .
2 1 −1 3
∗2.1.7 Suppose A ∈ M2 (C). Use the Cayley–Hamilton
theorem to find an explicit formula for A−1 in terms of the
entries of A (assuming that A is invertible).
2.1 The Schur and Spectral Decompositions 187
∗∗2.1.13 Show that A ∈ Mn (C) is normal if and only if 2.1.22 Suppose U ∈ Mn (C) is a skew-symmetric unitary
kAvk = kA∗ vk for all v ∈ Cn . matrix (i.e., U T = −U and U ∗U = I).
[Hint: Make use of Exercise 1.4.19.] (a) Show that n must be even (i.e., show that no such
matrix exists when n is odd).
(b) Show that the eigenvalues of U are ±i, each with
∗∗2.1.14 Show that A ∈ Mn (C) is normal if and only if multiplicity equal to n/2.
A∗ ∈ span I, A, A2 , A3 , . . . . (c) Show that there is a unitary matrix V ∈ Mn (C) such
[Hint: Apply the spectral decomposition to A and think that
about interpolating polynomials.] " #
∗ 0 1
U = V BV , where Y = and
−1 0
2.1.15 Suppose A, B ∈ Mn (C) are unitarily similar (i.e.,
there is a unitary U ∈ Mn (C) such that B = UAU ∗ ). Show Y O ··· O
that if A is normal then so is B.
O Y · · · O
B=. .. .. .
. ..
2.1.16 Suppose T ∈ Mn (C) is upper triangular. Show that . . . .
T is diagonal if and only if it is normal. O O ··· Y
[Hint: We actually showed this fact somewhere in this [Hint: Find a complex spectral decomposition of Y .]
section—just explain where.]
[Side note: This exercise generalizes Exercise 1.4.12.] ∗∗ 2.1.23 In this exercise, we finally show that if V
is a finite-dimensional inner product space over R and
∗∗2.1.17 Suppose A ∈ Mn (C) is normal. P : V → V is a projection then P is orthogonal (i.e.,
hP(v), v − P(v)i = 0 for all v ∈ V) if and only if it is self-
(a) Show that A is Hermitian if and only if its eigenvalues adjoint (i.e., P = P∗ ). Recall that we proved this when the
are real. ground field is C in Exercise 1.4.29.
(b) Show that A is skew-Hermitian if and only if its
eigenvalues are imaginary. (a) Show that if P is self-adjoint then it is orthogonal.
188 Chapter 2. Matrix Decompositions
(b) Use Exercise 1.4.28 to show that if P is orthogonal 2.1.27 Suppose A, B ∈ Mn (C) commute (i.e., AB = BA).
then the linear transformation T = P−P∗ ◦P satisfies Show that there is a common unitary matrix U ∈ Mn (C)
T ∗ = −T . that triangularizes them both:
(c) Use part (b) to show that if P is orthogonal then it
is self-adjoint. [Hint: Represent T in an orthonormal A = UT1U ∗ and B = UT2U ∗
basis, take its trace, and use Exercise 2.1.12.]
for some upper triangular T1 , T2 ∈ Mn (C).
2.1.24 Show that if P ∈ Mn (C) is an orthogonal projec- [Hint: Use Exercise 2.1.26 and mimic the proof of Schur
tion (i.e., P = P∗ = P2 ) then there exists a unitary matrix triangularization.]
U ∈ Mn (C) such that
" # 2.1.28 Suppose A, B ∈ Mn (C) are normal. Show that A
Ir O ∗
P=U U , and B commute (i.e., AB = BA) if and only if there is a
O O common unitary matrix U ∈ Mn (C) that diagonalizes them
where r = rank(P). both:
A = UD1U ∗ and B = UD2U ∗
2.1.25 A circulant matrix is a matrix C ∈ Mn (C) of the for some diagonal D1 , D2 ∈ Mn (C).
form [Hint: Leech off of Exercise 2.1.27.]
c0 c1 c2 · · · cn−1 [Side note: This result is still true if we replace “normal” by
cn−1 c0 c1 · · · cn−2 “real symmetric” and C by R throughout the exercise.]
cn−2 cn−1 c0 · · · cn−3
C= ,
.. .. .. .. .. 2.1.29 Suppose A, B ∈ Mn (C) are diagonalizable. Show
. . . . .
that A and B commute (i.e., AB = BA) if and only if there is
c1 c2 c3 · · · c0
a common invertible matrix P ∈ Mn (C) that diagonalizes
where c0 , c1 , . . . , cn−1 are scalars. Show that C can be diago- them both:
nalized by the Fourier matrix F from Exercise 1.4.15. That
is, show that F ∗CF is diagonal. A = PD1 P−1 and B = PD2 P−1
[Hint: It suffices to show that the columns of F are eigen- for some diagonal D1 , D2 ∈ Mn (C).
vectors of C.]
[Hint: When does a diagonal matrix D1 commute with an-
other matrix? Try proving the claim when the eigenvalues
∗2.1.26 Suppose A, B ∈ Mn (C) commute (i.e., AB = BA). of A are distinct first, since that case is much easier. For the
Show that there is a vector v ∈ Cn that is an eigenvector of general case, induction might help you sidestep some of the
each of A and B. ugly details.]
[Hint: This is hard. Maybe just prove it in the case when A
has distinct eigenvalues, which is much easier. The general
case can be proved using techniques like those used in the
proof of Schur triangularization.]
We have now seen that normal matrices play a particularly important role in
linear algebra, especially when decomposing matrices. There is one particularly
important sub-family of normal matrices that we now focus our attention on
that play perhaps an even more important role.
v v
e2 Ae2
For another A Av
geometric
interpretation
x x
e1
of positive Ae1
(semi)definiteness,
see Section 2.A.1.
Figure 2.4: As a linear transformation, a positive definite matrix is one that keeps
vectors pointing “mostly” in the same direction. In particular, the angle θ between
v and Av never exceeds π/2, so Av is always in the same half-space as v. Positive
semidefiniteness allows for v and Av to be perpendicular (i.e., orthogonal), so Av
can be on the boundary of the half-space defined by v.
Proof. We prove the theorem by showing that (a) =⇒ (b) =⇒ (d) =⇒ (c)
=⇒ (a).
To see that (a) =⇒ (b), let v be an eigenvector of A with corresponding
eigenvalue λ . Then Av = λ v, and multiplying this equation on the left by v∗
shows that v∗ Av = λ v∗ v = λ kvk2 . Since A is positive semidefinite, we know
that v∗ Av ≥ 0, so it follows that λ ≥ 0 too.
To see that (b) =⇒ (d), we just apply the spectral decomposition theorem
(either the complex Theorem 2.1.4 or the real Theorem 2.1.6, as appropriate)
to A.
√
To see why (d) =⇒ (c), let D be the diagonal matrix that is obtained by
taking
√ the (non-negative) square
√ root√of the diagonal
√ ∗entries
√ of D, and define
B = DU ∗ . Then B∗ B = ( DU ∗ )∗ ( DU ∗ ) = U D DU ∗ = UDU ∗ = A.
Another Finally, to see that (c) =⇒ (a), we let v ∈ Fn be any vector and we note
characterization
of positive
that
semidefiniteness
is provided in v∗ Av = v∗ B∗ Bv = (Bv)∗ (Bv) = kBvk2 ≥ 0.
Exercise 2.2.25.
It follows that A is positive semidefinite, so the proof is complete.
Example 2.2.1 Explicitly show that all four properties of Theorem 2.2.1 hold for the
Demonstrating matrix
Positive 1 −1
A= .
Semidefiniteness −1 1
Solution:
We already showed that v∗ Av = |v1 − v2 |2 ≥ 0 for all v ∈ C2 at the start
of this section, which shows that A is PSD. We now verify that properties
(b)–(d) of Theorem 2.2.1 are all satisfied as well.
In practice, For property (b), we can explicitly compute the eigenvalues of A:
checking
non-negativity " #!
of its eigenvalues 1−λ −1
is the simplest of
det = (1 − λ )2 − 1 = λ 2 − 2λ = λ (λ − 2) = 0,
−1 1 − λ
these methods of
checking positive
semidefiniteness. so the eigenvalues of A are 0 and 2, which are indeed non-negative.
For property (d), we want to find a unitary matrix U such that A =
UDU ∗ , where D has 2 and 0 (the eigenvalues of A) along its diagonal.
We know from the spectral decomposition that we can construct U by
placing the normalized eigenvectors of A into U as columns. Eigenvectors
corresponding to the eigenvalues 2 and 0 are v = (1, −1) and v = (1, 1),
2.2 Positive Semidefiniteness 191
We check respectively, so
property (d) before
(c) so that we can
1 1 1 2 0
mimic the proof of U=√ and D= .
Theorem 2.2.1. 2 −1 1 0 0
The proof of the above theorem is almost identical to that of Theorem 2.2.1,
so we leave it to Exercise 2.2.23. Instead, we note that there are two additional
characterizations of positive definite matrices that we would like to present, but
we first need some “helper” theorems that make it a bit easier to work with positive
(semi)definite matrices. The first of these results tells us how we can manipulate
positive semidefinite matrices without destroying positive semidefiniteness.
We return to this Proof. These properties all follow fairly quickly from the definition of positive
problem of asking
semidefiniteness, so we leave the proof of (a), (b), and (c) to Exercise 2.2.24.
what operations
transform PSD To show that property (d) holds, observe that for all v ∈ Fn we have
matrices into PSD
matrices in 0 ≤ (Pv)∗ A(Pv) = v∗ (P∗ AP)v,
Section 3.A.
so P∗ AP is positive semidefinite as well. If A is positive definite then positive
definiteness of P∗ AP is equivalent to the requirement that Pv 6= 0 whenever
192 Chapter 2. Matrix Decompositions
The diagonal blocks is positive (semi)definite. Then the diagonal blocks A1,1 , A2,2 , . . ., An,n
here must be square
must be positive (semi)definite.
for this block matrix
to make sense (e.g.,
A1,1 is square since it
has A1,2 to its right Proof. We use property (d) of Theorem 2.2.3. In particular, consider the block
and A∗1,2 below it). matrices
I O O
O I O
P1 = .. , P2 = .. , . . . , Pn = .. ,
. . .
O O I
where the sizes of the O and I blocks are such that the matrix multiplication APj
makes sense for all 1 ≤ j ≤ n. It is then straightforward to verify that Pj∗ APj =
A j, j for all 1 ≤ j ≤ n, so A j, j must be positive semidefinite by Theorem 2.2.3(d).
Furthermore, each Pj has rank equal to its number of columns, so A j, j is positive
definite whenever A is.
Example 2.2.2 Show that the following matrices are not positive semidefinite.
Showing Matrices a) 3 2 1 b) 3 −1 1 −1
are Not Positive 2 4 2
Semidefinite −1 1 2 1
1 2 −1 1 2 1 2
−1 1 2 3
nal block is
1 2
,
2 1
which is exactly the matrix B from Equation (2.2.1) that we showed
is not positive semidefinite earlier.
Theorem 2.2.5 A function h·, ·i : Fn × Fn → F is an inner product if and only if there exists
Positive Definite a positive definite matrix A ∈ Mn (F) such that
Matrices Make
Inner Products hv, wi = v∗ Aw for all v, w ∈ Fn .
If you need a Proof. We start by showing that if h·, ·i is an inner product then such a positive
refresher on inner
definite matrix A must exist. To this end, recall from Theorem 1.4.3 that there
products, refer back
to Section 1.4. exists a basis B of Fn such that
hv, wi = [v]B · [w]B for all v, w ∈ Fn .
Well, let PB←E be the change-of-basis matrix from the standard basis E to B.
Then [v]B = PB←E v and [w]B = PB←E w, so
hv, wi = [v]B · [w]B = (PB←E v) · (PB←E w) = v∗ (PB←E
∗
PB←E )w
for all v, w ∈ Fn . Since change-of-basis matrices are invertible, it follows from
Theorem 2.2.2(c) that A = PB←E ∗ PB←E is positive definite, which is what we
wanted.
In the other direction, we must show that every function h·, ·i of the form
hv, wi = v∗ Aw is necessarily an inner product when A is positive definite. We
thus have to show that all three defining properties of inner products from
Definition 1.3.6 hold.
For property (a), we note that A = A∗ , so
Here we used the
fact that if c ∈ C is hw, vi = w∗ Av = (w∗ Av)∗ = v∗ A∗ w = v∗ Aw = hv, wi for all v, w ∈ Fn .
a scalar, then c = c∗ .
If F = R then For property (b), we check that
these complex
conjugations just hv, w + cxi = v∗ A(w + cx) = v∗ Aw + c(v∗ Ax) = hv, wi + chv, xi
vanish.
for all v, w, x ∈ Fn and all c ∈ F.
Notice that Finally, for property (c) we note that A is positive definite, so
we called inner
products “positive hv, vi = v∗ Av ≥ 0 for all v ∈ Fn ,
definite” way
back when we first with equality if and only if v = 0. It follows that h·, ·i is indeed an inner product,
introduced them
in Definition 1.3.6.
which completes the proof.
194 Chapter 2. Matrix Decompositions
is an inner product on R2 .
Solution:
We already showed this function is an inner product back in Exam-
ple 1.3.18 in a rather brute-force manner. Now that we understand inner
products better, we can be much more slick—we just notice that we can
To construct this rewrite this function in the form
matrix A, just notice
that multiplying " #
out vT Aw gives T 1 2
∑i, j ai, j vi w j , so we just
hv, wi = v Aw, where A = .
2 5
let ai, j be the
coefficient of vi w j .
It is straightforward
√ to check that A is positive definite (its eigenvalues are
3 ± 2 2 > 0, for example), so Theorem 2.2.5 tells us that this function is
an inner product.
Theorem 2.2.6 Suppose A ∈ Mn is self-adjoint. Then A is positive definite if and only if,
Sylvester’s for all 1 ≤ k ≤ n, the determinant of the top-left k × k block of A is strictly
Criterion positive.
Example 2.2.4 Use Sylvester’s criterion to show that the following matrix is positive
Applying definite:
Sylvester’s 2 −1 i
Criterion A = −1 2 1 .
−i 1 2
Solution:
We have to check that the top-left 1 × 1, 2 × 2, and 3 × 3 blocks of A
For 2 × 2 matrices, all have positive determinants:
recall that
" #!
det
a b 2 =2>0
det
c d
2 −1
= ad − bc. det = 4 − 1 = 3 > 0, and
−1 2
For larger matrices,
we can use 2 −1 i
Theorem A.1.4 (or det −1 2 1 = 8 + i − i − 2 − 2 − 2 = 2 > 0.
many other
methods) to
−i 1 2
compute
determinants. It follows from Sylvester’s criterion that A is positive definite.
Proof of Sylvester’s criterion. For the “only if” direction of the proof, recall
from Theorem 2.2.4 that if A is positive definite then so is the top-left k × k
block of A, which we will call Ak for the remainder of this proof. Since Ak is
2.2 Positive Semidefiniteness 195
positive definite, its eigenvalues are positive, so det(Ak ) (i.e., the product of
those eigenvalues) is positive too, as desired.
The “if” direction is somewhat trickier to pin down, and we prove it by
induction on n (the size of A). For the base case, if n = 1 then it is clear that
det(A) > 0 implies that A is positive definite since the determinant of a scalar
just equals that scalar itself.
For the inductive step, assume that the theorem holds for (n − 1) × (n − 1)
matrices. To see that it must then hold for n × n matrices, notice that if A ∈
Mn (F) is as in the statement of the theorem and det(Ak ) > 0 for all 1 ≤ k ≤ n,
We can choose v
and x to be then det(A) > 0 and (by the inductive hypothesis) An−1 is positive definite.
orthogonal by Let λi and λ j be any two eigenvalues of A with corresponding orthogonal
the spectral eigenvectors v and x, respectively. Then define x = wn v − vn w and notice that
decomposition.
x 6= 0 (since {v, w} is linearly independent), but xn = wn vn − vn wn = 0. Since
xn = 0 and An−1 is positive definite, it follows that
We have to be
careful if vn = wn = 0,
since then x = 0. In
0 < x∗ Ax
this case, we instead = (wn v − vn w)∗ A(wn v − vn w)
define x = v to fix
up the proof. = |wn |2 v∗ Av − wn vn w∗ Av − vn wn v∗ Aw + |vn |w∗ Aw
= λi |wn |2 v∗ v − λi wn vn w∗ v − λ j vn wn v∗ w + λ j |vn |2 w∗ w
Recall that v and w
= λi |wn |2 kvk2 − 0 − 0 + λ j |vn |2 kwk2 .
are orthogonal, so
v∗ w = v · w = 0. It is thus not possible that both λi ≤ 0 and λ j ≤ 0. Since λi and λ j were
arbitrary eigenvalues of A, it follows that A must have at most one non-positive
eigenvalue. However, if it had exactly one non-positive eigenvalue then it would
be the case that det(A) = λ1 λ2 · · · λn ≤ 0, which we know is not the case. It
follows that all of A’s eigenvalues are strictly positive, so A is positive definite
by Theorem 2.2.2, which completes the proof.
For this matrix, the top-left block clearly has determinant 0, and straightfor-
ward computation shows that det(A) = 0 as well. However, A is not positive
semidefinite, since it has −1 as an eigenvalue.
The following theorem shows that there is indeed some hope though, and
Sylvester’s criterion does apply to 2 × 2 matrices as long as we add in the
requirement that the bottom-right entry is non-negative as well.
diagonal entries (Theorem 2.2.4). The fact that det(A) ≥ 0 follows from the
fact that A has non-negative eigenvalues, and det(A) is the product of them.
For the “if” direction, recall that the characteristic polynomial of A is
pA (λ ) = det(A − λ I) = λ 2 − tr(A)λ + det(A).
It follows from the quadratic formula that the eigenvalues of A are
q
1
λ= tr(A) ± tr(A)2 − 4 det(A) .
2
This proof shows that To see that these eigenvalues are non-negative (and thus A is positive semidef-
we can replace 2 2
the inequalities
inite), we
p just observe that if det(A) ≥ 0 then tr(A) − 4 det(A) < tr(A) , so
2
tr(A) − tr(A) − 4 det(A) ≥ 0.
a1,1 , a2,2 ≥ 0 in the
statement of this
theorem with the For example, if we apply this theorem to the matrices
single inequality
tr(A) ≥ 0. 1 −1 1 2
A= and B =
−1 1 2 1
from Equation (2.2.1), we see immediately that A is positive semidefinite (but
not positive definite) because det(A) = 1 − 1 = 0 ≥ 0, but B is not positive
semidefinite because det(B) = 1 − 4 = −3 < 0.
are
Notice that these " #!
three principal a b
minors are exactly det a = a, det d = d, and det = ad − |b|2 ,
the quantities that b d
determined positive
semidefiniteness in and the principal minors of a 3 × 3 Hermitian matrix
Theorem 2.2.7(a).
a b c
B = b d e
c e f
Theorem 2.2.8 Suppose A ∈ Mn and define the following objects for each 1 ≤ i ≤ n:
Gershgorin • ri = ∑ |ai, j | (the sum of off-diagonal entries of the i-th row of A),
Disc Theorem j6=i
Example 2.2.5 Draw the Gershgorin discs for the following matrix, and show that its
Gershgorin Discs eigenvalues are contained in these discs:
−1 2
A= .
−i 1 + i
Solution:
The radius of the The Gershgorin discs are D(−1, 2) and D(1 + i, 1). Direct calculation
second disc is 1
shows that the eigenvalues of A are 1 and −1 + i, which are indeed con-
because | − i| = 1.
tained within these discs:
198 Chapter 2. Matrix Decompositions
In this subsection,
Im
we focus on the
F = C case since
Gershgorin discs live D(−1, 2) D(1 + i, 1)
naturally in the
complex plane.
These same results 2 = −1 + i
apply if F = R simply
because real Re
matrices are 1 =1
complex.
In fact, this proof Now notice that the sum on the left can be split up into the form
shows that each
eigenvalue lies in the
Gershgorin disc
∑ ai, j v j = ∑ ai, j v j + ai,i . (again, since vi = 1)
j j6=i
corresponding to the
largest entry of its By combining and slightly rearranging the two equations above, we get
corresponding
eigenvectors. λ − ai,i = ∑ ai, j v j .
j6=i
Finally, taking the absolute value of both sides of this equation then shows that
We used the fact ¯ ¯ since |v j | ≤ 1 for all j 6= i
that |wz| = |w||z| for ¯ ¯
¯ ¯
all w, z ∈ C here. | − ai,i | = ¯ ai, j v j ¯ ≤ |ai, j ||v j | ≤ |ai, j | = ri ,
j6=i j6=i j6=i
triangle inequality
which means exactly that λ ∈ D(ai,i , ri ).
Example 2.2.6 Draw the Gershgorin discs based on the columns of the matrix from
Gershgorin Discs Example 2.2.5, and show that its eigenvalues are contained in those discs.
via Columns
Solution:
The Gershgorin discs based on the columns of this matrix are D(−1, 1)
and D(1 + i, 2), do indeed contain the eigenvalues 1 and −1 + i:
2.2 Positive Semidefiniteness 199
Im
column discs
row discs
2 = −1 + i
Re
1 =1
Remark 2.2.2 Be careful when using the Gershgorin disc theorem. Since every complex
Counting matrix has exactly n eigenvalues (counting multiplicities) and n Gershgorin
Eigenvalues in discs, it is tempting to conclude that each disc must contain an eigenvalue,
Gershgorin Discs but this is not necessarily true. For example, consider the matrix
1 2
A= ,
−1 −1
which has Gershgorin discs D(1, 2) and D(−1, 1). However, its eigenval-
ues are ±i, both of which are contained in D(1, 2) and neither of which
are contained in D(−1, 1):
Im
D(1, 2)
D(−1, 1)
1 =i
Re
2 = −i
Proofs of the However, in the case when the Gershgorin discs do not overlap, we
statements in this
can indeed conclude that each disc must contain exactly one eigenvalue.
remark are above
our pay grade, so Slightly more generally, if we partition the Gershgorin discs into disjoint
we leave them to sets, then each set must contain exactly as many eigenvalues as Gershgorin
more specialized discs. For example, the matrix
books like (HJ12)
4 −1 −1
B = 1 −1 1
1 0 5
200 Chapter 2. Matrix Decompositions
has Gershgorin discs D(4, 2), D(−1, 2), and D(5, 1). Since D(4, 2) and
D(5, 1) overlap, but D(−1, 2) is disjoint from them, we know that one of
B’s eigenvalues must be contained in D(−1, 2), and the other two must be
contained in D(4, 2) ∪ D(5, 1). Indeed, the eigenvalues of B are approxi-
mately λ1 ≈ −0.8345 and λ2,3 ≈ 4.4172 ± 0.9274i:
Im
D(−1, 2) D(4, 2)
2 ≈ 4.4172 + 0.9274i
1 ≈ −0.8345 D(5, 1)
Re
3 ≈ 4.4172 − 0.9274i
Our primary purpose for introducing Gershgorin discs is that they will help
us show that some matrices are positive (semi)definite shortly. First, we need
to introduce one additional family of matrices:
This matrix has Gershgorin discs D(2, 1), D(7, 1), and D(5, 2). Furthermore,
since A is Hermitian we know that its eigenvalues are real and are thus contained
in the real interval [1, 8] (see Figure 2.5). In particular, this implies that its
eigenvalues are strictly positive, so A is positive definite. This same type of
argument works in general, and leads immediately to the following theorem:
Im
-2
Figure 2.5: The Gershgorin discs of the Hermitian matrix from Equation (2.2.2) all lie
in the right half of the complex plane, so it is positive definite.
onal entries of A are real and non-negative, the centers of these Gershgorin
This is a one-way discs are located on the right half of the real axis. Furthermore, since A is diago-
theorem: diagonally
nally dominant, the radii of these discs are smaller than the coordinates of their
dominant matrices
are PSD, but PSD centers, so they do not cross the imaginary axis. It follows that every eigenvalue
matrices may not be of A is non-negative (or strictly positive if A is strictly diagonally dominant),
diagonally dominant so A is positive semidefinite by Theorem 2.2.1 (or positive definite by Theo-
(see the matrix from
Example 2.2.3, for
rem 2.2.2).
example).
Theorem 2.2.10 Suppose B,C ∈ Mm,n (F). The following are equivalent:
Unitary Freedom a) There exists a unitary matrix U ∈ Mm (F) such that C = UB,
of PSD b) B∗ B = C∗C,
Decompositions c) (Bv) · (Bw) = (Cv) · (Cw) for all v, w ∈ Fn , and
d) kBvk = kCvk for all v ∈ Fn .
If C = I then Proof. We showed above directly above the statement of the theorem that (a)
this theorem gives
implies (b), and we demonstrated the equivalence of conditions (b), (c), and (d)
many of the
characterizations of in Exercise 1.4.19. We thus only need to show that the conditions (b), (c), and
unitary matrices that (d) together imply condition (a).
we saw back in
Theorem 1.4.9.
To this end, note that since B∗ B is positive semidefinite, it has a set of
eigenvectors {v1 , . . . , vn } (with corresponding eigenvalues λ1 , . . . , λn , respec-
tively) that form an orthonormal basis of Fn . By Exercise 2.1.19, we know that
r = rank(B∗ B) of these eigenvalues are non-zero, which we arrange so that
λ1 , . . . , λr are the non-zero ones. We now prove some simple facts about these
eigenvalues and eigenvectors:
i) r ≤ m, since rank(XY ) ≤ min{rank(X), rank(Y )} in general, so r =
rank(B∗ B) ≤ rank(B) ≤ min{m, n}.
202 Chapter 2. Matrix Decompositions
ii) Bv1 , . . . , Bvr are non-zero and Bvr+1 = · · · = Bvn = 0. These facts follow
from noticing that, for each 1 ≤ j ≤ n, we have
Recall that v j is an
eigenvector, and
eigenvectors are (by kBv j k2 = (Bv j ) · (Bv j ) = v∗j (B∗ Bv j ) = v∗j (λ j v j ) = λ j kv j k2 ,
definition) non-zero.
which equals zero if and only if λ j = 0.
iii) The set {Bv1 , . . . , Bvr } is mutually orthogonal, since if i 6= j then
By multiplying the equation B∗U1 = C∗U2 on the right by U2∗ and then taking
the conjugate transpose of both sides, we see that C = (U2U1∗ )B. Since U2U1∗
is unitary, we are done.
Recall from Theorem 1.4.9 that unitary matrices preserve the norm (induced
Now is a good time by the usual dot product) and angles between vectors. The equivalence of
to have a look back
conditions (a) and (b) in the above theorem can be thought of as the converse—
at Figure 1.15.
if two sets of vectors have the same norm and pairwise angles as each other,
then there must be a unitary matrix that transforms one set into the other. That
is, if {v1 , . . . , vn } ⊆ Fm and {w1 , . . . , wn } ⊆ Fm are such that kv j k = kw j k
and vi · v j = wi · w j for all i, j then there exists a unitary matrix U ∈ Mm (F)
such
that w j = Uv j for all j. After all, if B = v1 | v2 | · · · | vn and C =
w1 | w2 | · · · | wn then B∗ B = C∗C if and only if these norms and dot
products agree.
2.2 Positive Semidefiniteness 203
Theorem 2.2.10 also raises the question of how simple we can make the
matrix B in a positive semidefinite decomposition A = B∗ B. The following
theorem provides one possible answer: we can choose B so that it is also
positive semidefinite.
Theorem 2.2.11 Suppose A ∈ Mn (F) is positive semidefinite. There exists a unique positive
Principal Square semidefinite matrix P ∈ Mn (F), called the principal square root of A,
Root of a Matrix such that A = P2 .
Proof. To see that such a matrix P exists, we use the standard method of
applying functions to diagonalizable matrices. Specifically, we use the spec-
tral decomposition to write A = UDU ∗ , where U ∈ Mn (F) is unitary and
D ∈ Mn (R) is diagonal with non-negative real numbers
√ (the eigenvalues
√ of A)
as its diagonal entries. If we then define P = U DU ∗ , where D is the diag-
onal matrix that contains the non-negative square roots of the entries of D, then
√ √ √ 2
P2 = (U DU ∗ )(U DU ∗ ) = U D U ∗ = UDU ∗ = A,
as desired.
To see that P is unique, suppose that Q ∈ Mn (F) is another positive semidef-
inite matrix for which Q2 = A. We can use the spectral decomposition to write
Q = V FV ∗ , where V ∈ Mn (F) is unitary and F ∈ Mn (R) is diagonal with non-
negative real numbers as its diagonal entries. Since V F 2V ∗ = Q2 = A = UDU ∗ ,
we conclude that the eigenvalues of Q2 equal those of A, and thus the diagonal
entries of F 2 equal those of D.
Since we are free in the spectral decomposition to order the eigenvalues
along the diagonals of F√and D however we like, we can assume without loss
Be careful—it is of generality that F = D. It then follows from the fact that P2 = A = Q2
tempting to try to
that V DV ∗ = UDU ∗ , so√ W D = DW√ , where W = U ∗V . Our goal now is to
show that V = U,
√ implies V DV = U DU (i.e., Q = P), which is equivalent
show√that this ∗ ∗
but this is not true
in general. For to W D = DW .
example, if D = I √
then U and V can
To this end, suppose that P has k distinct eigenvalues (i.e., D has k distinct
be anything. diagonal entries), which we denote by λ√ 1 , λ2 , . . ., λk , and denote the multiplic-
ity of each λ j by m j . We can then write D in block diagonal form as follows,
Again, the spectral where we have grouped repeated eigenvalues (if any) so as to be next to each
decomposition other (i.e., in the same block):
lets us order the
√D
diagonal entries of
λ1 Im1 O ··· O
(and thus of D)
however we like. √ O λ2 Im2 · · · O
D= . .. .. .. .
. .
. . .
O O · · · λk Imk
If we partition W as a block matrix via blocks of the same size and shape,
i.e.,
W1,1 W1,2 · · · W1,k
W W2,2 · · · W2,k
2,1
W =
.. .. ..
.. ,
. . . .
Wk,1 Wk,2 · · · Wk,k
then block matrix multiplication shows that the equation W D = DW is equiv-
alent to λi2Wi, j = λ j2Wi, j for all 1 ≤ i 6= j ≤ k. Since λi2 6= λ j2 when i 6= j, this
204 Chapter 2. Matrix Decompositions
Theorem 2.2.12 Suppose A ∈ Mn (F). There exists a unitary matrix U ∈ Mn (F) and a
Polar positive semidefinite matrix P ∈ Mn (F) such that
Decomposition
A = UP.
This is sometimes
called the right √
polar decomposition Furthermore, P is unique and is given by P = A∗ A, and U is unique if A
of A. A left polar is invertible.
decomposition is
A = QV , where Q is
PSD and V is Proof. Since A∗ A √
is positive semidefinite, we know from Theorem 2.2.11 that
unitary. These two
decompositions
we can define P = A∗ A so that A∗ A = P2 = P∗ P and P is positive semidefinite.
coincide (i.e., Q = P We then know from Theorem 2.2.10 that there exists a unitary matrix U ∈
and V = U) if and Mn (F) such that A = UP.
only if A is normal.
To see uniqueness of P, suppose A = U1 P1 = U2 P2 , where U1 ,U2 ∈ Mn (F)
2.2 Positive Semidefiniteness 205
Since P12 = P22 is positive semidefinite, is has a unique principal square root, so
P1 = P2 .
If A is invertible then uniqueness of U follows from the fact that we can
rearrange the decomposition A = UP to the form U ∗ = PA−1 , so U = (PA−1 )∗ .
Similarly, the
√
matrices
√ A∗ A
and AA∗ might be
√
The matrix A∗ A in the polar decomposition can be thought of as the
called the left and √
right absolute values “matrix version” of the absolute value of a complex √ number |z| = zz. In fact,
of A, respectively, this matrix is sometimes even denoted by |A| = A∗ A. In a sense, this matrix
and they are equal if provides a “regularization” of A that preserves many of its properties (such as
and only if A is
normal. rank and Frobenius norm—see Exercise 2.2.22), but also adds the desirable
positive semidefiniteness property.
Example 2.2.8 Compute the polar decomposition of the matrix A = 2 −1 .
Computing a Polar 2 4
Decomposition Solution:
In order to find the polar decomposition
√ A = UP, our first priority is
to compute A∗ A and then set P = A∗ A:
" #
∗ 2 2 2 −1 8 6
A A= = ,
−1 4 2 4 6 17
The polar decomposition is directly analogous to the fact that every complex
number z ∈ C can be written in the form z = reiθ , where r = |z| is non-negative
and eiθ lies on the unit circle in the complex plane. Indeed, we have already
been thinking about positive semidefinite matrices as analogous to non-negative
real numbers, and it similarly makes sense to think of unitary matrices as anal-
ogous to numbers on the complex unit circle (indeed, this is exactly what
Have a look at they are in the 1 × 1 case). After all, multiplication by eiθ rotates numbers in
Appendix A.3.3 if
the complex plane but does not change their absolute value, just like unitary
you need a refresher
on the polar form of matrices rotate and/or reflect vectors but do not change their length.
a complex number. In fact, just like we think of PSD matrices as analogous to non-negative
real numbers and unitary matrices as analogous to numbers on the complex
unit circle, there are many other sets of matrices that it is useful to think of as
analogous to subsets of the complex plane. For example, we think of the sets
206 Chapter 2. Matrix Decompositions
Thinking about
sets of matrices unit circle ∼ unitary matrices
geometrically (even
though it is a very
high-dimensional 0∼O 1∼I real numbers ∼
space that we Hermitian matrices
real numbers ≥ 0 ∼
cannot really picture
properly) is a very positive semidefinite matrices
useful technique for
building intuition.
Figure 2.6: Several sets of normal matrices visualized on the complex plane as the
sets of complex numbers to which they are analogous.
Remark 2.2.3 Given a positive semidefinite matrix A ∈ Mn , there are actually multiple
Unitary different ways to simplify the matrix B in the PSD decomposition A = B∗ B,
Multiplication and which one is best to use depends on what we want to do with it.
and PSD We showed in Theorem 2.2.11 that we can choose B to be positive
Decompositions
semidefinite, and this led naturally to the polar decomposition of a matrix.
However, there is another matrix decomposition (the Cholesky decompo-
sition of Section 2.B.2) that says we can instead make B upper triangular
with non-negative real numbers on the diagonal. This similarly leads to
the QR decomposition of Section 1.C. The relationships between these
four matrix decompositions are summarized here:
2.2 Positive Semidefiniteness 207
Decomposition of B Decomposition of A = B∗ B
polar decomposition: B = UP principal square root: A = P2
P is positive semidefinite
QR decomposition: B = UT Cholesky decomposition: A = T ∗ T
T is upper triangular with non-negative real numbers on the diagonal
2.2.10 Let J ∈ Mn be the n × n matrix with all entries ∗∗2.2.19 Suppose that A, B,C ∈ Mn are positive semidef-
equal to 1. Show that J is positive semidefinite. inite.
(a) Show that tr(A) ≥ 0.
∗∗2.2.11 Show that if A ∈ Mn is positive semidefinite and (b) Show that tr(AB) ≥ 0. [Hint: Decompose A and B.]
a j, j = 0 for some 1 ≤ j ≤ n then the entire j-th row and j-th (c) Provide an example to show that it is not necessarily
column of A must consist of zeros. the case that tr(ABC) ≥ 0. [Hint: Finding an example
by hand might be tricky. If you have trouble, try writ-
[Hint: What is v∗ Av if v has just two non-zero entries?]
ing a computer program that searches for an example
by generating A, B, and C randomly.]
2.2.12 Show that every (not necessarily Hermitian) strictly
diagonally dominant matrix is invertible.
∗∗2.2.20 Suppose that A ∈ Mn is self-adjoint.
(a) Show that A is positive semidefinite if and only if
∗∗2.2.13 Show that the block diagonal matrix
tr(AB) ≥ 0 for all positive semidefinite B ∈ Mn .
A1 O · · · O [Side note: This provides a converse to the statement
of Exercise 2.2.19(b).]
O A2 · · · O
(b) Show that A positive definite if and only if tr(AB) > 0
A= . .. ..
. .. for all positive semidefinite O 6= B ∈ Mn .
. . . .
O O · · · An
∗∗2.2.21 Suppose that A ∈ Mn is self-adjoint.
is positive (semi)definite if and only if each of A1 , A2 , . . . , An
(a) Show that if there exists a scalar c ∈ R such that
are positive (semi)definite.
tr(AB) ≥ c for all positive semidefinite B ∈ Mn then
A is positive semidefinite and c ≤ 0.
∗∗2.2.14 Show that A ∈ Mn is normal if and only if there (b) Show that if there exists a scalar c ∈ R such that
exists a unitary matrix U ∈ Mn such that A∗ = UA. tr(AB) > c for all positive definite B ∈ Mn then A is
positive semidefinite and c ≤ 0.
2.2.15 A matrix A ∈ Mn (R) with non-negative entries √
is called row stochastic if its rows each add up to 1 (i.e., ∗∗2.2.22 Let |A| = A∗ A be the absolute value of the
ai,1 + ai,2 + · · · + ai,n = 1 for each 1 ≤ i ≤ n). Show that matrix A ∈ Mn (F) that was discussed after Theorem 2.2.12.
each eigenvalue λ of a row stochastic matrix has |λ | ≤ 1. (a) Show that rank(|A|) = rank(A).
(b) Show that
|A|
F = kAkF .
∗∗2.2.16 Suppose F = R or F = C, and A ∈ Mn (F). (c) Show that
|A|v
= kAvk for all v ∈ Fn .
(a) Show that A is self-adjoint if and only if there exist
positive semidefinite matrices P, N ∈ Mn (F) such ∗∗2.2.23 Prove Theorem 2.2.2.
that A = P − N. [Hint: Apply the spectral decompo-
[Hint: Mimic the proof of Theorem 2.2.1 and just make
sition to A.]
minor changes where necessary.]
(b) Show that if F = C then A can be written as a linear
combination of 4 or fewer positive semidefinite ma-
trices, even if it is not Hermitian. [Hint: Have a look ∗∗2.2.24 Recall Theorem 2.2.3, which described some
at Remark 1.B.1.] ways in which we can combine PSD matrices to create new
(c) Explain why the result of part (b) does not hold if PSD matrices.
F = R.
(a) Prove part (a) of the theorem.
(b) Prove part (b) of the theorem.
2.2.17 Suppose A ∈ Mn is self-adjoint with p strictly pos- (c) Prove part (c) of the theorem.
itive eigenvalues. Show that the largest integer r with the
property that P∗ AP is positive definite for some P ∈ Mn,r
∗∗2.2.25 Suppose A ∈ Mn (F) is self-adjoint.
is r = p.
(a) Show that A is positive semidefinite if and only if
√ there exists a set of vectors {v1 , v2 , . . . , vn } ⊂ Fn
2.2.18 Suppose B ∈ Mn . Show that tr B∗ B ≥ |tr(B)|. such that
[Side note: In other words, out of all possible PSD decom-
positions of a PSD matrix, its principal square root is the ai, j = vi · v j for all 1 ≤ i, j ≤ n.
one with the largest trace.] [Side note: A is sometimes called the Gram matrix
of {v1 , v2 , . . . , vn }.]
(b) Show that A is positive definite if and only if the set
of vectors from part (a) is linearly independent.
2.2 Positive Semidefiniteness 209
2.2.26 Suppose that A, B ∈ Mn are positive definite. ∗∗2.2.28 In this exercise, we show that if A ∈ Mn (C) is
(a) Show that all eigenvalues of AB are real and positive. written in terms of its columns as A = a1 | a2 | · · · | an ,
√ −1 then
[Hint: Multiply
√ AB on the left by A and on the det(A) ≤ ka1 kka2 k · · · kan k.
right by A.]
(b) Part (a) does not imply that AB is positive definite. [Side note: This is called Hadamard’s inequality.]
Why not? (a) Explain why it suffices to prove this inequality in
the case when ka j k = 1 for all 1 ≤ j ≤ n. Make this
2.2.27 Let A, B ∈ Mn be positive definite matrices. assumption throughout the rest of this question.
(b) Show that det(A∗ A) ≤ 1. [Hint: Use the AM–GM
(a) Show that det(I + B) ≥ 1 + det(B). inequality (Theorem A.5.3).]
(b) Show that det(A + B) ≥ det(A) + det(B). [Hint: (c) Conclude that | det(A)| ≤ 1 as well, thus completing
√ −1 √ −1
det(A + B) = det(A) det(I + A B A ).] the proof.
[Side note: The stronger inequality (d) Explain under which conditions equality is attained
in Hadamard’s inequality.
(det(A + B))1/n ≥ (det(A))1/n + (det(B))1/n
is also true for positive definite matrices (and is called
Minkowski’s determinant theorem), but proving it is
quite difficult.]
Theorem 2.3.1 Suppose F = R or F = C, and A ∈ Mm,n (F). There exist unitary matrices
Singular Value U ∈ Mm (F) and V ∈ Mn (F), and a diagonal matrix Σ ∈ Mm,n (R) with
Decomposition non-negative entries, such that
(SVD)
A = UΣV ∗ .
Before proving the singular value decomposition, it’s worth comparing the
ways in which it is “better” and “worse” than the other matrix decompositions
that we already know:
If m 6= n then A∗ A and • Better: It applies to every single matrix (even rectangular ones). Every
AA∗ have different
other matrix decomposition we have seen so far had at least some re-
sizes, but they still
have essentially the strictions (e.g., diagonalization only applies to matrices with a basis of
same eigenvalues— eigenvectors, the spectral decomposition only applies to normal matrices,
whichever one is and Schur triangularization only applies to square matrices).
larger just has some
extra 0 eigenvalues. • Better: The matrix Σ in the middle of the SVD is diagonal (and even
The same is actually has real non-negative entries). Schur triangularization only results in an
true of AB and BA for upper triangular middle piece, and even diagonalization and the spectral
any A and B (see
Exercise 2.B.11).
decomposition do not guarantee an entrywise non-negative matrix.
• Worse: It requires two unitary matrices U and V , whereas all of our
previous decompositions only required one unitary matrix or invertible
matrix.
Proof of the singular value decomposition. To see that singular value decom-
positions exists, suppose that m ≥ n and construct a spectral decomposition
of the positive semidefinite matrix A∗ A = V DV ∗ (if m < n then we instead
use the matrix AA∗ , but otherwise the proof is almost identical). Since A∗ A is
positive semidefinite, its eigenvalues (i.e., the diagonal entries
p of D) are real
and non-negative, so we can define Σ ∈ Mm,n by [Σ] j, j = d j, j for all j, and
In other words, Σ is
the principal square [Σ]i, j = 0 if i 6= j.
root of D, but with It follows that
extra zero rows so ∗
that it has the same A∗ A = V DV ∗ = V Σ∗ ΣV ∗ = ΣV ∗ ΣV ∗ ,
size as A.
so the equivalence of conditions (a) and (b) in Theorem 2.2.10 tells us that
there exists a unitary matrix U ∈ Mm (F) such that A = U(ΣV ∗ ), which is a
singular value decomposition of A.
To check the “furthermore” claims we just note that if A = UΣV ∗ with U
and V unitary and Σ diagonal then it must be the case that
We must write Σ∗ Σ
here (instead of Σ2 )
A∗ A = (UΣV ∗ )∗ (UΣV ∗ ) = V (Σ∗ Σ)V ∗ ,
because Σ might not which is a diagonalization of A∗ A. Since the only way to diagonalize a matrix is
be square.
via its eigenvalues and eigenvectors (refer back to Theorem 2.0.1, for example),
it follows that the columns of V are eigenvectors of A∗ A and the diagonal
entries of Σ∗ Σ (i.e., the squares of the diagonal entries of Σ) are the eigenvalues
of A∗ A. The statements about eigenvalues and eigenvectors of AA∗ are proved
in a similar manner.
We typically denote singular values (i.e., the diagonal entries of Σ) by
For a refresher on
these facts about σ1 , σ2 , . . . , σmin{m,n} , and we order them so that σ1 ≥ σ2 ≥ . . . ≥ σmin{m,n} . Note
the rank of a matrix, that exactly rank(A) of A’s singular values are non-zero, since rank(UΣV ∗ ) =
see Exercise 2.1.19 rank(Σ), and the rank of a diagonal matrix is the number of non-zero diagonal
and Theorem A.1.2,
entries. We often denote the rank of A simply by r, so we have σ1 ≥ · · · ≥ σr > 0
for example.
and σr+1 = · · · = σmin{m,n} = 0.
Example 2.3.1 Compute the singular values of the matrix A = 3 2 .
Computing −2 0
Singular Solution:
Values The singular values are the square roots of the eigenvalues of A∗ A.
2.3 The Singular Value Decomposition 211
√ of A A are λ1 =
√ 16 and λ2 = 1, so
We thus conclude that the eigenvalues ∗
Example 2.3.2 Compute a singular value decomposition of the matrix A = 1 2 3 .
Computing a −1 0 1
Singular Value 3 2 1
Decomposition Solution:
As discussed earlier, our first step is to find the “V ” and “Σ” pieces of
A’s singular value decomposition, which we do by constructing a spectral
decomposition of A∗ A and taking the square roots of its eigenvalues. Well,
direct calculation shows that
This is a good time
to remind yourself 11 8 5
of how to calculate
eigenvalues and A∗ A = 8 8 8 ,
eigenvectors, in 5 8 11
case you have
gotten rusty. We do
not go through all of
which has characteristic polynomial
the details here.
pA∗ A (λ ) = det(A∗ A − λ I) = −λ 3 + 30λ 2 − 144λ = −λ (λ − 6)(λ − 24).
Before delving into what makes the singular value decomposition so useful,
it is worth noting that if A ∈ Mm,n (F) has singular value decomposition A =
UΣV ∗ then AT and A∗ have singular value decompositions
We use V to mean
the entrywise AT = V ΣT U T and A∗ = V Σ∗U ∗ .
complex conjugate
of V . That is, In particular, Σ, ΣT , and Σ∗ all have the same diagonal entries, so A, AT , and
V = (V ∗ )T .
A∗ all have the same singular values.
y y
e2
V ∗ e2
V∗
x V∗ x
e1
V ∗ e1
y y
Ae2
1
V ∗ e2 2
2
1 Ae1
x U x
V ∗ e1 U
Figure 2.7: The singular value decomposition says that every linear transforma-
tion (i.e., multiplication by a matrix A) can be thought of as a rotation/reflection
V ∗ , followed by a scaling along the standard axes Σ, followed by another rota-
tion/reflection U.
y y
x
Figure 2.8: Every linear transformation on R3 sends the unit sphere to a (possibly
degenerate) ellipsoid. The linear transformation displayed here is the one with the
standard matrix A from Example 2.3.2.
The fact that the unit sphere is turned into a 2D ellipse by this matrix
corresponds to the fact that it has rank 2, so its range is 2-dimensional. In fact,
the first two left singular vectors (which point in the directions of the major and
minor axes of the ellipse) form an orthonormal basis of the range. Similarly, the
third right singular vector, v3 has the property that Av3 = UΣV ∗ v3 = UΣe3 =
σ3Ue3 = 0, since σ3 = 0. It follows that v3 is in the null space of A.
This same type of argument works in general and leads to the following the-
orem, which shows that the singular value decomposition provides orthonormal
bases for each of the four fundamental subspaces of a matrix:
Theorem 2.3.2 Let A ∈ Mm,n be a matrix with rank(A) = r and singular value decompo-
Bases of the sition A = UΣV ∗ , where
Fundamental
Subspaces U = u1 | u2 | · · · | um and V = v1 | v2 | · · · | vn .
Then
a) {u1 , u2 , . . . , ur } is an orthonormal basis of range(A),
b) {ur+1 , ur+2 , . . . , um } is an orthonormal basis of null(A∗ ),
c) {v1 , v2 , . . . , vr } is an orthonormal basis of range(A∗ ), and
d) {vr+1 , vr+2 , . . . , vn } is an orthonormal basis of null(A).
Dividing both sides by σ j then shows that u1 , u2 , . . . , ur are all in the range
of A. Since range(A) is (by definition) r-dimensional, and {u1 , u2 , . . . , ur } is a
set of r mutually orthogonal unit vectors, it must be an orthonormal basis of
range(A).
Similarly, σ j = 0 whenever j ≥ r + 1, so
It follows that vr+1 , vr+2 , . . . , vn are all in the null space of A. Since the rank-
nullity theorem (Theorem A.1.2(e)) tells us that null(A) has dimension n − r,
and {vr+1 , vr+2 , . . . , vn } is a set of n − r mutually orthogonal unit vectors, it
must be an orthonormal basis of null(A).
The corresponding facts about range(A∗ ) and null(A∗ ) follow by applying
these same arguments to A∗ instead of A.
Remark 2.3.1 Up until now, the transpose of a matrix (and more generally, the adjoint of
A Geometric a linear transformation) has been one of the few linear algebraic concepts
Interpretation that we have not interpreted geometrically. The singular value decomposi-
of the Adjoint tion lets us finally fill this gap.
Notice that if A has singular value decomposition A = UΣV ∗ then A∗
Keep in mind that A∗
exists even if A is not
and A−1 (if it exists) have singular value decompositions
square, in which
case Σ∗ has the A∗ = V Σ∗U ∗ and A−1 = V Σ−1U ∗ ,
same diagonal
entries as Σ, but a where Σ−1 is the diagonal matrix with 1/σ1 , . . ., 1/σn on its diagonal.
different shape.
2.3 The Singular Value Decomposition 217
e2
e2
A might be more
complicated than
this (it might send
the unit circle to an x x
ellipse rather than to
e1
a circle), but it is a
bit easier to visualize e1
the relationships
between these
pictures when it
sends circles to
∗
circles.
−1
y y
∗e
2 ∗e
1
−1 e −1 e
2 1
x x
Theorem 2.3.3 Suppose F = R or F = C, and A ∈ Mm,n (F) has rank(A) = r. There exist
Orthogonal orthonormal sets of vectors {u j }rj=1 ⊂ Fm and {v j }rj=1 ⊂ Fn such that
Rank-One Sum
Decomposition r
A= ∑ σ j u j v∗j ,
j=1
Proof. For simplicity, we again assume that m ≤ n throughout this proof, since
nothing substantial changes if m > n. Use the singular value decomposition to
write A = UΣV ∗ , where U and V are unitary and Σ is diagonal, and then write
U and V in terms of their columns:
U = u1 | u2 | · · · | um and V = v1 | v2 | · · · | vn .
In fact, not only does the orthogonal rank-one sum decomposition follow
from the singular value decomposition, but they are actually equivalent—we
can essentially just follow the above proof backward to retrieve the singular
value decomposition from the orthogonal rank-one sum decomposition. For
this reason, this decomposition is sometimes just referred to as the singular
value decomposition itself.
Solution:
This is the same matrix from Example 2.3.3, which has singular value
2.3 The Singular Value Decomposition 219
decomposition
√ √ √
2 − 3 −1 6 0 0 0
1 √
U=√ 2 0 2 , Σ = 0 2 0 0 and
6 √ √
2 3 −1 0 0 0 0
0 −1 0 1
1 1 0 1 0
V=√ .
2 1 0 −1 0
0 1 0 1
Theorem 2.3.4 If A ∈ Mn is a normal matrix then its singular values are the absolute
Singular Values of values of its eigenvalues.
Normal Matrices
The fact that the three maximizations above really are equivalent to each
other follows simply from rescaling the vectors v that are being maximized
over. In particular, v is a vector that maximizes kAvk/kvk (i.e., the first maxi-
mization) if and only if v/kvk is a unit vector that maximizes kAvk (i.e., the
second and third maximizations). The operator norm is typically considered
the “default” matrix norm, so if we use the notation kAk without any subscripts
or other indicators, we typically mean the operator norm (just like kvk for a
vector v ∈ Fn typically refers to the norm induced by the dot product if no other
context is provided).
Remark 2.3.2 In the definition of the operator norm (Definition 2.3.1), the norm used on
Induced Matrix vectors in Fn is the usual norm induced by the dot product:
Norms q
√
kvk = v · v = |v1 |2 + |v2 |2 + · · · + |vn |2 .
Most of the results However, it is also possible to define matrix norms (or more generally,
from later in this
norms of linear transformations between any two normed vector spaces—
section break down
if we use a weird see Section 1.D) based on any norms on the input and output spaces. That
norm on the input is, given any normed vector spaces V and W, we define the induced norm
and output vector of a linear transformation T : V → W by
space. For example,
induced matrix def
norms are often very kT k = max kT (v)kW : kvkV = 1 ,
v∈V
difficult to compute,
but the operator and the geometric interpretation of this norm is similar to that of the
norm is easy to
compute. operator norm—kT k measures how much T can stretch a vector, when
“stretch” is measured in whatever norms we chose for V and W.
Notice that a matrix cannot stretch any vector by more than a multiplicative
factor of its operator norm. That is, if A ∈ Mm,n and B ∈ Mn,p then kAwk ≤
Be careful: in kAkkwk and kBvk ≤ kBkkvk for all v ∈ F p and w ∈ Fn . It follows that
expressions like
kAkkwk, the first norm
is the operator norm
k(AB)vk = kA(Bv)k ≤ kAkkBvk ≤ kAkkBkkvk for all v ∈ F p .
(of a matrix) and the
second norm is the Dividing both sides by kvk shows that k(AB)vk/kvk ≤ kAkkBk for all v 6= 0,
norm (of a vector) so kABk ≤ kAkkBk. We thus say that the operator norm is submultiplicative.
induced by the dot It turns out that the Frobenius norm is also submultiplicative, and we state these
product.
two results together as a theorem.
222 Chapter 2. Matrix Decompositions
Figure 2.9: A visual representation of the operator norm. Matrices (linear transforma-
tions) transform the unit circle into an ellipse. The operator norm is the distance of the
farthest point on that ellipse from the origin (i.e., the length of its semi-major axis).
To prove this relationship between the operator norm and singular values
algebraically, we first need the following helper theorem that shows that mul-
tiplying a matrix on the left or right by a unitary matrix does not change its
operator norm. For this reason, we say that the operator norm is unitarily
invariant, and it turns out that the Frobenius norm also has this property:
Theorem 2.3.6 Let A ∈ Mm,n and suppose U ∈ Mm and V ∈ Mn are unitary matrices.
Unitary Invariance Then
kUAV k = kAk and kUAV kF = kAkF .
Proof. For the operator norm, we start by showing that every unitary matrix
The fact that kUk = 1 U ∈ Mm has kUk = 1. To this end, just recall from Theorem 1.4.9 that kUvk =
whenever U is
kvk for all v ∈ Fm , so kUvk/kvk = 1 for all v 6= 0, which implies kUk = 1.
unitary is useful.
Remember it! We then know from submultiplicativity of the operator norm that
Theorem 2.3.7 Suppose A ∈ Mm,n has rank r and singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0.
Matrix Norms in Then s
r
Terms of Singular
Values
kAk = σ1 and kAkF = ∑ σ 2j .
j=1
so kΣk ≥ σ1 . For the opposite inequality, note that for every v ∈ Fn we have
s s
r r
k vk = k( 1 v1 , . . . , r vr , 0, . . . , 0)k = 2
i |vi |
2 ≤ 2
1 |vi |2 ≤ 1 kvk.
i=1 i=1
since 1 ≥ 2 ≥ ··· ≥ r
Example 2.3.5 Compute the operator and Frobenius norms of A = 1 2 3 .
Computing −1 0 1
Matrix Norms 3 2 1
Solution:
We√ saw in Example
√ 2.3.2 that this matrix has non-zero singular values
σ1 = 2 6 and σ2 = 6. By Theorem 2.3.7, it follows that
√
kAk = σ1 = 2 6 and
q q √ √ √ √
kAkF = σ12 + σ22 = (2 6)2 + ( 6)2 = 24 + 6 = 30.
224 Chapter 2. Matrix Decompositions
with the final equality following from unitary invariance of the operator norm
We use
Theorem 2.3.7 twice (Theorem 2.3.6).
here at the end: Since Σ∗ Σ is a diagonal matrix with largest entry σ12 , it follows that
once to see that
kΣ∗ Σk = σ12 and once kΣ∗ Σk = σ12 = kAk2 , which completes the proof.
to see that σ1 = kAk.
We close this section by noting that there are actually many matrix norms
out there (just like we saw that there are many vector norms in Section 1.D), and
many of the most useful ones come from singular values just like the operator
and Frobenius norms. We explore another particularly important matrix norm
of this type in Exercises 2.3.17–2.3.19.
2.3.4 Determine which of the following statements are 2.3.14 Let A ∈ Mm,n have singular values σ1 ≥ σ2 ≥
true and which are false. · · · ≥ 0. Show that the block matrix
" #
∗ (a) If λ is an eigenvalue of A ∈ Mn (C) then |λ | is a O A
singular value of A.
(b) If σ is a singular value of A ∈ Mn (C) then σ 2 is a A∗ O
singular value of A2 . has eigenvalues ±σ1 , ±σ2 , . . ., together with |m − n| extra 0
∗ (c) If σ is a singular value of A ∈ Mm,n (C) then σ 2 is eigenvalues.
a singular value of A∗ A.
(d) If A ∈ Mm,n (C) then kA∗ Ak = kAk2 .
∗ (e) If A ∈ Mm,n (C) then kA∗ AkF = kAk2F . ∗∗2.3.15 Suppose A ∈ Mm,n and c ∈ R is a scalar. Show
(f) If A ∈ Mn (C) is a diagonal matrix then its singular that the block matrix
" #
values are its diagonal entries.
cIm A
∗ (g) If A ∈ Mn (C) has singular value decomposition
A = UDV ∗ then A2 = UD2V ∗ . A∗ cIn
(h) If U ∈ Mn is unitary then kUk = 1. is positive semidefinite if and only if kAk ≤ c.
∗ (i) Every matrix has the same singular values as its trans-
pose.
∗∗2.3.16 Show that the operator norm (Definition 2.3.1)
is in fact a norm (i.e., satisfies the three properties of Defini-
∗∗2.3.5 Show that A ∈ Mn is unitary if and only if all of tion 1.D.1).
its singular values equal 1.
2.3.22 Suppose that the 2 × 2 block matrix ∗∗2.3.26 A matrix A ∈ Mn (C) is called complex sym-
" # metric if AT = A. For example,
A B " #
B∗ C 1 i
A=
i 2−i
is positive semidefinite. Show that range(B) ⊆ range(A).
is complex symmetric. In this exercise, we show that the sin-
[Hint: Show instead that null(A) ⊆ null(B∗ ).]
gular value decomposition of these matrices can be chosen
to have a special form.
2.3.23 Use the Jordan–von Neumann theorem (Theo-
(a) Provide an example to show that a complex symmet-
rem 1.D.8) to determine whether or not the operator norm
ric matrix may not be normal and thus may not have
is induced by some inner product on Mm,n .
a spectral decomposition.
(b) Suppose A ∈ Mn (C) is complex symmetric. Show
2.3.24 Suppose A ∈ Mm,n has singular values σ1 , σ2 , that there exists a unitary matrix V ∈ Mn (C) such
. . ., σ p (where p = min{m, n}) and QR decomposition (see that, if we define B = V T AV , then B is complex sym-
Section 1.C) A = UT with U ∈ Mm unitary and T ∈ Mm,n metric and B∗ B is real.
upper triangular. Show that the product of the singular val- [Hint: Apply the spectral decomposition to A∗ A.]
ues of A equals the product of the diagonal entries of T : (c) Let B be as in part (b) and define BR = (B + B∗ )/2
and BI = (B − B∗ )/(2i). Show that BR and BI are
σ1 · σ2 · · · σ p = t1,1 · t2,2 · · ·t p,p . real, symmetric, and commute.
[Hint: B = BR + iBI and B∗ B is real.]
∗∗2.3.25 Suppose P ∈ Mn (C) is a non-zero projection [Side note: Here we are using the Cartesian decom-
(i.e., P2 = P). position of B introduced in Remark 1.B.1.]
(d) Let B be as in part (b). Show that there is a unitary
(a) Show that kPk ≥ 1. matrix W ∈ Mn (R) such that W T BW is diagonal.
(b) Show that if P∗ = P (i.e., P is an orthogonal projec- [Hint: Use Exercise 2.1.28—which matrices have we
tion) then kPk = 1. found that commute?]
(c) Show that if kPk = 1 then P∗ = P. (e) Use the unitary matrices V and W from parts (b)
[Hint: Schur triangularize P.] and (d) of this exercise to conclude that there exists
[Side note: This can be seen as the converse of Theo- a unitary matrix U ∈ Mn (C) and a diagonal matrix
rem 1.4.13.] D ∈ Mn (R) with non-negative entries such that
A = UDU T .
[Side note: This is called the Takagi factorization of
A. Be somewhat careful—the entries on the diagonal
of D are the singular values of A, not its eigenvalues.]
Definition 2.4.1 Given a scalar λ ∈ C and an integer k ≥ 1, the Jordan block of order k
Jordan Blocks corresponding to λ is the matrix Jk (λ ) ∈ Mk (C) of the form
We say that Jk (λ ) λ 1 0 ··· 0 0
has λ along its 0 λ 1 ··· 0 0
diagonal and
1 along its 0 0 λ ··· 0 0
“superdiagonal”. Jk (λ ) = . .. .. .. .. .. .
. .
. . . . .
0 0 0 ··· λ 1
0 0 0 ··· 0 λ
Theorem 2.4.1 If A ∈ Mn (C) then there exists an invertible matrix P ∈ Mn (C) and
Jordan Jordan blocks Jk1 (λ1 ), Jk2 (λ2 ), . . . , Jkm (λm ) such that
Decomposition
Jk1 (λ1 ) O ··· O
O
Jk2 (λ2 ) · · · O −1
A = P . .. .. P .
. ..
. . . .
We will see shortly
that the numbers O O · · · Jkm (λm )
λ1 , λ2 , . . . , λm are
necessarily the Furthermore, this block diagonal matrix is called the Jordan canonical
eigenvalues of A
form of A, and it is unique up to re-ordering the diagonal blocks.
listed according
to geometric
multiplicity. Also, For example, the following matrices are in Jordan canonical form:
we must have
k1 + · · · + km = n. 4 0 0 0 2 1 0 0 3 1 0 0
0 2 1 0 0 2 0 0 0 3 0 0
, , and .
0 0 2 1 0 0 4 0 0 0 3 1
0 0 0 2 0 0 0 5 0 0 0 3
228 Chapter 2. Matrix Decompositions
Diagonal matrices are all in Jordan canonical form, and their Jordan blocks
all have sizes 1 × 1, so the Jordan decomposition really is a generalization of
diagonalization. We now start introducing the tools needed to prove that every
matrix has such a decomposition, to show that it is unique, and to actually
compute it.
Solution:
Since A is triangular (in fact, it is in Jordan canonical form), we see
immediately that its only eigenvalue is 2 with algebraic multiplicity 5. We
We partition A as a then compute
block matrix in this
way just to better
highlight what 0 1 0 0 0 0 0 1 0 0
happens to its 0 0 1 0 0 0 0 0 0 0
Jordan blocks when A − 2I = 0 0 0 0 0 , (A − 2I)2 = 0 0 0 0 0 ,
computing the
0 0 0 0 1 0 0 0 0 0
powers (A − 2I)k .
0 0 0 0 0 0 0 0 0 0
Theorem 2.4.2 Suppose A ∈ Mn (C) has eigenvalue λ with order-k geometric multiplicity
Jordan Canonical γk . Then for each k ≥ 1, every Jordan canonical form of A has
Form from a) γk − γk−1 Jordan blocks J j (λ ) of order j ≥ k, and
Geometric
Multiplicities b) 2γk − γk+1 − γk−1 Jordan blocks Jk (λ ) of order exactly k.
Before proving this theorem, we note that properties (a) and (b) are ac-
tually equivalent to each other—each one can be derived from the other via
straightforward algebraic manipulations. We just present them both because
property (a) is a bit simpler to work with, but property (b) is what we actually
want.
the nullity of a block diagonal matrix is just the sum of the nullities of its
diagonal blocks, it suffices to prove the follow two claims:
• nullity (J j (µ) − λ I)k = 0 whenever λ 6= µ. To see why this is the case,
simply notice that J j (µ) − λ I has diagonal entries (and thus eigenvalues,
since it is upper triangular) λ − µ 6= 0. It follows that J j (µ) − λ I is
invertible whenever λ 6= µ, so it and its powers all have full rank (and
thus nullity 0).
• nullity (J j (λ ) − λ I)k = k whenever 0 ≤ k ≤ j. To see why this is the
case, notice that
0 1 0 ··· 0 0
0 0 1 · · · 0 0
That is,
0 0 0 · · · 0 0
J j (λ ) − λ I = J j (0). In
Exercise 2.4.17, we J j (λ ) − λ I =
.. .. .. . .
.. .. ,
call this matrix N1 . . . . . . .
0 0 0 · · · 0 1
0 0 0 ··· 0 0
which is a simple enough matrix that we can compute the nullities of its
powers fairly directly (this computation is left to Exercise 2.4.17).
Indeed, if we let mk (1 ≤ k ≤ n) denote the number of occurrences of
the Jordan block Jk (λ ) along the diagonal of J, the above two claims tell us
that γ1 = m1 + m2 + m3 + · · · + mn , γ2 = m1 + 2m2 + 2m3 + · · · + 2mn , γ3 =
m1 + 2m2 + 3m3 + · · · + 3mn , and in general
k n
γk = ∑ jm j + ∑ km j .
j=1 j=k+1
Subtracting these formulas from each other gives γk − γk−1 = ∑nj=k m j , which is
exactly statement (a) of the theorem. Statement (b) of the theorem then follows
from subtracting the formula in statement (a) from a shifted version of itself:
2γk − γk+1 − γk−1 = (γk − γk−1 ) − (γk+1 − γk )
n n
= ∑ mj − ∑ m j = mk .
j=k j=k+1
The above theorem has the following immediate corollaries that can some-
times be used to reduce the amount of work needed to construct a matrix’s
Jordan canonical form:
• The geometric multiplicity γ1 of the eigenvalue λ counts the number of
Jordan blocks corresponding to λ .
• If γk = γk+1 for a particular value of k then γk = γk+1 = γk+2 = · · · .
Furthermore, if we make use of the fact that the sum of the orders of the Jordan
blocks corresponding to a particular eigenvalue λ must equal its algebraic
multiplicity (i.e., the number of times that λ appears along the diagonal of the
Furthermore,
Jordan canonical form), we get a bound on how many geometric multiplicities
A ∈ Mn (C) is
diagonalizable if we have to compute in order to construct a Jordan canonical form:
and only if, for each
of its eigenvalues,
the multiplicities ! If λ is an eigenvalue of a matrix with algebraic multiplicity µ
satisfy and geometric multiplicities γ1 , γ2 , γ3 , . . ., then γk ≤ µ for each
γ1 = γ2 = · · · = µ. k ≥ 1. Furthermore, γk = µ whenever k ≥ µ.
2.4 The Jordan Decomposition 231
As one final corollary of the above theorem, we are now able to show
that Jordan canonical forms are indeed unique (up to re-ordering their Jordan
blocks), assuming that they exist in the first place:
Example 2.4.2 Compute the Jordan canonical form of A = −6 0 −2 1 .
Our First Jordan 0 −3 −2 1
Canonical Form
3 0 1 1
−3 0 2 2
Solution:
The eigenvalues of this matrix are (listed according to algebraic multi-
plicity) 3, −3, −3, and −3. Since λ = 3 has algebraic multiplicity 1, we
know that J1 (3) = [3] is one of the Jordan blocks in the Jordan canonical
form J of A. Similarly, since λ = −3 has algebraic multiplicity 3, we know
that the orders of the Jordan blocks for λ = −3 must sum to 3. That is, the
Jordan canonical form of A must be one of
3 0 0 0 3 0 0 0 3 0 0 0
0 −3 1 0 0 −3 0 0 0 −3 0 0
0 0 −3 1 , 0 0 −3 1 , 0 0 −3 0 .
0 0 0 −3 0 0 0 −3 0 0 0 −3
via standard techniques, but we omit the details here due to the size
of this matrix. In particular, the geometric multiplicity of λ = 1 is γ1 = 4,
and we furthermore have
γ2 = nullity (A − I)2
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 −1 −1 0 0 1 −1 0
0 1 1 0 0 −1 1 0
= nullity = 7,
0 0 0 0 0 0 0 0
0 −1 −1 0 0 1 −1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
A Return to Similarity
Recall that two matrices A, B ∈ Mn (C) are called similar if there exists an
invertible matrix P ∈ Mn (C) such that A = PBP−1 . Two matrices are similar if
and only if there is a common linear transformation T between n-dimensional
vector spaces such that A and B are both standard matrices of T (with respect to
different bases). Many tools from introductory linear algebra can help determine
whether or not two matrices are similar in certain special cases (e.g., if two
matrices are similar then they have the same characteristic polynomial, and thus
the same eigenvalues, trace, and determinant), but a complete characterization
of similarity relies on the Jordan canonical form:
2.4 The Jordan Decomposition 233
Proof. The equivalence of conditions (b) and (c) follows immediately from
Theorem 2.4.2, which tells us that we can determine the orders of a matrix’s
Jordan blocks from its geometric multiplicities, and vice-versa. We thus just
focus on demonstrating that conditions (a) and (b) are equivalent.
To see that (a) implies (b), suppose A and B are similar so that we can write
A = QBQ−1 for some invertible Q. If B = PJP−1 is a Jordan decomposition
of B (i.e., the Jordan canonical form of B is J) then A = Q(PJP−1 )Q−1 =
(QP)J(QP)−1 is a Jordan decomposition of A with the same Jordan canonical
form J.
If A = PJP−1
is a Jordan For the reverse implication, suppose that A and B have identical Jordan
decomposition, canonical forms. That is, we can find invertible P and Q such that the Jordan
we can permute the decompositions of A and B are A = PJP−1 and B = QJQ−1 , where J is their
columns of P to put
the diagonal blocks
(shared) Jordan canonical form. Rearranging these equations gives J = P−1 AP
of J in any order and J = Q−1 BQ, so P−1 AP = Q−1 BQ. Rearranging one more time then gives
we like. A = P(Q−1 BQ)P−1 = (PQ−1 )B(PQ−1 )−1 , so A and B are similar.
Example 2.4.4 Determine whether or not the following matrices are similar:
Checking Similarity
1 1 0 0 1 0 0 0
0 1 0 0 0 1 1 0
A= and B = .
0 0 1 1 0 0 1 1
0 0 0 1 0 0 0 1
Solution:
This example is tricky These matrices are already in Jordan canonical form. In particular, the
if we do not use
Jordan blocks of A are
Theorem 2.4.3, since
A and B have the
1 1 1 1
same rank and and ,
characteristic 0 1 0 1
polynomial.
whereas the Jordan blocks of B are
1 1 0
1 and 0 1 1 .
0 0 1
Since they have different Jordan blocks, A and B are not similar.
234 Chapter 2. Matrix Decompositions
Example 2.4.5 Determine whether or not the following matrices are similar:
Checking Similarity
(Again) −6 0 −2 1 −3 0 3 0
0 −3 −2 1 −2 2 −5 2
A= and B = .
3 0 1 1 2 1 −4 −2
−3 0 2 2 −2 2 1 −1
Solution:
We already saw in Example 2.4.2 that A has Jordan canonical form
3 0 0 0
0 −3 1 0
J= 0 0 −3 1 .
0 0 0 −3
It follows that, to see whether or not A and B are similar, we need to check
whether or not B has the same Jordan canonical form J.
To this end, we note that B also has eigenvalues (listed according to
algebraic multiplicity) 3, −3, −3, and −3. Since λ = 3 has algebraic multi-
plicity 1, we know that J1 (3) = [3] is one of the Jordan blocks in the Jordan
canonical form B. Similarly, since λ = −3 has algebraic multiplicity 3,
we know that the orders of the Jordan blocks for λ = −3 sum to 3.
It is straightforward to show that the eigenvalue λ = −3 has geometric
multiplicity γ1 = 1 (its corresponding eigenvectors are all scalar multiples
of (1, 0, 0, 1)), so it must have just one corresponding Jordan block. That
is, the Jordan canonical form of B is indeed the matrix J above, so A and
B are similar.
Example 2.4.6 Determine whether or not the following matrices are similar:
Checking Similarity
(Yet Again) 1 2 3 4 1 0 0 4
2 1 0 3 0 2 3 0
A= and B = .
3 0 1 2 0 3 2 0
4 3 2 1 4 0 0 1
Solution:
These matrices do note even have the same trace (tr(A) = 4, but tr(B) =
6), so they cannot possibly be similar. We could have also computed their
Jordan canonical forms to show that they are not similar, but it saves
a lot of work to do the easier checks based on the trace, determinant,
eigenvalues, and rank first!
decomposition—a fact that we have been taking for granted up until now.
In order to get an idea of how we might compute a matrix’s Jordan decom-
position (or even just convince ourselves that it exists), suppose for a moment
that the Jordan canonical form of a matrix A ∈ Mk (C) has just a single Jordan
block. That is, suppose we could write A = PJk (λ )P−1 for some invertible
P ∈ Mk (C) and some scalar λ ∈ C (necessarily equal to the one and only
eigenvalue of A). Our usual block matrix multiplication techniques show that if
v( j) is the j-th column of P then
Recall that Pe j = v( j) ,
and multiplying both
λ 1 ··· 0
0 ··· 0
sides on the left by λ
P−1 shows that Av(1) = PJk (λ )P−1 v(1) = P
.. .. ..
(1)
.. e1 = λ Pe1 = λ v ,
P−1 v( j) = e j . . . . .
0 0 ··· λ
and
We use superscripts
λ 1 ··· 0
for v( j) here since we 0
will use subscripts for λ ··· 0
something else Av( j) = PJk (λ )P−1 v( j) = P
.. .. ..
.. e j = P(λ e j + e j−1 )
shortly. . . . .
= λ v( j) + v( j−1)
0 0 ··· λ
for all 2 ≤ j ≤ k.
Phrased slightly differently, v(1) is an eigenvector of A with eigenvalue λ ,
(i.e., (A − λ I)v(1) = 0) and v(2) , v(3) , . . . , v(k) are vectors that form a “chain”
with the property that
We thus guess that vectors that form a “chain” leading down to an eigenvec-
tor are the key to constructing a matrix’s Jordan decomposition, which leads
naturally to the following definition:
Theorem 2.4.4 Suppose v(1) , v(2) , . . . , v(k) is a Jordan chain corresponding to an eigenvalue
Multiplication by a λ of A ∈ Mn (C). If we define
Single Jordan
Chain P = v(1) | v(2) | · · · | v(k) ∈ Mn,k (C)
then AP = PJk (λ ).
Solution:
At this point, The eigenvalues of this matrix are (listed according to their alge-
we know that the
braic multiplicity) 2, 4, and 4. An eigenvector corresponding to λ = 2 is
Jordan canonical
form of A must be v1 = (0, 1, 1), and an eigenvector corresponding to λ = 4 is v2 = (1, 0, 1).
2 0 0
However, the eigenspace corresponding to the eigenvalue λ = 4 is just
0 4 1 . 1-dimensional (i.e., its geometric multiplicity is γ1 = 1), so it is not possi-
0 0 4 ble to find a linearly independent set of two eigenvectors corresponding to
λ = 4.
2.4 The Jordan Decomposition 237
(1) (2)
Instead, we must find a Jordan chain v2 , v2 with the property that
(2)
It follows that the third entry of v2 is free, while its other two entries are
(2)
not. If we choose the third entry to be 0 then we get v2 = (1/2, 1/2, 0)
as one possible solution.
(1) (2)
Now that we have a set of 3 vectors v1 , v2 , v2 , we can place them
as columns in a 3 × 3 matrix, and that matrix should bring A into its Jordan
canonical form:
h i 0 1 1/2
(1) (2)
P = v1 | v2 | v2 = 1 0 1/2 .
1 1 0
The procedure carried out in the above example works as long as each
eigenspace is 1-dimensional—to find the Jordan decomposition of a matrix A,
we just find an eigenvector for each eigenvalue and then extend it to a Jordan
chain so as to fill up the columns of a square matrix P. Doing so results in P
being invertible (a fact which is not obvious, but we will prove shortly) and
P−1 AP being the Jordan canonical form of A. The following example illustrates
this procedure again with a longer Jordan chain.
Solution:
We noted in Example 2.4.2 that this matrix has eigenvalues 3 (with al-
gebraic multiplicity 1) and −3 (with algebraic multiplicity 3 and geometric
multiplicity 1). An eigenvector corresponding to λ = 3 is v1 = (0, 0, 1, 2),
and an eigenvector corresponding to λ = −3 is v2 = (0, 1, 0, 0).
To construct a Jordan decomposition of A, we do what we did in the
(1) (2) (3)
previous example: we find a Jordan chain v2 , v2 , v2 corresponding
(1)
to λ = −3. We choose v2 = v2 = (0, 1, 0, 0) to be the eigenvector that
(2)
we already computed, and then we find v2 by solving the linear system
(2) (1) (2)
(A + 3I)v2 = v2 for v2 :
−3 0 −2 1 0 1 0 0 0 1/3
0 0 −2 1 1
row reduce 0 0 1 0 −1/3
3 0 4 1 0 −−−−−−→ 0 0 0 1 1/3 .
−3 0 2 5 0 0 0 0 0 0
(2)
It follows that the second entry of v2 is free, while its other entries are
(2)
not. If we choose the second entry to be 0 then we get v2 = (1/3, 0,
−1/3, 1/3) as one possible solution.
We still need one more vector to complete this Jordan chain, so we
(3) (2)
The same sequence just repeat this procedure: we solve the linear system (A + 3I)v2 = v2
of row operations (3)
can be used to find for v2 :
(3)
v2 as was used
(2)
to find v2 . This −3 0 −2 1 1/3 1 0 0 0 −1/9
happens in general, 0 0 −2 1 0 0 0 1 0 0
row reduce .
since the only 3 0 4 1 −1/3 −− −− −−→ 0 0 0 1 0
difference between
these linear systems −3 0 2 5 1/3 0 0 0 0 0
is the right-hand side.
(3)
Similar to before, the second entry of v2 is free, while its other en-
(3)
tries are not. If we choose the second entry to be 0, then we get v2 =
(−1/9, 0, 0, 0) as one possible solution.
(1) (2) (3)
Be careful—we Now that we have a set of 4 vectors v1 , v2 , v2 , v2 , we can place
might be tempted to (1) (2) (3)
(3)
normalize v2 to them as columns in a 4 × 4 matrix P = v1 | v2 | v2 | v2 , and that
(3)
v2 = (1, 0, 0, 0), but matrix should bring A into its Jordan canonical form:
we cannot do this!
We can choose the 0 0 1/3 −1/9 0 0 1/3 1/3
free variable (the
(3) 0 1 0 0 0 1 0 0
second entry of v2 P= and P−1 = .
in this case), but not 1 0 −1/3 0 0 0 −2 1
the overall scaling 2 0 1/3 0 −9 0 −6 3
in Jordan chains.
Straightforward calculation then shows that
3 0 0 0
0 −3 1 0
P−1 AP = 0 0 −3
,
1
0 0 0 −3
Then B1 ∪ B2 ∪ · · · ∪ Bm is a basis of Cn .
In this theorem and with the second equality following from the fact that T1 − λ1 I is an upper
proof, unions like triangular µ1 × µ1 matrix with all diagonal entries equal to 0, so (T1 − λ1 I)µ1 =
C1 ∪ · · · ∪Cm are
meant as multisets,
O by Exercise 2.4.16. Similarly, for each 2 ≤ j ≤ m the matrix (T j − λ1 I)µ1
so that if there were has non-zero diagonal entries and is thus invertible,
so we see that the first µ1
any vector in columns of P form a basis of null (A − λ1 I)µ1 . A similar argument
shows that
multiple C j ’s then the next µ2 columns of P form a basis of null (A − λ2 I)µ2 , and so on.
their union would
necessarily be We have thus proved that, for each 1 ≤ j ≤ m, there exists a basis C j of
linearly dependent. null (A − λ j I)µ j such that C1 ∪ C2 ∪ · · · ∪ Cm is a basis of Cn (in particular,
242 Chapter 2. Matrix Decompositions
Since we also have the |B j | = |C j | for all 1 ≤ j ≤ m, it must be the case that
|B1 ∪ B2 ∪ · · · ∪ Bm | = |C1 ∪ C2 ∪ · · · ∪ Cm | = n, which implies that B1 ∪ B2 ∪
· · · ∪ Bm is indeed a basis of Cn .
Remark 2.4.1 Another way of phrasing Theorem 2.4.5 is as saying that, for every matrix
The Jordan A ∈ Mn (C), we can write Cn as a direct sum (see Section 1.B) of its
Decomposition generalized eigenspaces:
and Direct Sums
m
M
Cn = null (A − λ j I)µ j .
j=1
We are now in a position to rigorously prove that every matrix has a Jordan
decomposition. We emphasize that it really is worth reading through this proof
(even if we did not do so for the proof of Theorem 2.4.5), since it describes
an explicit procedure for actually constructing the Jordan decomposition in
general.
It is worth noting that the vectors that we add to the set Ck in Step 2 of the
above proof are the “tops” of the Jordan chains of order exactly k. Since each
Jordan chain corresponds to one Jordan block in the Jordan canonical form,
Theorem 2.4.2 tells us that we must add |Ck \Bk+1 | = 2γk − γk+1 − γk−1 such
vectors for all k ≥ 1, where γk is the corresponding geometric multiplicity of
order k.
While this procedure for constructing the Jordan decomposition likely
seems quite involved, we emphasize that things “usually” are simple enough
that we can just use our previous worked examples as a guide. For now though,
we work through a rather large example so as to illustrate the full algorithm in
a bit more generality.
244 Chapter 2. Matrix Decompositions
Example 2.4.10 Construct a Jordan decomposition of the 8 × 8 matrix from Example 2.4.3.
A Hideous Jordan
Solution:
Decomposition
As we noted in Example 2.4.3, the only eigenvalue of A is λ = 1,
which has algebraic multiplicity µ = 8 and geometric multiplicities γ1 = 4,
γ2 = 7, and γk = 8 whenever k ≥ 3. In the language of the proof of Theo-
rem 2.4.1, Step 1 thus tells us to set k = 8 and B9 = {}.
In Step 2, nothing happens when k = 8 since we add |C8 \B9 | = 2γ8 −
γ9 − γ7 = 16 − 8 − 8 = 0 vectors. It follows that nothing happens in Step 3
as well, so B8 = C8 = B9 = {}. A similar argument shows that Bk =
Bk+1 = {} for all k ≥ 4, so we proceed directly to the k = 3 case of Step 2.
Just try to follow k = 3: In Step 2, we start by setting C3 = B4 = {}, and then we need
along with the steps
to add 2γ3 − γ4 − γ2 = 16 − 8 − 7 = 1 vector from
of the algorithm
presented here,
without wor- null (A − λ I)3 \ null (A − λ I)2 = C8 \ null (A − λ I)2
rying about the nasty
calculations needed to C3 . One vector that works is (0, 0, −1, 0, 0, −1, 1, 0), so we
to compute any choose C3 to be the set containing this single vector, which we
particular vectors or (3)
matrices we present. call v1 .
(3)
In Step 3, we just extend v1 to a Jordan chain by computing
(2) (3)
v1 = (A − I)v1 = (−1, 1, 0, 0, 1, 1, 0, −1) and
(1) (2)
v1 = (A − I)v1 = (0, 0, −1, 1, 0, −1, 0, 0),
Proof. For each 1 ≤ n < k, let Nn ∈ Mk (C) denote the matrix with ones on
its n-th superdiagonal and zeros elsewhere (i.e., [Nn ]i, j = 1 if j − i = n and
Recall that [Nn ]i, j [Nn ]i, j = 0 otherwise). Then Jk (λ ) = λ I + N1 , and we show in Exercise 2.4.17
refers to the
that these matrices satisfy N1n = Nn for all 1 ≤ n < k, and N1n = O when n ≥ k.
(i, j)-entry of Nn .
We now prove the statement of this theorem when we choose to center the
Taylor series from Definition 2.4.4 at a = λ . That is, we write f as a Taylor
series centered at λ , so that
∞
f (n) (λ ) ∞
f (n) (λ )
f (x) = ∑ (x − λ )n , so f Jk (λ ) = ∑ (Jk (λ ) − λ I)n .
n=0 n! n=0 n!
Theorem 2.4.7 Suppose A ∈ Mn (C) has Jordan decomposition as in Theorem 2.4.1, and
Matrix Functions f : C → C is analytic on some open disc containing all of the eigenvalues
via the Jordan of A. Then
Decomposition f Jk1 (λ1 ) O ··· O
O f Jk2 (λ2 ) · · · O
−1
f (A) = P .. .. .. P .
..
. . . .
O O · · · f Jkm (λm )
Proof. We just exploit the fact that matrix powers, and thus Taylor series,
behave very well with block diagonal matrices and the Jordan decomposition.
For any a in the open disc on which f is analytic, we have
f ( j) (a)
∞
f (A) = ∑ (A − aI) j
j=0 j!
j
In the second Jk1 (λ1 ) − aI ··· O
equality, we use the ∞ f ( j) (a)
.. .. .. −1
fact that = P∑ . . . P
(PJP−1 ) j = PJ j P−1 for j=0 j!
all j. j
O · · · Jkm (λm ) − aI
f J (λ ) · · · O
k1 1
. . .. −1
= P .. .. . P ,
O · · · f Jkm (λm )
as claimed.
By combining the previous two theorems, we can explicitly apply any
analytic function f to any matrix by first constructing that matrix’s Jordan
decomposition, applying f to each of the Jordan blocks in its Jordan canonical
form via the formula of Theorem 2.4.6, and then stitching those computations
together via Theorem 2.4.7.
Example 2.4.11 Compute eA if A = 5 1 −1 .
A Matrix
1 3 −1
Exponential via
2 0 2
the Jordan
Decomposition
248 Chapter 2. Matrix Decompositions
Solution:
We computed the following Jordan decomposition A = PJP−1 of this
matrix in Example 2.4.7:
0 2 1 2 0 0 −1 1 1
1 1
P= 2 0 1 , J = 0 4 1 , P−1 = 1 −1 1 .
2 2
2 2 0 0 0 4 2 2 −2
Applying the function f (x) = ex to the two Jordan blocks in J via the
formula of Theorem 2.4.6 gives
Here we use the fact
that f 0 (x) = ex too. " # " #
2 4 1 f (4) f 0 (4) e4 e4
f (2) = e and f = = .
0 4 0 f (4) 0 e4
When we define matrix functions in this way, they interact with matrix
multiplication how we √would expect them to. For example, the principal square
root function f (x) = x is analytic everywhere except on the set of non-positive
real numbers,√so Theorems 2.4.6 and 2.4.7 tell us how to compute the principal
square root A of any matrix A ∈ Mn (C) whose eigenvalues can be placed
in an open disc avoiding
√ that strip in the complex plane. As we would hope,
this matrix satisfies ( A)2 = A, and if A is positive definite √
then this method
produces exactly the positive definite principal square root A described by
Theorem 2.2.11.
√
Example 2.4.12 Compute A if A = 3 2 3 .
The Principal 1 2 −3
Square Root of a −14 −1
Non-Positive Solution:
Semidefinite
We can use the techniques of this section to see that A has Jordan
Matrix
decomposition A = PJP−1 , where
Try computing
−2 −1 2 1 0 0 0 1 1
a Jordan
−1 1
decomposition P= 2 1 −1 , J = 0 4 1 , P = 2 2 −2 .
2
of A on your own. 0 −1 1 0 0 4 2 2 0
√
To apply the function f (x) = x to the two Jordan blocks in J via the
1
formula of Theorem 2.4.6, we note that f 0 (x) = 2√ x
, so that
" # " #
4 1 f (4) f 0 (4) 2 1/4
f (1) = 1 and f = = .
0 4 0 f (4) 0 2
2.4 The Jordan Decomposition 249
Remark 2.4.2 If a function f is not analytic on any disc containing a certain scalar λ ∈ C
Is Analyticity then the results of this section say nothing about how to compute f (A) for
Needed? matrices with λ as an eigenvalue. Indeed, it may be the case that there
is no sensible way to even define f (A) for these matrices, as√the sum
in Definition 2.4.4 may not converge. For example, if f (x) = x is the
principal square root function, then f is not analytic at x = 0 (despite √
being
defined there), so it is not obvious how to compute (or even define) A if
A has 0 as an eigenvalue.
We can sometimes get around this problem by just applying the for-
mulas of Theorems 2.4.6 and 2.4.7 anyway, as long as the quantities used
in those formulas all exist. For example, we could say that if
"√ #
2i 0 √ 2i 0 1+i 0
A= then A= √ = ,
0 0 0 0 0 0
even though A has 0 as an eigenvalue. On the other hand, this method says
that
We are again using "√ √ #
the fact 0 1 √ 0 1/(2 0)
√ that if B= then B= ,
f (x) =√ x then √
0 0 0 0
f 0 (x) = 1/(2 x) here.
which
√ makes no sense (notice the division by 0 in the top-right corner
of B, corresponding to the fact that the square root function is not
differentiable at 0). Indeed, this matrix B does not have a square root
at all, nor does any matrix with a Jordan block of size 2 × 2 or larger
corresponding to the eigenvalue 0.
√
Similarly, Definition 2.4.4 does not tell us what C means if
i 0
C= ,
0 −i
√
even though the principal square root function f = x is analytic on open
discs containing the eigenvalues i and −i. The problem here is that there
is no common disc D containing i and −i on which f is analytic, since any
such disc must contain 0 as well. In practice, we just ignore this problem
and apply the formulas of Theorems 2.4.6 and 2.4.7 anyway to get
"√ #
√ i 0 1 1+i 0
C= √ =√ .
0 −i 2 0 1−i
250 Chapter 2. Matrix Decompositions
Solution:
We can use the techniques of this section to see that A has Jordan
decomposition A = PJP−1 , where
2 2 0 1/2 1 0 1 0 −2
−1 1
P = −1 0 1 , J = 0 1/2 1 , P = 0 0 2 .
2
0 1 0 0 0 1/2 1 2 −2
2.4.4 Compute the indicated matrix. ∗2.4.11 Show that eA is invertible for all A ∈ Mn (C).
" #
√ 2 0
∗ (a) A, where A = . 2.4.12 Suppose A, B ∈ Mn (C).
1 2
" # (a) Show that if AB = BA then eA+B = eA eB .
1 −1
(b) eA , where A = . [Hint: It is probably easier to use the definition of
1 −1 eA+B rather than the Jordan decomposition. You may
" #
1 2 −1 use the fact that you can rearrange infinite sums aris-
(c) sin(A), where A = . ing from analytic functions just like finite sums.]
3 4 −2
" # (b) Provide an example to show that if eA+B may not
1 2 −1 equal eA eB if AB 6= BA.
(d) cos(A), where A = .
3 4 −2
1 −2 −1 2.4.13 Show that a matrix A ∈ Mn (C) is invertible if
∗ (e) eA , where A = 0 3 1 . and only if it can be written in the form A = UeX , where
0 −4 −1 U ∈ Mn (C) is unitary and X ∈ Mn (C) is Hermitian.
√ 1 −2 −1
(f) A, where A = 0
3
1 . ∗∗ 2.4.14 Show that sin2 (A) + cos2 (A) = I for all A ∈
Mn (C).
0 −4 −1
2.4.5 Determine which of the following statements are
true and which are false.
252 Chapter 2. Matrix Decompositions
2.4.15 Show that every matrix A ∈ Mn (C) can be written (b) Show that An = O.
in the form A = D + N, where D ∈ Mn (C) is diagonaliz-
able, N ∈ Mn (C) is nilpotent (i.e., N k = O for some integer ∗∗2.4.17 For each 1 ≤ j < k, let Nn ∈ Mk denote the ma-
k ≥ 1), and DN = ND. trix with ones on its n-th superdiagonal and zeros elsewhere
[Side note: This is called the Jordan–Chevalley decompo- (i.e., [Nn ]i, j = 1 if j − i = n and [Nn ]i, j = 0 otherwise).
sition of A.] (a) Show that N1n = Nn for all 1 ≤ n < k, and N1n = O
when n ≥ k.
∗∗2.4.16 Suppose A ∈ Mn is strictly upper triangular (i.e., (b) Show that nullity(Nn ) = min{k, n}.
it is upper triangular with diagonal entries equal to 0).
(a) Show that, for each 1 ≤ k ≤ n, the first k superdiag-
onals ofAk consist entirely of zeros. That is, show
that Ak i, j = 0 whenever j − i < k.
In this chapter, we learned about several new matrix decompositions, and how
they fit in with and generalize the matrix decompositions that we already knew
about, like diagonalization. For example, we learned about a generalization of
diagonalization that applies to all matrices, called the Jordan decomposition
(Theorem 2.4.1), and we learned about a special case of diagonalization called
the spectral decomposition (Theorems 2.1.4 and 2.1.6) that applies to normal
or symmetric matrices (depending on whether the field is C or R, respectively).
See Figure 2.10 for a reminder of which decompositions from this chapter are
special cases of each other.
Diagonalization
(Theorem 2.0.1)
Spectral decomposition
(Theorems 2.1.4 and 2.1.6)
Figure 2.10: Some matrix decompositions from this chapter. The decompositions
in the top row apply to any square complex matrix (and the singular value de-
composition even applies to rectangular matrices). Black lines between two
decompositions indicate that the lower decomposition is a special case of the
one above it that applies to a smaller set of matrices.
One common way of thinking about some matrix decompositions is as
providing us with a canonical form for matrices that is (a) unique, and (b)
captures all of the “important” information about that matrix (where the exact
meaning of “important” depends on what we want to do with the matrix or
We call a property what it represents). For example,
of a matrix A “basis • If we are thinking of A ∈ Mm,n as representing a linear system, the
independent” if a
change of basis relevant canonical form is its reduced row echelon form (RREF), which
does not affect it is unique and contains all information about the solutions of that linear
(i.e., A and PAP−1 system.
share that property • If we are thinking of A ∈ Mn as representing a linear transformation
for all invertible P).
and are only interested in basis-independent properties of it (e.g., its
rank), the relevant canonical form is its Jordan decomposition, which is
2.5 Summary and Review 253
Table 2.2: A summary of the matrix decompositions that answer the question of
how simple a matrix can be made upon multiplying it by an invertible matrix on
the left and/or right. These decompositions are all canonical and apply to every
matrix (with the understanding that the matrix must be square for similarity to make
sense).
We can similarly ask how simple we can make a matrix upon multiplying it
on the left and/or right by unitary matrices, but it turns out that the answers to
this question are not so straightforward. For example, Schur triangularization
Since Schur (Theorem 2.1.1) tells us that by applying a unitary similarity to a matrix
triangularization is A ∈ Mn (C), we can make it upper triangular (i.e., we can find a unitary
not canonical, we matrix U ∈ Mn (C) such that A = UTU ∗ for some upper triangular matrix
cannot use it to
answer the question T ∈ Mn (C)). However, this upper triangular form is not canonical, since most
of, given two matrices have numerous different Schur triangularizations that look nothing
matrices like one another.
A, B ∈ Mn (C),
whether or not there A related phenomenon happens when we consider one-sided multiplication
exists a unitary matrix by a unitary matrix. There are two decompositions that give completely dif-
U ∈ Mn (C) such that ferent answers for what type of matrix can be reached in this way—the polar
A = UBU ∗ .
decomposition (Theorem 2.2.12) says that every matrix can be made positive
semidefinite, whereas the QR decomposition (Theorem 1.C.1) says that every
matrix can be made upper triangular. Furthermore, both of these forms are “not
quite canonical”—they are unique as long as the original matrix is invertible,
but they are not unique otherwise.
Of the decompositions that consider multiplication by unitary matrices, only
the singular value decomposition (Theorem 2.3.1) is canonical. We summarize
these observations in Table 2.3 for easy reference.
We also learned about the special role that the set of normal matrices, as well
as its subsets of unitary, Hermitian, skew-Hermitian, and positive semidefinite
matrices play in the realm of matrix decompositions. For example, the spectral
254 Chapter 2. Matrix Decompositions
Table 2.3: A summary of the matrix decompositions that answer the question of
how simple a matrix can be made upon multiplying it by unitary matrices on
the left and/or right. These decompositions all apply to every matrix A (with the
understanding that A must be square for unitary similarity to make sense).
decomposition tells us that normal matrices are exactly the matrices that can
not only be diagonalized, but can be diagonalized via a unitary matrix (rather
than just an invertible matrix).
While we have already given characterization theorems for unitary matrices
(Theorem 1.4.9) and positive semidefinite matrices (Theorem 2.2.1) that pro-
vide numerous different equivalent conditions that could be used to define them,
we have not yet done so for normal matrices. For ease of reference, we provide
such a characterization here, though we note that most of these properties were
already proved earlier in various exercises.
Theorem 2.5.1 Suppose A ∈ Mn (C) has eigenvalues λ1 , . . . , λn . The following are equiv-
Characterization of alent:
Normal Matrices a) A is normal,
b) A = UDU ∗ for some unitary U ∈ Mn (C) and diagonal D ∈ Mn (C),
c) there is an orthonormal basis of Cn consisting of eigenvectors of A,
d) kAk2F = |λ1 |2 + · · · + |λn |2 ,
e) the singular values of A are |λ1 |, . . . , |λn |,
Compare f) (Av) · (Aw) = (A∗ v) · (A∗ w) for all v, w ∈ Cn , and
statements (g) and
g) kAvk = kA∗ vk for all v ∈ Cn .
(h) of this theorem
to the
characterizations of
unitary matrices Proof. We have already discussed the equivalence of (a), (b), and (c) exten-
given in sively, as (b) and (c) are just different statements of the spectral decomposition.
Theorem 1.4.9.
The equivalence of (a) and (d) was proved in Exercise 2.1.12, the fact that (a)
implies (e) was proved in Theorem 2.3.4 and its converse in Exercise 2.3.20.
Finally, the fact that (a), (f), and (g) are equivalent follows from taking B = A∗
in Exercise 1.4.19.
It is worth noting, however, that there are even more equivalent charac-
terizations of normality, though they are somewhat less important than those
discussed above. See Exercises 2.1.14, 2.2.14, and 2.5.4, for example.
2.5 Summary and Review 255
2.5.1 For each of the following matrices, say which of 2.5.5 Two Hermitian matrices A, B ∈ MH n are called
the following matrix decompositions can be applied to it: ∗-congruent if there exists an invertible matrix S ∈ Mn (C)
(i) diagonalization (i.e., A = PDP−1 with P invertible and such that A = SBS∗ .
D diagonal), (ii) Schur triangularization, (iii) spectral de- (a) Show that for every Hermitian matrix A ∈ MHn there
composition, (iv) singular value decomposition, and/or (v) exist non-negative integers p and n such that A is
Jordan decomposition. ∗-congruent to
" # " #
∗ (a) 1 1 (b) 0 −i
Ip O O
−1 1 i 0 O −In O .
" # " #
∗ (c) 1 0 (d) 1 2 O O O
1 1 3 4 [Hint: This follows quickly from a decomposition
" # " #
∗ (e) 1 0 (f) 1 2 3 that we learned about in this chapter.]
(b) Show that every Hermitian matrix is ∗-congruent to
1 1.0001 4 5 6
" # exactly one matrix of the form described by part (a)
∗ (g) 0 0 (h) The 75 × 75 matrix
(i.e., p and n are determined by A).
with every entry equal
0 0 [Hint: Exercise 2.2.17 might help here.]
to 1.
[Side note: This shows that the form described by
part (a) is a canonical form for ∗-congruence, a fact
2.5.2 Determine which of the following statements are that is called Sylvester’s law of inertia.]
true and which are false.
∗ (a) Any two eigenvectors that come from different 2.5.6 Suppose A, B ∈ Mn (C) are diagonalizable and A
eigenspaces of a normal matrix must be orthogonal. has distinct eigenvalues.
(b) Any two distinct eigenvectors of a Hermitian matrix (a) Show that A and B commute if and only if B ∈
must be orthogonal. span I, A, A2 , A3 , . . . . [Hint: Use Exercise 2.1.29
∗ (c) If A ∈ Mn has polar decomposition A = UP then P and think about interpolating polynomials.]
is the principal square root of A∗ A. (b) Provide an example to show that if A does not have
distinct eigenvalues then it may be the case that A and
2.5.3 Suppose A ∈ Mn (C) is normal and B ∈ Mn (C) B commute even though B 6∈ span I, A, A2 , A3 , . . . .
commutes with A (i.e., AB = BA).
(a) Show that A∗ and B commute as well. [Hint: Com- ∗2.5.7 Suppose that A ∈ Mn is a positive definite matrix
pute kA∗ B − BA∗ k2F .] [Side note: This result is called for which each entry is either 0 or 1. In this exercise, we
Fuglede’s theorem.] show that the only such matrix is A = I.
(b) Show that if B is also normal then so is AB. (a) Show that tr(A) ≤ n.
(b) Show det(A) ≥ 1.
∗∗2.5.4 Suppose A ∈ Mn (C) has eigenvalues λ1 , . . . , λk [Hint: First show that det(A) is an integer.]
(listed according to geometric multiplicity) with correspond- (c) Show that det(A) ≤ 1.
ing eigenvectors v1 , . . . , vk , respectively, that form a linearly [Hint: Use the AM–GM inequality (Theorem A.5.3)
independent set. with the eigenvalues of A.]
(d) Use parts (a), (b), and (c) to show that every eigen-
Show that A is normal if and only if A∗ has eigenvalues
value of A equals 1 and thus A = I.
λ1 , . . . , λk with corresponding eigenvectors v1 , . . . , vk , re-
spectively.
[Hint: One direction is much harder than the other. For
the difficult direction, be careful not to assume that k = n
without actually proving it—Schur triangularization might
help.]
Simply matching up this form with the coefficients of q shows that we can
choose
The fact that the
(2, 3)-entry of A 2 −1 1
equals 0 is a result of A = −1 3 0 so that q(v) = vT Av.
the fact that q has
no “yz” term. 1 0 3
It follows that q is a quadratic form, since q(v) = f (v, v), where f (v, w) =
vT Aw is a bilinear form.
In fact, the converse holds as well—not only are polynomials with degree-2
terms quadratic forms, but every quadratic form on Rn can be written as a
polynomial with degree-2 terms. We now state and prove this observation.
Proof. We have already discussed why conditions (b) and (d) are equiva-
lent, and the equivalence of (c) and (a) follows immediately from applying
2.A Extra Topic: Quadratic Forms and Conic Sections 257
Theorem 1.3.5 to the bilinear form f associated with q (i.e., the bilinear form
with the property that q(v) = f (v, v) for all v ∈ Rn ). Furthermore, the fact that
(d) implies (c) is trivial, so the only remaining implication to prove is that (c)
implies (d).
To this end, simply notice that if q(v) = vT Av for all v ∈ Rn then
vT Avis a scalar and
thus equals its own vT (A + AT )v = vT Av + vT AT v = q(v) + (vT Av)T = q(v) + vT Av = 2q(v).
transpose.
We can thus replace A by the symmetric matrix (A + AT )/2 without changing
the quadratic form q.
For the “furthermore” claim, suppose A ∈ MSn is symmetric and define a
linear transformation T by T (A) = q, where q(v) = vT Av. The equivalence of
conditions (a) and (d) shows that q is a quadratic form, and every quadratic
form is in the range of T . To see that T is invertible (and thus an isomorphism),
we just need to show that T (A) = 0 implies A = O. In other words, we need
to show that if q(v) = vT Av = 0 for all v ∈ Rn then A = O. This fact follows
immediately from Exercise 1.4.28(b), so we are done.
Recall from Since the above theorem tells us that every quadratic form can be written
Example 1.B.3 that
in terms of a symmetric matrix, we can use any tools that we know of for
the matrix (A + AT )/2
from the above manipulating symmetric matrices to help us better understand quadratic forms.
proof is the For example, the quadratic form q(x, y) = (3/2)x2 +xy+(3/2)y2 can be written
symmetric part of A. in the form q(v) = vT Av, where
x 1 3 1
v= and A = .
y 2 1 3
Proof. We know from Theorem 2.A.1 that there exists a symmetric matrix A ∈
Mn (R) such that q(v) = vT Av for some all v ∈ Rn . Applying the real spectral
decomposition (Theorem 2.1.6) to A gives us a unitary matrix U ∈ Mn (R) and
a diagonal matrix D ∈ Mn (R) such that A = UDU T , so q(v) = (U T v)D(U T v),
as before.
258 Chapter 2. Matrix Decompositions
If we write U = u1 | u2 | · · · | un and let λ1 , λ2 , . . ., λn denote the
eigenvalues of A, listed in the order in which they appear on the diagonal of D,
then
q(v) = (U T v)D(U T v)
Here we use the
fact that uTj v = u j · v λ1 0 ··· 0 u1 · v
for all 1 ≤ j ≤ n. 0 0
λ2 ··· u2 · v
= u1 · v, u2 · v, · · · , un · v
.. .. .. ..
..
. . . . .
0 0 ··· λn un · v
= λ1 (u1 · v) + λ2 (u2 · v) + · · · + λn (un · v)2
2 2
Positive Eigenvalues
If the symmetric matrix A ∈ MSn (R) corresponding to a quadratic form q(v) =
vT Av is positive semidefinite then, by definition, we must have q(v) ≥ 0 for
all v. For this reason, we say in this case that q itself is positive semidefinite
(PSD), and we say that it is furthermore positive definite (PD) if q(v) > 0
whenever v 6= 0.
All of our results about positive semidefinite matrices carry over straight-
forwardly to the setting of PSD quadratic forms. In particular, the following
fact follows immediately from Theorem 2.2.1:
Recall that
the scalars in
! A quadratic form q : Rn → R is positive semidefinite if and
Corollary 2.A.2 are only if each of the scalars in Corollary 2.A.2 are non-negative.
the eigenvalues of A. It is positive definite if and only if those scalars are all strictly
positive.
A level set of q is the
set of solutions to
q(v) = c, where c is a If q is positive definite then has it ellipses (or ellipsoids, or hyperellipsoids,
given scalar. They depending on the dimension) as its level sets. The principal radii of the level
are horizontal slices √ √ √
of q’s graph. set q(v) = 1 are equal to 1/ λ1 , 1/ λ2 , . . ., 1/ λn , where λ1 , λ2 , . . ., λn are
2.A Extra Topic: Quadratic Forms and Conic Sections 259
Example 2.A.2 Plot the level sets of the quadratic form q(x, y) = (3/2)x2 + xy + (3/2)y2
A Positive Definite and then graph it.
Quadratic Form Solution:
We already diagonalized this quadratic form back in Equation (2.A.1).
In particular, if we let v = (x, y) then
√ √
q(v) = 2(u1 · v)2 + (u2 · v)2 , where u1 = (1, 1)/ 2, u2 = (1, −1)/ 2.
It follows that the level sets of q are ellipses√rotated so that their principal
√
axes point in the directions of u1 = (1, 1)/ 2 and u2 = (1, −1)/ √2, and
the level set q(v) = 1 has corresponding principal radii equal to 1/ 2 and
1, respectively. These level sets, as well as the resulting graph of q, are
displayed below:
If a quadratic form acts on R3 instead of R2 then its level sets are ellipsoids
(rather than ellipses), and its graph is a “hyperparaboloid” living in R4 that is a
bit difficult to visualize. For example, applying the spectral decomposition to
the quadratic form
It follows that the level sets of q are ellipsoids with principal axes pointing in
the directions of u1 , u2 , and u3 . Furthermore, the corresponding principal radii
260 Chapter 2. Matrix Decompositions
√
for the level set q(v) = 1 are 1/2, 1/2, and 1/ 2, respectively, as displayed in
Figure 2.11(a).
Negative Eigenvalues
A symmetric matrix If all of the eigenvalues of a symmetric matrix are strictly negative then
with all-negative
the level sets of the associated quadratic form are still ellipses (or ellipsoids,
eigenvalues is called
negative definite. or hyperellipsoids), but its graph is instead a (hyper)paraboloid that opens
down (not up). These observations follow simply from noticing that if q has
negative eigenvalues then −q (which has the same level sets as q) has positive
eigenvalues and is thus positive definite.
The shapes that Some Zero Eigenvalues
arise in this section If a quadratic form is just positive semidefinite (i.e., one or more of the eigen-
are the conic
sections and their values of the associated symmetric matrix equal zero) then its level sets are
higher-dimensional degenerate—they look like lower-dimensional ellipsoids that are stretched into
counterparts. higher-dimensional space. For example, the quadratic form
does not actually depend on z at all, so its level sets extend arbitrarily far in the
z direction, as in Figure 2.11(b).
q(x, y, ) = 2
q(x, y, ) = 1
q(x, y, ) = 1
These elliptical y x
cylinders can be
thought of as
y x
ellipsoids with one of
their radii equal to ∞.
Figure 2.11: Level sets of positive (semi)definite quadratic forms are (hy-
per)ellipsoids.
Example 2.A.3 Plot the level sets of the quadratic form q(x, y) = x2 − 2xy + y2 and then
A Positive graph it.
Semidefinite
Solution:
Quadratic Form
This quadratic form is simple enough that we can simply eyeball a
2.A Extra Topic: Quadratic Forms and Conic Sections 261
diagonalization of it:
which has eigenvalues 2 and 0. The level sets of q are thus degenerate
ellipses (which are just pairs of lines, since “ellipses” in R1 are just pairs
of points). Furthermore, the graph of this quadratic form is degenerate
paraboloid (i.e., a parabolic sheet), as shown below:
√
y 1/ ≈
2
= q(x, y)
We can think of =4
these level sets (i.e., x
pairs of parallel lines) x =3
as ellipses with one y
of their principal radii √ √
equal to ∞.
1/ 1 = 1/ 2 =2
=0 =1
= 1, 2, 3, 4 =0
Example 2.A.4 Plot the level sets of the quadratic form q(x, y) = xy and then graph it.
An Indefinite
Solution:
Quadratic Form
Applying the spectral decomposition to the symmetric matrix associ-
ated with q reveals that it has eigenvalues λ1 = 1/2 and λ2 = −1/2, with
corresponding eigenvectors (1, 1) and (1, −1), respectively. It follows that
q can be diagonalized as
1 1
q(x, y) = (x + y)2 − (x − y)2 .
4 4
Since there are terms being both added and subtracted, we conclude that
262 Chapter 2. Matrix Decompositions
the level sets of q are hyperbolas and its graph is a saddle, as shown below:
y
= 1, 2, 3, 4
The z = 0 level set is
√ √ = xy
exactly the x- and 1/ 1 = 2 y
y-axes, which
separate the other
two families of x
hyperbolic level sets.
=1
p √ x
1/ | 2 | = 2
= −1, −2, −3, −4
y y y
x x x
Figure 2.12: The quadratic form q(x, y, z) = x2 + y2 − z2 is indefinite, so its level sets
are hyperboloids. The one exception is the level set q(x, y, z) = 0, which is a double
cone that serves as a boundary between the two types of hyperboloids that exist
(one-sheeted and two-sheeted).
(h) q(x, y, z) = 3x2 + 3y2 + 3z2 − 2xy − 2xz − 2yz ∗ (e) x2 + y2 + 2z2 − 2xz − 2yz = 3
∗ (i) q(x, y, z) = x2 + 2y2 + 2z2 − 2xy + 2xz − 4yz (f) x2 + y2 + 4z2 − 2xy − 4xz = 2
(j) q(w, x, y, z) = w2 + 2x2 − y2 − 2xy + xz − 2wx + yz ∗ (g) 3x2 + y2 + 2z2 − 2xy − 2xz − yz = 1
(h) 2x2 + y2 + 4z2 − 2xy − 4xz = 0
2.A.2 Determine what type of object the graph of the
given equation in R2 is (e.g., ellipse, hyperbola, two lines, 2.A.4 Determine which of the following statements are
or maybe even nothing at all). true and which are false.
∗ (a) x2 + 2y2 = 1 (b) x2 − 2y2 = −4 ∗ (a) Quadratic forms are bilinear forms.
∗ (c) x2 + 2xy + 2y2 = 1 (d) x2 + 2xy + 2y2 = −2 (b) The sum of two quadratic forms is a quadratic form.
∗ (e) 2x2 + 4xy + y2 = 3 (f) 2x2 + 4xy + y2 = 0 ∗ (c) A quadratic form q(v) = vT Av is positive semidefi-
∗ (g) 2x2 + 4xy + y2 = −1 (h) 2x2 + xy + 3y2 = 0 nite if and only if A is positive semidefinite.
(d) The graph of a non-zero quadratic form q : R2 → R
is either an ellipse, a hyperbola, or two lines.
2.A.3 Determine what type of object the graph of the given ∗ (e) The function f : R → R defined by f (x) = x2 is a
equation in R3 is (e.g., ellipsoid, hyperboloid of one sheet, quadratic form.
hyperboloid of two sheets, two touching cones, or maybe
even nothing at all).
2.A.5 Determine which values of a ∈ R make the follow-
∗ (a) x2 + 3y2 + 2z2 = 2
ing quadratic form positive definite:
(b) x2 + 3y2 + 2z2 = −3
∗ (c) 2x2 + 2xy − 2xz + 2yz = 1 q(x, y, z) = x2 + y2 + z2 − a(xy + xz + yz).
(d) 2x2 + 2xy − 2xz + 2yz = −2
One common technique for making large matrices easier to work with is to
break them up into 2 × 2 block matrices and then try to use properties of the
smaller blocks to determine corresponding properties of the large matrix. For
example, solving a large linear system Qx = b directly might be quite time-
consuming, but we can make it smaller and easier to solve by writing Q as a
2 × 2 block matrix, and similarly writing x and b as “block vectors”, as follows:
" # " #
A B x1 b
Q= , x= , and b = 1 .
C D x2 b2
264 Chapter 2. Matrix Decompositions
If A is invertible then one way to solve this linear system is to subtract CA−1
times the first equation from the second equation, which puts it into the form
This procedure is
sometimes called Ax1 + Bx2 = b1
“block Gaussian
elimination”, and it (D −CA−1 B)x2 = b2 −CA−1 b1 .
really can be
thought of as a This version of the linear system perhaps looks uglier at first glance, but it
block matrix version has the useful property of being block upper triangular: we can solve it by
of the Gaussian
elimination
first solving for x2 in the smaller linear system (D −CA−1 B)x2 = b2 −CA−1 b1
algorithm that we and then solving for x1 in the (also smaller) linear system Ax1 + Bx2 = b1 . By
already know. using this technique, we can solve a 2n × 2n linear system Qx = b entirely via
matrices that are n × n.
This type of reasoning can also be applied directly to the block matrix Q
(rather than the corresponding linear system), and doing so leads fairly quickly
to the following very useful block matrix decomposition:
We do not require Proof. We simply multiply together the block matrices on the right:
A and D to have the
same size in this " #" #" #
theorem (and thus B I O A O I A−1 B
and C need not CA−1 I O D −CA−1 B O I
be square). " #" #
I O A A(A−1 B)
= −1
CA I O D −CA−1 B
" # " #
A B A B
= = ,
(CA−1 )A (CA−1 B) + (D −CA−1 B) C D
as claimed.
We will use the above decomposition throughout this section to come up
with formulas for things like det(Q) and Q−1 , or find a way to determine
positive semidefiniteness of Q, based only on corresponding properties of its
blocks A, B, C, and D.
is D −CA−1 B.
To illustrate what we can do with the Schur complement and the block
matrix decomposition of Theorem 2.B.1, we now present formulas for the
determinant and inverse of a 2 × 2 block matrix.
Note that in the final step we used the fact that the determinant of a block diago-
nal matrix equals the product of the determinants of its blocks (this fact is often
covered in introductory linear algebra texts—see the end of Appendix A.1.5).
If A is not invertible
then these block
matrix formulas do
not work. For one In particular, notice that det(Q) = 0 if and only if det(S) = 0 (since we are
way to sometimes assuming that A is invertible, so we know that det(A) 6= 0). By recalling that a
get around this matrix has determinant 0 if and only if it is not invertible, this tells us that Q is
problem, see invertible if and only if its Schur complement is invertible. This fact is re-stated
Exercise 2.B.8.
in the following theorem, along with an explicit formula for its inverse.
Proof. We just take the inverse of both sides of the block matrix decomposition
of Theorem 2.B.1:
" #−1 " #" #" #!−1
A B I O A O I A−1 B
Recall that =
(XY Z)−1 = Z −1Y −1 X −1 . C D CA−1 I O S O I
" #−1 " #−1 " #−1
I A−1 B A O I O
=
If the blocks are all O I O S CA−1 I
1 × 1 then the " #" #" #
formula provided by I −A−1 B A−1 O I O
this theorem = .
simplifies to the O I O S−1 −CA−1 I
familiar formula
" # Note that in the final step we used the fact that the inverse of a block diagonal
1 d −b
Q−1 = matrix is just the matrix with inverted diagonal blocks and the fact that for any
det(Q) −c a
matrix X we have −1
that we already I X I −X
know for 2 × 2 = .
matrices. O I O I
2.B Extra Topic: Schur Complements and Cholesky 267
Example 2.B.2 Use the Schur complement to compute the determinant of the following
Using the Schur matrix:
Complement 2 1 1 0
1 2 1 1
Q= .
1 1 2 1
0 1 1 2
Solution:
We recall the top-left A block and the Schur complement S of A in Q
from Example 2.B.1:
2 1 1 4 2
A= and S = .
1 2 3 2 4
Proof. We notice that in this case, the decomposition of Theorem 2.B.1 simpli-
fies slightly to Q = P∗ DP, where
" # " #
I A−1 B A O
P= and D = .
O I O S
268 Chapter 2. Matrix Decompositions
(P−1 )∗ Q(P−1 ) = D.
A = T ∗ T.
A “leading entry” Before proving this theorem, we recall that T being in row echelon form
is the first non-zero
implies that it is upper triangular, but is actually a slightly stronger requirement
entry in a row.
than just upper triangularity. For example, the matrices
0 1 0 1
and
0 0 0 1
are both upper triangular, but only the one on the left is in row echelon form.
We also note that the choice of m = rank(A) in this theorem is optimal in some
sense:
If A is positive definite • If m < rank(A) then no such decomposition of A is possible (even if
then T is square,
we ignore the upper triangular requirement) since if T ∈ Mm,n then
upper triangular,
and its diagonal rank(A) = rank(T ∗ T ) = rank(T ) ≤ m.
entries are strictly • If m > rank(A) then decompositions of this type exist (for example, we
positive. can just pad the matrix T from the m = rank(A) case with extra rows of
zeros at the bottom), but they are no longer unique—see Remark 2.B.1.
Proof of Theorem 2.B.5. We prove the result by induction on n (the size of A).
√ base case, the result is clearly true if n = 1 since we can choose
For the
T = [ a], which is an upper triangular 1 × 1 matrix with a non-negative
2.B Extra Topic: Schur Complements and Cholesky 269
is a Cholesky decomposition of A.
Case 2: a1,1 6= 0. We can write A as the block matrix
" #
a1,1 a∗2,1
A= ,
a2,1 A2,2
where A2,2 ∈ Mn−1 and a2,1 ∈ Fn−1 is a column vector. By applying Theo-
The Schur rem 2.B.1, we see that if S = A2,2 − a2,1 a∗2,1 /a1,1 is the Schur complement of
complement exists
a1,1 in A then we can decompose A in the form
because a1,1 6= 0
in this case, so a1,1 " #" #" #
is invertible. 1 0T a1,1 0T 1 a∗2,1 /a1,1
A= .
a2,1 /a1,1 I 0 S 0 I
Example 2.B.3 Find the Cholesky decomposition of the matrix A = 4 −2 2 .
Finding a Cholesky −2 2 −2
Decomposition 3 2 −2
Solution:
To construct A’s Cholesky decomposition, we mimic the proof of
Theorem 2.B.5. We start by writing A as a block matrix with 1 × 1 top-left
block:
" #
a1,1 a∗2,1 −2 2 −2
A= , where a1,1 = 4, a2,1 = , A2,2 = .
a2,1 A2,2 2 −2 3
It follows from the proof of Theorem 2.B.5 that the Cholesky decomposi-
At this point, we tion of A is
have done one step
of the induction in "√ √ #∗ "√ √ #
the proof that A has a1,1 a∗2,1 / a1,1 a1,1 a∗2,1 / a1,1
a Cholesky A=
decomposition. The
0 T 0 T
∗
next step is to apply
2 −1 1 2 −1 1 (2.B.1)
the same procedure
= 0 0 ,
to S, and so on.
0
T 0
T
where A2,2 = xx∗ +T2∗ T2 . We could thus just choose x ∈ Fn−1 small enough
so that A2,2 − xx∗ is positive semidefinite and thus has a Cholesky decom-
position A2,2 − xx∗ = T2∗ T2 .
The Cholesky For example, it is straightforward to verify that if
decomposition is a
special case of the
LU decomposition 0 0 0
A = LU. If A is positive A = 0 2 3
semidefinite, we can
choose L = U ∗ . 0 3 5
2.B.2 Compute the Schur complement of the top-left 2 × 2 ∗∗2.B.8 Consider the 2 × 2 block matrix
block in each of the following matrices, and use it to com- " #
pute the determinant of the given matrix and determine A B
Q= .
whether or not it is positive (semi)definite. C D
∗ (a) 2 1 1 0 (b) 3 2 0 −1 In this exercise, we show how to come up with block matrix
formulas for Q if the D ∈ Mn block is invertible (rather
1 3 1 1 2 2 1 0
than the A block).
1 1 2 0 0 1 2 0
0 1 0 1 −1 0 0 3 Suppose that D is invertible and let S = A − BD−1C (S is
called the Schur complement of D in Q).
2.B.3 Determine which of the following statements are (a) Show how to write
" #
true and which are false.
S O
∗ (a) The Schur complement of the top-left 2 × 2 block of Q=U L,
O D
a 5 × 5 matrix is a 3 × 3 matrix.
(b) If a matrix is positive definite then so are the Schur where U is an upper triangular block matrix with
complements of any of its top-left blocks. ones on its diagonal and L is a lower triangular block
∗ (c) Every matrix has a Cholesky decomposition. matrix with ones on its diagonal.
(b) Show that det(Q) = det(D) det(S).
(c) Show that Q is invertible if and only if S is invertible,
∗2.B.4 Find infinitely many different decompositions of
and find a formula for its inverse.
the matrix (d) Show that Q is positive (semi)definite if and only if
0 0 0
S is positive (semi)definite.
A = 0 1 1
0 1 2
2.B.9 Suppose A ∈ Mm,n and B ∈ Mn,m . Show that
of the form A = T ∗ T , where T ∈ M3 is upper triangular.
det(Im + AB) = det(In + BA).
2.B.5 Suppose X ∈ Mm,n and c ∈ R is a scalar. Use the
[Side note: This is called Sylvester’s determinant iden-
Schur complement to show that the block matrix
tity.]
" #
cIm X [Hint: Compute the determinant of the block matrix
X ∗ cIn " #
I −A
Q= m
is positive semidefinite if and only if kXk ≤ c. B In
[Side note: You were asked to prove this directly in Exer- using both Schur complements (see Exercise 2.B.8).]
cise 2.3.15.]
∗2.B.7 Suppose A ∈ Mn is invertible and S = D −CA−1 B [Side note: In other words, AB and BA have the same eigen-
is the Schur complement of A in the 2 × 2 block matrix values, counting algebraic multiplicity, but with AB having
" # m − n extra zero eigenvalues.]
A B
Q= .
C D ∗∗2.B.12 Show that the Cholesky decomposition described
by Theorem 2.B.5 is unique.
Show that rank(Q) = rank(A) + rank(S).
[Hint: Follow along with the given proof of that theorem
and show inductively that if Cholesky decompositions in
Mn−1 are unique then they are also unique in Mn .]
2.C Extra Topic: Applications of the SVD 273
has infinitely many solutions, like (x, y, z) = (1, 1, 1) and (x, y, z) = (2, −1, 2),
even though its coefficient matrix has rank 2 (which we showed in Exam-
ple 2.3.2) and is thus not invertible.
With this example in mind, it seems natural to ask whether or not there
exists a matrix A† with the property that we can find a solution to the linear
system Ax = b (when it exists, but even if A is not invertible) by setting x = A† b.
Definition 2.C.1 Suppose F = R or F = C, and A ∈ Mm,n (F) has orthogonal rank-one sum
Pseudoinverse decomposition
r
of a Matrix
A= ∑ σ j u j v∗j .
j=1
Recall that singular The advantage of the pseudoinverse over the regular inverse is that every
values are unique, matrix has one. Before we can properly explore the pseudoinverse and see what
but singular vectors we can do with it though, we have to prove that it is well-defined. That is, we
are not.
have to show that no matter which orthogonal rank-one sum decomposition
274 Chapter 2. Matrix Decompositions
The fact that this is the orthogonal projection onto range(A) follows from
the fact that {u1 , u2 , . . . , ur } forms an orthonormal basis of range(A) (Theo-
rem 2.3.2(a)), together with Theorem 1.4.10.
The fact that A† A is the orthogonal projection onto range(A∗ ) follows from
computing A† A = ∑rj=1 v j v∗j in a manner similar to above, and then recalling
from Theorem 2.3.2(c) that {v1 , v2 , . . . , vr } forms an orthonormal basis of
range(A∗ ). On the other hand, the fact that A† A is the orthogonal projection
onto range(A† ) (and thus range(A∗ ) = range(A† )) follows from swapping the
roles of A and A† (and using the fact that (A† )† = A) in part (a).
The proof of parts (b) and (d) of the theorem are all almost identical, so we
leave them to Exercise 2.C.7.
We now show the converse of the above theorem—the pseudoinverse A† is
the only matrix with the property that A† A and AA† are the claimed orthogonal
projections. In particular, this shows that the pseudoinverse is well-defined and
does not depend on which orthogonal rank-one sum decomposition of A was
used to construct it—each one of them results in a matrix with the properties of
Theorem 2.C.1, and there is only one such matrix.
Theorem 2.C.2 If A ∈ Mm,n and B ∈ Mn,m are such that AB is the orthogonal projec-
Well-Definedness tion onto range(A) and BA is the orthogonal projection onto range(B) =
of the range(A∗ ) then B = A† .
Pseudoinverse
2.C Extra Topic: Applications of the SVD 275
Proof. We know from Theorem 2.C.1 that AA† is the orthogonal projection
onto range(A), we know from Theorem 1.4.10 that orthogonal projections are
uniquely determined by their range, so AB = AA† . A similar argument (making
use of the fact that range(A† ) = range(A∗ ) = range(B)) shows that BA = A† A.
Since projections leave everything in their range unchanged, we conclude
that (BA)Bx = Bx for all x ∈ Fn , so BAB = B, and a similar argument shows
that A† AA† = A† . Putting these facts together shows that
Example 2.C.1 Compute the pseudoinverse of the matrix A = 1 2 3 .
Computing a −1 0 1
Pseudoinverse 3 2 1
Solution:
We already saw in Example 2.3.2 that the singular value decomposition
of this matrix is A = UΣV ∗ , where
√ √ √ √
3 2 1 2 − 3 −1
1 √ 1 √
U=√ 0 2 −2 , V=√ 2 0 2
,
6 √ √ 6 √ √
3 − 2 −1 2 3 −1
√
2 6 0 0
√
Σ= 0 6 0 .
0 0 0
Example 2.C.2 Compute the pseudoinverse of the matrix A = 1 1 1 −1 .
Computing 0 1 1 0
a Rectangular −1 1 1 1
Pseudoinverse Solution:
This is the same matrix from Example 2.3.3, which has singular value
276 Chapter 2. Matrix Decompositions
It follows that
A† = V Σ†U ∗
√
Verify this matrix 0 −1 0 1 1/ 6 0 0 √ √ √
2 2 2
multiplication on
1 1 0 1
0 0
1/2 0 √ √
your own. It builds =√ − 3 0 3
character. 12 1 0 −1 0 0 0 0
0 1 0 1 −1 2 −1
0 0 0
3 0 −3
1 2
2 2
= .
12 2 2 2
−3 0 3
Theorem 2.C.3 Suppose F = R or F = C, and A ∈ Mm,n (F) and b ∈ Fm are such that the
Pseudoinverses linear system Ax = b has at least one solution. Then x = A† b is a solution,
Solve Linear and furthermore if y ∈ Fn is any other solution then kA† bk < kyk.
Systems
Proof. The linear system Ax = b has a solution if and only if b ∈ range(A). To
see that x = A† b is a solution in this case, we simply notice that
Ax = A(A† b) = (AA† )b = b,
since AA† is the orthogonal projection onto range(A), by Theorem 2.C.1(a).
To see that kA† bk ≤ kyk for all solutions y of the linear system (i.e., all y
such that Ay = b), we note that
A† b = A† (Ay) = (A† A)y.
2.C Extra Topic: Applications of the SVD 277
that we originally introduced in Equation (2.C.1). The solution set of this linear
system consists of the vectors of the form (0, 3, 0) + z(1, −2, 1), where z is a
free variable. This set contains some vectors that are hideous (e.g., choosing
z = 341 gives the solution (x, y, z) = (341, −679, 341)), but it also contains
some vectors that are not so hideous (e.g., choosing z = 1 gives the solution
(x, y, z) = (1, 1, 1), which is the solution found by the pseudoinverse). The
guarantee that the pseudoinverse finds the smallest-norm solution means that
we do not have to worry about it returning “large and ugly” solutions like
(341, −679, 341). Geometrically, it means that it finds the solution closest to
the origin (see Figure 2.13).
(1, 1, 1)
(1, −2, 1)
x (0, 3, 0)
y
Figure 2.13: Every point on the line (0, 3, 0) + z(1, −2, 1) is a solution of the linear
system from Equation (2.C.1). The pseudoinverse finds the solution (x, y, z) = (1, 1, 1),
which is the point on that line closest to the origin.
Not only does the pseudoinverse find the “best” solution when a solution
Think of Ax − b as the exists, it even find the “best” non-solution when no solution exists. To make
error in our solution x: sense of this statement, we again think in terms of norms and distances—if
it is the difference
between what Ax
no solution to a linear system Ax = b exists, then it seems reasonable that the
actually is and what “next best thing” to a solution would be the vector that makes Ax as close to b as
we want it to be (b). possible. In other words, we want to find the vector x that minimizes kAx − bk.
The following theorem shows that choosing x = A† b also solves this problem:
Here, Ay is just an
arbitrary element Proof. We know from Theorem 1.4.11 that the closest point to b in range(A) is
of range(A). Pb, where P is the orthogonal projection onto range(A). Well,
278 Chapter 2. Matrix Decompositions
as desired.
This method of finding the closest thing to a solution of a linear system is
called linear least squares, and it is particularly useful when trying to fit data
to a model, such as finding a line (or plane, or curve) of best fit for a set of data.
For example, suppose we have many data points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ),
and we want to find the line that best describes the relationship between x and
y as in Figure 2.14.
y
x
-3 -2 -1 0 1 2 3
Figure 2.14: The line of best fit for a set of points is the line that minimizes the sum of
squares of vertical displacements of the points from the line (highlighted here in
orange).
To find this line, we consider the “ideal” scenario—we try (and typically
fail) to find a line that passes exactly through all n data points by setting up the
corresponding linear system:
Example 2.C.3 Find the line of best fit for the points (−2, 0), (−1, 1), (0, 0), (1, 2), and
Finding a Line (2, 2).
of Best Fit
Solution:
To find the line of best fit, we set up the system of linear equations that
we would like to solve. Ideally, we would like to find a line y = mx + b
that goes through all 5 data points. That is, we want to find m and b such
that
0 = −2m + b, 1 = −1m + b, 0 = 0m + b,
2 = 1m + b, 2 = 2m + b.
It’s not difficult to see that this linear system has no solution, but we
can find the closest thing to a solution (i.e., the line of best fit) by using the
2.C Extra Topic: Applications of the SVD 279
(−2, 0) (0, 0)
x
-3 -2 -1 0 1 2 3
This exact same method also works for finding the “plane of best fit”
for data points (x1 , y1 , z1 ), (x2 , y2 , z2 ), . . . , (xn , yn , zn ), and so on for higher-
dimensional data as well (see Exercise 2.C.5). We can even do things like
find quadratics of best fit, exponentials of best fit, or other weird functions of
best fit (see Exercise 2.C.6).
By putting together all of the results of this section, we see that the pseu-
doinverse gives the “best solution” to a system of linear equations Ax = b in
all cases:
• If the system has a unique solution, it is x = A† b.
• If the system has infinitely many solutions, then x = A† b is the smallest
solution—it minimizes kxk.
• If the system has no solutions, then x = A† b is the closest thing to a
solution—it minimizes kAx − bk.
the primary reasons why we would do this is that it allows us to compress the
data that is represented by a matrix, since a full (real) n × n matrix requires us
to store n2 real numbers, but a rank-k matrix only requires us to store 2kn + k
real numbers. To see this, note that we can store a low-rank matrix via its
orthogonal rank-one sum decomposition
k
A= ∑ σ j u j v∗j ,
j=1
Proof of Theorem 2.C.5. Pick any matrix B ∈ Mm,n with rank(B) = k, which
necessarily has (n − k)-dimensional null space by the rank-nullity theorem
(Theorem A.1.2(e)). Also, consider the vector space Vk+1 = span{v1 , v2 , . . . ,
vk+1 }, which is (k + 1)-dimensional. Since (n − k) + (k + 1) = n + 1, we know
from Exercise 1.5.2(b) that null(B) ∩ Vk+1 is at least 1-dimensional, so there
exists a unit vector w ∈ null(B) ∩ Vk+1 . Then we have
kA − Bk ≥ k(A − B)wk (since kwk = 1)
= kAwk (since w ∈ null(B))
!
r
∗
=
∑ σ j u j v j w
(orthogonal rank-one sum decomp. of A)
j=1
k+1
=
∑ σ j (v∗j w)u j
. (w ∈ Vk+1 , so v∗j w = 0 when j > k + 1)
j=1
2.C Extra Topic: Applications of the SVD 281
At this point, we note that Theorem 1.4.5 tells us that (v∗1 w, . . . , v∗k+1 w) is the
coefficient vector of w in the basis {v1 , v2 , . . . , vk+1 } of Vk+1 . This then implies,
via Corollary 1.4.4, that kwk2 = ∑k+1 ∗ 2
j=1 |v j w| , and similarly that
2
k+1
k+1
∑ σ j (v∗j w)u j
= ∑ σ 2j |v∗j w|2 .
j=1
j=1
With this observation in hand, we now continue the chain of inequalities from
above:
v
uk+1
u
kA − Bk ≥ t ∑ σ 2j |v∗j w|2 (since {u1 , . . . , uk+1 } is an orthonormal set)
j=1
v
uk+1
u
≥ σk+1 t ∑ |v∗j w|2 (since σ j ≥ σk+1 )
j=1
k+1
= σk+1 (since ∑ |v∗j w|2 = kwk2 = 1)
j=1
r
= kA − Ak k, (since A − Ak = ∑ σ j u j v∗j )
j=k+1
as desired.
That is, find the matrix B with rank(B) = 1 that minimizes kA − Bk.
Solution:
Recall from Example 2.3.2 that the singular value decomposition of
this matrix is A = UΣV ∗ , where
√ √ √ √
3 2 1 2 − 3 −1
1 √
, V = √1 √2
U=√ 0 2 −2 0 2,
6 √ √ 6 √ √
3 − 2 −1 2 3 −1
√
2 6 0 0
√
Σ= 0 6 0 .
0 0 0
rank-1 matrix B:
√
2 6 0 0
∗
B=U 0 0 0 V
As always, 0 0 0
multiplying matrices √ √ √ √ √
√
like these together 3 2 1 2 6 0 0 2 2 2
1 √ √ √
is super fun. = 0
2 −2 0 0 0 − 3 0 3
6 √ √
3 − 2 −1 0 0 0 −1 2 −1
2 2 2
= 0 0 0 .
2 2 2
We can use this method to compress pretty much any information that we
can represent with a matrix, but it works best when there is some correlation
between the entries in the rows and columns of the matrix (e.g., this method
does not help much if we just place inherently 1-dimensional data like a text file
into a matrix of some arbitrary shape). For example, we can use it to compress
black-and-white images by representing the brightness of each pixel in the
image by a number, arranging those numbers in a matrix of the same size and
shape as the image, and then applying the Eckart–Young–Mirsky theorem to
that matrix.
Similarly, we can compress color images by using the fact that every color
can be obtained from mixing red, green, and blue, so we can use three matrices:
one for each of those primary colors. Figure 2.15 shows the result of applying
a rank-k approximation of this type to an image for k = 1, 5, 20, and 100.
Remark 2.C.1 It seems natural to ask how low-rank matrix approximation changes if we
Low-Rank use a matrix norm other than the operator norm. It turns out that, for a wide
Approximation in variety of matrix norms (including the Frobenius norm), nothing changes
other Matrix Norms at all. For example, one rank-k matrix B that minimizes kA − BkF is exactly
the same as the one that minimizes kA − Bk. That is, the closest rank-k
approximation does not change at all, even if we measure “closeness” in
this very different way.
This fact about More generally, low-rank approximation works the same way for every
unitarily-invariant
matrix norm that is unitarily-invariant (i.e., if Theorem 2.3.6 holds for a
norms is beyond the
scope of this particular matrix norm, then so does the Eckart–Young–Mirsky theorem).
book—see (Mir60) For example, Exercise 2.3.19 shows that something called the “trace norm”
for a proof. is unitarily-invariant, so the Eckart–Young–Mirsky theorem still works if
we replace the operator norm with it.
The rank-1
approximation is
interesting because
we can actually see
that it has rank 1:
every row and
column is a multiple
of every other row
and column, which
is what creates the
banding effect in
the image.
The rank-100
approximation is
almost
indistinguishable
from the original
image.
Figure 2.15: A picture of the author’s cats that has been compressed via the
Eckart–Young–Mirsky theorem. The images are (a) uncompressed, (b) use a rank-1
approximation, as well as a (c) rank-5, (d) rank-20, and a (e) rank-100 approximation.
The original image is 500 × 700 with full rank 500.
284 Chapter 2. Matrix Decompositions
(d) A = 1 1 1 , b= 1 ∗∗2.C.5 Find the plane z = ax + by + c of best fit for the
1 0 1 1 following 4 data points (x, y, z):
0 1 0 1
(0, −1, −1), (0, 0, 1), (0, 1, 3), (2, 0, 3).
2.C.2 Find the line of best fit (in the sense of Exam-
ple 2.C.3) for the following collections of data points. ∗∗ 2.C.6 Find the curve of the form y = c1 sin(x) +
[Side note: Exercise 2.C.10 provides a way of solving this c2 cos(x) that best fits the following 3 data points (x, y):
problem that avoids computing the pseudoinverse.] (0, −1), (π/2, 1), (π, 0).
∗ (a) (1, 1), (2, 4).
(b) (−1, 4), (0, 1), (1, −1).
∗∗2.C.7 Prove parts (b) and (d) of Theorem 2.C.1.
∗ (c) (1, −1), (3, 2), (4, 7).
(d) (−1, 3), (0, 1), (1, −1), (2, 2).
2.C.8 Suppose F = R or F = C, and A ∈ Mm,n (F),
B ∈ Mn,p (F), and C ∈ M p,r (F). Explain how to compute
2.C.3 Find the best rank-k approximations of each of the
ABC if we know AB, B, and BC, but not necessarily A or C
following matrices for the given value of k.
" # themselves.
∗ (a) A = 3 1 , k = 1.
[Hint: This would be trivial if B were square and invertible.]
1 3
" #
(b) A = 1 −2 3 , k = 1. ∗2.C.9 In this exercise, we derive explicit formulas for the
3 2 1 pseudoinverse in some special cases.
∗ (c) A = 1 0 2 , k = 1. (a) Show that if A ∈ Mm,n has linearly independent
columns then A† = (A∗ A)−1 A∗ .
0 1 1
−2 −1 1 (b) Show that if A ∈ Mm,n has linearly independent rows
then A† = A∗ (AA∗ )−1 .
(d) A = 1 0 2 , k = 2.
0
[Side note: A∗ A and AA∗ are indeed invertible in these cases.
1 1
See Exercise 1.4.30, for example.]
−2 −1 1
2.C.4 Determine which of the following statements are 2.C.10 Show that if x1 , x2 , . . ., xn are not all the same then
true and which are false. the line of best fit y = mx + b for the points (x1 , y1 ), (x2 , y2 ),
∗ (a) I † = I. . . ., (xn , yn ) is the unique solution of the linear system
(b) The function T : M3 (R) → M3 (R) defined by A∗ Ax = A∗ b,
T (A) = A† (the pseudoinverse of A) is a linear trans-
formation. where
∗ (c) For all A ∈ M4 (C), it is the case that range(A† ) =
x1 1 y1
range(A). y2
x2 1
(d) If A ∈ Mm,n (R) and b ∈ Rm are such that the linear
, and x = m .
system Ax = b has a solution then there is a unique A= . .. , b=
.
. .. b
solution vector x ∈ Rn that minimizes kxk. . .
∗ (e) For every A ∈ Mm,n (R) and b ∈ Rm there is a unique xn 1 y n
vector x ∈ Rn that minimizes kAx − bk.
[Hint: Use Exercise 2.C.9.]
f (x) = sin(x)
for all x ∈ Q
x
f (x̃) = sin(x̃)
/Q
x̃ ∈
Figure 2.16: If f (x) = sin(x) for all x ∈ Q and f is continuous, then it must be the case
that f (x) = sin(x) for all x ∈ R.
Remark 2.D.1 This idea of extending a function from Q to all of R via continuity is
Defining actually how some common functions are defined. For example, what does
Exponential the expression 2x even mean if x is irrational? We build up to the answer
Functions to this question one step at a time:
• If x is a positive integer then 2x = 2·2 · · · 2 (x times). If x is a negative
integer then 2x = 1/2−x . If x = 0 then 2x = 1.
• If x is rational, so we can√ write x = p/q for some integers p, q
(q 6= 0), then 2x = 2 p/q = q 2 p .
Recall that • If x is irrational, we define 2x by requiring that the function f (x) = 2x
“irrational” just
be continuous. That is, we set
means “not rational”.
Some well-known
irrational 2x = lim 2r .
√ numbers r→x
include 2, π, and e. r is rational
Recall that [Ak ]i, j is Before proceeding, we clarify that limits of matrices are simply meant
the (i, j)-entry of Ak .
entrywise: lim Ak = A means that lim [Ak ]i, j = [A]i, j for all i and j. We illustrate
k→∞ k→∞
with an example.
" # −1
Example 2.D.1 Let Ak = 1 k 2k − 1 and B = 2 1 . Compute lim Ak B .
k→∞
Limits of Matrices k 2k − 1 4k + 1 4 2
Solution:
We start by computing
Recall that the
inverse of a 2 × 2 " #
matrix is k 4k + 1 1 − 2k
" #−1 A−1
k = .
a b
5k − 1 1 − 2k k
=
c d
" # In particular, it is worth noting that lim A−1
k does not exist (the entries of
1 d −b k→∞
.
ad − bc −c a A−1
k get larger and larger as k increases), which should not be surprising
since " #!
1 k 2k − 1 1 2
lim Ak = lim =
k→∞ k→∞ k 2k − 1 4k + 1 2 4
is not invertible. However, if we multiply on the right by B then
" # " #
−1 k 4k + 1 1 − 2k 2 1 k 6 3
Ak B = = ,
5k − 1 1 − 2k k 4 2 5k − 1 2 1
so " #! " #
k 6 3 1 6 3
lim A−1
k B = lim = .
k→∞ k→∞ 5k − 1 2 1 5 2 1
lim Ak = A,
k→∞
• if λi 6= λ j then
(λi + i/k) − (λ j + j/k) = (λi − λ j ) + (i/k − j/k)
≥ |λi − λ j | − |i − j|/k
≥ |λi − λ j | − (n − 1)/k > 0.
Remark 2.D.2 Limits can actually be defined in any normed vector space—we have just
Limits in Normed restricted attention to Mm,n since that is the space where these concepts
Vector Spaces are particularly useful for us, and the details simplify in this case since
matrices have entries that we can latch onto.
In general, as long as a vector space V is finite-dimensional, we can
define limits in V by first fixing a basis B of V and then saying that
where the limit on the right is just meant entrywise. It turns out that this
notion of limit does not depend on which basis B of V we choose (i.e., for
any two bases B and C of V, it is the case that lim [vk ]B = [v]B if and only
k→∞
if lim [vk ]C = [v]C ).
k→∞
If V is infinite-dimensional then this approach does not work, since
we may not be able to construct a basis of V in the first place, so we may
not have any notion of “entrywise” limits to work with. We can get around
this problem by picking some norm on V (see Section 1.D) and saying that
does not in the ∞-norm. To see why this is the case, we note that
Z 1 1
xk+1 1
k fk − 0k1 = |xk | dx = = and
0 k + 1 x=0 k + 1
k fk − 0k∞ = max |xk | = 1k = 1 for all k ≥ 1.
x∈[0,1]
1 f1 (x) = x
2
x2 3
1
x
..
4 .
x20 lim kxk k1 = 0
k→
x
1 1 3 1
4 2 4
Definition 2.D.2 Suppose F = R or F = C. We say that a function f : Mm,n (F) → Mr,s (F)
Continuous is continuous if it is the case that
Functions
lim f (Ak ) = f (A) whenever lim Ak = A.
k→∞ k→∞
Proof. Since B is dense in Mm,n (F), for any A ∈ Mm,n (F) we can find matrices
A1 , A2 , . . . ∈ B such that lim Ak = A. Then
k→∞
with the middle equality following from the fact that Ak ∈ B so f (Ak ) = g(Ak ).
For example, the above theorem tells us that if f : Mn (C) → C is a contin-
uous function for which f (A) = det(A) whenever A is invertible, then f must
in fact be the determinant function, and that if g : Mn (C) → Mn (C) is a con-
tinuous function for which g(A) = A2 − 2A + 3I whenever A is diagonalizable,
2.D Extra Topic: Continuity and Matrix Analysis 291
We originally proved Proof. We start by noticing that if A is invertible then AB and BA are similar,
that AB is similar to
since
BA (when A or B is
invertible) back in AB = A(BA)A−1 .
Exercise 1.A.5.
In particular, this tells us that f (AB) = f A(BA)A−1 = f (BA) whenever A
is invertible. If we could show that f (AB) = f (BA) for all A and B then we
would be done, since Theorem 1.A.1 would then tell us that f (A) = tr(A) for
all A ∈ Mn .
However, this final claim follows immediately from continuity of f and
density of the set of invertible matrices. In particular, if we fix any matrix
B ∈ Mn and define fB (A) = f (AB) and gB (A) = f (BA) then we just showed
that fB (A) = gB (A) for all invertible A ∈ Mn and all (not necessarily invertible)
B ∈ Mn . Since fB and gB are continuous, it follows from Theorem 2.D.3 that
fB (A) = gB (A) (so f (AB) = f (BA)) for all A, B ∈ Mn , which completes the
proof.
If the reader is uncomfortable with the introduction of the function fB and
gB at the end of the above proof, it can instead be finished a bit more directly by
making use of some of the ideas from the proof of Theorem 2.D.1. In particular,
for any (potentially non-invertible) matrix A ∈ Mn and integer k > 1, we define
Ak = A + 1k I and note that limk→∞ Ak = A and Ak is invertible when k is large.
292 Chapter 2. Matrix Decompositions
We then compute
f (AB) = f lim Ak B (since lim Ak = A)
k→∞ k→∞
as desired.
Theorem 2.D.5 Suppose F = R or F = C, and A ∈ Mm,n (F). There exists a unitary ma-
QR Decomposition trix U ∈ Mm (F) and an upper triangular matrix T ∈ Mm,n (F) with non-
(Again) negative real entries on its diagonal such that
A = UT.
Proof of Theorem 2.D.5. We assume that n ≥ m and simply note that a com-
pletely analogous argument works if n < m. We write A = [ B | C ] where
B ∈ Mm and define Ak = [ B + 1k I | C ] for each integer k ≥ 1. Since B + 1k I is
invertible (i.e., its columns are linearly independent) whenever k is sufficiently
large, the proof of Theorem 1.C.1 tells us that Ak has a QR decomposition
Ak = Uk Tk whenever k is sufficiently large. We now use limit arguments to
show that A itself also has such a decomposition.
Since the set of unitary matrices is closed and bounded, the Bolzano–
Weierstrass theorem tells us that there is a sequence k1 < k2 < k3 < · · · with
Recall that limits the property that U = lim j→∞ Uk j exists and is unitary. Similarly, if lim j→∞ Tk j
are taken entrywise,
exists then it must be upper triangular since each Tk j is upper triangular. To see
and the limit of the 0
entries in the lower that this limit does exist, we compute
triangular portion of ∗
each Tk j is just 0.
lim Tk j = lim Uk∗j Ak j = lim Uk j lim Ak j = U ∗ lim Ak = U ∗ A.
j→∞ j→∞ j→∞ j→∞ k→∞
If the reader does not like having to make use of the Bolzano–Weierstrass
theorem as we did above, another proof of the QR decomposition is outlined in
Exercise 2.D.7.
From the Inverse to the Adjugate
One final method of extending a property of matrices from those that are
invertible to those that perhaps are not is to make use of something called the
“adjugate” of a matrix:
While the above definition likely seems completely bizarre at first glance,
it is motivated by the following two properties:
• The function adj : Mn → Mn is continuous, since matrix multiplica-
tion, addition, and the coefficients of characteristic polynomials are all
continuous, and
The adjugate is • The adjugate satisfies A−1 = adj(A)/ det(A) whenever A ∈ Mn is in-
sometimes defined
vertible. To verify this claim, first recall that the constant term of the
in terms of the
cofactors of A characteristic polynomial is a0 = pA (0) = det(A − 0I) = det(A). Using
instead (see (Joh20, the Cayley–Hamilton theorem (Theorem 2.1.3) then tells us that
Section 3.A), for
example). These two pA (A) = (−1)n An + an−1 An−1 + · · · + a1 A + det(A)I = O,
definitions are
equivalent.
and multiplying this equation by A−1 (if it exists) shows that
Theorem 2.D.6 Suppose that F = R or F = C and A(t) ∈ Mn (F) is a matrix whose entries
Jacobi’s Formula depend in a continuously differentiable way on a parameter t ∈ F. If we let
dA/dt denote the matrix that is obtained by taking the derivative of each
entry of A with respect to t, then
d dA
det A(t) = tr adj A(t) .
dt dt
Theorem 2.D.7 For all A ∈ Mn (C) it is the case that sin2 (A) + cos2 (A) = I.
Trigonometric
Identities
for Matrices Proof. Recall that sin2 (x) + cos2 (x) = 1 for all x ∈ C. It follows that if A is
diagonalizable via A = PDP−1 , where D has diagonal entries λ1 , λ2 , . . ., λn ,
then
We proved sin2 (λ1 ) 0 ··· 0
this theorem 2
0 sin (λ2 ) ··· 0
via the Jordan −1
sin2 (A) + cos2 (A) = P . .. .. P
decomposition in .. ..
Exercise 2.4.14. . . .
0 0 · · · sin2 (λn )
cos2 (λ1 ) 0 ··· 0
2
0 cos (λ2 ) · · · 0
−1
+P .. .. .. P
..
. . . .
0 0 ··· cos2 (λn )
= PP−1 = I.
Since f (x) = sin2 (x) + cos2 (x) is analytic on C and thus continuous on Mn (C),
it follows from Theorem 2.D.3 that sin2 (A) + cos2 (A) = I for all (not necessar-
ily diagonalizable) A ∈ Mn (C).
As another example to illustrate the utility of this technique, we now
provide an alternate proof of the Cayley–Hamilton theorem (Theorem 2.1.3)
that avoids the technical argument that we originally used to prove it via Schur
triangularization (which also has a messy, technical proof).
2.D Extra Topic: Continuity and Matrix Analysis 295
Keith Conrad
Up until now, all of our explorations in linear algebra have focused on vec-
tors, matrices, and linear transformations—all of our matrix decompositions,
change of basis techniques, algorithms for solving linear systems or construct-
ing orthonormal bases, and so on have had the purpose of deepening our
understanding of these objects. We now introduce a common generalization
of all of these objects, called “tensors”, and investigate which of our tools and
techniques do and do not still work in this new setting.
For example, just like we can think of vectors (that is, the type of vectors
that live in Fn ) as 1-dimensional lists of numbers and matrices as 2-dimensional
arrays of numbers, we can think of tensors as any-dimensional arrays of
numbers. Perhaps more usefully though, recall that we can think of vectors
geometrically as arrows in space, and matrices as linear transformations that
move those arrows around. We can similarly think of tensors as functions that
move vectors, matrices, or even other more general tensors themselves around
in a linear way. In fact, they can even move multiple vectors, matrices, or tensors
around (much like bilinear forms and multilinear forms did in Section 1.3.3).
In a sense, this chapter is where we really push linear algebra to its ultimate
limit, and see just how far our techniques can extend. Tensors provide a common
generalization of almost every single linear algebraic object that we have seen—
not only are vectors, matrices, linear transformations, linear forms, and bilinear
forms examples of tensors, but so are more exotic operations like the cross
product, the determinant, and even matrix multiplication itself.
Before diving into the full power of tensors, we start by considering a new
operation on Fn and Mm,n (F) that contains much of their essence. After all,
tensors themselves are quite abstract and will take some time to get our heads
around, so it will be useful to have this very concrete motivating operation in
our minds when we are introduced to them.
Definition 3.1.1 The Kronecker product of matrices A ∈ Mm,n and B ∈ M p,q is the block
Kronecker Product matrix
Recall that ai, j is the a1,1 B a1,2 B ··· a1,n B
(i, j)-entry of A. a B a2,2 B ··· a2,n B
def
2,1
A⊗B = .. .. ..
.. ∈ Mmp,nq .
. . . .
am,1 B am,2 B ··· am,n B
Solutions:
B 2B 2 1 4 2
As always, the bars a) A ⊗ B = = 0 1 0 2 .
that we use to 3B 4B
partition these block 6 3 8 4
matrices are just 0 3 0 4
provided for ease of
" #
visualization—they
2A A 2 4 1 2
have no b) B ⊗ A = = 6 8 3 4 .
mathematical O A
meaning. 0 0 1 2
0 0 3 4
w 3
c) v ⊗ w = 2w = 4 .
6
We will see in 8
Theorem 3.1.8 that,
even though A ⊗ B
and B ⊗ A are The above example shows that the Kronecker product is not commutative
typically not equal,
in general: A ⊗ B might not equal B ⊗ A. However, it does have most of the
they share many of
the same properties. other “standard” properties that we would expect a matrix product to have, and
3.1 The Kronecker Product 299
Theorem 3.1.1 Suppose A, B, and C are matrices with sizes such that the operations below
Vector Space make sense, and let c ∈ F be a scalar. Then
Properties of the a) (A ⊗ B) ⊗C = A ⊗ (B ⊗C) (associativity)
Kronecker Product b) A ⊗ (B +C) = A ⊗ B + A ⊗C (left distributivity)
c) (A + B) ⊗C = A ⊗C + B ⊗C (right distributivity)
d) (cA) ⊗ B = A ⊗ (cB) = c(A ⊗ B)
Proof. The proofs of all of these statements are quite similar to each other, so
we only explicitly prove part (b)—the remaining parts of the theorem are left
to Exercise 3.1.20.
To this end, we just fiddle around with block matrices a bit:
a1,1 (B +C) a1,2 (B +C) · · · a1,n (B +C)
a (B +C) a (B +C) · · · a (B +C)
2,1 2,2 2,n
A ⊗ (B +C) =
.. .. . .
. . . . .
.
am,1 (B +C) am,2 (B +C) · · · am,n (B +C)
a1,1 B a1,2 B · · · a1,n B a1,1C a1,2C · · · a1,nC
a B a B · · · a B a C a C · · · a2,nC
2,1 2,2 2,n 2,1 2,2
= .. .. .. .
+
. .. .. .
. . . .. .. . . ..
am,1 B am,2 B · · · am,n B am,1C am,2C · · · am,nC
= A ⊗ B + A ⊗C,
as desired.
In particular, associativity of the Kronecker product (i.e., property (a) of the
above theorem) tells us that we can unambiguously define Kronecker powers
Associativity of the of matrices by taking the Kronecker product of a matrix with itself repeatedly,
Kronecker product
without having to worry about the exact order in which we perform those
also tells us that
expressions like products. That is, for any integer p ≥ 1 we can define
A ⊗ B ⊗C make sense.
def
A⊗p = A ⊗ A ⊗ · · · ⊗ A .
| {z }
p copies
Solutions: " #
I O 1 0 0 0
a) I ⊗2 = I ⊗ I = = 0 1 0 0 .
O I
0 0 1 0
0 0 0 1
H H 1 1 1 1
b) H ⊗2 = H ⊗ H = = 1 −1 1 −1 .
H −H
1 1 −1 −1
1 −1 −1 1
" #
H ⊗2 H ⊗2
We could also c) H = H ⊗ (H ) =
⊗3 ⊗2
compute
H ⊗2 −H ⊗2
H ⊗3 = (H ⊗2 ) ⊗ H. We
would get the same 1 1 1 1 1 1 1 1
answer. 1 −1 1 −1 1 −1 1 −1
1 1 −1 −1 1 1 −1 −1
1 −1 −1 1 1 −1 −1 1
= 1 1
.
1 1 −1 −1 −1 −1
1 −1 1 −1 −1 1 −1 1
1 1 −1 −1 −1 −1 1 1
1 −1 −1 1 −1 1 1 −1
Remark 3.1.1 Notice that the matrices H, H ⊗2 , and H ⊗3 from Example 3.1.2 all have the
Hadamard Matrices following two properties: their entries are all ±1, and their columns are
mutually orthogonal. Matrices with these properties are called Hadamard
matrices, and the Kronecker product gives one method of constructing
them: H ⊗k is a Hadamard matrix for all k ≥ 1.
One of the longest-standing unsolved questions in linear algebra asks
which values of n are such that there exists an n × n Hadamard matrix.
The above argument shows that they exist whenever n = 2k for some k ≥ 1
(since H ⊗k is a 2k × 2k matrix), but it is expected that they exist whenever
Entire books have n is a multiple of 4. For example, here is a 12 × 12 Hadamard matrix that
been written about
cannot be constructed via the Kronecker product:
Hadamard matrices
and the various ways
of constructing them 1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1
[Aga85, Hor06]. 1 1 −1 1 −1 −1 −1 1 1 1 −1 1
1 1 1 −1 1 −1 −1 −1 1 1 1 −1
1 −1 1 1 −1 1 −1 −1 −1 1 1 1
1 1 −1 1 1 −1 1 −1 −1 −1 1 1
1 1 1 −1 1 1 −1 1 −1 −1 −1 1
.
1 1 1 1 −1 1 1 −1 1 −1 −1 −1
1 −1 1 1 1 −1 1 1 −1 1 −1 −1
1 −1 −1 1 1 1 −1 1 1 −1 1 −1
1 −1 −1 −1 1 1 1 −1 1 1 −1 1
1 1 −1 −1 −1 1 1 1 −1 1 1 −1
1 −1 1 −1 −1 −1 1 1 1 −1 1 1
known, and currently the smallest n for which it is not known if there
exists a Hadamard matrix is n = 668.
Notice that, in part (a) of the above example, the Kronecker product of two
identity matrices was simply a larger identity matrix—this happens in general,
We look at some regardless of the sizes of the identity matrices in the product. Similarly, the
other sets of
Kronecker product of two diagonal matrices is always diagonal.
matrices that are
preserved by the The Kronecker product also plays well with usual matrix multiplication and
Kronecker product other operations like the transpose and inverse. We summarize these additional
shortly, in
Theorem 3.1.4.
properties here:
Theorem 3.1.2 Suppose A, B, C, and D are matrices with sizes such that the operations
Algebraic Properties of below make sense. Then
the Kronecker Product a) (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD),
b) (A ⊗ B)−1 = A−1 ⊗ B−1 , if either side of this expression exists
c) (A ⊗ B)T = AT ⊗ BT , and
d) (A ⊗ B)∗ = A∗ ⊗ B∗ if the matrices are complex.
Proof. The proofs of all of these statements are quite similar to each other
and follow directly from the definitions of the relevant operations, so we
only explicitly prove part (a)—the remaining parts of the theorem are left to
Exercise 3.1.21.
To see why part (a) of the theorem holds, we compute (A ⊗ B)(C ⊗ D) via
block matrix multiplication:
a1,1 B a1,2 B · · · a1,n B c1,1 D c1,2 D · · · c1,p D
So that these matrix a B a B · · · a B c D c D · · · c D
multiplications 2,1 2,2 2,n 2,1 2,2 2,p
actually make sense, (A ⊗ B)(C ⊗ D) = .. .. ..
.. .. .. ..
..
we are assuming . . . . . . . .
that A ∈ Mm,n and
C ∈ Mn,p .
am,1 B am,2 B · · · am,n B cn,1 D cn,2 D · · · cn,p D
n
∑ j=1 a1, j c j,1 BD ∑nj=1 a1, j c j,2 BD · · · ∑nj=1 a1, j c j,p BD
n
∑ j=1 a2, j c j,1 BD ∑nj=1 a2, j c j,2 BD · · · ∑nj=1 a2, j c j,p BD
= .. .. ..
..
. . . .
n n n
This calculation looks ∑ j=1 am, j c j,1 BD ∑ j=1 am, j c j,2 BD · · · ∑ j=1 am, j c j,p BD
ugly, but really it’s n
just applying the ∑ j=1 a1, j c j,1 ∑nj=1 a1, j c j,2 · · · ∑nj=1 a1, j c j,p
definition of matrix n
multiplication
∑ j=1 a2, j c j,1 ∑nj=1 a2, j c j,2 · · · ∑nj=1 a2, j c j,p
= .. .. .. ⊗ (BD)
multiple times.
..
. . . .
n n n
a c a
∑ j=1 m, j j,1 ∑ j=1 m, j j,2 c · · · a
∑ j=1 m, j j,p c
= (AC) ⊗ (BD),
as desired.
It is worth noting that Theorems 3.1.1 and 3.1.2 still work if we replace
all of the matrices by vectors, since we can think of those vectors as 1 × n or
m × 1 matrices. Doing this in parts (a) and (d) of the above theorem shows us
that if v, w ∈ Fn and x, y ∈ Fm are (column) vectors, then
(v ⊗ x) · (w ⊗ y) = (v ⊗ x)∗ (w ⊗ y) = (v∗ w)(x∗ y) = (v · w)(x · y).
302 Chapter 3. Tensors and Multilinearity
In other words, the dot product of two Kronecker products is just the product
of the individual dot products. In particular, this means that v ⊗ x is orthogonal
to w ⊗ y if and only if v is orthogonal to w or x is orthogonal to y (or both).
Properties Preserved by the Kronecker Product
Because the Kronecker product and Kronecker powers can create such large
matrices so quickly, it is important to understand how properties of A ⊗ B
relate to the corresponding properties of A and B themselves. For example, the
following theorem shows that we can compute the eigenvalues, determinant,
and trace of A ⊗ B directly from A and B themselves.
Proof. Part (a) of the theorem follows almost immediately from the fact that
the Kronecker product plays well with matrix multiplication and scalar multi-
plication (i.e., Theorems 3.1.1 and 3.1.2):
(A ⊗ B)(v ⊗ w) = (Av) ⊗ (Bw) = (λ v) ⊗ (µw) = λ µ(v ⊗ w),
so v ⊗ w is an eigenvector of A ⊗ B corresponding to eigenvalue λ µ, as claimed.
Note in particular Part (b) follows directly from the definition of the Kronecker product:
that parts (b)
and (c) of this
a1,1 B a1,2 B · · · a1,n B
theorem imply that
a B a B · · · a B
tr(A ⊗ B) = tr(B ⊗ A) 2,1 2,2 2,n
and tr(A ⊗ B) = tr
.. .. ..
..
det(A ⊗ B) = det(B ⊗ A). . . . .
am,1 B am,2 B · · · am,n B
= a1,1 tr(B) + a2,2 tr(B) + · · · + an,n tr(B) = tr(A)tr(B).
Finally, for part (c) we note that
det(A ⊗ B) = det (A ⊗ In )(Im ⊗ B) = det(A ⊗ In ) det(Im ⊗ B).
We will see another Since Im ⊗ B is block diagonal, its determinant just equals the product of the
way to prove this
determinants of its diagonal blocks:
determinant equality
(as long as F = R or
F = C) in B O ··· O
O B · · · O
Exercise 3.1.9. m
det(Im ⊗ B) = det
.. .. . .
.. = det(B) .
. . . .
O O ··· B
A similar argument shows that det(A ⊗ In ) = det(A)n , so we get det(A ⊗ B) =
det(A)n det(B)m , as claimed.
The Kronecker product also preserves several useful families of matrices.
For example, it follows straightforwardly from the definition of the Kronecker
product that if A ∈ Mm and B ∈ Mn are both upper triangular, then so is A ⊗ B.
We summarize some observations of this type in the following theorem:
3.1 The Kronecker Product 303
(where U1 and U2 are unitary, and T1 and T2 are upper triangular), then to
find a Schur triangularization of A ⊗ B we can simply compute the Kronecker
products U1 ⊗U2 and T1 ⊗ T2 , since
An analogous argument shows that the Kronecker product also preserves di-
agonalizations of matrices (in the sense of Theorem 2.0.1), as well as spectral
decompositions, QR decompositions, singular value decompositions, and polar
decompositions.
Because these decompositions behave so well under the Kronecker prod-
uct, we can use them to get our hands on any matrix properties that can be
inferred from these decompositions. For example, by looking at how the Kro-
necker product interacts with the singular value decomposition, we arrive at
the following theorem:
Again, all of these properties follow fairly quickly from the relevant def-
initions and the fact that if A and B have singular value decompositions
A = U1 Σ1V1∗ and B = U2 Σ2V2∗ , then
The one notable matrix decomposition that does not behave quite so cleanly
under the Kronecker product is the Jordan decomposition. In particular, if
J1 ∈ Mm (C) and J2 ∈ Mn (C) are two matrices in Jordan canonical form then
J1 ⊗ J2 may not be in Jordan canonical form. For example, if
1 1 1 1
1 1 0 1 0 1
J1 = J2 = then J1 ⊗ J2 = 0 0 1 1 ,
0 1
0 0 0 1
is a basis of Fmn (see Exercise 3.1.17, and note that a similar statement holds
for matrices in Mmp,nq being written as a linear combination of matrices of
the form A ⊗ B). In the special case when B and C are the standard bases of
Fm and Fn , respectively, B ⊗C is also the standard basis of Fmn . Furthermore,
ordering the basis vectors ei ⊗ e j by placing their subscripts in lexicographical
order produces exactly the “usual” ordering of the standard basis vectors of
Fmn . For example, if m = 2 and n = 3 then
In fact, this same observation works when taking the Kronecker product of 3 or
more standard basis vectors as well.
When working with vectors that are (linear combinations of) Kronecker
products of other vectors, we typically want to know what the dimensions of
the different factors of the Kronecker product are. For example, if we say that
v ⊗ w ∈ F6 , we might wonder whether v and w live in 2- and 3-dimensional
spaces, respectively, or 3- and 2-dimensional spaces. To alleviate this issue,
we use the notation Fm ⊗ Fn to mean Fmn , but built out of Kronecker products
of vectors from Fm and Fn , in that order (and we similarly use the notation
Mm,n ⊗ M p,q for Mmp,nq ). When working with the Kronecker product of
many vectors, we often use the shorthand notation
(Fn )⊗p = Fn ⊗ Fn ⊗ · · · ⊗ Fn .
def
| {z }
p copies
3.1 The Kronecker Product 305
We will clarify what Furthermore, we say that any vector v = v1 ⊗ v2 ⊗ · · · ⊗ v p ∈ (Fn )⊗p that
the word “tensor”
can be written as a Kronecker product (rather than as a linear combination of
means in the
coming sections. Kronecker products) is an elementary tensor.
In other words, this theorem tells us about the standard matrix of the
Recall that when linear transformation TA,B : Mm,n → M p,r defined by TA,B (X) = AXBT , where
working with A ∈ M p,m and B ∈ Mr,n are fixed. In particular, it says that the standard matrix
standard matrices
with respect to the of TA,B (with respect to the standard basis E = {E1,1 , E1,2 , . . . , Em,n }) is simply
standard basis E, we [TA,B ] = A ⊗ B.
use the shorthand
[T ] = [T ]E . Proof of Theorem
3.1.7 We start by showing that if X = Ei, j for some i, j then
vec AEi, j BT = (A ⊗ B)vec(Ei, j ). To see this, note that Ei, j = ei eTj , so using
Theorem 3.1.6 twice tells us that
vec AEi, j BT = vec Aei eTj BT = vec (Aei )(Be j )T = (Aei ) ⊗ (Be j )
= (A ⊗ B)(ei ⊗ e j ) = (A ⊗ B)vec ei eTj = (A ⊗ B)vec(Ei, j ).
If we then use the fact that we can write as X = ∑i, j xi, j Ei, j , the result follows
from the fact that vectorization is linear:
This proof technique !
is quite common
T T
when we want to vec AXB = vec A ∑ xi, j Ei, j B = ∑ xi, j vec AEi, j BT
show that a linear i, j i, j
transformation acts !
in a certain way: we
show that it acts that
= ∑ xi, j (A ⊗ B)vec(Ei, j ) = (A ⊗ B)vec ∑ xi, j Ei, j
i, j i, j
way on a basis, and
then use linearity to = (A ⊗ B)vec(X),
show that it must do
the same thing on as desired.
the entire vector
space.
The above theorem is nothing revolutionary, but it is useful because it
provides an explicit and concrete way of solving many problems that we have
(at least implicitly) encountered before. For example, suppose we had a fixed
matrix A ∈ Mn and we wanted to find all matrices X ∈ Mn that commute with
it. One way to do this would be to multiply out AX and XA, set entries of those
matrices equal to each other, and solve the resulting linear system. To make
the details of this linear system more explicit, however, we can notice that
AX = XA if and only if AX − XA = O, which (by taking the vectorization of
both sides of the equation and applying Theorem 3.1.7) is equivalent to
(A ⊗ I − I ⊗ AT )vec(X) = 0.
3.1 The Kronecker Product 307
This is a linear system that we can solve “directly”, as we now illustrate with
an example.
Example 3.1.3 Find all matrices X ∈ M2 that commute with A = 1 1 .
Finding Matrices 0 0
that Commute
Solution:
As noted above, one way to tackle this problem is to solve the linear
system (A ⊗ I − I ⊗ AT )vec(X) = 0. The coefficient matrix of this linear
system is
1 0 1 0 1 0 0 0 0 0 1 0
0 1 0 1 1 0 0 0 −1 1 0 1
A ⊗ I − I ⊗ AT = − = .
0 0 0 0 0 0 1 0 0 0 −1 0
0 0 0 0 0 0 1 0 0 0 −1 0
From here we can see that, if we label the entries of vec(X) as vec(X) =
(x1 , x2 , x3 , x4 ), then x3 = 0 and −x1 + x2 + x4 = 0, so x1 = x2 + x4 (x2 and
x4 are free). It follows that vec(X) = (x2 + x4 , x2 , 0, x4 ), so the matrices X
that commute with A are exactly the ones of the form
x2 + x4 x2
X = mat (x2 + x4 , x2 , 0, x4 ) = ,
0 x4
Definition 3.1.3 Given positive integers m and n, the swap matrix Wm,n ∈ Mmn is the
Swap Matrix matrix defined in any of the three following (equivalent) ways:
a) Wm,n = [T ], the standard matrix of the transposition map T :
Mm,n → Mn,m with respect to the standard basis E,
The entries of Wm,n b) Wm,n (ei ⊗ e j ) = e j ⊗ ei for all 1 ≤ i ≤ m, 1 ≤ j ≤ n, and
are all just 0 or 1, and
this result works over c) Wm,n = E1,1 E2,1 · · · Em,1 .
any field F. If F = R or E
F = C then Wm,n is 1,2 E2,2 · · · Em,2
unitary. .. .. .. ..
. . . .
E1,n E2,n · · · Em,n
If the dimensions m and n are clear from context or irrelevant, we denote
this matrix simply by W .
308 Chapter 3. Tensors and Multilinearity
For example, we already showed in Example 1.2.10 that the standard matrix
of the transpose map T : M2 → M2 is
1 0 0 0
0 0 1 0
[T ] = ,
0 1 0 0
0 0 0 1
We use W (instead of so this is the swap matrix W2,2 . We can check that it satisfies definition (b)
S) to denote the
by directly computing each of W2,2 (e1 ⊗ e1 ), W2,2 (e1 ⊗ e2 ), W2,2 (e2 ⊗ e1 ), and
swap matrix
because the letter W2,2 (e2 ⊗ e2 ). Similarly, it also satisfies definition (c) since we can write it as
“S” is going to the block matrix
become
overloaded later in " # 1 0 0 0
this section. We can E E2,1 0 0 1 0
think of “W” as W2,2 = 1,1 = 0 1 0 0 .
standing for “sWap”. E1,2 E2,2
0 0 0 1
To see that these three definitions agree in general (i.e., when m and n do
not necessarily both equal 2), we first compute
T T T
[T ] = [E1,1 ]E | [E1,2 ]E | · · · | [Em,n ]E (definition of [T ])
= vec(E1,1 ) | vec(E2,1 ) | · · · | vec(En,m ) (definition of vec)
= vec(e1 eT1 ) | vec(e2 eT1 ) | · · · | vec(en eTm ) (Ei, j = ei eTj for all i, j)
= e1 ⊗ e1 | e2 ⊗ e1 | · · · | en ⊗ em . (Theorem 3.1.6)
Some books call the The equivalence of definitions (a) and (b) of the swap matrix follows fairly
swap matrix the
quickly now, since multiplying a matrix by ei ⊗ e j just results in one of the
commutation matrix
and denote it by columns of that matrix, and in this case we will have [T ](ei ⊗ e j ) = e j ⊗ ei .
Km,n . Similarly, the equivalence of definitions (a) and (c) follows just by explicitly
writing out what the columns of the block matrix (c) are: they are exactly
e1 ⊗ e1 , e2 ⊗ e1 , . . ., en ⊗ em , which we just showed are also the columns of [T ].
Here we use dots (·) b) Again, we use characterization (c) to construct this swap matrix:
instead of zeros for
ease of visualization.
E1,1 E2,1 E3,1 1 · · · · · · · ·
W3,3 =
E1,2 E2,2 E3,2 = · · · 1 · · · · ·
· · · · · · 1 · ·
E1,3 E2,3 E3,3
· 1 · · · · · · ·
· · · · 1 · · · ·
· · · · · · · 1 ·
· · 1 · · · · · ·
· · · · · 1 · · ·
Matrices (like the · · · · · · · · 1
swap matrix) with a
single 1 in each row
and column, and
zeros elsewhere, are
called permutation The swap matrix Wm,n has some very nice properties, which we prove in
matrices. Compare Exercise 3.1.18. In particular, every row and column has a single non-zero
this with signed or entry (equal to 1), if F = R or F = C then it is unitary, and if m = n then it is
complex
symmetric. If the dimensions m and n are clear from context or irrelevant, we
permutation
matrices from just denote this matrix by W for simplicity.
Theorem 1.D.10. The name “swap matrix” comes from the fact that it swaps the two factors
in any Kronecker product: W (v ⊗ w) = w ⊗ v for all v ∈ Fm and w ∈ Fn . To
see this, just write each of v and w as linear combinations of the standard basis
vectors (v = ∑i vi ei and w = ∑ j w j e j ) and then use characterization (b) of swap
matrices:
!
W (v ⊗ w) = W ∑ vi ei ⊗ ∑ w j e j = ∑ vi w jW (ei ⊗ e j )
i j i, j
= ∑ vi w j e j ⊗ ei = ∑ w je j ⊗ ∑ vi ei = w ⊗ v.
i, j j i
More generally, the following theorem shows that swap matrices also solve
exactly the problem that we introduced them to solve—they can be used to
transform A ⊗ B into B ⊗ A:
Theorem 3.1.8 T .
Suppose A ∈ Mm,n and B ∈ M p,q . Then B ⊗ A = Wm,p (A ⊗ B)Wn,q
Almost-Commutativity
of the Kronecker
Product Proof.
Notice that if we write
A and B in terms of their columns as A =
a1 | a2 | · · · | an and B = b1 | b2 | · · · | bq , respectively, then
n q
Alternatively, we A = ∑ ai eTi and B= ∑ b j eTj .
could use
i=1 j=1
Theorem A.1.3 here
to write A and B as a
sum of rank-1
matrices.
310 Chapter 3. Tensors and Multilinearity
= B ⊗ A,
is the swap matrix. We thus just need to perform the indicated matrix
3.1 The Kronecker Product 311
multiplications:
B ⊗ A = W (A ⊗ B)W T
1 0 0 0 2 1 4 2 1 0 0 0
0 0 1 0 0 1 0 2 0 0 1 0
=
0 1 0 0 6 3 8 4 0 1 0 0
0 0 0 1 0 3 0 4 0 0 0 1
2 4 1 2
6 8 3 4
= .
0 0 1 2
0 0 3 4
Note that this agrees with Example 3.1.1, where we computed each of
A ⊗ B and B ⊗ A explicitly.
Definition 3.1.4 Suppose n, p ≥ 1 are integers. The symmetric subspace Snp is the sub-
The Symmetric space of (Fn )⊗p consisting of vectors that are unchanged by swap matrices:
Subspace
def
Snp = v ∈ (Fn )⊗p : Wσ v = v for all σ ∈ S p .
Properties of Sn2
basis: {e j ⊗ e j : 1 ≤ j ≤ n} ∪ {ei ⊗ e j + e j ⊗ ei : 1 ≤ i < j ≤ n}
dimension: n+1 n(n + 1)
=
2 2
For example, the members of S22 are the vectors of the form (a, b, b, c), which
are isomorphic via matricization to the 2 × 2 (symmetric) matrices of the form
" #
a b
.
b c
The following theorem generalizes these properties to higher values of p.
3.1 The Kronecker Product 313
Theorem 3.1.10 The symmetric subspace Snp ⊆ (Fn )⊗p has the following properties:
Properties of the 1
Symmetric Subspace a) One projection onto Snp is given by ∑ Wσ ,
p! σ ∈S p
n+ p−1
b) dim(Snp ) = , and
p
c) the following set is a basis of Snp :
( )
∑ Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) : 1 ≤ j1 ≤ j2 ≤ · · · ≤ j p ≤ n .
σ ∈S p
When we say Furthermore, if F = R or F = C then the projection in part (a) and the
“orthogonal” here,
basis in part (c) are each orthogonal.
we mean with
respect to the usual
dot product on
(Fn )⊗p . Proof. We begin by proving property (a). Define P = ∑σ ∈S p Wσ /p! to be the
proposed projection onto Snp . It is straightforward to check that P2 = P = PT ,
so P is an (orthogonal, if F = R or F = C) projection onto some subspace of
(Fn )⊗p —we leave the proof of those statements to Exercise 3.1.15.
It thus now suffices to show that range(P) = Snp . To this end, first notice
that for all τ ∈ S p we have
The fact that
composing all 1 1
Wτ P = ∑ Wτ Wσ = Wτ◦σ = P,
permutations by τ p! σ ∈S p p! σ∑
∈S p
gives the set of all
permutations follows
from the fact that S p with the final equality following from the fact that every permutation in S p
is a group (i.e., every can be written in the form τ ◦ σ for some σ ∈ S p . It follows that everything in
permutation is
invertible). To write a
range(P) is unchanged by Wτ (for all τ ∈ S p ), so range(P) ⊆ Snp .
particular To prove the opposite inclusion, we just notice that if v ∈ Snp then
permutation ρ ∈ S p
as ρ = τ ◦ σ , just 1 1
choose σ = τ −1 ◦ ρ. Pv = ∑ Wσ v = v = v,
p! σ ∈S p p! σ∑
∈S p
so v ∈ range(P) and thus Snp ⊆ range(P). Since we already proved the opposite
inclusion, it follows that range(P) = Snp , so P is a projection onto Snp as claimed.
To prove property (c) (we will prove property (b) shortly), we first notice
that the columns of the projection P from part (a) have the form
1
P(e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) = Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ),
p! σ∑
∈S p
To demonstrate property (b), we simply notice that the basis from part (c) of
A multiset is just a set the theorem contains as many vectors as there are multisets { j1 , j2 , . . . , j p } ⊆
in which repetition is
{1, 2, . . . , n}. A standard combinatorics result says that there are exactly
allowed, like
{1, 2, 2, 3}. Order n+ p−1 (n − p + 1)!
does not matter in =
multisets (just like p p! (n − 1)!
regular sets). such multisets (see Remark 3.1.2), which completes the proof.
p-element multisets with entries chosen from an n-element set (a fact that
we made use of at the end of the proof of Theorem 3.1.10). We represent
each multiset graphically via “stars and bars”, where p stars represent the
members of a multiset and n − 1 bars separate the values of those stars.
For example, in the n = 5, p = 6 case, the multisets {1, 2, 3, 3, 5, 5} and
It is okay for bars to {1, 1, 1, 2, 4, 4} would be represented by the stars and bars arrangements
be located at the
start or end, and it is
also okay for multiple
F | F | FF | | FF and FFF | F | | FF |,
bars to appear
consecutively. respectively.
Notice that there are a total of n + p − 1 positions in such an arrange-
ment of stars and bars (p positions for the stars and n − 1 positions for the
bars), and each arrangement is completely determined by the positions
that we choose for the p stars. It follows that there are n+p−1
p such con-
figurations of stars and bars, and thus exactly that many multisets of size
p chosen from a set of size n, as claimed.
The orthogonal basis vectors from part (c) of Theorem 3.1.10 do not form
an orthonormal basis because they are not properly normalized. For example,
in the n = 2, p = 3 case, we have dim(S23 ) = 43 = 4 and the basis of S23
described by the theorem consists of the following 4 vectors:
This tells us that, in Basis Vector Tuple (j1 , j2 , j3 )
the n = 2, p = 3 case,
S23 consists of the 6 e1 ⊗ e1 ⊗ e1 (1, 1, 1)
vectors of the form 2 e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 (1, 1, 2)
(a, b, b, c, b, c, c, d).
2 e1 ⊗ e2 ⊗ e2 + e2 ⊗ e1 ⊗ e2 + e2 ⊗ e2 ⊗ e1 (1, 2, 2)
6 e2 ⊗ e2 ⊗ e2 (2, 2, 2)
In order to turn
√ this basis into an orthonormal one, we must divide each
vector in it by p! m1! m2 ! · · · mn !, where m j denotes the multiplicity of j
in the corresponding tuple ( j1 , j2 , . . . , jn ).For example, for the basis vector
2 e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 corresponding to the√tuple (1, 1, 2)
above,
√ √ m1 = 2 and m2 = 1, so we divide that vector by p! m1! m2 ! =
we have
3! 2! 1! = 2 3 to normalize it:
1
√ e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 .
3
We close our discussion of the symmetric subspace by showing that it could
be defined in another (equivalent) way—as the span of Kronecker powers of
vectors in Fn (as long as F = R or F = C).
3.1 The Kronecker Product 315
Theorem 3.1.11 Suppose F = R or F = C. The symmetric subspace Snp ⊆ (Fn )⊗p is the
Tensor-Power Basis span of Kronecker powers of vectors:
of the Symmetric
Subspace Snp = span v⊗p : v ∈ Fn .
Remark 3.1.3 Another way to see that symmetric matrices can be written as a linear
The Spectral combination of symmetric rank-1 matrices is to make use of the real
Decomposition in the spectral decomposition (Theorem 2.1.6, if F = R) or the Takagi factoriza-
Symmetric Subspace tion (Exercise 2.3.26, if F = C). In particular, if F = R and A ∈ MSn has
{u1 , u2 , . . . , un } as an orthonormal basis of eigenvectors with correspond-
ing eigenvalues λ1 , λ2 , . . ., λn , then
n
A= ∑ λ j u j uTj
j=1
is one way of writing A in the desired form. If we trace things back through
the isomorphism between Sn2 and MSn then this shows in the p = 2 case
that we can write every vector v ∈ Sn2 in the form
n
When F = C, a v= ∑ λ ju j ⊗ u j,
symmetric (not j=1
Hermitian!) matrix
might not be normal, and a similar argument works when F = C if we use the Takagi factoriza-
so the complex
spectral tion instead.
decomposition Notice that what we have shown here is stronger than the statement
might not apply to it,
which is why we
of Theorem 3.1.11 (in the p = 2 case), which does not require the set
must use the Takagi {u1 , u2 , . . . , un } to be orthogonal. Indeed, when p ≥ 3, this stronger claim
factorization.
316 Chapter 3. Tensors and Multilinearity
v = λ1 v1 ⊗ v1 ⊗ v1 + λ2 v2 ⊗ v2 ⊗ v2
Definition 3.1.5 Suppose n, p ≥ 1 are integers. The antisymmetric subspace Anp is the
The Antisymmetric following subspace of (Fn )⊗p :
Subspace
def
Anp = v ∈ (Fn )⊗p : Wσ v = sgn(σ )v for all σ ∈ S p .
Recall that the sign As suggested above, in the p = 2 case the antisymmetric subspace is
sgn(σ ) of a
permutation σ is the
isomorphic to the set MsS n of skew-symmetric matrices via matricization, since
number of W v = −v if and only if mat(v)T = −mat(v). With this in mind, we now remind
transpositions ourselves of some of the properties of MsS
n here, as well as the corresponding
needed to generate properties of A2n that they imply:
it (see
Appendix A.1.5).
Properties of MsS
n Properties of A2n
basis: {Ei, j − E j,i : 1 ≤ i < j ≤ n} {ei ⊗ e j − e j ⊗ ei : 1 ≤ i < j ≤ n}
dim.: n n(n − 1) n n(n − 1)
= =
2 2 2 2
For example, the members of A22 are the vectors of the form (0, a, −a, 0), which
are isomorphic via matricization to the 2 × 2 (skew-symmetric) matrices of the
form
0 a
.
−a 0
The following theorem generalizes these properties to higher values of p.
3.1 The Kronecker Product 317
Theorem 3.1.12 The antisymmetric subspace Anp ⊆ (Fn )⊗p has the following properties:
Properties of the 1
Antisymmetric a) One projection onto Anp is given by ∑ sgn(σ )Wσ ,
p! σ ∈S p
Subspace n
b) dim(Anp ) = , and
p
c) the following set is a basis of Anp :
( )
∑ sgn(σ )Wσ (e j1 ⊗ · · · ⊗ e j p ) : 1 ≤ j1 < · · · < j p ≤ n .
σ ∈S p
For example, the unique (up to scaling) vectors in A22 and A33 are
e1 ⊗ e2 − e2 ⊗ e1 and e1 ⊗ e2 ⊗ e3 + e2 ⊗ e3 ⊗ e1 + e3 ⊗ e1 ⊗ e2
− e1 ⊗ e3 ⊗ e2 − e2 ⊗ e1 ⊗ e3 − e3 ⊗ e2 ⊗ e1 ,
respectively.
It is worth comparing these antisymmetric vectors to the formula for the
determinant of a matrix, which for matrices A ∈ M2 and B ∈ M3 have the
forms
This formula is stated respectively, and which has the following form in general for matrices C ∈ Mn :
as Theorem A.1.4 in
Appendix A.1.5. det(C) = ∑ sgn(σ )c1,σ (1) c2,σ (2) · · · cn,σ (n) .
σ ∈Sn
The fact that the antisymmetric vector in Ann looks so much like the formula for
the determinant of an n × n matrix is no coincidence—we will see in Section 3.2
(Example 3.2.9 in particular) that there is a well-defined sense in which the
determinant “is” this unique antisymmetric vector.
(b) Show that every eigenvalue of A ⊗ In + Im ⊗ B is the ∗∗3.1.17 Show that if B and C are bases of Fm and Fn ,
sum of an eigenvalue of A and an eigenvalue of B. respectively, then the set
B ⊗C = {v ⊗ w : v ∈ B, w ∈ C}
∗∗3.1.8 A Sylvester equation is a matrix equation of the
form is a basis of Fmn .
AX + XB = C,
[Hint: Use Exercises 1.2.27(a) and 3.1.16.]
where A ∈ Mm (C), B ∈ Mn (C), and C ∈ Mm,n (C) are
given, and the goal is to solve for X ∈ Mn,m (C).
(a) Show that the equation AX + XB = C is equivalent ∗∗3.1.18 Show that the swap matrix Wm,n has the following
to (A ⊗ I + I ⊗ BT )vec(X) = vec(C). properties:
(b) Show that a Sylvester equation has a unique solu- (a) Each row and column of Wm,n has exactly one non-
tion if and only if A and −B do not share a common zero entry, equal to 1.
eigenvalue. [Hint: Make use of part (a) and the result (b) If F = R or F = C then Wm,n is unitary.
of Exercise 3.1.7.] (c) If m = n then Wm,n is symmetric.
∗∗3.1.9 Let A ∈ Mm (C) and B ∈ Mn (C). 3.1.19 Show that 1 and −1 are the only eigenvalues of the
(a) Show that every eigenvalue of A ⊗ B is of the form swap matrix Wn,n , and the corresponding eigenspaces are
λ µ for some eigenvalues λ of A and µ of B. Sn2 and A2n , respectively.
[Side note: This exercise is sort of the converse of
Theorem 3.1.3(a).] ∗∗3.1.20 Recall Theorem 3.1.1, which established some
(b) Use part (a) to show that det(A ⊗ B) = of the basic properties of the Kronecker product.
det(A)n det(B)m .
(a) Prove part (a) of the theorem.
(b) Prove part (c) of the theorem.
3.1.10 Suppose F = R or F = C.
(c) Prove part (d) of the theorem.
(a) Construct the orthogonal projection onto S23 . That is,
write this projection down as an 8 × 8 matrix.
(b) Construct the orthogonal projection onto A23 . ∗∗3.1.21 Recall Theorem 3.1.2, which established some
of the ways that the Kronecker product interacts with other
matrix operations.
∗∗3.1.11 Suppose x ∈ Fm ⊗ Fn .
(a) Prove part (b) of the theorem.
(a) Show that there exist linearly independent sets
(b) Prove part (c) of the theorem.
{v j } ⊂ Fm and {w j } ⊂ Fn such that
(c) Prove part (d) of the theorem.
min{m,n}
x= ∑ v j ⊗ w j.
j=1 ∗∗3.1.22 Prove Theorem 3.1.4.
(b) Show that if F = R or F = C then the sets {v j } and
{w j } from part (a) can be chosen to be mutually or- ∗∗3.1.23 Prove Theorem 3.1.5.
thogonal.
[Side note: This is sometimes called the Schmidt
decomposition of x.] ∗∗3.1.24 Prove Theorem 3.1.12.
3.1.12 Compute a Schmidt decomposition (see Exer- 3.1.25 Let 1 ≤ p ≤ ∞ and let k · k p denote the p-norm
cise 3.1.11) of x = (2, 1, 0, 0, 1, −2) ∈ R2 ⊗ R3 . from Section 1.D.1. Show that kv ⊗ wk p = kvk p kwk p for
all v ∈ Cm , w ∈ Cn .
∗∗3.1.13 Show that Theorem 3.1.11 does not hold when
F = Z2 is the field with 2 elements (see Appendix A.4) and ∗∗3.1.26 Let 1 ≤ p, q ≤ ∞ be such that 1/p + 1/q = 1.
n = 2, p = 3. We now provide an alternate proof of Hölder’s inequality
(Theorem 1.D.5), which says that
∗∗3.1.14 Suppose F = R or F = C. Show that if v ∈ Snp v · w ≤ kvk p kwkq for all v, w ∈ Cn .
and w ∈ Anp then v · w = 0.
(a) Explain why it suffices to prove this inequality in the
case when kvk p = kwkq = 1. Make this assumption
∗∗3.1.15 In this exercise, we complete the proof of Theo- throughout the rest of this exercise.
rem 3.1.10. Let P = ∑σ ∈S p Wσ /p!. (b) Show that, for each 1 ≤ j ≤ n, either
(a) Show that PT = P. |v j w j | ≤ |v j | p or |v j w j | ≤ |w j |q .
(b) Show that P2 = P.
(c) Show that
∗∗3.1.16 Show that if {w1 , w2 , . . . , wk } ⊆ Fn
is linearly v · w ≤ kvk pp + kwkqq = 2.
independent and {v1 , v2 , . . . , vk } ⊆ Fm is any set then the
equation (d) Thisis not quite what we wanted (we wanted to show
k that v · w ≤ 1, not 2). To fix this problem, let k ≥ 1
∑ vj ⊗wj = 0 be an integer and replace v and w in part (c) by v⊗k
j=1
and w⊗k , respectively. What happens as k gets large?
implies v1 = v2 = · · · = vk = 0.
320 Chapter 3. Tensors and Multilinearity
Definition 3.2.1 Suppose V1 , V2 , . . . , V p and W are vector spaces over the same field. A
Multilinear multilinear transformation is a function T : V1 × V2 × · · · × V p → W
Transformations with the property that, if we fix 1 ≤ j ≤ p and any p − 1 vectors vi ∈ Vi
(1 ≤ i 6= j ≤ p), then the function S : V j → W defined by
is a linear transformation.
The above definition is a bit of a mouthful, but the idea is simply that a
multilinear transformation is a function that looks like a linear transformation
on each of its inputs individually. When there are just p = 2 input spaces
we refer to these functions as bilinear transformations, and we note that
bilinear forms (refer back to Section 1.3.3) are the special case that arises
when the output space is W = F. Similarly, we sometimes call a multilinear
transformation with p input spaces a p-linear transformation (much like we
sometimes called multilinear forms p-linear forms).
Solution:
The following facts about the cross product are equivalent to C being
bilinear:
• (v + w) × x = v × x + w × x,
• v × (w + x) = v × w + v × x, and
• (cv) × w = v × (cw) = c(v × w) for all v, w, x ∈ R3 and c ∈ R.
These properties are all straightforward to prove directly from the defini-
We proved these tion of the cross product, so we just prove the first one here:
properties in
Section 1.A of
[Joh20]. (v + w) × x = (v2 + w2 )x3 − (v3 + w3 )x2 , (v3 + w3 )x1 − (v1 + w1 )x3 ,
(v1 + w1 )x2 − (v2 + w2 )x1
= (v2 x3 − v3 x2 , v3 x1 − v1 x3 , v1 x2 − v2 x1 )
+ (w2 x3 − w3 x2 , w3 x1 − w1 x3 , w1 x2 − w2 x1 )
= v × x + w × x,
as claimed.
then T× is bilinear. To verify this claim, we just have to recall that matrix
multiplication is both left- and right-distributive, so for any matrices A,
B, and C of appropriate size and any scalar c, we have
spaces, since the output space is trivial). The sum of these two numbers (i.e.,
either p or p + 1, depending on whether W = F or not) is called its order.
For example, matrix multiplication (between two matrices) is a multilinear
transformation of type (2, 1) and order 2 + 1 = 3, and the dot product has type
(2, 0) and order 2 + 0 = 2. We will see shortly that the order of a multilinear
transformation tells us the dimensionality of an array of numbers that should
be used to represent that transformation (just like vectors can be represented
via a 1D list of numbers and linear transformations can be represented via a 2D
array of numbers/a matrix). For now though, we spend some time clarifying
what types of multilinear transformations correspond to which sets of linear
algebraic objects that we are already familiar with:
• Transformations of type (1, 1) are functions T : V → W that act linearly
on V. In other words, they are exactly linear transformations, which we
already know and love. Furthermore, the order of a linear transformation
is 1 + 1 = 2, which corresponds to the fact that we can represent them
via matrices, which are 2D arrays of numbers.
• Transformations of type (1, 0) are linear transformations f : V → F,
As a slightly more which are linear forms. The order of a linear form is 1 + 0 = 1, which
trivial special case,
corresponds to the fact that we can represent them via vectors (via
scalars can be
thought of as Theorem 1.3.3), which are 1D arrays of numbers. In particular, recall
multilinear that we think of these as row vectors.
transformations of
type (0, 0).
• Transformations of type (2, 0) are bilinear forms T : V1 × V2 → F, which
have order 2 + 0 = 2. The order once again corresponds to the fact
(Theorem 1.3.5) that they can be represented naturally by matrices (2D
arrays of numbers).
• Slightly more generally, transformations of type (p, 0) are multilinear
forms T : V1 × · · · × V p → F. The fact that they have order p + 0 = p
corresponds to the fact that we can represent them via Theorem 1.3.6 as
p-dimensional arrays of scalars.
• Transformations of type (0, 1) are linear transformations T : F → W,
which are determined completely by the value of T (1). In particular, for
every such linear transformation T , there exists a vector w ∈ W such that
T (c) = cw for all c ∈ F. The fact that the order of these transformations
is 0 + 1 = 1 corresponds to the fact that we can represent them via the
vector w (i.e., a 1D array of numbers) in this way, though this time we
think of it as a column vector.
We summarize the above special cases, as well the earlier examples of bilin-
ear transformations like matrix multiplication, in Figure 3.1 for easy reference.
Type-(2, 0)
(bilinear forms) dot product
(matrices)
Type-(1, 0)
We do not give any
(linear forms) trace
explicit examples of
(row vectors)
type-(0, 0), (0, 1), or Type-(p, 0)
(1, 1) transformations (p-linear forms) determinant
here (i.e., scalars,
column vectors, Type-(0, 0)
or linear
transformations) (scalars)
since you are
hopefully familiar
enough with these Type-(0, 1)
objects by now that (column vectors) matrix
you can come up
with some yourself. multiplication
Type-(p, 1) Type-(1, 1)
(linear transforms.) cross
(matrices) product
Type-(2, 1) Kronecker
(bilinear transforms.) product
Figure 3.1: A summary of the various multilinear transformations, and their types,
that we have encountered so far.
3.2.2 Arrays
We saw back in Theorem 1.3.6 that we can represent multilinear forms (i.e.,
type-(p, 0) multilinear transformations) on finite-dimensional vector spaces as
multi-dimensional arrays of scalars, much like we represent linear transforma-
tions as matrices. We now show that we can similarly do this for multilinear
transformations of type (p, 1). This fact should not be too surprising—the idea
is simply that each multilinear transformation is determined completely by how
it acts on basis vectors in each of the input arguments.
324 Chapter 3. Tensors and Multilinearity
Theorem 3.2.1 Suppose V1 , . . . , V p and W are finite-dimensional vector spaces over the
Multilinear same field and T : V1 × · · · × V p → W is a multilinear transformation.
Transformations Let v1, j , . . ., v p, j denote the j-th coordinate of v1 ∈ V1 , . . ., v p ∈ V p with
as Arrays respect to some bases of V1 , . . ., V p , respectively, and let {w1 , w2 , . . .} be
a basis of W. Then there exists a unique family of scalars {ai; j1 ,..., j p } such
that
T (v1 , . . . , v p ) = ∑ ai; j1 ,..., j p v1, j1 · · · v p, j p wi
i, j1 ,..., j p
for all v1 ∈ V1 , . . ., v p ∈ V p .
Proof. In the p = 1 case, this theorem simply says that, once we fix bases of
V and W, we can represent every linear transformation T : V → W via its
standard matrix, whose (i, j)-entry we denote here by ai; j . This is a fact that we
already know to be true from Theorem 1.2.6. We now prove this more general
result for multilinear transformations via induction on p, and note that linear
transformations provide the p = 1 base case.
For the inductive step, we proceed exactly as we did in the proof of The-
orem 1.3.6. Suppose that the result is true for all (p − 1)-linear transforma-
tions acting on V2 × · · · × V p . If we let B = {x1 , x2 , . . . , xm } be a basis of V1
We use j1 here then the inductive hypothesis tells us that the (p − 1)-linear transformations
instead of just j since
S j1 : V2 × · · · × V p → W defined by
it will be convenient
later on.
S j1 (v2 , . . . , v p ) = T (x j1 , v2 , . . . , v p )
can be written as
The scalar ai; j1 , j2 ,..., j p
here depends on j1 T (x j1 , v2 , . . . , v p ) = S j1 (v2 , . . . , v p ) = ∑ ai; j1 , j2 ,..., j p v2, j2 · · · v p, j p wi
(not just i, j2 , . . . , j p ) i, j2 ,..., j p
since each choice
of x j1 gives a for some fixed family of scalars {ai; j1 , j2 ,..., j p }. If we write an arbitrary vector
different
(p − 1)-linear
v1 ∈ V1 as a linear combination of the basis vectors x1 , x2 , . . . , xm (i.e., v1 =
transformation S j1 . v1,1 x1 + v1,2 x2 + · · · + v1,m xm ), it then follows via linearity that
T (v1 , v2 , . . . , v p )
!
m m
=T ∑ v1, j1 x j1 , v2 , . . . , v p (v1 = ∑ v1, j1 x j1 )
j1 =1 j1 =1
m
= ∑ v1, j1 T (x j1 , v2 , . . . , v p ) (multilinearity of T )
j1 =1
m
= ∑ v1, j1 ∑ ai; j1 , j2 ,..., j p v2, j2 · · · v p, j p wi (inductive hypothesis)
j1 =1 i, j2 ,..., j p
a3;1,2 = 1 a2;1,3 = −1
a3;2,1 = −1 a1;2,3 = 1
a2;3,1 = 1 a1;3,2 = −1,
Recall that the first and ai; j1 , j2 = 0 otherwise. We can arrange these scalars into a 3D array,
subscript in ai; j1 , j2
which we display below with i indexing the rows, j1 indexing the columns,
corresponds to the
output of the and j2 indexing the “layers” as indicated:
transformation and
the next two
1
subscripts j2 = 2
correspond to the j2 = 3
inputs. 0 j2 =
0 0 1
1
−
0 0 0 0 0 0
1 0
0 −1 0 0 0 0
0 0 0 0
1 −1 0
0
If you ever used a It is worth noting that the standard array of the cross product that we
mnemonic like
computed in the above example is exactly the same as the standard array of the
e1 e2 e3 determinant (of a 3 × 3 matrix) that we constructed back in Example 1.3.14. In
det v1 v2 v3
w1 w2 w3
a sense, the determinant and cross product are the exact same thing, just written
in a different way. They are both order-3 multilinear transformations, but the
to compute v × w, determinant on M3 is of type (3, 0) whereas the cross product is of type (2, 1),
this is why it works.
which explains why they have such similar properties (e.g., they can both be
used to find the area of parallelograms in R2 and volume of parallelepipeds in
R3 ).
Example 3.2.3 Construct the standard array of the matrix multiplication map T× : M2 ×
Matrix Multiplication M2 → M2 with respect to the standard basis of M2 .
as a 3D Array
Solution:
Since T× is an order-3 tensor, its corresponding array is 3-dimensional
and specifically of size 4 × 4 × 4. To compute the scalars {ai; j1 , j2 }, we
again plug the basis vectors (matrices) of the input spaces into T× and
write the results in terms of the basis vectors of the output space. Rather
than compute all 16 of the required matrix products explicitly, we simply
recall that
(
Ek,` , if a = b or
T× (Ek,a , Eb,` ) = Ek,a Eb,` =
O, otherwise.
3.2 Multilinear Transformations 327
We thus have
For example,
a1;2,3 = 1 because a1;1,1 = 1 a1;2,3 = 1 a2;1,2 = 1 a2;2,4 = 1
E1,2 (the 2nd basis a3;3,1 = 1 a3;4,3 = 1 a4;3,2 = 1 a4;4,4 = 1,
vector) times E2,1
(the 3rd basis vector)
equals E1,1 (the 1st and ai; j1 , j2 = 0 otherwise. We can arrange these scalars into a 3D array,
basis vector). which we display below with i indexing the rows, j1 indexing the columns,
and j2 indexing the “layers” as indicated:
1
j2 = 2
j2 = 3
0 j2 = 4
0
0 0
0 0
0 j2 =
1 0 0 0
0 0 0 0 0 0
1 0 0 0 1 0 0
0 0 0 0 0 0 1
0 1 0 0 0
0 0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1
0 0 0
0 0 0 0
0
Even though there is It is worth noting that there is no “standard” order for which of the
no standard order
three subscripts should represent the rows, columns, and layers of a 3D
which which
dimensions array, and swapping the roles of the subscripts can cause the array to look
correspond to the quite different. For example, if we use i to index the rows, j2 to index the
subscripts j1 , . . ., jk , columns, and j1 to index the layers, then this array instead looks like
we always use i
(which
1
corresponded to the j1 = 2
single output vector j1 = 3
space) to index the 0 j1 = 4
rows of the standard 0
0 0
1 1
0 j1 =
1 0 0 0
array.
1 0 0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0
0 0
0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1
0 1 1
0 0 0 0
0
One of the author’s
greatest failings is his
inability to draw 4D
objects on 2D paper.
Solution:
We already computed the standard array of C in Example 3.2.2, so we
just have to arrange the 27 scalars from that array into its 3 × 9 standard
For example, the block matrix as follows:
leftmost “−1” here
comes from the fact
that, in the standard 0 0 0 0 0 1 0 −1 0
array A of C, A= 0 0 −1 0 0 0 1 0 0 .
a2;1,3 = −1 (the 0 1 0 −1 0 0 0 0 0
subscripts say row 2,
block column 1,
column 3. As one particular case that requires some special attention, consider what
the standard block matrix of a bilinear form f : V1 × V2 → F looks like. Since
bilinear forms are multilinear forms of order 2, we could arrange the entries
of their standard arrays into a matrix of size dim(V1 ) × dim(V2 ), just as we did
back in Theorem 1.3.5. However, if we follow the method described above for
turning a standard array into a standard block matrix, we notice that we instead
get a 1 × dim(V1 ) block matrix (i.e., row vector) whose entries are themselves
1 × dim(V2 ) row vectors. For example, the standard block matrix {a1; j1 , j2 } of
a bilinear form acting on two 3-dimensional vector spaces would have the form
a1;1,1 a1;1,2 a1;1,3 a1;2,1 a1;2,2 a1;2,3 a1;3,1 a1;3,2 a1;3,3 .
3.2 Multilinear Transformations 329
Example 3.2.5 Construct the standard block matrix of the dot product on R4 with respect
The Standard to the standard basis.
Block Matrix of the
Dot Product
Solution:
Recall from Example 1.3.9 that the dot product is a bilinear form that
thus a multilinear transformation of type (2, 0). To compute the entries
of its standard block matrix, we compute its value when all 16 possible
combinations of standard vectors are plugged into it:
(
1, if j1 = j2 or
e j1 · e j2 =
0, otherwise.
It follows that
Notice that the standard block matrix of the dot product that we constructed
in the previous example is simply the identity matrix, read row-by-row. This is
not a coincidence—it follows from the fact that the matrix representation of the
dot product (in the sense of Theorem 1.3.5) is the identity matrix. We now state
this observation slightly more generally and more prominently:
Remark 3.2.2 One of the advantages of the standard block matrix over a standard array is
Same Order but that its shape tells us its type (rather than just its order). For example, we
Different Type saw in Examples 3.2.2 and 1.3.14 that the cross product and the determinant
have the same standard arrays. However, their standard block matrices are
different—they are matrices of size 3 × 9 and 1 × 27, respectively.
This distinction can be thought of as telling us that the cross product
and determinant of a 3 × 3 matrix are the same object (e.g., anything that
can be done with the determinant of a 3 × 3 matrix can be done with the
cross product, and vice-versa), but the way that they act on other objects
Similarly, while is different (the cross product takes in 2 vectors and outputs 1 vector,
matrices represent
both bilinear forms whereas the determinant takes in 3 vectors and outputs a scalar).
and linear We can similarly construct standard block matrices of tensors (see
transformations, the
way that they act in
Remark 3.2.1), and the only additional wrinkle is that they can have
those two settings is block rows too (not just block columns). For example, the standard arrays
different (since their of bilinear forms f : V1 × V2 → F of type (2, 0), linear transformations
types are different T : V → W and type-(0, 2) tensors f : W1∗ × W2∗ → F have the following
but their orders
are not.
330 Chapter 3. Tensors and Multilinearity
In general, just like there are two possible shapes for tensors of order 1
(row vectors and column vectors), there are three possible shapes for
tensors of order 2 (corresponding to types (2, 0), (1, 1), and (0, 2)) and
p + 1 possible shapes for tensors of order p. Each of these different types
corresponds to a different shape of the standard block matrix and a different
way in which the tensor acts on vectors.
Refer back to Just like the action of a linear transformation on a vector can be represented
Theorem 1.2.6 for a
precise statement via multiplication by that transformation’s standard matrix, so too can the
about how a linear action of a multilinear transformation on multiple vectors. However, we need
transformations one additional piece of machinery to make this statement actually work (after
relate to matrix
all, what does it even mean to multiply a matrix by multiple vectors?), and that
multiplication.
machinery is the Kronecker product.
Theorem 3.2.2 Suppose V1 , . . ., V p and W are finite-dimensional vector spaces with bases
Standard Block Matrix B1 , . . ., B p and D, respectively, and T : V1 × · · · × V p → W is a multilinear
of a Multilinear transformation. If A is the standard block matrix of T with respect to those
Transformation bases, then
T (v1 , . . . , v p ) D = A [v1 ]B1 ⊗· · ·⊗ [v p ]B p for all v1 ∈ V1 , . . . , v p ∈ V p .
We could use Before proving this result, we look at some examples to clarify exactly what
notation like
it is saying. Just as with linear transformations, the main idea here is that we
[T ]D←B1 ,...,B p to
denote the standard can now represent arbitrary multilinear transformations (on finite-dimensional
block matrix of T vector spaces) via explicit matrix calculations. Even more amazingly though,
with respect to the this theorem tells us that the Kronecker product turns multilinear things (the
bases B1 , . . ., B p and
D. That seems like a
transformation T ) into linear things (multiplication by the matrix A). In a
bit much, though. sense, the Kronecker product can be used to absorb all of the multilinearity of
multilinear transformations, turning them into linear transformations (which
are much easier to work with).
Example 3.2.6 Verify Theorem 3.2.2 for the cross product C : R3 × R3 → R3 with respect
The Cross Product via to the standard basis of R3 .
the Kronecker Product
Solution:
Our goal is simply to show that if A is the standard block matrix of the
cross product then C(v, w) = A(v ⊗ w) for all v, w ∈ R3 . If we recall this
3.2 Multilinear Transformations 331
standard block matrix from Example 3.2.4 then we can simply compute
In this setting, we
v1 w1
interpret v ⊗ w as a
v1 w2
column vector, just
like we do for v in a v1 w3
matrix equation like 0 0 0 0 0 1 0 −1 0 v2 w1
Av = b.
A(v ⊗ w) = 0 0 −1 0 0 0 1 0 0
v2 w2
0 1 0 −1 0 0 0 0 0 v2 w3
v3 w1
v2 w3 − v3 w2
v3 w2
= v3 w1 − v1 w3 ,
v3 w3
v1 w2 − v2 w1
Example 3.2.7 Verify Theorem 3.2.2 for the dot product on R4 with respect to the standard
The Dot Product via basis.
the Kronecker Product
Solution:
Our goal is to show that if A is the standard block matrix of the dot
product (which we computed in Example 3.2.5) then v · w = A(v ⊗ w) for
all v, w ∈ R4 . If we recall that this standard block matrix is
For space reasons,
we do not explicitly
list the product A= 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 ,
A(v ⊗ w)—it is a 1 × 16
matrix times a 16 × 1 then we see that
matrix. The four
terms in the dot A(v ⊗ w) = v1 w1 + v2 w2 + v3 w3 + v4 w4 ,
product correspond
to the four “1”s in A.
which is indeed the dot product v · w.
Example 3.2.8 Verify Theorem 3.2.2 for matrix multiplication on M2 with respect to the
Matrix Multiplication standard basis.
via the Kronecker
Solution:
Product
Once again, our goal is to show that if E = {E1,1 , E1,2 , E2,1 , E2,2 } is
the standard basis of M2 and
1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
A= 0 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
b1,1 c1,1
c1,2
b
[B]E = 1,2 and [C]E =
c2,1 .
b2,1
b2,2 c2,2
332 Chapter 3. Tensors and Multilinearity
Example 3.2.9 Construct the standard block matrix of the determinant (as a multilinear
The Standard form from Fn × · · · × Fn to F) with respect to the standard basis of Fn .
Block Matrix of the
Solution:
Determinant is the
Antisymmetric Recall that if A, B ∈ Mn and B is obtained from A by interchanging two
Kronecker Product of its columns, then det(B) = − det(A). As a multilinear form acting on the
columns of a matrix, this means that the determinant is antisymmetric:
det(v1 , . . . , v j1 , . . . , v j2 , . . . , vn ) = − det(v1 , . . . , v j2 , . . . , v j1 , . . . , vn )
for all σ ∈ Sn and v ∈ (Fn )⊗n . By using the fact that WσT = Wσ −1 , we then
see that AT = sgn(σ )Wσ −1 AT for all σ ∈ Sn , which exactly means that A
lives in the antisymmetric subspace of (Fn )⊗n : A ∈ An,n .
We now recall from Section 3.1.3 that An,n is 1-dimensional, and in
particular A must have the form
This example also for some c ∈ F. Furthermore, since det(I) = 1, we see that AT (e1 ⊗ · · · ⊗
shows that the
en ) = 1, which tells us that c = 1. We have thus shown that the standard
determinant is the
only function with block matrix of the determinant is
the following three
properties: A= ∑ sgn(σ )eσ (1) ⊗ eσ (2) ⊗ · · · ⊗ eσ (n) ,
multilinearity, σ ∈Sn
antisymmetry, and
det(I) = 1. as a row vector.
Proof of Theorem 3.2.2. There actually is not much that needs to be done
to prove this theorem—all of the necessary bits and pieces just fit together
via naïve (but messy) direct computation. For ease of notation, we let v1, j ,
. . ., v p, j denote the j-th entry of [v1 ]B1 , . . ., [v p ]B p , respectively, just like in
Theorem 3.2.1.
On the one hand, using the definition of the Kroneckerproduct tells us that,
for each integer i, the i-th entry of A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p is
The notation here is h
a bit unfortunate. [·]B i
A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p = ∑ ai; j1 , j2 ,..., j p v1, j1 v2, j2 · · · v p, j p .
refers to a i | {z }
j1 , j2 ,..., j p
coordinate vector | {z } entries of
and [·]i refers to the sum across i-th row of A [v1 ]B1 ⊗···⊗[v p ]B p
i-th entry of a vector.
Onthe other hand, this is exactly what Theorem 3.2.1 tells us that the i-th entry
of T (v1 , . . . , v p ) D is (in the notation of that theorem, this was the coefficient
of wi , the i-th basis vector in D), which is all we needed to observe to complete
the proof.
Operator Norm
The operator norm of a linear transformation T : V → W between finite-
A similar definition dimensional inner product spaces is
works if V and W are
infinite-dimensional,
but we must replace kT k = max kT (v)k : kvk = 1 ,
v∈V
“max” with “sup”, and
even then the value where kvk refers to the norm of v induced by the inner product on V, and
of the supremum
might be ∞. kT (v)k refers to the norm of T (v) induced by the inner product on W. In the
special case when V = W = Fn , this is just the operator norm of a matrix from
Section 2.3.3.
To extend this idea to multilinear transformations, we just optimize over
unit vectors in each input argument. That is, if T : V1 × V2 × · · · × V p → W is
a multilinear transformation between finite-dimensional inner product spaces,
then we define its operator norm as follows:
def
kT k = max kT (v1 , v2 , . . . , v p )k : kv1 k = kv2 k = · · · = kv p k = 1 .
v j ∈V j
1≤ j≤p
334 Chapter 3. Tensors and Multilinearity
only if its columns form a linearly dependent set. By combining these two
facts, we see that det(v1 , v2 , . . . , vn ) = 0 if and only if {v1 , v2 , . . . , vn } is
linearly dependent. In other words,
ker(det) = (v1 , v2 , . . . , vn ) ∈ Fn × · · · × Fn :
{v1 , v2 , . . . , vn } is linearly dependent .
To get our hands on range(T ) and ker(T ) in general, we can make use
of the standard block matrix A of T , just like we did for the operator norm.
Specifically, making use of Theorem 3.2.2 immediately gives us the following
result:
Theorem 3.2.3 Suppose V1 , . . . , V p and W are finite-dimensional vector spaces with bases
Range and Kernel B1 , . . . , B p and D, respectively, and T : V1 × · · · × V p → W is a multilinear
in Terms of Standard transformation. If A is the standard block matrix of T with respect to these
Block Matrices bases, then
range(T ) = w ∈ W : [w]D = A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p
for some v1 ∈ V1 , . . . , v p ∈ V p , and
ker(T ) = (v1 , . . . , v p ) ∈ V1 × · · · × V p : A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p = 0 .
That is, the range and kernel of a multilinear transformation consist of the
elementary tensors in the range and null space of its standard block matrix,
respectively (up to fixing bases of the vector spaces so that this statement
actually makes sense). This fact is quite analogous to our earlier observation that
the operator norm of a multilinear transformation T is obtained by maximizing
kAvk over elementary tensors v = v1 ⊗ · · · ⊗ v p .
Example 3.2.14 Compute the kernel of the cross product C : R3 × R3 → R3 directly, and
Kernel of the also via Theorem 3.2.3.
Cross Product
Solution:
To compute the kernel directly, recall that kC(v, w)k = kvkkwk| sin(θ )|,
where θ is the angle between v and w (we already used this formula in
Example 3.2.10). It follows that C(v, w) = 0 if and only if v = 0, w = 0,
In other words, ker(C) or sin(θ ) = 0. That is, ker(C) consists of all pairs (v, w) for which v and
consists of sets {v, w}
that are linearly w lie on a common line.
dependent To instead arrive at this result via Theorem 3.2.3, we first recall that C
(compare with
Example 3.2.13).
has standard block matrix
0 0 0 0 0 1 0 −1 0
A = 0 0 −1 0 0 0 1 0 0 .
0 1 0 −1 0 0 0 0 0
Standard techniques from introductory linear algebra show that the null
space of A is the symmetric subspace: null(A) = S32 . The kernel of C thus
consists of the elementary tensors in S32 (once appropriately re-interpreted
just as pairs of vectors, rather than as elementary tensors).
We know from Theorem 3.1.11 that these elementary tensors are ex-
actly the ones of the form c(v ⊗ v) for some v ∈ R3 . Since c(v ⊗ v) =
(cv) ⊗ v = v ⊗ (cv), it follows that ker(C) consists of all pairs (v, w) for
3.2 Multilinear Transformations 337
Rank
Recall that the rank of a linear transformation is the dimension of its range.
Since the range of a multilinear transformation is not necessarily a subspace,
we cannot use this same definition in this more general setting. Instead, we
treat the rank-one sum decomposition (Theorem A.1.3) as the definition of the
rank of a matrix, and we then generalize it to linear transformations and then
multilinear transformations.
Once we do this, we see that one way of expressing the rank of a linear
transformation T : V → W is as the minimal integer r such that there exist linear
forms f1 , f2 , . . . , fr : V → F and vectors w1 , . . . , wr ∈ W with the property that
Notice that
r
Equation (3.2.1)
implies that the
T (v) = ∑ f j (v)w j for all v ∈ V. (3.2.1)
range of T is j=1
contained within the
span of w1 , . . . , wr , so
Indeed, this way of defining the rank of T makes sense if we recall that,
rank(T ) ≤ r. in the case when V = Fn and W = Fm , each f j looks like a row vector: there
exist vectors x1 , . . . , xr ∈ Fn such that f j (v) = xTj v for all 1 ≤ j ≤ r. In this
case, Equation (3.2.1) says that
r r
T (v) = ∑ (xTj v)w j = ∑ w j xTj v for all v ∈ Fn ,
j=1 j=1
which simply means that the standard matrix of T is ∑rj=1 w j xTj . In other words,
Equation (3.2.1) just extends the idea of a rank-one sum decomposition from
matrices to linear transformations.
To generalize Equation (3.2.1) to multilinear transformations, we just intro-
duce additional linear forms—one for each of the input vector spaces.
To apply Strassen’s By iterating this procedure, we can compute the product of large
algorithm to
matrices much more quickly than we can via the definition of matrix
matrices of size that
is not a power of 2, multiplication. For example, computing the product of two 4 × 4 matrices
just pad it with rows directly from the definition of matrix multiplication requires 64 scalar
and columns of multiplications. However, if we partition those matrices as 2 × 2 block
zeros as necessary.
matrices and multiply them via the clever method above, we just need
to perform 7 multiplications of 2 × 2 matrices, each of which can be
The expressions like implemented via 7 scalar multiplications in the same way, for a total of
“O(n3 )” here are
72 = 49 scalar multiplications.
examples of big-O
notation. For This faster method of matrix multiplication is called Strassen’s al-
example, “O(n3 )
operations” means
gorithm, and it requires only O(nlog2 (7) ) ≈ O(n2.8074 ) scalar operations,
“no more than Cn3 versus the standard matrix multiplication algorithm’s O(n3 ) scalar opera-
operations, for some tions. The fact that we can multiply two 2 × 2 matrices together via just 7
scalar C”. scalar multiplications (i.e., the fact that Strassen’s algorithm exists) follows
immediately from the 7-term rank sum decomposition (†) for the 2 × 2
Even the rank of the matrix multiplication transformation. After all, the matrices M j (1 ≤ j ≤ 7)
3 × 3 matrix
multiplication are simply the products f1, j (A) f2, j (B).
transformation is Similarly, finding clever rank sum decompositions of larger matrix
currently
multiplication transformations leads to even faster algorithms for ma-
unknown—all we
know is that it is trix multiplication, and these techniques have been used to construct an
between 19 and 23, algorithm that multiplies two n × n matrices in O(n2.3729 ) scalar opera-
inclusive. See [Sto10] tions. One of the most important open questions in all of linear algebra
and references
therein for details.
340 Chapter 3. Tensors and Multilinearity
asks whether or not, for every ε > 0, there exists such an algorithm that
requires only O(n2+ε ) scalar operations.
r
(1) (2) (p) T
A= ∑ wj vj ⊗vj ⊗···⊗vj
j=1
we see learn immediately the important fact that the rank of a multilinear trans-
formation is bounded from below by the rank of its standard block matrix (after
all, every rank-one sum decomposition of the form described by Theorem 3.2.4
is also of the form described by Theorem A.1.3, but not vice-versa):
[Side note: This function is called the permanent, ∗3.2.10 Suppose T× : Mm,n × Mn,p → Mm,p is the bilin-
and its formula is similar to that of the determinant, ear transformation that multiplies two matrices. Show that
but with the signs of permutations ignored.] mp ≤ rank(T× ) ≤ mnp.
3.2.2 Compute the standard block matrix (with respect ∗3.2.11 We motivated many properties of multilinear trans-
to the standard bases of the given spaces) of each of the formations so as to generalize properties of linear transfor-
following multilinear transformations: mations. In this exercise, we show that they also generalize
properties of bilinear forms.
(a) The function T : R2 ×R2 → R2 defined by T (v, w) =
(2v1 w1 − v2 w2 , v1 w1 + 2v2 w1 + 3v2 w2 ). Suppose V and W are finite-dimensional vector spaces over
∗(b) The function T : R3 ×R2 → R3 defined by T (v, w) = a field F, f : V × W → F is a bilinear form, and A is the
(v2 w1 +2v1 w2 +3v3 w1 , v1 w1 +v2 w2 +v3 w1 , v3 w2 − matrix associated with f via Theorem 1.3.5.
v1 w1 ). (a) Show that rank( f ) = rank(A).
∗(c) The Kronecker product T⊗ : R2 × R2 → R4 (i.e., (b) Show that if F = R or F = C, and we choose the
the bilinear transformation defined by T⊗ (v, w) = bases B and C in Theorem 1.3.5 to be orthonormal,
v ⊗ w). then k f k = kAk.
(d) Given a fixed matrix X ∈ M2 , the function TX :
M2 × R2 → R2 defined by TX (A, v) = AXv.
3.2.12 Let D : Rn × Rn → R be the dot product.
(a) Show that kDk = 1.
3.2.3 Determine which of the following statements are (b) Show that rank(D) = n.
true and which are false.
∗(a) The dot product D : Rn × Rn → R is a multilinear ∗∗3.2.13 Let C : R3 × R3 → R3 be the cross product.
transformation of type (n, n, 1).
(b) The multilinear transformation from Exer- (a) Construct the “naïve” rank sum decomposition that
cise 3.2.2(b) has type (2, 1). shows that rank(C) ≤ 6.
∗(c) The standard block matrix of the matrix multiplica- [Hint: Mimic Example 3.2.15.]
tion transformation T× : Mm,n × Mn,p → Mm,p has (b) Show that C(v, w) = ∑5j=1 f1, j (v) f2, j (w)x j for all
size mn2 p × mp. v, w ∈ R3 , and thus rank(C) ≤ 5, where
(d) The standard block matrix of the Kronecker product f1,1 (v) = v1 f1,2 (v) = v1 + v3
T⊗ : Rm × Rn → Rmn (i.e., the bilinear transforma-
tion defined by T⊗ (v, w) = v ⊗ w) is the mn × mn f1,3 (v) = −v2 f1,4 (v) = v2 + v3
identity matrix. f1,5 (v) = v2 − v1
x1 = e1 + e3 x2 = e1
3.2.5 Suppose T× : Mm,n × Mn,p → Mm,p is the bilinear
x3 = e2 + e3 x4 = e2
transformation that multiplies two matrices. Compute kT× k.
x5 = e1 + e2 + e3 .
[Hint: Keep in mind that the norm induced by the inner
product on Mm,n is the Frobenius norm. What inequalities [Side note: It is actually the case that rank(C) = 5,
do we know involving that norm?] but this is quite difficult to show.]
3.2.6 Show that the operator norm of a multilinear trans- 3.2.14 Use the rank sum decomposition from Exam-
formation is in fact a norm (i.e., satisfies the three properties ple 3.2.13(b) to come up with a formula for the cross product
of Definition 1.D.1). C : R3 × R3 → R3 that involves only 5 real number multi-
plications, rather than 6.
3.2.7 Show that the range of a multilinear form f : V1 × [Hint: Mimic Remark 3.2.3.]
· · ·×V p → F is always a subspace of F (i.e., range( f ) = {0}
or range( f ) = F). ∗3.2.15 Let det : Rn × · · · × Rn → R be the determinant.
[Side note: Recall that this is not true of multilinear trans- (a) Suppose n = 2. Compute rank(det).
formations.] (b) Suppose n = 3. What is rank(det)? You do not need
to rigorously justify your answer—it is given away
∗3.2.8 Describe the range of the cross product (i.e., the in one of the other exercises in this section.
bilinear transformation C : R3 × R3 → R3 from Exam- [Side note: rank(det) is not known in general. For example,
ple 3.2.1). if n = 5 then the best bounds we know are 17 ≤ rank(det) ≤
20.]
342 Chapter 3. Tensors and Multilinearity
3.2.16 Provide an example to show that in the rank sum [Hint: Consider one of the multilinear transformations
decomposition of Theorem 3.2.4, if r = rank(T ) then the whose rank we mentioned in this section.]
sets of vectors {w j }rj=1 ⊂ FdW and {v1, j }rj=1 ⊂ Fd1 , . . .,
{vk, j }rj=1 ⊂ Fdk cannot necessarily be chosen to be linearly
independent (in contrast with Theorem A.1.3, where they
could).
We now introduce an operation, called the tensor product, that combines two
vector spaces into a new vector space. This operation can be thought of as a
generalization of the Kronecker product that not only lets us multiply together
vectors (from Fn ) and matrices of different sizes, but also allows us to multiply
together vectors from any vector spaces. For example, we can take the tensor
product of a polynomial from P 3 with a matrix from M2,7 while retaining a
vector space structure. Perhaps more usefully though, it provides us with a
systematic way of treating multilinear transformations as linear transformations,
without having to explicitly construct a standard block matrix representation as
in Theorem 3.2.2.
Definition 3.3.1 Suppose V and W are vector spaces over a field F. Their tensor product
Tensor Product is the (unique up to isomorphism) vector space V ⊗ W, also over the field
F, with vectors and operations satisfying the following properties:
a) For every pair of vectors v ∈ V and w ∈ W, there is an associated
vector (called an elementary tensor) v ⊗ w ∈ V ⊗ W, and every
Unique “up to vector in V ⊗ W can be written as a linear combination of these
isomorphism” means elementary tensors.
that all vector
b) Vector addition satisfies
spaces satisfying
these properties are
isomorphic to each v ⊗ (w + y) = (v ⊗ w) + (v ⊗ y) and
other. (v + x) ⊗ w = (v ⊗ w) + (x ⊗ w) for all v, x ∈ V, w, y ∈ W.
Property (d) is the d) For every vector space X over F and every bilinear transformation
universal property
T : V × W → X , there exists a linear transformation S : V ⊗ W → X
that forces V ⊗ W to
be as large as such that
possible.
T (v, w) = S(v ⊗ w) for all v ∈ V and w ∈ W.
The way to think about property (d) above (the universal property) is that
it forces V ⊗ W to be so large that we can squash it down (via the linear
transformation S) onto any other vector space X that is similarly constructed
from V and W in a bilinear way. In this sense, the tensor product can be thought
of as “containing” (up to isomorphism) every bilinear combination of vectors
from V and W.
To make the tensor product seem more concrete, it is good to keep the
Kronecker product in the back of our minds as the motivating example. For
344 Chapter 3. Tensors and Multilinearity
Throughout this example, Fm ⊗ Fn = Fmn and Mm,n ⊗ M p,q = Mmp,nq contain all vectors of
entire section, every
the form v ⊗ w and all matrices of the form A ⊗ B, respectively, as well as their
time we encounter a
new theorem, ask linear combinations (this is property (a) of the above definition). Properties (b)
yourself what it and (c) of the above definition are then just chosen to be analogous to the
means for the properties that we proved for the Kronecker product in Theorem 3.1.1, and
Kronecker product
of vectors or
property (d) gets us all linear combinations of Kronecker products (rather than
matrices. just some subspace of them). In fact, Theorem 3.2.2 is essentially equivalent
to property (d) in this case: the standard matrix of S equals the standard block
matrix of T .
We now work through another example to try to start building up some
intuition for how the tensor product works more generally.
For example, if where there are only finitely many non-zero terms in the sum. Since
f (x) = 2x2 − 4 and
each term x j yk is an elementary tensor (it is x j ⊗ yk ), it follows
g(x) = x3 + x − 1 then
( f ⊗ g)(x, y) = that every h ∈ P2 is a linear combination of elementary tensors, as
f (x)g(y) = desired.
(2x2 − 4)(y3 + y − 1).
b), c) These properties follow almost immediately from bilinearity of
usual multiplication: if f , g, h ∈ P and c ∈ R then
f ⊗ (g + ch) (x, y) = f (x) g(y) + ch(y)
= f (x)g(y) + c f (x)h(y)
= ( f ⊗ g)(x, y) + c( f ⊗ h)(x, y)
SY →Z (v ⊗Y w) = v ⊗Z w for all v ∈ V, w ∈ W.
Theorem 3.3.1 Suppose V and W are vector spaces over the same field with bases B
Bases of Tensor and C, respectively. Then their tensor product V ⊗ W exists and has the
Products following set as a basis:
def
B ⊗C = e ⊗ f | e ∈ B, f ∈ C .
Proof. We start with the trickiest part of this proof to get our heads around—we
simply define B ⊗C to be a linearly independent set and V ⊗ W to be its span.
That is, we are not defining V ⊗ W via Definition 3.3.1, but rather we are
defining it as the span of a collection of vectors, and then we will show that
346 Chapter 3. Tensors and Multilinearity
it has all of the properties of Definition 3.3.1 (thus establishing that a vector
space with those properties really does exist). Before proceeding, we make
some points of clarification:
If you are • Throughout this proof, we are just thinking of “e ⊗ f” as a symbol that
uncomfortable with
means nothing more than “the ordered pair of vectors e and f”.
defining V ⊗ W to be
a vector space • Similarly, we do not yet know that V ⊗ W really is the tensor product
made up of abstract of V and W, but rather we are just thinking of it as the vector space that
symbols, look ahead
to Remark 3.3.1,
has B ⊗C as a basis:
which might clear ( )
things up a bit.
V ⊗W = ∑ ci d j (ei ⊗ f j ) : bi , c j ∈ F, ei ∈ B, f j ∈ C for all i, j ,
i, j
Remark 3.3.1 The tensor product is actually quite analogous to the external direct sum
The Direct Sum and from Section 1.B.3. Just like the direct sum V ⊕ W can be thought of
Tensor Product as a way of constructing a new vector space by “adding” two vector
spaces V and W, the tensor product V ⊗ W can be thought of as a way of
constructing a new vector space by “multiplying” V and W.
More specifically, both of these vector spaces V ⊕ W and V ⊗ W are
constructed out of ordered pairs of vectors from V and W, along with
a specification of how to add those ordered pairs and multiply them by
scalars. In V ⊕ W we denoted those ordered pairs by (v, w), whereas in
V ⊗ W we denote those ordered pairs by v ⊗ w. The biggest difference
between these two constructions is that these ordered pairs are the only
members of V ⊕ W, whereas some members of V ⊗ W are just linear
combinations of these ordered pairs.
This analogy between the direct sum and tensor product is even more
explicit if we look at what they do to bases B and C of V and W, respec-
tively: a basis of V ⊕ W is the disjoint union B ∪ C, whereas a basis of
V ⊗ W is the Cartesian product B ×C (up to relabeling the members of
these sets appropriately). For example, if V = R2 and W = R3 have bases
B = {v1 , v2 } and C = {v3 , v4 , v5 }, respectively, then R2 ⊕ R3 ∼ = R5 has
In this basis of basis
R2 ⊕ R3 , each {(v1 , 0), (v2 , 0), (0, v3 ), (0, v4 ), (0, v5 )},
subscript from B and
C appears exactly which is just the disjoint union of B and C, but with each vector turned into
once. In the basis of
an ordered pair so that this statement makes sense. Similarly, R2 ⊗R3 ∼= R6
R2 ⊗ R3 , each
ordered pair of has basis
subscripts from B
and C appears {v1 ⊗ v3 , v1 ⊗ v4 , v1 ⊗ v5 , v2 ⊗ v3 , v2 ⊗ v4 , v2 ⊗ v5 },
exactly once.
which is just the Cartesian product of B and C, but with each ordered pair
written as an elementary tensor so that this statement makes sense.
Even after working through the proof of Theorem 3.3.1, tensor products
still likely feel very abstract—how do we actually construct them? The reason
that they feel this way is that they are only defined up to isomorphism, so it’s
somewhat impossible to say what they look like. We could say that Fm ⊗ Fn =
Fmn and that v ⊗ w is just the Kronecker product of v ∈ Fm and w ∈ Fn , but
we could just as well say that Fm ⊗ Fn = Mm,n (F) and that v ⊗ w = vwT
is the outer product of v and w. After all, these spaces are isomorphic via
vectorization/matricization, so they look the same when we just describe what
linear algebraic properties they satisfy.
For this reason, when we construct “the” tensor product of two vector
spaces, we have some freedom in how we represent it. In the following
348 Chapter 3. Tensors and Multilinearity
Example 3.3.2 Describe the vector space M2 (R) ⊗ P 2 and its elementary tensors.
The Tensor Product
Solution:
of Matrices and
If we take inspiration from the Kronecker product, which places a copy
Polynomials
of one vector space on each basis vector (i.e., entry) of the other vector
space, it seems natural to guess that we can represent M2 ⊗ P 2 as the
vector space of 2 × 2 matrices whose entries are polynomials of degree at
Recall that P 2 is the most 2. That is, we guess that
space of
polynomials of (" # )
degree at most 2. 2 f1,1 (x) f1,2 (x) 2
M2 ⊗ P = fi, j ∈ P for 1 ≤ i, j ≤ 2 .
f2,1 (x) f2,2 (x)
M2 ⊗ P 2 =
f : R → M2 f (x) = Ax2 + Bx +C for some A, B,C ∈ M2 .
Theorem 3.3.3 If V, W, and X are vector spaces over the same field then
Associativity of the
Tensor Product (V ⊗ W) ⊗ X ∼
= V ⊗ (W ⊗ X ).
Recall that “∼
=” Proof. For each x ∈ X , define a bilinear map Tx : V × W → V ⊗ (W ⊗ X )
means “is
by Tx (v, w) = v ⊗ (w ⊗ x). Since Tx is bilinear, the universal property of the
isomorphic to”.
3.3 The Tensor Product 349
tensor product (i.e., property (d) of Definition 3.3.1) says that there exists a
linear transformation Sx : V ⊗ W → V ⊗ (W ⊗ X ) that acts via Sx (v ⊗ w) =
Tx (v, w) = v ⊗ (w ⊗ x).
Next, define a bilinear map Te : (V ⊗W)×X → V ⊗(W ⊗X ) via Te(u, x) =
Sx (u) for all u ∈ V ⊗ W and x ∈ X . By using the universal property again, we
see that there exists a linear transformation Se : (V ⊗ W) ⊗ X → V ⊗ (W ⊗ X )
e ⊗ x) = Te(u, x) = Sx (u). If u = v ⊗ w then this says that
that acts via S(u
Se (v ⊗ w) ⊗ x = Sx (v ⊗ w) = v ⊗ (w ⊗ x).
This argument can also be reversed to find a linear transformation that sends
v ⊗ (w ⊗ x) to (v ⊗ w) ⊗ x, so Se is invertible and thus an isomorphism.
Now that we know that the tensor product is associative, we can unambigu-
ously refer to the tensor product of 3 vector spaces V, W, and X as V ⊗ W ⊗ X ,
since it does not matter whether we take this to mean (V ⊗ W) ⊗ X or V ⊗
(W ⊗ X ). The same is true when we tensor together 4 or more vector spaces,
and in this setting we say that an elementary tensor in V1 ⊗ V2 ⊗ · · · ⊗ V p is a
vector of the form v1 ⊗ v2 ⊗ · · · ⊗ v p , where v j ∈ V j for each 1 ≤ j ≤ p.
It is similarly the case that the tensor product is commutative in the sense
Keep in mind that that V ⊗ W ∼ = W ⊗ V (see Exercise 3.3.10). For example, if V = Fm and
V ⊗W ∼ = W ⊗ V does n
W = F then the swap operator Wm,n of Definition 3.1.3 is an isomorphism from
not mean that
v ⊗ w = w ⊗ v for all V ⊗ W = Fm ⊗ Fn to W ⊗ V = Fn ⊗ Fm . More generally, higher-order tensor
v ∈ V, w ∈ W. Rather, product spaces V1 ⊗ V2 ⊗ · · · ⊗ V p are also isomorphic to the tensor product of
it just means that the same spaces in any other order. That is, if σ : {1, 2, . . . , p} → {1, 2, . . . , p}
there is an
isomorphism that
is a permutation then
sends each v ⊗ w to
w ⊗ v. V1 ⊗ V 2 ⊗ · · · ⊗ V p ∼
= Vσ (1) ⊗ Vσ (2) ⊗ · · · ⊗ Vσ (p) .
Again, if each of these vector spaces is Fn (and thus the tensor product is the
Kronecker product) then the swap matrix Wσ introduced in Section 3.1.3 is the
standard isomorphism between these spaces.
We motivated the tensor product as a generalization of the Kronecker
product that applies to any vector spaces (as opposed to just Fn and/or Mm,n ).
We now note that if we fix bases of the vector spaces that we are working
with then the tensor product really does look like the Kronecker product of
coordinate vectors, as we would hope:
Theorem 3.3.4 Suppose V1 , V2 , . . ., V p are finite-dimensional vector spaces over the same
Kronecker Product of field with bases B1 , B2 , . . ., B p , respectively. Then
Coordinate Vectors
def
B1 ⊗ B2 ⊗ · · · ⊗ B p = b(1) ⊗ b(2) ⊗ · · · ⊗ b(p) | b( j) ∈ B j for all 1 ≤ j ≤ p
B1 ⊗ B2 = {v1 ⊗ w1 , v1 ⊗ w2 , v1 ⊗ w3 , v2 ⊗ w1 , v2 ⊗ w2 , v2 ⊗ w3 }.
It might sometimes
be convenient
to instead arrange the entries of the coor-
dinate vector v1 ⊗ v2 ⊗ · · · ⊗ v p B ⊗B ⊗···⊗B p into a p-dimensional dim(V1 ) ×
1 2
dim(V2 )×· · ·×dim(V p ) array, rather than a long dim(V1 ) dim(V2 ) · · · dim(V p )-
entry vector as we did here. What the most convenient representation of
v1 ⊗ v2 ⊗ · · · ⊗ v p is depends heavily on context—what the tensor product
represents and what we are trying to do with it.
Definition 3.3.2 Suppose V1 , V2 , . . . , V p are finite-dimensional vector spaces over the same
Tensor Rank field and v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p . The tensor rank (or simply the rank) of
v, denoted by rank(v), is the minimal integer r such that v can be written
as a sum of r elementary tensors:
r
(1) (2) (p)
v = ∑ vi ⊗ vi ⊗ · · · ⊗ vi ,
i=1
( j)
where vi ∈ V j for each 1 ≤ i ≤ r and 1 ≤ j ≤ p.
The tensor rank generalizes the rank of a matrix in the following sense: if
p = 2 and V1 = Fm and V2 = Fn , then Fm ⊗ Fn ∼ = Mm,n (F), and the tensor rank
in this space really is just the usual matrix rank. After all, when we represent
Fm ⊗ Fn in this way, its elementary tensors are the matrices v ⊗ w = vwT , and
we know from Theorem A.1.3 that the rank of a matrix is the fewest number of
these (rank-1) matrices that are needed to sum to it.
In fact, a similar argument shows that when all of the spaces are finite-
dimensional, the tensor rank is equivalent to the rank of a multilinear transfor-
mation. After all, we showed in Theorem 3.2.4 that the rank of a multilinear
transformation T is the least integer r such that its standard block matrix A can
be written in the form
r
(1) (2) (p) T
A= ∑ wj vj ⊗vj ⊗···⊗vj .
j=1
Well, the vectors in the above sum are exactly the elementary tensors if we
represent Fd1 ⊗ Fd2 ⊗ · · · ⊗ Fd p ⊗ FdW as MdW ,d1 d2 ···d p (F) in the natural way.
Flattenings and Bounds
Since the rank of a multilinear transformation is difficult to compute, so is
tensor rank. However, there are a few bounds that we can use to help narrow
it down somewhat. The simplest of these bounds comes from just forgetting
about part of the tensor product structure of V1 ⊗ V2 ⊗ · · · ⊗ V p . That is, if we
let {S1 , S2 , . . . , Sk } be any partition of {1, 2, . . . , p} (i.e., S1 , S2 , . . ., Sk are sets
such that S1 ∪ S2 ∪ · · · ∪ Sk = {1, 2, . . . , p} and Si ∩ S j = {} whenever i 6= j)
then
The notation
! ! !
O O O O
Vi ∼
i∈S j
V1 ⊗ V 2 ⊗ · · · ⊗ V p = Vi ⊗ Vi ⊗ · · · ⊗ Vi .
i∈S1 i∈S2 i∈Sk
means the tensor
product of each Vi To be clear, we are thinking of the vector space on the right as a tensor product
where i ∈ S j . It is
of just k vector spaces—its elementary tensors are the ones of the form v1 ⊗
analogous to big-Σ N
notation for sums. v2 ⊗ · · · ⊗ vk , where v j ∈ i∈S j Vi for each 1 ≤ j ≤ k.
Perhaps the most natural isomorphism between these two spaces comes
from recalling that we can write each v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p as a sum of
elementary tensors
r
(1) (2) (p)
v= ∑ vj ⊗vj ⊗···⊗vj ,
j=1
and if we simply regroup the those products we get the following vector
352 Chapter 3. Tensors and Multilinearity
N N N
ṽ ∈ i∈S1 Vi ⊗ i∈S2 Vi ⊗···⊗ i∈Sk Vi :
! ! !
r O (i) O (i) O (i)
ṽ = ∑ vj ⊗ vj ⊗···⊗ vj .
j=1 i∈S1 i∈S2 i∈Sk
We call ṽ a flattening of v, and we note that the rank of ṽ never exceeds the
rank of v, simply because the procedure that we used to construct ṽ from v
turns a sum of r elementary tensors into another sum of r elementary tensors.
We state this observation as the following theorem:
Theorem 3.3.5 Suppose V1 , V2 , . . . , V p are finite-dimensional vector spaces over the same
Tensor Rank of field and v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p . If ṽ is a flattening of v then
Flattenings
rank(v) ≥ rank(ṽ).
Flattenings are most We emphasize that the opposite inequality does not holds in general, since
useful when k = 2, as
the
N
coarser
tensor
N
product
structure
N
of some
of the elementary tensors in
we can then easily
compute their rank i∈S1 Vi ⊗ i∈S2 Vi ⊗ · · · ⊗ i∈Sk Vi do not correspond to elementary
by thinking of them tensors in V1 ⊗ V2 ⊗ · · · ⊗ V p , since the former space is a “coarser” tensor
as matrices. product of just k ≤ p spaces.
For example, if v ∈ Fd1 ⊗ Fd2 ⊗ Fd3 ⊗ Fd4 then we obtain flattenings of
v by just grouping some of these tensor product factors together. There are
many ways to do this, so these flattenings of v may live in many different
spaces, some of which are listed below along with the partition {S1 , S2 , . . . , Sk }
of {1, 2, 3, 4} that they correspond to:
Example 3.3.3 Show that the following vector v ∈ (C2 )⊗4 has tensor rank 4:
Computing Tensor
Rank Bounds via v = e1 ⊗ e1 ⊗ e1 ⊗ e1 + e1 ⊗ e2 ⊗ e1 ⊗ e2 +
Flattenings e2 ⊗ e1 ⊗ e2 ⊗ e1 + e2 ⊗ e2 ⊗ e2 ⊗ e2 .
3.3 The Tensor Product 353
Solution:
It is clear that rank(v) ≤ 4, since v was provided to us as a sum of
4 elementary tensors. To compute lower bounds of rank(v), we could
use any of its many flattenings. For example, if we choose the partition
(C2 )⊗4 ∼= C2 ⊗ C8 then we get the following flattening ṽ of v:
In the final line here,
the first factor of ṽ = e1 ⊗ (e1 ⊗ e1 ⊗ e1 ) + e1 ⊗ (e2 ⊗ e1 ⊗ e2 ) +
terms like e1 ⊗ e6 lives
in C2 while the e2 ⊗ (e1 ⊗ e2 ⊗ e1 ) + e2 ⊗ (e2 ⊗ e2 ⊗ e2 )
second lives in C8 . = e1 ⊗ e1 + e1 ⊗ e6 + e2 ⊗ e3 + e2 ⊗ e8 ,
There are many situations where none of a vector’s flattenings have rank
equal to the vector itself, so it may be the case that none of the lower bounds
obtained in this way are tight. We thus introduce one more way of bounding
tensor rank that is sometimes a bit stronger than these bounds based on flatten-
ings. The rough idea behind this bound is that if f ∈ V1∗ is a linear form then
We only specified
the linear transformation f ⊗ I ⊗ · · · ⊗ I : V1 ⊗ V2 ⊗ · · · ⊗ V p → V2 ⊗ · · · ⊗ V p
how f ⊗ I ⊗ · · · ⊗ I defined by
acts on elementary
tensors here. How it ( f ⊗ I ⊗ · · · ⊗ I)(v1 ⊗ v2 ⊗ · · · ⊗ v p ) = f (v1 )(v2 ⊗ · · · ⊗ v p )
acts on the rest of
space is determined sends elementary tensors to elementary tensors, and can thus be used to help us
via linearity.
investigate tensor rank. We note that the universal property of the tensor product
(Definition 3.3.1(d)) tells us that this function f ⊗ I ⊗ · · · ⊗ I actually exists and
is well-defined: existence of f ⊗ I ⊗ · · · ⊗ I follows from first constructing a
multilinear transformation g : V1 × V2 × · · · × V p → V2 ⊗ · · · ⊗ V p that acts via
g(v1 , v2 , . . . , v p ) = f (v1 )(v2 ⊗ · · · ⊗ v p ) and then using the universal property
to see that there exists a function f ⊗ I ⊗ · · · ⊗ I satisfying ( f ⊗ I ⊗ · · · ⊗ I)(v1 ⊗
v2 ⊗ · · · ⊗ v p ) = g(v1 , v2 , . . . , v p )).
354 Chapter 3. Tensors and Multilinearity
Theorem 3.3.6 Suppose V1 , V2 , . . . , V p are finite-dimensional vector spaces over the same
Another Lower Bound field and v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p . Define
on Tensor Rank
Sv = ( f ⊗ I ⊗ · · · ⊗ I)(v) | f ∈ V1∗ ,
in this case. However, the following example shows that some vectors in (C2 )⊗3
have tensor rank 3.
Example 3.3.4 Show that the following vector v ∈ (C2 )⊗3 has tensor rank 3:
Tensor Rank Can
Exceed Local v = e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 .
Dimension
Solution:
It is clear that rank(v) ≤ 3, since v was provided to us as a sum of
3 elementary tensors. On the other hand, to see that rank(v) ≥ 3, we
Here we have construct the subspace Sv ⊆ C2 ⊗ C2 described by Theorem 3.3.6:
associated the linear
form f with a row
vector wT (via Sv = (wT e1 )(e1 ⊗ e2 + e2 ⊗ e1 ) + (wT e2 )(e1 ⊗ e1 ) | w ∈ C2
Theorem 1.3.3) and = (a, b, b, 0) | a, b ∈ C .
defined a = wT e2
and b = wT e1 for
simplicity of notation. It is clear that dim(Sv ) = 2. Furthermore, we can see that the only
elementary tensors in Sv are those of the form (a, 0, 0, 0), since the matri-
cization of (a, b, b, 0) is " #
a b
,
b 0
3.3 The Tensor Product 355
lim vk = e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1
k→∞
is the vector with tensor rank 3 from Example 3.3.4. To deal with issues like this,
we define the border rank of a vector v ∈ Fd1 ⊗ · · · ⊗ Fd p to be the smallest
integer r such that v can be written as a limit of vectors with tensor rank no
In fact, its border larger than r. We just showed that the vector from Example 3.3.4 has border
rank is exactly 2. See
rank ≤ 2, despite having tensor rank 3.
Exercise 3.3.12.
Nothing like this happens when p = 2 (i.e., when we can think of tensor
rank in Fd1 ⊗ Fd2 as the usual rank of a matrix in Md1 ,d2 (F)): if A1 , A2 , . . . ∈
Md1 ,d2 (F) each have rank(Ak ) ≤ r then
In other words, the
rank of a matrix can
“jump down” in a rank lim Ak ≤ r
k→∞
limit, but it cannot
“jump up”. When too (as long as this limit exists). This fact can be verified using the techniques
p ≥ 3, tensor rank
can jump either up of Section 2.D—recall that the singular values of a matrix are continuous in its
or down in limits. entries, so if each Ak has at most r non-zero singular values then the same must
be true of their limit (in fact, this was exactly Exercise 2.D.3).
The other unfortunate aspect to tensor rank is that it is field-dependent. For
example, if we let e+ = e1 + e2 ∈ R2 and e− = e1 − e2 ∈ R2 then the vector
singular values (and thus the rank) of a real matrix do not depend on whether
we consider it as a member of Mm,n (R) or Mm,n (C). Another way to see that
matrix rank does not suffer from this problem is to just notice that allowing
complex arithmetic in Gaussian elimination does not increase the number of
zero rows that we can obtain in a row echelon form of a real matrix.
Remark 3.3.2 Tensors are one of the most ubiquitous and actively-researched areas in
We Are Out all of science right now. We have provided a brief introduction to the
of Our Depth topic and its basic motivation, but there are entire textbooks devoted to
exploring properties of tensors and what can be done with them, and we
cannot possibly do the subject justice here. See [KB09] and [Lan12], for
example, for a more thorough treatment.
[Hint: Mimic Example 3.3.4. Where does this argument for all v ∈ V1 ⊗ · · · ⊗ V p .
break down if we use complex numbers instead of real num-
bers?] 3.3.8 In this exercise, we generalize some of the obser-
vations that we made about the vector from Example 3.3.4.
3.3.3 Determine which of the following statements are Suppose {x1 , x2 }, {y1 , y2 }, and {z1 , z2 } are bases of C2 and
true and which are false. let
∗(a) C ⊗ C ∼= C. v = x1 ⊗ y1 ⊗ z2 + x1 ⊗ y2 ⊗ z1 + x2 ⊗ y1 ⊗ z1 ∈ (C2 )⊗3 .
(b) If we think of C as a 2-dimensional vector space over
(a) Show that v has tensor rank 3.
R (e.g., with basis {1, i}) then C ⊗ C ∼
= C.
(b) Show that v has border rank 2.
∗(c) If v ∈ Fm ⊗ Fn then rank(v) ≤ min{m, n}.
(d) If v ∈ Fm ⊗ Fn ⊗ F p then rank(v) ≤ min{m, n, p}.
∗(e) The tensor rank of v ∈ Rd1 ⊗ Rd2 ⊗ · · · ⊗ Rd p is 3.3.9 Show that the 3-variable polynomial f (x, y, z) =
at least as large as its tensor rank as a member of x + y + z cannot be written in the form f (x, y, z) =
Cd1 ⊗ Cd2 ⊗ · · · ⊗ Cd p . p1 (x)q1 (y)r1 (z) + p2 (x)q2 (y)r2 (z) for any single-variable
polynomials p1 , p2 , q1 , q2 , r1 , r2 .
∗∗3.3.4 Show that the linear transformation S : V ⊗ W → [Hint: You can prove this directly, but it might be easier to
X from Definition 3.3.1(d) is necessarily unique. leech off of some vector that we showed has tensor rank 3.]
∗∗3.3.5 Verify the claim of Example 3.3.2 that we can ∗∗3.3.10 Show that if V and W are vector spaces over the
represent M2 ⊗ P 2 as the set of functions f : F → M2 of same field then V ⊗ W ∼
= W ⊗ V.
the form f (x) = Ax2 + Bx +C for some A, B,C ∈ M2 , with
elementary tensors defined by (A ⊗ f )(x) = A f (x). That is,
3.3.11 Show that if V, W, and X are vector spaces over
verify that the four defining properties of Definition 3.3.1
the same field then the tensor product distributes over the
hold for this particular representation of M2 ⊗ P 2 .
external direct sum (see Section 1.B.3) in the sense that
3.A.1 Representations
We start by precisely defining the types linear transformations that we are now
going to focus our attention on.
Example 3.A.1 Construct the standard matrix of the linear map ΦR : Mn → Mn (called
The Reduction Map the reduction map) defined by ΦR (X) = tr(X)I − X.
Solution:
We first compute
(
I − E j, j if i = j,
ΦR (Ei, j ) = tr(Ei, j )I − Ei, j =
−Ei, j otherwise.
n
def
If we define e+ = vec(I) = ∑ ei ⊗ ei then
i=1
(
e+ − e j ⊗ e j if i = j,
vec ΦR (Ei, j ) =
−ei ⊗ e j otherwise.
For example, in the n = 3 case this standard matrix has the form
· · · · 1 · · · 1
· −1 · · · · · · ·
· · −1 · · · · · ·
As usual, we use dots
· · · −1 · · · · ·
(·) to denote entries
equal to 0. ΦR =
1 · · · · · · · 1 .
· · · · · −1 · · ·
· · · · · · −1 · ·
· · · · · · · −1 ·
1 · · · 1 · · · ·
For example, the From the standard matrix of a matrix-valued linear map Φ, it is straight-
(unique up to scalar
multiplication) forward to deduce many of its properties. For example, since e+ eT+ has rank 1
eigenvector of ΦR and
with eigenvalue n − 1
n
√
is I: ΦR (I) = nI − I =
ke+ k =
∑ ei ⊗ ei
= n,
(n − 1)I.
i=1
360 Chapter 3. Tensors and Multilinearity
we conclude that if ΦR is the reduction map from Example 3.A.1, then ΦR =
e+ eT+ − I (and thus ΦR itself) has one eigenvalue equal to n − 1 and the other
n − 1 eigenvalues (counted according to multiplicity) equal to −1.
The Choi Matrix
There is another matrix representation of a matrix-valued linear map Φ that has
some properties that often make it easier to work with than the standard matrix.
The idea here is that Φ : Mm,n → M p,q (like every linear transformation) is
completely determined by how it acts on a basis of the input space, so it is
completely determined by the mn matrices Φ(E1,1 ), Φ(E1,2 ), . . ., Φ(Em,n ), and
it is often convenient to arrange these matrices into a single large block matrix.
Definition 3.A.2 The Choi matrix of a matrix-valued linear map Φ : Mm,n → M p,q is the
Choi Matrix mp × nq matrix
m n
def
CΦ = ∑ ∑ Φ(Ei, j ) ⊗ Ei, j .
i=1 j=1
We can think of the Choi matrix as a p × q block matrix whose blocks are
m × n. When partitioned in this way, the (i, j)-entry of each block is determined
by the corresponding entry of Φ(Ei, j ). For example, if Φ : M3 → M3 is such
that
a ∗ ∗ b ∗ ∗ c ∗ ∗
We use asterisks (∗) ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
here to denote ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
entries whose values
a b c d ∗ ∗ e ∗ ∗ f ∗ ∗
we do not care
about right now. Φ(E1,1 ) = d e f , then CΦ =
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ .
g h i ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
g ∗ ∗ h ∗ ∗ i ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Equivalently, Theorem 3.1.8 tells us that CΦ is just the swapped version of the
block matrix whose (i, j)-block equals Φ(Ei, j ):
!
m n
Notice that Ei, j ⊗ A is T
a block matrix with A
CΦ = Wm,p ∑ ∑ Ei, j ⊗ Φ(Ei, j ) Wn,q
in its (i, j)-block, i=1 j=1
whereas A ⊗ Ei, j is a
Φ(E1,1 ) Φ(E1,2 ) ··· Φ(E1,n )
block matrix with the
entries of A in the
Φ(E ) Φ(E ) ··· Φ(E2,n )
2,1 2,2 T
(i, j)-entries of each = Wm,p
.. .. .. ..
Wn,q .
block. . . . .
Φ(Em,1 ) Φ(Em,2 ) ··· Φ(Em,n )
Example 3.A.2 Construct the Choi matrix of the linear map ΦR : Mn → Mn defined by
The Choi Matrix of ΦR (X) = tr(X)I − X (i.e., the reduction map from Example 3.A.1).
the Reduction Map
3.A Extra Topic: Matrix-Valued Linear Maps 361
Solution:
As we noted in Example 3.A.1,
(
I − E j, j if i = j,
ΦR (Ei, j ) = tr(Ei, j )I − Ei, j =
−Ei, j otherwise.
n
where e+ = ∑ ei ⊗ ei as before.
i=1
Definition 3.1.3(c)
tells us that the Choi
matrix of the We emphasize that the Choi matrix CΦ of a linear map Φ does not act
transpose is the as a linear transformation in the same way that Φ does. That is, it is not the
swap matrix (just like case that vec(Φ(X)) = CΦ vec(X); the Choi matrix CΦ behaves as a linear
its standard matrix).
transformation very differently than Φ does. However, CΦ does make it easier
to identify some important properties of Φ. For example, we say that Φ : Mn →
Mm is transpose-preserving if Φ(X T ) = Φ(X)T for all X ∈ Mn —a property
that is encoded very naturally in CΦ :
Since Ei,T j = E j,i , these two conditions are equivalent to each other.
For example, the reduction map from Example 3.A.2 is transpose-preserving—
a fact that can be verified in a straightforward manner from its definition or by
noticing that its Choi matrix is symmetric.
362 Chapter 3. Tensors and Multilinearity
In the case when the ground field is F = C and we focus on the conjugate
transpose (i.e., adjoint), things work out even more cleanly. We say that Φ :
Mn → Mm is adjoint preserving if Φ(X ∗ ) = Φ(X)∗ for all X ∈ Mn , and
we say that it is Hermiticity-preserving if Φ(X) is Hermitian whenever X ∈
Mn (C) is Hermitian. The following result says that these two families of maps
coincide with each other when F = C and can each be easily identified from
their Choi matrices.
Theorem 3.A.2 Suppose Φ : Mn (C) → Mm (C) is a linear map. The following are equiv-
Hermiticity-Preserving alent:
Linear Maps a) Φ is Hermiticity-preserving,
b) Φ is adjoint-preserving, and
c) CΦ is Hermitian.
Proof. The equivalence of properties (b) and (c) follows from the same argu-
ment as in the proof of Theorem 3.A.1, just with a complex conjugation thrown
on top of the transposes. We thus focus on the equivalence of properties (a)
and (b).
To see that (b) implies (a) we notice that if Φ is adjoint-preserving and
X is Hermitian then Φ(X)∗ = Φ(X ∗ ) = Φ(X), so Φ(X) is Hermitian too. For
the converse, notice that we can write every matrix X ∈ Mn (C) as a linear
combination of Hermitian matrices:
It is worth comparing
this with the 1 1
X= (X + X ∗ ) + (iX − iX ∗ ) .
Cartesian 2 | {z } 2i | {z }
decomposition of Hermitian Hermitian
Remark 1.B.1.
It follows that if Φ is Hermiticity-preserving then Φ(X + X ∗ ) and Φ(iX −
iX ∗ ) are each Hermitian, so
∗
∗ 1 ∗ 1 ∗
Φ(X) = Φ (X + X ) + (iX − iX )
2 2i
∗
1 1
= Φ(X + X ∗ ) + Φ(iX − iX ∗ )
2 2i
1 1
= Φ(X + X ∗ ) − Φ(iX − iX ∗ )
2 2i
1 1
=Φ (X + X ) − (iX − iX ∗ )
∗
2 2i
∗
=Φ X
Transpose-Preserving Bisymmetric
Γ refers to the partial
transpose, which we Characterizations: Φ◦T = T ◦Φ Φ = Φ◦T = T ◦Φ
introduce in Section T
CΦ = CΦ T = Γ(C )
CΦ = CΦ Φ
3.A.2.
Operator-Sum Representations
The third and final representation of matrix-valued linear maps that we will use
is one that is a bit more “direct”—it does not aim to represent Φ : Mm,n → M p,q
as a matrix, but rather provides a formula to clarify how Φ acts on matrices.
If the input and where {Ai } ⊆ M p,m and {Bi } ⊆ Mq,n are fixed families of matrices.
output spaces are
square (i.e., m = n
and p = q) then the It is perhaps not immediately obvious that every matrix-valued linear map
Ai and Bi matrices even has an operator-sum representation, so before delving into any examples,
have the same size.
we present a theorem that tells us how to convert a standard matrix or Choi
matrix into an operator-sum representation (and vice-versa).
Theorem 3.A.3 Suppose Φ : Mm,n → M p,q is a matrix-valued linear map. The following
Converting Between are equivalent:
Representations a) [Φ] = ∑ Ai ⊗ Bi .
i
b) CΦ = ∑ vec(Ai )vec(Bi )T .
i
c) Φ(X) = ∑ Ai XBTi for all X ∈ Mm,n .
i
Proof. The equivalence of the representations (a) and (c) follows immediately
from Theorem 3.1.7, which tells us that
! !
vec ∑ Ai XBTi = ∑ Ai ⊗ Bi vec(X)
i i
364 Chapter 3. Tensors and Multilinearity
Here we use the fact for all X ∈ Mm,n . Since it is similarly the case that vec Φ(X) = [Φ]vec(X)
that standard
matrices are unique.
for all X ∈ Mm,n , we conclude that Φ(X) = ∑i Ai XBTi for all X ∈ Mm,n if and
only if [Φ] = ∑i Ai ⊗ Bi .
To see that an operator-sum representation (c) corresponds to a Choi matrix
of
the form (b), we first write each Ai and Bi in terms of their columns: Ai =
ai,1 | ai,2 | · · · | ai,m and Bi = bi,1 | bi,2 | · · · | bi,n , so that vec(Ai ) =
∑mj=1 ai, j ⊗ e j and vec(Bi ) = ∑nj=1 bi, j ⊗ e j . Then
m n
CΦ = ∑ ∑ Φ(E j,k ) ⊗ E j,k
j=1 k=1
m n
=∑∑ ∑ Ai E j,k BTi ⊗ E j,k
i j=1 k=1
m n
=∑∑ ∑ ai, j bTi,k ⊗ e j eTk
i j=1 k=1
! !T
m n
=∑ ∑ ai, j ⊗ e j ∑ bi,k ⊗ ek
i j=1 k=1
= ∑ vec(Ai )vec(Bi )T ,
i
as claimed. Furthermore, each of these steps can be reversed to see that (b)
implies (c) as well.
In particular, part (b) of this theorem tells us that we can convert any
rank-one sum decomposition of the Choi matrix of Φ into an operator-sum
representation. Since every matrix has a rank-one sum decomposition, every
matrix-valued linear map has an operator-sum representation as well. In fact,
by leeching directly off of the many things that we know about rank-one sum
decompositions, we immediately arrive at the following corollary:
Corollary 3.A.4 Every matrix-valued linear map Φ : Mm,n (F) → M p,q (F) has an operator-
The Size of Operator- sum representation of the form
Sum Decompositions
rank(CΦ )
Φ(X) = ∑ Ai XBTi for all X ∈ Mm,n
i=1
In particular, with both sets {Ai } and {Bi } linearly independent. Furthermore, if F =
rank(CΦ ) ≤
R or F = C then the sets {Ai } and {Bi } can be chosen to be mutually
min{mp, nq} is the
minimum number of orthogonal with respect to the Frobenius inner product.
terms in any
operator-sum
representation of Φ. Proof. This result for an arbitrary field F follows immediately from the fact
This quantity is
(see Theorem A.1.3) that CΦ has a rank-one sum decomposition of the form
sometimes called
the Choi rank of Φ.
rank(CΦ )
CΦ = ∑ vi wTi ,
i=1
where {vi }ri=1 ⊂ Fmp and {wi }ri=1 ⊂ Fnq are linearly independent sets of col-
umn vectors. Theorem 3.A.3 then gives us the desired operator-sum repre-
sentation of Φ(X) by choosing Ai = mat(vi ) and Bi = mat(wi ) for all i. To
see that the orthogonality claim holds when F = R or F = C, we instead use
the orthogonal rank-one sum decomposition provided by the singular value
decomposition (Theorem 2.3.3).
3.A Extra Topic: Matrix-Valued Linear Maps 365
The overline on CΦ in Proof. Property (a) is just the special case of Theorem 1.4.8 that arises when
this theorem just working with the standard bases of Mm,n (F) and M p,q (F), which are orthonor-
means complex
conjugation (which mal. Property (c) then follows almost immediately from Theorem 3.A.3: if
has no effect if Φ(X) = ∑i Ai XBTi then
F = R).
!∗
∗
Φ = [Φ]∗ = ∑ Ai ⊗ Bi = ∑ A∗i ⊗ B∗i ,
i i
Definition 3.A.4 Suppose Φ and Ψ are matrix-valued linear maps acting on Mm,n and
Kronecker Product of M p,q , respectively. Then Φ ⊗ Ψ is the matrix-valued linear map defined
Matrix-Valued Linear by
Maps
(Φ ⊗ Ψ)(A ⊗ B) = Φ(A) ⊗ Ψ(B) for all A ∈ Mm,n , B ∈ M p,q .
def def
The notations Γ and For brevity, we often denote these maps by Γ = T ⊗I and Γ = I ⊗T , respectively,
Γ were chosen to
and we think of each of them as one “half” of the transpose of a block matrix—
each look like half of
a T. after all, Γ ◦ Γ = T .
We can similarly define the partial trace maps as follows:
def def
tr1 = tr ⊗ In : Mmn → Mn and tr2 = Im ⊗ tr : Mmn → Mm .
Example 3.A.4 Compute each of the following matrices if A = 1 2 0 −1 .
Partial Transposes, 2 −1 3 2
Traces, and Maps −3 2 2 1
a) Γ(A), 1 0 2 3
b) tr1 (A), and
c) (I ⊗Φ)(A), where Φ(X) = tr(X)I −X is the map from Example 3.A.1.
Solutions:
a) To compute Γ(A) = (I ⊗ T )(A), we just transpose each block of A:
1 2 0 3
2 −1 −1 2
Γ(A) = −3 1
.
2 2
2 0 1 3
368 Chapter 3. Tensors and Multilinearity
−1 −3 −2 2
def
This special vector e+ Finally, it is worthwhile to notice that if e+ = vec(I) = ∑ ei ⊗ ei , then
was originally i
introduced in
Example 3.A.1. ! !T
e+ eT+ = ∑ ei ⊗ ei ∑ej ⊗ej = ∑(ei eTj ) ⊗ (ei eTj ) = ∑ Ei, j ⊗ Ei, j .
i j i, j i, j
In particular, this means that the Choi matrix of a matrix-valued linear map
Φ can be constructed by applying Φ to one half of this special rank-1 matrix
e+ eT+ :
(Φ ⊗ I)(e+ eT+ ) = ∑ Φ(Ei, j ) ⊗ Ei, j = CΦ .
i, j
Remark 3.A.1 Many standard linear preserver problems have answers that are very similar
Linear Preserver to each other. For example, the linear maps Φ : Mn → Mn that preserve
Problems the determinant (i.e., the maps Φ that satisfy det(Φ(X)) = det(X) for all
3.A Extra Topic: Matrix-Valued Linear Maps 369
Example 3.A.5 Show that the linear map ΦR : Mn → Mn defined by ΦR (X) = tr(X)I − X
The Reduction Map (i.e., the reduction map from Example 3.A.1) is positive.
is Positive
Solution:
Just notice that if the eigenvalues of X are λ1 , . . ., λn then the eigenval-
ues of ΦR (X) = tr(X)I − X are tr(X) − λn , . . ., tr(X) − λ1 . Since tr(X) =
λ1 + · · · + λn , it follows that the eigenvalues of ΦR (X) are non-negative
(since they are each the sum of n − 1 of the eigenvalues of X), so it is
positive semidefinite whenever X is positive semidefinite. It follows that
ΦR is positive.
It is worth noting that In particular, 1-positive linear maps are exactly the positive maps them-
Φ ⊗ Ik is positive if
selves, and if a map is (k + 1)-positive then it is necessarily k-positive as well
and only if Ik ⊗ Φ is
positive. (see Exercise 3.A.18). We now look at some examples of linear maps that are
and are not k-positive.
1 0 0 1 0 0 0 1
Recall that the The matrix on the right has eigenvalues 1 (with multiplicity 3) and −1
transposition map is
(with multiplicity 1), so it is not positive semidefinite. It follows that T ⊗ I2
1-positive.
is not positive, so T : M2 → M2 is not 2-positive.
To see that T is not 2-positive when n > 2, we can just embed this
same example into higher dimensions by padding the input matrix e+ eT+
with extra rows and columns of zeros.
One useful technique that sometimes makes working with positive maps
simpler is the observation that a linear map Φ : Mn → Mm is positive if and
only if Φ(vv∗ ) is positive semidefinite for all v ∈ Fn . The reason for this is
The spectral the spectral decomposition, which says that every positive semidefinite matrix
decomposition is
X ∈ Mn can be written in the form
either of
Theorems 2.1.4 (if n
F = C) or 2.1.6 (if for some {v j } ⊂ Fn .
F = R).
X= ∑ v j v∗j
j=1
In particular, if Φ(vv∗ ) is PSD for all v ∈ Fn then Φ(X) = ∑ j Φ(v j v∗j ) is PSD
whenever X is PSD, so Φ is positive.
The following example makes use of this technique to show that the set of
k-positive linear maps is not just contained within the set of (k + 1)-positive
maps, but in fact this inclusion is strict as long as 1 ≤ k < n.
k
Verifying the second
equality here is ugly,
(Φ ⊗ Ik )(vv∗ ) = ∑ k(w∗j wi )I − wi w∗j ⊗ xi x∗j
i, j=1
but routine—we do
not need to be k k ∗
clever. =∑ ∑ wi ⊗ xi − w j ⊗ x j wi ⊗ xi − w j ⊗ x j
i=1 j=i+1
Since kwi k = 1, the
k
term I − wi w∗i is
positive semidefinite. + k ∑ (I − wi w∗i ) ⊗ xi x∗i .
i=1
The above example raises a natural question—if the sets of k-positive linear
maps are strict subsets of each other when 1 ≤ k ≤ n, what about when k > n?
For example, is there a matrix-valued linear map that is n-positive but not (n + 1)-
positive? The following theorem shows that no, this is not possible—any map
that is n-positive is actually necessarily completely positive (i.e., k-positive for
all k ≥ 1). Furthermore, there is a simple characterization of these maps in terms
of either their Choi matrices and operator-sum representations.
Theorem 3.A.6 Suppose Φ : Mn (F) → Mm (F) is a linear map. The following are equiva-
Characterization of lent:
Completely Positive a) Φ is completely positive.
Maps b) Φ is min{m, n}-positive.
c) CΦ is positive semidefinite.
d) There exist matrices {Ai } ⊂ Mm,n (F) such that Φ(X) = ∑ Ai XA∗i .
i
This result is Proof. We prove this theorem via the cycle of implications (a) =⇒ (b) =⇒
sometimes called
(c) =⇒ (d) =⇒ (a). The fact that (a) =⇒ (b) is immediate from the relevant
Choi’s theorem.
definitions, so we start by showing that (b) =⇒ (c).
Recall that To this end, first notice that if n ≤ m then Φ being n-positive tells us
n
e+ = ∑ ei ⊗ ei . that (Φ ⊗ In )(e+ e∗+ ) is positive semidefinite. However, we noted earlier that
i=1
(Φ ⊗ In )(e+ e∗+ ) = CΦ , so CΦ is positive semidefinite. On the other hand, if
372 Chapter 3. Tensors and Multilinearity
The fact that (c) =⇒ (d) follows from using the spectral decomposition to
write CΦ = ∑i vi v∗i . Theorem 3.A.3 tells us that if we define Ai = mat(vi ) then
Φ has the operator-sum representation Φ(X) = ∑i Ai XA∗i , as desired.
To see that (d) =⇒ (a) and complete the proof, we notice that if k ≥ 1 and
X ∈ Mn ⊗ Mk is positive semidefinite, then so is
Example 3.A.8 Show that the linear map Ψ : Mn → Mn defined by Ψ(X) = tr(X)I − X T
The Reduction Map is completely positive.
is Decomposable
Solution:
We just compute the Choi matrix of Ψ. Since the Choi matrix of the
This map Ψ is map X 7→ tr(X)I is I ⊗ I and the Choi matrix of the transpose map T is
sometimes called
the swap matrix Wn,n , we conclude that CΨ = I ⊗ I −Wn,n . Since Wn,n is
the Werner–Holevo
map. Hermitian and unitary, its eigenvalues all equal 1 and −1, so CΨ is positive
semidefinite. Theorem 3.A.6 then implies that Ψ is completely positive.
In particular, notice that if ΦR (X) = tr(X)I − X is the reduction map
from Example 3.A.5 then ΦR = T ◦ Ψ has the form (3.A.1) and is thus
decomposable.
3.A Extra Topic: Matrix-Valued Linear Maps 373
is positive semidefinite for all v ∈ F3 . To this end, recall that we can prove that
this matrix is positive semidefinite by checking that all of its principal minors
A principal minor of are nonnegative (see Remark 2.2.1).
a square matrix A is
the determinant of a There are three 1 × 1 principal minors of ΦC (vv∗ ):
square submatrix
that is obtained by |v1 |2 + |v3 |2 , |v2 |2 + |v1 |2 , and |v3 |2 + |v2 |2 ,
deleting some rows
and the same which are clearly all nonnegative. Similarly, one of the three 2 × 2 principal
columns of A.
minors of ΦC (vv∗ ) is
" #!
|v1 |2 + |v3 |2 −v1 v2
det = (|v1 |2 + |v3 |2 )(|v2 |2 + |v1 |2 ) − |v1 |2 |v2 |2
−v1 v2 |v2 |2 + |v1 |2
= |v1 |4 + |v1 |2 |v3 |2 + |v2 |2 |v3 |2
≥ 0.
The calculation for the other two 2 × 2 principal minors is almost identical, so
we move right on to the one and only 3 × 3 principal minor of ΦC (vv∗ ):
det ΦC (vv∗ ) = (|v1 |2 + |v3 |2 )(|v2 |2 + |v1 |2 )(|v3 |2 + |v2 |2 ) − 2|v1 |2 |v2 |2 |v3 |2
− (|v1 |2 + |v3 |2 )|v2 |2 |v3 |2 − (|v2 |2 + |v1 |2 )|v1 |2 |v3 |2
− (|v3 |2 + |v2 |2 )|v1 |2 |v2 |2
= |v1 |2 |v3 |4 + |v2 |2 |v1 |4 + |v3 |2 |v2 |4 − 3|v1 |2 |v2 |2 |v3 |2 .
374 Chapter 3. Tensors and Multilinearity
In order to show that this quantity is nonnegative, we recall that the AM–
The AM–GM GM inequality (Theorem A.5.3) tells us that if x, y, z ≥ 0 then (x + y + z)/3 ≥
inequality is √
3 xyz. Choosing x = |v |2 |v |4 , y = |v |2 |v |4 and z = |v |2 |v |4 gives
introduced in 1 3 2 1 3 2
Appendix A.5.2. q
|v1 |2 |v3 |4 + |v2 |2 |v1 |4 + |v3 |2 |v2 |4
≥ 3
|v1 |6 |v2 |6 |v3 |6 = |v1 |2 |v2 |2 |v3 |2 ,
3
which is exactly what we wanted. It follows that ΦC is positive, as claimed.
To see that ΦC cannot be written in the form ΦC = Ψ1 + T ◦ Ψ2 with Ψ1
and Ψ2 completely positive, we first note that constructing the Choi matrix of
both sides of this equation shows thats it is equivalent to show that we cannot
Recall that Y Γ refers write CΦC = X +Y Γ with both X and Y positive semidefinite. Suppose for the
to (T ⊗ I)(Y ); the first sake of establishing a contradiction that CΦC can be written in this form. A
partial transpose
of Y . straightforward computation shows that the Choi matrix of ΦC is
Here, X and Y are 1 · · · −1 · · · −1
the Choi matrices of · · · · · · · · ·
Ψ1 and Ψ2 , · · 1 · · · · · ·
respectively.
· · · 1 · · · · ·
CΦC =
−1 · · · 1 · · · −1
.
· · · · · · · · ·
· · · · · · · · ·
· · · · · · · 1 ·
−1 · · · −1 · · · 1
Since CΦC 2,2 = 0 and X and Y are positive semidefinite, it is necessarily
the case that y2,2 = 0. The fact that CΦC 6,6 = CΦC 7,7 = 0 similarly implies
y6,6 = y7,7 = 0. If a diagonal entry of a PSD matrix equals 0 then its entire row
and column must equal 0 by Exercise 2.2.11, so we conclude in particular that
It is also worth noting The reason that we focused on these entries of Y is that, under partial transposi-
that this linear map
tion, they are the entries that are moved to the locations of the “−1” entries of
ΦC is not
2-positive—see CΦC above. That is,
Exercise 3.A.4.
−1 = CΦC 1,5 = x1,5 + Y Γ 1,5 = x1,5 + y4,2 = x1,5 ,
and a similar argument shows that x1,9 = x5,1 = x5,9 = x9,1 = x9,5 = −1 as well.
Positive
semidefiniteness
of Y also tells us that y1,1 , y5,5 , y9,9 ≥ 0, which
(since CΦC 1,1 = CΦC 5,5 = CΦC 9,9 = 1) implies x1,1 , x5,5 , x9,9 ≤ 1. We have
now learned enough about X to see that it is not positive semidefinite: it is
straightforward to check that
Recall that
e+ = ∑3i=1 ei ⊗ ei = eT+ Xe+ = x1,1 + x5,5 + x9,9 − 6 ≤ 3 − 6 = −3,
(1, 0, 0, 0, 1, 0, 0, 0, 1).
so X is not positive semidefinite. This is a contradiction that completes the
proof.
On the other hand, if the dimensions m and n are both sufficiently small,
then a somewhat deep result (which we now state without proof) says that every
positive map is indeed decomposable.
3.A Extra Topic: Matrix-Valued Linear Maps 375
Figure 3.2: A schematic that depicts the relationships between the sets of positive,
k-positive, completely positive, and decomposable linear maps, as well the lo-
cations of the various positive linear maps that we have seen so far within these
sets.
Remark 3.A.2 The proof of Theorem 3.A.8 is quite technical—it was originally proved in
The Set of Positive the m = n = 2 case in [Stø63] (though there are now some slightly simpler
Linear Maps proofs of this case available [MO15]) and in the {m, n} = {2, 3} case in
[Wor76]. If n = 2 and m ≥ 4 then there are positive linear maps that are not
decomposable (see Exercise 3.A.22), just like we showed in the m, n ≥ 3
case in Theorem 3.A.7.
In higher dimensions, the structure of the set of positive linear maps is
not understood very well, and constructing “strange” positive linear maps
like the one from Theorem 3.A.7 is an active research area. The fact that
this set is so much simpler when mn ≤ 6 can be thought of roughly as a
statement that there just is not enough room in those small dimensions for
the “true nature” of the set of positive maps to become apparent.
3.A.6 Show that every adjoint-preserving linear map ∗∗3.A.15 Show that Φ : Mn (R) → Mm (R) is both bisym-
Φ : Mn → Mm can be written in the form Φ = Ψ1 − Ψ2 , metric and decomposable if and only if there exists a
where Ψ1 , Ψ2 : Mn → Mm are completely positive. completely positive map Ψ : Mn (R) → Mm (R) such that
Φ = Ψ + T ◦ Ψ.
∗ 3.A.7 Show that if Φ : Mn → Mm is positive then
so is Φ∗ . ∗∗3.A.16 Show that the linear map Φ ⊗ Ψ from Defini-
tion 3.A.4 is well-defined. That is, show that it is uniquely
∗3.A.8 A matrix-valued linear map Φ : Mn → Mm is determined by Φ and Ψ, and that the value of (Φ ⊗ Ψ)(C)
called unital if Φ(In ) = Im . Show that the following are does not depend on the particular way in which we write
equivalent: C = ∑ Ai ⊗ Bi .
i) Φ is unital, i
3.A.10 Suppose Φ, Ψ : Mn → Mn are linear maps. ∗∗3.A.19 Suppose a linear map Φ : Mn → Mm is such
(a) Show that CΨ◦Φ = (Ψ ⊗ I)(CΦ ). that Φ(X) is positive definite whenever X is positive defi-
(b) Show that CΦ◦Ψ∗ = (I ⊗ Ψ)(CΦ ), where Ψ is the lin- nite. Use the techniques from Section 2.D to show that Φ is
positive.
ear map defined by Ψ(X) = Ψ(X) for all X ∈ Mn .
3.A.20 Suppose U ∈ Mn (C) is a skew-symmetric uni- (c) Show that Φ is not completely positive.
tary matrix (i.e., U T = −U and U ∗U = I). Show that
the map ΦU : Mn (C) → Mn (C) defined by ΦU (X) = ∗∗3.A.22 Consider the linear map Φ : M2 → M4 defined
tr(X)I − X −UX T U ∗ is positive. by
[Hint: Show that x · (Ux) = 0 for all x ∈ Cn . Exercise 2.1.22 " #!
a b
might help, but is not necessary.] Φ =
c d
[Side note: ΦU is called a Breuer–Hall map, and it is worth
comparing it to the (also positive) reduction map of Ex- 4a − 2b − 2c + 3d 2b − 2a 0 0
ample 3.A.1. While the reduction map is decomposable,
2c − 2a 2a b 0
Breuer–Hall maps are not.] .
0 c 2d −2b − d
0 0 −2c − d 4a + 2d
3.A.21 Consider the linear map Φ : M2 → M4 defined
by (a) Show that if X is positive definite then the determi-
nant of the top-left 1 × 1, 2 × 2, and 3 × 3 blocks of
0 0 x1,2 tr(X)/2 Φ(X) are strictly positive.
0 0 x2,1 x1,2 (b) Show that if X is positive definite then det(Φ(X)) > 0
Φ(X) = tr(X)I4 −
x
.
as well.
2,1 x1,2 0 0
[Hint: This is hard. Try factoring this determinant as
tr(X)/2 x2,1 0 0 det(X)p(X), where p is a polynomial in the entries
(a) Show that Φ is positive. of X. Computer software might help.]
[Hint: Show that Φ(X) is diagonally dominant.] (c) Use Sylvester’s Criterion (Theorem 2.2.6) and Exer-
[Side note: In fact, Φ is decomposable—see Exer- cise 3.A.19 to show that Φ is positive.
cise 3.C.15.] [Side note: However, Φ is not decomposable—see
(b) Construct CΦ . Exercise 3.C.17.]
We then used our various linear algebraic tools like the real spectral decom-
Every multivariable
position (Theorem 2.1.6) to learn more about the structure of these quadratic
polynomial can be forms. Furthermore, analogous results hold for quadratic forms f : Cn → C,
written as a sum of and they can be proved simply by instead making use of the complex spectral
homogeneous decomposition (Theorem 2.1.4).
polynomials (e.g., a
degree-2 polynomial In general, a polynomial for which every term has the exact same degree
is a sum of a is called a homogeneous polynomial. Linear and quadratic forms constitute
quadratic form,
the degree-1 and degree-2 homogeneous polynomials, respectively, and in this
linear form, and a
scalar). section we investigate their higher-degree brethren. In particular, we will see
378 Chapter 3. Tensors and Multilinearity
By the spectral Proof. In order to help us prove this result, we first define an inner product
decomposition, if
h·, ·i : HP np ×HP np → F by setting h f , gi equal to a certain weighted dot product
p = 2 then we can
choose r = n in this of the vectors of coefficients of f , g ∈ HP np . Specifically, if
theorem. In general,
we can choose r = f (x1 , x2 , . . . , xn ) = ∑ ak1 ,k2 ,...,kn x1k1 x2k2 · · · xnkn and
dim HP np = n+p−1
p
k1 +k2 +···+kn =p
(we will see how this
g(x1 , x2 , . . . , xn ) = ∑ bk1 ,k2 ,...,kn x1k1 x2k2 · · · xnkn ,
dimension is
k1 +k2 +···+kn =p
computed shortly).
then we define
ak1 ,k2 ,...,kn bk1 ,k2 ,...,kn
h f , gi = ∑ p , (3.B.2)
k1 +···+kn =p k1 ,k2 ,...,kn
Multinomial where k1 ,k2p,...,kn = k1! k2p! ! ··· kn ! is a multinomial coefficient. We show that this
coefficients and the
multinomial theorem function is an inner product in Exercise 3.B.11.
(Theorem A.2.2) are Importantly, notice that if g is the p-th power of a linear form (i.e., there
introduced in
are scalars c1 , c2 , . . . , cn ∈ F such that g(x1 , x2 , . . . , xn ) = (c1 x1 + c2 x2 + · · · +
Appendix A.2.1.
cn xn ) p ) then we can use the multinomial theorem (Theorem A.2.2) to see that
p k1 k2
g(x1 , x2 , . . . , xn ) = ∑ c c
1 2 · · · c kn
n x1k1 x2k2 · · · xnkn ,
k +k +···+kn =p k 1 , k 2 , . . . , kn
1 2
Our goal is thus to show that this span is not a proper subspace of HP np .
Suppose for the sake of establishing a contradiction that this span were a
proper subspace of HP np . Then there would exist a non-zero homogeneous
For example, we polynomial f ∈ HP np orthogonal to each p-th power of a linear form:
could choose f to
be any non-zero h f , gi = 0 whenever g(x1 , x2 , . . . , xn ) = (c1 x1 + c2 x2 + · · · + cn xn ) p .
vector in the
orthogonal Equation (3.B.3) then tells us that f (c1 , c2 , . . . , cn ) = 0 for all c1 , c2 , . . . , cn ∈ F,
complement (see which implies f is the zero polynomial, which is the desired contradiction.
Section 1.B.2) of the
span of the g’s. In the case when p is odd, the linear combination of powers described by
Theorem 3.B.1 really is just a sum of powers, since we can absorb any scalar
(positive or negative) inside the linear forms. For example, every cubic form
f : Fn → F can be written in the form
r
f (x1 , x2 , . . . , xn ) = ∑ (ci,1 x1 + ci,2 x2 + · · · + ci,n xn )3 .
i=1
In fact, if F = C then this can be done regardless of whether p is even or odd.
However, if p is even and F = R then the best we can do is absorb the absolute
value of scalars λi into the powers of linear forms (i.e., we can assume without
loss of generality that λi = ±1 for each 1 ≤ i ≤ m).
380 Chapter 3. Tensors and Multilinearity
c1 + c3 + c4 = 4, 3c3 − 3c4 = 0,
3c3 + 3c4 = 12, and c2 + c3 − c4 = −6.
This linear system has (c1 , c2 , c3 , c4 ) = (1, −6, 2, 2) as its unique solution,
which gives us the decomposition
The key observation that connects homogeneous polynomials with the other
topics of this chapter is that HP np is isomorphic to the symmetric subspace
Snp ⊂ (Fn )⊗p that we explored in Section 3.1.3. To see why this is the case,
recall that one basis of Snp consists of the vectors of the form
( )
∑ Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) : 1 ≤ j1 ≤ j2 ≤ · · · ≤ j p ≤ n .
σ ∈S p
3.B Extra Topic: Homogeneous Polynomials 381
∑ Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p )
σ ∈S p
w = c1 (1, 0)⊗4 + c2 (0, 1)⊗4 + c3 (1, 1)⊗4 + c4 (1, 2)⊗4 + c5 (2, 1)⊗4 .
If we explicitly write out what all of these vectors are (they have 16
entries each!), we get a linear system consisting of 16 equations and 5
variables. However, many of those equations are identical to each other,
and after discarding those duplicates we arrive at the following linear
system consisting of just 5 equations:
c1 + c3 + c4 + 16c5 = 2
c2 + c3 + 16c4 + c5 = 2
c3 + 2c4 + 8c5 = 3
c3 + 4c4 + 4c5 = 1
c3 + 8c4 + 2c5 = 3
This linear system has (c1 , c2 , c3 , c4 , c5 ) = (−8, −8, −7, 1, 1) as its unique
solution, which gives us the decomposition
w = (1, 2)⊗4 + (2, 1)⊗4 − 7(1, 1)⊗4 − 8(1, 0)⊗4 − 8(0, 1)⊗4 .
Remark 3.B.1 Recall from Section 3.3.3 that the tensor rank of w ∈ (Fn )⊗p (denoted by
Symmetric rank(w)) is the least integer r such that we can write
Tensor Rank
r
(1) (2) (p)
w = ∑ vi ⊗ vi ⊗ · · · ⊗ vi ,
i=1
( j)
where vi ∈ Fn for each 1 ≤ i ≤ r and 1 ≤ j ≤ p.
Now that we know that Theorem 3.1.11 holds, we could similarly
define (as long as F = R or F = C) the symmetric tensor rank of any
w ∈ Snp (denoted by rankS (w)) to be the least integer r such that we can
If F = C then we can write
omit the λi scalars r
here since
√ we can w = ∑ λi v⊗p
i ,
absorb p λi into vi . If i=1
F = R then we can
similarly assume that where λi ∈ F and vi ∈ Fn for each 1 ≤ i ≤ r.
λi = ±1 for all i.
Perhaps not surprisingly, only a few basic facts are known about the
symmetric tensor rank:
• When p = 2, the symmetric tensor rank just equals the usual rank:
rank(w) = rankS (w) for all w ∈ Sn2 . This is yet another consequence
of the spectral decomposition (see Exercise 3.B.6).
• In general, rank(w) ≤ rankS (w) for all w ∈ Snp . This fact follows
directly from the definitions of these two quantities—when comput-
ing the symmetric rank, we minimize over a strictly smaller set of
sums than for the non-symmetric rank.
• There are cases when rank(w) < rankS (w), even just when p = 3,
but they are not easy to construct or explain [Shi18].
3.B Extra Topic: Homogeneous Polynomials 383
n+p−1 n+p−1
• Since dim Snp = p , we have rankS (w) ≤ p for all
w ∈ Snp .
Solution:
If we divide f through by x34 , we get
f (x1 , x2 , x3 ) x14 x 3 x2 x2 x1 x2
4
= 4 + 2 1 4 + 3 12 + 4 3 2 + 5,
x3 x3 x3 x3 x3
Theorem 3.B.2 Suppose f ∈ HP 2p (R). Then f is positive semidefinite if and only if it can
Two-Variable PSD be written as a sum of squares of polynomials.
Homogeneous
Polynomials
Proof. The “if” direction is trivial (even for polynomials of more than 2 vari-
ables), so we only show the “only if” direction. Also, thanks to dehomogeniza-
tion, it suffices to just prove that every PSD (not necessarily homogeneous)
polynomial in a single variable can be written as a sum of squares.
To this end, we induct on the degree p of the polynomial f . In the p = 2
base case, we can complete the square in order to write f in its vertex form
where a > 0 and the vertex of the graph of f is located at the point (h, k) (see
Figure 3.3(a)). Since f (x) ≥ 0 for all x we know that f (h) = k ≥ 0, so this
vertex form is really a sum of squares decomposition of f .
x = −1 x=2
(order 1) (order 2)
x
(2, 1) -1 1 2 3
x
-1 1 2 3 4 5
For the inductive step, suppose f has even degree p ( f can clearly not be
positive semidefinite if p is odd). Let k ≥ 0 be the minimal value of f (which
exists, since p is even) and let h be a root of the polynomial f (x) − k. Since
3.B Extra Topic: Homogeneous Polynomials 385
for some integer q ≥ 1 and some polynomial g. Since k ≥ 0 and g(x) = ( f (x) −
k)/(x − h)2q ≥ 0 is a PSD polynomial of degree p − 2q < p, it follows from the
inductive hypothesis that g can be written as a sum of squares of polynomials,
so f can as well.
It is also known [Hil88] that every PSD polynomial in HP 43 (R) can be
written as a sum of squares of polynomials, though the proof is rather technical
and outside of the scope of this book. Remarkably, this is the last case where
such a property holds—for all other values of n and p, there exist PSD polyno-
mials that cannot be written as a sum of squares of polynomials. The following
example presents such a polynomial in HP 63 (R).
Similar examples show that there are PSD polynomials that cannot be
written as a sum of squares of polynomials in HP np (R) whenever n ≥ 3 and
p ≥ 6 is even. An example that demonstrates the existence of such polynomials
in HP 44 (R) (and thus HP np (R) whenever n ≥ 4 and p ≥ 4 is even) is provided in
Exercise 3.B.7. Finally, a summary of these various results about the connection
between positive semidefiniteness and the ability to write a polynomial as a
sum of squares of polynomials is provided by Table 3.2.
n (number of variables)
p (degree) 1 2 3 4 5
2 X (spectral decomp.)
4 7 (Exer. 3.B.7)
X (Thm. 3.B.2)
X[Hil88]
X (trivial)
6 7 (Exam. 3.B.4)
8
10
12
Table 3.2: A summary of which values of n and p lead to a real n-variable degree-p
PSD homogeneous polynomial necessarily being expressible as a sum of squares
of polynomials.
Remark 3.B.2 Remarkably, even though some positive semidefinite polynomials cannot
Sums of Squares of be written as a sum of squares of polynomials, they can all be written as a
Rational Functions sum of squares of rational functions [Art27]. For example, we showed in
Example 3.B.4 that the PSD homogeneous polynomial
f (x, y, z) = x2 y4 + y2 z4 + z2 x4 − 3x2 y2 z2
then we can write f in the following form that makes it “obvious” that it
3.B Extra Topic: Homogeneous Polynomials 387
is positive semidefinite:
This is not quite a sum of squares of rational functions, but it can be turned
into one straightforwardly (see Exercise 3.B.8).
form
q(x, y) = f (x, y, x, y) = xT Φ yyT x.
The following theorem pins down what properties of Φ lead to various desirable
properties of the associated biquadratic form q.
Table 3.3: A summary of various isomorphisms that relate certain families of ho-
mogeneous polynomials to certain families of matrices and matrix-valued linear
maps.
Solution:
It is straightforward to show that q(x, y) = xT Φ yyT x, where Ψ :
We were lucky here M3 → M3 is the (completely positive) Werner–Holevo map defined by
that we could
“eyeball” a CP map
Ψ(X) = tr(X)I − X T (see Example 3.A.8). It follows that if we can find
Ψ for which an operator-sum representation Ψ(X) = ∑i Ai XATi then we will get a sum-
q(x, y) = xT Ψ(yyT )x. In of-squares representation q(x, y) = ∑i (xT Ai y)2 as an immediate corollary.
general, finding a
suitable map Ψ to To construct such an operator-sum representation, we mimic the proof
show that q is a sum of Theorem 3.A.6: we construct CΨ and then let the Ai ’s be the matri-
of squares can be cizations of its scaled eigenvectors. We recall from Example 3.A.8 that
done via
CΨ = I − W3,3 , and applying the spectral decomposition to this matrix
semidefinite
programming (see shows that CΨ = ∑3i=1 vi vTi , where
Exercise 3.C.18).
v1 = (0, 1, 0, −1, 0, 0, 0, 0, 0),
v2 = (0, 0, 1, 0, 0, 0, −1, 0, 0), and
v3 = (0, 0, 0, 0, 0, 1, 0, −1, 0).
This biquadratic form is the Choi map described by Theorem 3.A.7, then
(as well as the Choi
map from
Theorem 3.A.7) was yT ΦC xxT y = x12 (y21 + y22 ) + x22 (y22 + y23 ) + x32 (y23 + y21 )
introduced by Choi − 2x1 x2 y1 y2 − 2x2 x3 y2 y3 − 2x3 x1 y3 y1 = q(x, y).
[Cho75].
However, ΦC is not bisymmetric, so we instead consider the bisym-
metric map
1
Ψ = ΦC + T ◦ ΦC ,
2
which also has the property that q(x, y) = yT Ψ xxT y. Since ΦC is
positive, so is Ψ, so we know from Theorem 3.B.3 that q is positive
We show that q is semidefinite.
PSD “directly” in
Exercise 3.B.10. b) On the other hand, even though we know that ΦC is not decompos-
able, this does not directly imply that the map Ψ from part (a) is not
decomposable, so we cannot directly use Theorem 3.B.3 to see that
q cannot be written as a sum of squares of bilinear forms. We thus
prove this property of q directly.
If q could be written as a sum of squares of bilinear forms, then it
would have the form
r
q(x, y) = ∑ ai x1 y1 + bi x1 y2 + ci x1 y3 + di x2 y1 + ei x2 y2 + fi x2 y3
i=1
2
+ hi x3 y1 + ji x3 y2 + ki x3 y3
for some families of scalars {ai }, {bi }, . . . , {ki } ∈ R and some in-
teger r. By matching up terms of q with terms in this hypothetical
sum-of-squares representation of q, we can learn about their coeffi-
cients.
For example, since the coefficient of x12 y23 in q is 0, whereas the
coefficient of x12 y23 in this sum-of-squares representation is ∑ri=1 c2i ,
we learn that ci = 0 for all 1 ≤ i ≤ r. A similar argument with the
coefficients of x22 y21 and x32 y22 then shows that di = ji = 0 for all
1 ≤ i ≤ r as well. It follows that our hypothetical sum-of-squares
representation of q actually has the somewhat simpler form
r 2
q(x, y) = ∑ ai x1 y1 + bi x1 y2 + ei x2 y2 + fi x2 y3 + hi x3 y1 + ki x3 y3 .
i=1
On the other hand, recall from Theorem 3.A.8 that all positive maps Φ :
Mn → Mm are decomposable when (m, n) = (2, 2), (m, n) = (2, 3), or (m, n) =
(3, 2). It follows immediately that all biquadratic forms q : Rm × Rn → R can
be written as a sum of squares of bilinear forms, under the same restrictions on
m and n. We close this section by noting the following strengthening of this
result: PSD biquadratic forms can be written as a sum of squares of bilinear
forms as long as one of m or n equals 2.
This fact is not true In particular, this result tells us via the isomorphism between biquadratic
when (m, n) = (2, 4) if forms and bisymmetric maps that if Φ : Mn (R) → Mm (R) is bisymmetric with
Φ is not
min{m, n} = 2 then Φ being positive is equivalent to it being decomposable.
bisymmetric—see
Exercise 3.C.17. We do not prove this theorem, however, as it is quite technical—the interested
reader is directed to [Cal73].
3.B.1 Write each of the following homogeneous polyno- ∗(c) Every positive semidefinite polynomial in HP 62 (R)
mials as a linear combination of powers of linear forms, in can be written as a sum of squares of polynomials.
the sense of Theorem 3.B.1. (d) Every positive semidefinite polynomial in HP 63 (R)
can be written as a sum of squares of polynomials.
∗(a) 3x2 + 3y2 − 2xy ∗(e) Every positive semidefinite polynomial in HP 28 (R)
(b) 2x2 + 2y2 − 3z2 − 4xy + 6xz + 6yz can be written as a sum of squares of polynomials.
∗(c) 2x3 − 9x2 y + 3xy2 − y3
(d) 7x3 + 3x2 y + 15xy2
∗(e) x 2 y + y2 z + z 2 x 3.B.4 Compute the dimension of Pnp , the vector space of
(f) 6xyz − x3 − y3 − z3 (non-homogeneous) polynomials of degree at most p in n
∗(g) 2x4 − 8x3 y − 12x2 y2 − 32xy3 − 10y4 variables.
(h) x2 y2
∗∗3.B.5 Show that the polynomial f (x, y) = x2 y2 cannot
3.B.2 Write each of the following vectors from the sym- be written as a sum of 4-th powers of linear forms.
metric subspace Snp as a linear combination of vectors of
the form v⊗p . ∗∗3.B.6 We claimed in Remark 3.B.1 that the symmetric
∗(a) (2, 3, 3, 5) ∈ S22 tensor rank of a vector w ∈ Sn2 equals its usual tensor rank:
(b) (1, 3, −1, 3, −3, 3, −1, 3, 1) ∈ S32 rank(w) = rankS (w).
∗(c) (1, −2, −2, 0, −2, 0, 0, −1) ∈ S23 a) Prove this claim if the ground field is R.
(d) (2, 0, 0, 4, 0, 4, 4, 6, 0, 4, 4, 6, 4, 6, 6, 8) ∈ S24 [Hint: Use the real spectral decomposition.]
b) Prove this claim if the ground field is C.
3.B.3 Determine which of the following statements are [Hint: Use the Takagi factorization of Exer-
true and which are false. cise 2.3.26.]
∗(a) If g is a dehomogenization of a homogeneous poly-
nomial f then the degree of g equals that of f .
(b) If f ∈ HP np (R) is non-zero and positive semidefinite
then p must be even.
3.B Extra Topic: Homogeneous Polynomials 393
∗∗ 3.B.7 In this exercise, we show that the polynomial ∗∗3.B.10 Solve Example 3.B.6(a) “directly”. That is, show
f ∈ HP 44 defined by that the biquadratic form q : R3 × R3 → R defined by
f (w, x, y, z) = w4 + x2 y2 + y2 z2 + z2 x2 − 4wxyz q(x, y) = y21 (x12 + x32 ) + y22 (x22 + x12 ) + y23 (x32 + x22 )
is positive semidefinite, but cannot be written as a sum of − 2x1 x2 y1 y2 − 2x2 x3 y2 y3 − 2x3 x1 y3 y1
squares of polynomials (much like we did for a polynomial
in HP 63 in Example 3.B.4). is positive semidefinite without appealing to Theorem 3.B.3
or the Choi map from Theorem 3.A.7.
a) Show that f is positive semidefinite.
b) Show that f cannot be written as a sum of squares of [Hint: This is hard. One approach is to fix 5 of the 6 variables
polynomials. and use the quadratic formula on the other one.]
∗∗3.B.8 Write the polynomial from Remark 3.B.2 as a ∗∗ 3.B.11 Show that the function defined in Equa-
sum of squares of rational functions. tion (3.B.2) really is an inner product.
[Hint: Multiply and divide the “obvious” PSD decomposi-
tion of f by another copy of x2 + y2 + z2 .] 3.B.12 Let v ∈ (C2 )⊗3 be the vector that we showed has
tensor rank 3 in Example 3.3.4. Show that it also has sym-
metric tensor rank 3.
∗∗3.B.9 Show that a function q : Rm × Rn → R has the
form described by Definition 3.B.2 if and only if there is a [Hint: The decompositions that we provided of the vector
quadrilinear form f : Rm × Rn × Rm × Rn → R such that from Equation (3.3.3) might help.]
q(x, y) = f (x, y, x, y) for all x ∈ Rm , y ∈ Rn .
then A B, since
1 −i
A−B = ,
i 1
which is positive semidefinite (its eigenvalues are 2 and 0).
We prove these This ordering of Hermitian matrices is called the Loewner partial order,
basic properties of
and it shares many of the same properties as the usual ordering on R (in fact,
the Loewner partial
order in for 1 × 1 Hermitian matrices it is the usual ordering on R). For example, if
Exercise 3.C.1. A, B,C ∈ MH n then:
Reflexive: it is the case that A A,
Antisymmetric: if A B and B A then A = B, and
Transitive: if A B and B C then A C.
In fact, these three properties are exactly the defining properties of a partial
order—a function on a set (not necessarily a set of matrices) that behaves like
we would expect something that we call an “ordering” to.
394 Chapter 3. Tensors and Multilinearity
Two matrices However, there is one important property that the ordering on R has that is
A, B ∈ MH n for which missing from the Loewner partial order: If a, b ∈ R then it is necessarily the
A 6 B and B 6 A are
called
case that either a ≥ b or b ≥ a (or both), but the analogous statement about the
non-comparable, Loewner partial order on MH n does not hold. For example, if
and their existence is
why this is called a 1 0 0 0
“partial” order (as A= and B =
opposed to a “total” 0 0 0 1
order like the one on
R). then each of A − B and B − A have −1 as an eigenvalue, so A 6 B and B 6 A.
Recall (from Figure 2.6, for example) that we can think of Hermitian
matrices and positive semidefinite matrices as the “matrix versions” of real
Since we are numbers and non-negative real numbers, and the standard inner product on
working with
Hermitian matrices MH n is the Frobenius inner product hC, Xi = tr(CX) (just like the standard
(i.e., C, X ∈ MHn ), we inner product on Rn is the dot product hc, xi = c · x). Since we now similarly
do not need the think of the Loewner partial order as the “matrix version” of the ordering of
conjugate transpose
real numbers, the following definition hopefully seems like a somewhat natural
in the Frobenius inner
product tr(C∗ X). generalization of linear programs—it just involves replacing every operation
that is specific to R or Rn with its “natural” matrix-valued generalization.
standard form) and its optimal value is the maximal or minimal value that the
One wrinkle that objective function can attain subject to the constraints (i.e., it is the “solution” of
occurs for SDPs that
did not occur for the semidefinite program). A feasible point is a matrix X ∈ MH n that satisfies
linear programs is all of the constraints of the SDP (i.e., Φ(X) B and X O), and the feasible
that the maximum or region is the set consisting of all feasible points.
minimum in a
semidefinite In addition to turning multiple constraints and matrices into a single block-
program might not diagonal matrix and constraint, as in the previous example, we can also use
be attained (i.e., by techniques similar to those that are used for linear programs to transform a
“maximum” we
really mean
wide variety of optimization problems into the standard form of semidefinite
“supremum” and by programs. For example:
“minimum” we really • We can turn a minimization problem into a maximization problem by
mean
“infimum”)—see multiplying the objective function by −1 and then multiplying the result-
Example 3.C.6. ing optimal value by −1, as illustrated in Figure 3.4.
• We can turn an “=” constraint into a pair of “” and “” constraints
All of these (since the Loewner partial order is antisymmetric).
modifications are
completely • We can turn a “” constraint into a “” constraint by multiplying it by
analogous to how −1 and flipping the direction of the inequality (see Exercise 3.C.1(e)).
we can manipulate
inequalities and • We can turn an unconstrained (i.e., not necessarily PSD) matrix variable
equalities involving X into a pair of PSD matrix variables by setting X = X + − X − where
real numbers. X + , X − O (see Exercise 2.2.16).
−tr(CX) tr(CX)
-5 -4 -3 -2 -1 0 1 2 3 4 5
Figure 3.4: Minimizing tr(CX) is essentially the same as maximizing −tr(CX); the final
answers just differ by a minus sign.
Solution:
Since Y is unconstrained, we replace it by the pair of PSD variables
Y + and Y − via Y = Y + −Y − and Y + ,Y − O. Making this change puts
the SDP into the form
Here we used the
fact that minimize: tr(CX) + tr(DY + ) − tr(DY − )
Ψ(Y + −Y − ) = subject to: X + Ψ(Y + ) − Ψ(Y − ) = I
Ψ(Y + ) − Ψ(Y − ), since Φ(X) + Y+ − Y− O
Ψ is linear. +
X, Y , Y− O
After making this final substitution, the original linear program can be
written in primal standard form as follows:
− maximize: tr CeXe
subject to: e Xe Be
Φ
Xe O
minimize: "x #
subject to: xIm A
O (3.C.3)
A∗ xIn
x≥0
However, writing the SDP explicitly in standard form like this is perhaps not
terribly useful—once an optimization problem has been written in a form
involving a linear objective function and only linear entrywise and positive
semidefinite constraints, the fact that it is an SDP (i.e., can be converted into
primal standard form) is usually clear.
While computing the operator norm of a matrix via semidefinite program-
ming is not really a wise choice (it is much quicker to just compute it via the
fact that kAk equals the largest singular value of A), this same technique lets us
use the operator norm in the objective function or in the constraints of SDPs.
For example, if A ∈ Mn then the optimization problem
In words, this SDP
finds the closest (in
the sense of the minimize: kY − Ak
operator norm) PSD subject to: y j, j = a j, j for all 1≤ j≤n (3.C.4)
matrix to A that has Y O
the same diagonal
entries as A.
is a semidefinite program, since it can be written in the form
minimize: "x #
subject to: xIm Y −A
O
Y − A∗ xIn
y j, j = a j, j for all 1≤ j≤n
x, Y 0,
which in turn could be written in the primal standard form of Definition 3.C.1
if desired.
However, we have to be slightly more careful here than with our earlier
manipulations of SDPs—the fact that the optimization problem (3.C.4) is a
semidefinite program relies crucially on the fact that we are minimizing the
norm in the objective function. The analogous maximization problem is not
3.C Extra Topic: Semidefinite Programming 399
Remark 3.C.1 One way to think about the fact that we can use semidefinite programming
Semidefinite Programs to minimize the operator norm, but not maximize it, is via the fact that it is
and Convexity convex:
(1 − t)A + tB
≤ (1 − t)kAk + tkBk for all A, B ∈ Mm,n , 0 ≤ t ≤ 1.
For an introduction Generally speaking, convex functions are easy to minimize since their
to convex functions,
graphs “open up”. It follows that any local minimum that they have is nec-
see Appendix A.5.2.
We can think of essarily a global minimum, so we can minimize them simply by following
minimizing a convex any path on the graph that leads down, as illustrated below:
function just like
rolling a marble y
z
down the side of a
bowl—the marble
z = f (x, y)
never does anything
clever, but it finds y = f (x)
the global minimum
every time anyway. y x
x
as the Frobenius norm (well, its square anyway—see Exercise 3.C.9), the trace
norm (see Exercise 2.3.17 and the upcoming Example 3.C.5), or the maxi-
mum or minimum eigenvalue (see Exercise 3.C.3). However, we must also be
slightly careful about where we place these quantities in a semidefinite program.
Every norm is convex, Norms are all convex, so they can be placed in a “minimize” objective function
but not every norm
or on the smaller half of a “≤” constraint, while the minimum eigenvalue is
can be computed
via semidefinite concave and thus can be placed in a “maximize” objective function or on the
programming. larger half of a “≥” constraint. The possible placements of these quantities are
summarized in Table 3.4.
minimize: λmax (X) maximize: λmin (A − X)
subject to: kA − XkF ≤ 2 subject to: λmax (X) + kXktr ≤ 1
X O X O
are both semidefinite programs, whereas neither of the following two optimiza-
tion problems are:
The left problem is
not an SDP due to
minimize: tr(X) maximize: λmax (A − X)
the norm equality
constraint and the subject to: kA − Xktr = 1 subject to: X A
right problem is not X O X O.
an SDP due to
maximizing λmax . We also note that Table 3.4 is not remotely exhaustive. Many other linear
algebraic quantities, particularly those involving eigenvalues, singular val-
ues, and/or norms, can be incorporated into semidefinite programs—see Exer-
cises 3.C.10 and 3.C.13.
y y−x ≤ 1 y y−x ≤ 1
x + 2y = 11
3 3
(x, y) = (1, 2) x + 2y = 9
2 2
x + 2y = 7
1 1
x + 2y = 5
x x
-1 1 2 3 4 -1 1 2 3 4
x + 2y = 3
-1 -1
z = −3
x+y ≤ 3 x + 2y = −3 x + 2y =x +
−1y ≤ 3x + 2y = 1
Figure 3.5: The feasible region of the linear program that asks us to maximize x + 2y
subject to the constraints x + y ≤ 3, y − x ≤ 1, and x, y ≥ 0. Its optimal value is 5, which
is attained at (x, y) = (1, 2).
The feasible region of this SDP is displayed in Figure 3.6, and its optimal value
is x = 3, which is attained on one of the curved boundaries of the feasible
region (not at one of its corners).
y y
Sylvester’s criterion 3 x(3 − x) ≥ (y − 1)2 3
(Theorem 2.2.6) can
turn any positive 2
semidefinite 2
constraint into a (x, y) = (3, 1)
1 1
family of polynomial
constraints in the x
matrix’s entries. For
x
-1 1 2 3 4 -1 1 2 3 4
example, -1 -1
Theorem 2.2.7 says
x + 1 ≥ y2
that x = −1 x=1 x=3
" #
x+1 y
O
y 1
if and only if
x + 1 ≥ y2 . Figure 3.6: The feasible region of the semidefinite program (3.C.5). Its optimal value
is 3, which is attained at (x, y) = (3, 1). Notably, this point is not a corner of the
feasible region.
402 Chapter 3. Tensors and Multilinearity
This geometric wrinkle that semidefinite programs have over linear pro-
grams has important implications when it comes to solving them. Recall that
If the coefficients in
a linear program are the standard method of solving a linear program is to use the simplex method,
rational then so is its which works by jumping from corner to adjacent corner of the feasible region
optimal value. This so as to increase the objective function by as much as possible at each step.
statement is not true
This method relies crucially on the fact that the optimal value of the linear
of semidefinite
programs. program occurs at a corner of the feasible region, so it does not generalize
straightforwardly to semidefinite programs.
While the simplex method for linear programs always terminates in a finite
number of steps and can produce an exact description of its optimal value, no
The CVX package such algorithm for semidefinite programs is known. There are efficient methods
[CVX12] for MATLAB for numerically approximating the optimal value of an SDP to any desired
and the CVXPY accuracy, but these algorithms are not really practical to run by hand—they
package [DCAB18]
for Python can be
are instead implemented by various computer software packages, and we do
used to solve not explore these methods here. There are entire books devoted to semidefinite
semidefinite programming (and convex optimization in general), so the interested reader
programs, for is directed to [BV09] for a more thorough treatment. We are interested in
example (both
packages are free).
semidefinite programming primarily for its duality theory, which often lets us
find the optimal solution of an SDP analytically.
3.C.3 Duality
Just as was the case for linear programs, semidefinite programs have a robust
duality theory. However, since we do not have simple methods for explicitly
solving semidefinite programs by hand like we do for linear programs, we will
find that duality plays an even more important role in this setting.
minimize: tr(BY )
Four things “flip” subject to: Φ∗ (Y ) C
when constructing a Y O
dual problem:
“maximize”
becomes “minimize”, The original semidefinite program in Definition 3.C.2 is called the pri-
Φ turns into Φ∗ , the mal problem, and the two of them together are called a primal/dual pair.
“” constraint Although constructing the dual of a semidefinite program is a rather routine
becomes “”, and B
and C switch spots. and mechanical affair, keep in mind that the above definition only applies once
the SDP is written in primal standard form (fortunately, we already know how
to convert any SDP into that form). Also, even though we can construct the
adjoint Φ∗ of any linear map Φ : MH H
n → Mm via Corollary 3.A.5, it is often
quicker and easier to just “eyeball” a formula for the adjoint and then check
that it is correct.
3.C Extra Topic: Semidefinite Programming 403
Example 3.C.3 Construct the dual of the semidefinite program (3.C.3) that computes the
Constructing the operator norm of a matrix A ∈ Mm,n .
Dual of an SDP
Solution:
Our first step is to write this SDP in primal standard form, which we
recall can be done by defining
" # " #
O A −xIm O
B= ∗ , C = −1, and Φ(x) = .
A O O −xIn
for all Y ∈ MH H
m and Z ∈ Mn . We can see from inspection that one map
Recall from ∗
(and thus the map) Φ that works is given by the formula
Theorem 1.4.8 that
adjoints are unique.
Y ∗
Φ∗ = −tr(Y ) − tr(Z).
∗ Z
It follows that the dual of the original semidefinite program has the
The minus sign in following form:
front of this
minimization comes " #" #!
from the fact that O A Y X
we had to convert − minimize: tr
the original SDP from A∗ O X∗ Z
" #
a minimization to a subject to: Y X
maximization in O
order to put it into X∗ Z
primal standard form − tr(Y ) − tr(Z) ≥ −1
(this is also why
C = −1 instead of
C = 1). After simplifying things somewhat, this dual problem can be written in
the somewhat prettier (but equivalent) form
maximize: Re tr(AX ∗ ) #
"
subject to: Y −X
∗ O
−X Z
tr(Y ) + tr(Z) ≤ 2
Just as is the case with linear programs, the dual of a semidefinite program
is remarkable for the fact that it can provide us with upper bounds on the
optimal value of the primal problem (and the primal problem similarly provides
lower bounds to the optimal value of the dual problem):
Proof. Since Y and B − Φ(X) are both positive semidefinite, we know from
Exercise 2.2.19 that
Recall that
tr Φ(X)Y = 0 ≤ tr (B − Φ(X))Y = tr(BY ) − tr Φ(X)Y = tr(BY ) − tr XΦ∗ (Y ) .
tr XΦ∗ (Y ) simply
because we are It follows that tr(BY ) ≥ tr XΦ∗ (Y ) . Then using the fact that X and Φ∗ (Y ) −C
working in the are both positive semidefinite, a similar argument shows that
Frobenius inner
product and Φ∗ is 0 ≤ tr X(Φ∗ (Y ) −C) = tr XΦ∗ (Y )) − tr(XC),
(by definition) the
adjoint of Φ in this so tr XΦ∗ (Y )) ≥ tr(XC). Stringing these two inequalities together shows that
inner product.
tr(BY ) ≥ tr XΦ∗ (Y ) ≥ tr(XC) = tr(CX),
as desired.
Weak duality not only provides us with a way of establishing upper bounds
on the optimal value of our semidefinite program, but it often also lets us easily
determine when we have found its optimal value. In particular, if we can find
feasible points of the primal and dual problems that give the same value when
plugged into their respective objective functions, they must be optimal, since
they cannot possibly be increased or decreased past each other (see Figure 3.7).
-2 -1 0 1 2 3 4 5 6 7 8
Figure 3.7: Weak duality says that the objective function of the primal (maxi-
mization) problem cannot be increased past the objective function of the dual
(minimization) problem.
The following example illustrates how we can use this feature of weak
duality to solve semidefinite programs, at least in the case when they are simple
enough that we can spot what we think the solution should be. That is, weak
duality can help us verify that a conjectured optimal solution really is optimal.
Example 3.C.4 Use weak duality to solve the following SDP in the variable X ∈ MH
3:
Solving an SDP via
Weak Duality
0 1 1
maximize: tr 1 0 1 X
1 1 0
subject to: OX I
Solution:
The objective function of this semidefinite program just adds up the
Actually, the off-diagonal entries of X, and the constraint O X I says that every
objective function
eigenvalue of X is between 0 and 1 (see Exercise 3.C.3). Roughly speaking,
adds up the real
parts of the we thus want to find a PSD matrix X with small diagonal entries and large
off-diagonal entries off-diagonal entries. One matrix that seems to work fairly well is
of X.
1 1 1
1
X= 1 1 1 ,
3
1 1 1
3.C Extra Topic: Semidefinite Programming 405
Since the matrix on the right-hand-side of the constraint in this dual SDP
has eigenvalues 2, −1, and −1, we can find a positive semidefinite matrix
Y satisfying that constraint simply by increasing the negative eigenvalues
to 0 (while keeping the corresponding eigenvectors the same). That is, if
In other words, we we have the spectral decomposition
choose Y to be the
positive semidefinite
part of 0 1 1 2 0 0
1 0 1 = U 0 −1 0 U ∗
0 1 1
1 0 1 ,
1 1 0 0 0 −1
1 1 0
then we choose
in the sense of 2 0 0
Exercise 2.2.16.
Explicitly,
Y =U 0 0 0 U ∗ ,
0 0 0
1 1 1
2
Y = 1 1 1 . which has tr(Y ) = 2.
3
1 1 1 Since we have now found feasible points of both the primal and dual
problems that attain the same objective value of 2, we know that this must
in fact be the optimal value of both problems.
As suggested by the previous example, weak duality is useful for the fact
that it can often be used to solve a semidefinite program without actually
performing any optimization at all, as long as the solution is simple enough
that we can eyeball feasible points of each of the primal and dual problems that
attain the same value. We now illustrate how this procedure can be used to give
us new characterizations of linear algebraic objects.
Example 3.C.5 Suppose A ∈ Mm,n . Use weak duality to show that the following semidefi-
Using Weak Duality to nite program in the variables X ∈ MH H
m and Y ∈ Mn computes kAktr (the
Understand an SDP trace norm of A, which was introduced in Exercise 2.3.17):
for the Trace Norm
minimize: (tr(X) +#tr(Y ))/2
"
subject to: X A
O
A∗ Y
X, Y O
Solution:
To start, we append rows or columns consisting entirely of zeros to A
so as to make it square. Doing so does not affect kAktr or any substantial
406 Chapter 3. Tensors and Multilinearity
so (tr(X) + tr(Y ))/2 = kAktr and thus X and Y produce the desired value
in the objective function. Furthermore, X and Y are feasible since they are
positive semidefinite and so is
If A (and thus Σ) is " # " # " √ # " √ #∗
not square then X A UΣU ∗ UΣV ∗ U Σ U Σ
√
either U Σ or V Σ
√ ∗ = = √ ∗ √ ∗ ,
A∗ Y V Σ∗U ∗ V ΣV ∗
V Σ V Σ
needs to be
padded with some √
zero columns for this where Σ is the entrywise square root of Σ.
decomposition to
make sense. To prove the opposite inequality (i.e., to show that this SDP is bounded
below by kAktr ), we first need to construct its dual. To this end, we first
explicitly list the components of its primal standard form (3.C.2):
" # " # " #
Here, C is negative O A −1 Im O X ∗ −X O
B= ∗ , C= , and Φ = .
because the given A O 2 O In ∗ Y O −Y
SDP is a minimization
problem, so we have
to multiply the It is straightforward to show that Φ∗ = Φ, so the dual SDP (after simplify-
objective function ing a bit) has the form
by −1 to turn it into a
maximization
problem (i.e., to put maximize: "Re tr(AZ)
#
it into primal subject to: X Z
standard form). O
Z∗ Y
X, Y I
Next, we find a feasible point of this SDP that attains the desired value
of kAktr in the objective function. If A has the singular value decomposition
Recall that the fact A = UΣV ∗ then the matrix
that Φ = Φ∗ means
that it is called " # " #
self-adjoint. X Z I VU ∗
=
Z∗ Y UV ∗ I
which shows that kAktr is a lower bound on the value of this SDP. Since
we have now proved both bounds, we are done.
3.C Extra Topic: Semidefinite Programming 407
Remark 3.C.2 The previous examples of solving SDPs via duality might seem somewhat
Numerics Combined “cooked up” at first, especially since in Example 3.C.5 we were told what
with Duality Make a the optimal value is, and we just had to verify it. In practice, we are
Powerful Combination typically not told the optimal value of the SDP that we are working with
ahead of time. However, this is actually not a huge restriction, since we
can use computer software to numerically solve the SDP and then use that
solution to help us eyeball the analytic (non-numerical) solution.
To illustrate what we mean by this, consider the following semidefinite
program that maximizes over symmetric matrices X ∈ MS2 :
It might be difficult to see what the optimal value of this SDP is, but
we can get a helpful nudge by using computer software to solve it and
find that the optimal value is approximately 0.7071, which is attained at a
matrix " #
0.8536 0.3536
X≈ .
0.3536 0.1464
√
The optimal value looks like it is probably 1/ 2 = 0.7071 . . ., and to
prove it we just need to find feasible matrices that attain this value in the
objective functions of the primal and dual problems.
Another way to The matrix X that works in the primal problem must have x1,1 + x2,2 ≤
guess the entries of
1 (and
√ the numerics above suggest that equality holds)√ and x1,1 − x2,2 =
X would be to plug
the decimal 1/ 2.√Solving this 2 × 2 linear system gives x1,1 = (2 + 2)/4 and x2,2 =
approximations of its (2
√ − 2)/4, and the constraint x1,1 − x1,2 = 1/2 then tells us that x1,2 =
entries into a tool like
the Inverse Symbolic 2/4. We thus guess that the maximum value is obtained at the matrix
Calculator [BBP95]. " √ √ #
1 2+ 2 2
X= √ √ ,
4 2 2− 2
Similar to before, we can solve this SDP numerically to find that its optimal
value is also approximately 0.7071, and is attained when
In the previous two examples, we saw that not only did the dual problem
serve as an upper bound on the primal problem, but rather we were able to
find particular feasible points of each problem that resulted in their objective
functions taking on the same value, thus proving optimality. The following
theorem shows that this phenomenon occurs with a great deal of generality—
there are simple-to-check conditions that guarantee that there must be feasible
points in each of the primal and dual problems that attain the optimal value in
their objective functions.
Recall that X O Before stating what these conditions are, we need one more piece of ter-
means that X is
positive definite (i.e., minology: we say that an SDP in primal standard form (3.C.2) is feasible if
PSD and invertible) there exists a matrix X ∈ MH n satisfying all of its constraints (i.e., X O and
and Φ(X) ≺ B means Φ(X) B), and we say that it is strictly feasible if X can be chosen to make
that B − Φ(X) is
both of those inequalities strict (i.e., X O and Φ(X) ≺ B). Feasibility and
positive definite.
strict feasibility for problems written in the dual form of Definition 3.C.2 are
defined analogously. Geometrically, strict feasibility of an SDP means that its
feasible region has an interior—it is not a degenerate lower-dimensional shape
that consists only of boundaries and edges.
Theorem 3.C.2 Suppose that both problems in a primal/dual pair of SDPs are feasible, and
Strong Duality at least one of them is strictly feasible. Then the optimal values of those
SDPs coincide. Furthermore,
The conditions in this a) if the primal problem is strictly feasible then the optimal value is
theorem are
attained in the dual problem, and
sometimes called
the Slater conditions b) if the dual problem is strictly feasible then the optimal value is
for strong duality. attained in the primal problem.
There are other
(somewhat more
technical) Since the proof of this result relies on some facts about convex sets that
conditions that we have not explicitly introduced in the main body of this text, we leave it
guarantee that a
to Appendix B.3. However, it is worth presenting some examples to illustrate
primal/dual pair
share their optimal what the various parts of the theorem mean and why they are important. To
value as well. start, it is worthwhile to demonstrate what we mean when we say that strict
feasibility implies that the optimal value is “attained” by the other problem in a
primal/dual pair.
Example 3.C.6 Show that no feasible point of the following SDP attains its optimal value:
An SDP That
Does Not Attain minimize: x
Its Optimal Value subject to: x 1
O
1 y
x, y ≥ 0.
Solution:
Be careful: even if Recall from Theorem 2.2.7 that the matrix in this SDP is positive
the strong duality
semidefinite if and only if xy ≥ 1. In particular, this means that (x, y) =
conditions of
Theorem 3.C.2 hold (x, 1/x) is a feasible point of this SDP for all x > 0, so certainly the optimal
(i.e., the primal/dual value of this SDP cannot be bigger than 0. However, no feasible point has
SDPs are strictly x = 0, since we would then have xy = 0 6≥ 1. It follows that the optimal
feasible) and the
optimal value is
value of this SDP is 0, but no feasible point attains that value—they just
attained, that does get arbitrarily close to it.
not mean that it is The fact that the optimal value of this SDP is not attained does not
attained at a strictly
feasible point. contradict Theorem 3.C.2 since the dual of this SDP must not be strictly
feasible. To verify this claim, we compute the dual SDP to have the
following form (after simplifying somewhat):
maximize: Re(z)
subject to: x −z 2 0
O
−z y 0 0
Recall from The constraint in the above SDP forces y = 0, so there is no positive
Theorem 2.2.4 that
definite matrix that satisfies it (i.e., this dual SDP is not strictly feasible).
positive definite
matrices have strictly
positive diagonal There are also conditions other than those of Theorem 3.C.2 that can be
entries. used to guarantee that the optimal value of an SDP is attained or its dual
has the same optimal value. For example, the Extreme Value Theorem from
real analysis says that every continuous function on a closed and bounded
set necessarily attains it maximum and minimum values. Since the objective
function of every SDP is linear (and thus continuous), and the feasible set of
every SDP is closed, we obtain the following criterion:
On the other hand, notice that the feasible region of the SDP from Exam-
ple 3.C.6 is unbounded, since we can make x and y as large as we like in it (see
Figure 3.8).
y xy ≥ 1 x=0 x=2 x=4
3 3
2 2
1 1
x x
-1 1 2 3 4 -1 1 2 3 4
-1 -1
Figure 3.8: The feasible region of the semidefinite program from Example 3.C.6.
The fact that the optimal value of this SDP is not attained corresponds to the fact
that its feasible region gets arbitrarily close to the line x = 0 but does not contain
any point on it.
In the previous example, even though the dual SDP was not strictly feasible,
the primal SDP was, so the optimal values of both problems were still forced
410 Chapter 3. Tensors and Multilinearity
to equal each other by Theorem 3.C.2 (despite the optimal value not actually
being attained in the primal problem). We now present an example that shows
that it is also possible for neither SDP to be strictly feasible, and thus for their
optimal values to differ.
Example 3.C.7 Show that the following primal/dual SDPs, which optimize over the vari-
A Primal/Dual SDP ables X ∈ MS3 in the primal and y, z ∈ R in the dual, have different optimal
Pair with Unequal values:
Optimal Values
Primal Dual
maximize: −x2,2 minimize: y
subject to: 2x1,3 + x2,2 = 1 subject to: 0 0 y
x3,3 = 0 0 y+1 0 O
Verify on your own
X O y 0 z
that these problems
are indeed duals of
each other. Furthermore, explain why this phenomenon does not contradict strong
duality (Theorem 3.C.2).
Solution:
In the primal problem, the fact that x3,3 = 0 and X O forces x1,3 = 0
as well. The first constraint then says that x2,2 = 1, so the optimal value of
the primal problem is −1.
In the dual problem, the fact that the top-left entry of the 3 × 3 PSD
Recall from matrix equals 0 forces y = 0, so the optimal value of this dual problem
Exercise 2.2.11 that if
equals 0.
a diagonal entry of
a PSD matrix equals Since these two optimal values are not equal to each other, we know
0 then so does that from Theorem 3.C.2 that neither problem is strictly feasible. Indeed, the
entire row and
column.
primal problem is not strictly feasible since the constraint x3,3 = 0 ensures
that the entries in the final row and column of X all equal 0 (so X cannot
be invertible), and the dual problem is similarly not strictly feasible since
the 3 × 3 PSD matrix must have every entry in its first row and column
equal to 0.
-2 -1 0 1 2 3 4 5 6 7 8
Figure 3.9: Strong duality says that, for many semidefinite programs, the objective
function of the primal (maximization) problem can be increased to the exact
same value that the objective function of the dual (minimization) problem can be
decreased to.
quite straightforward. For example, to see that the SDP from Example 3.C.5 is
strictly feasible, we just need to observe that we can choose X and Y to each be
suitably large multiples of the identity matrix.
Example 3.C.8 Show that the following semidefinite program in the variables Y ∈ MHm,
Using Strong Duality Z ∈ MHn , and X ∈ M m,n (C) computes the operator norm of the matrix
to Solve an SDP A ∈ Mm,n (C):
maximize: Re tr(AX ∗ ) #
"
subject to: Y −X
∗ O
−X Z
tr(Y ) + tr(Z) ≤ 2
Solution:
Recall from Example 3.C.3 that this is the dual of the SDP (3.C.3) that
computes kAk. It follows that the optimal value of this SDP is certainly
bounded above by kAk, so there are two ways we could proceed:
• we could find a feasible point of this SDP that attains the conjectured
optimal value kAk, and then note that the true optimal value must
We find a feasible be kAk by weak duality, or
point attaining this
optimal value in • we could show that the primal SDP (3.C.3) is strictly feasible, so
Exercise 3.C.8. this dual SDP must attain its optimal value, which is the same as the
optimal value kAk of the primal SDP by strong duality.
We opt for the latter method—we show that the primal SDP (3.C.3) is
strictly feasible. To this end, we just note that if x is really, really large (in
When proving strong particular, larger than kAk) then x > 0 and
duality, it is often
useful to just let the " #
variables be very big. xIm A
O,
Try not to get hung A∗ xIn
up on how big is big
enough to make
everything positive so the primal SDP (3.C.3) is strictly feasible and strong duality holds.
definite, as long as it We do not need to, but we can also show that the dual SDP that we
is clear that there is
a big enough started with is strictly feasible by choosing
choice that works.
Y = Im /m, Z = In /n, and X = O.
Remark 3.C.3 Just like linear programs, semidefinite programs can be infeasible (i.e.,
Unbounded and have an empty feasible region) or unbounded (i.e., have an objective
Feasibility SDPs function that can be made arbitrarily large). If a maximization problem is
The corresponding unbounded then we say that its optimal value is ∞, and if it is infeasible
table for linear then we say that its optimal value is −∞ (and of course these two quantities
programs is identical, are swapped for minimization problems).
except with a blank
in the Weak duality immediately tells us that if an SDP is unbounded then
“infeasible/solvable” its dual must be infeasible, which leads to the following possible infeasi-
cells. There exists an ble/unbounded/solvable (i.e., finite optimal value) pairings that primal and
infeasible SDP with
solvable dual (see dual problems can share:
Exercise 3.C.4), but
no such pair of LPs
exists.
412 Chapter 3. Tensors and Multilinearity
Primal problem
Infeasible Solvable Unbounded
Infeasible X X X
Dual
Solvable X X ·
Unbounded X · ·
maximize: 0
subject to: x j, j = 3 for all 1 ≤ j ≤ 5
xi, j = −2 for {i, j} = {1, 4}, {2, 4}, {4, 5}
X O
If such a PSD matrix exists, this SDP has optimal value 0, otherwise it is
infeasible and thus has optimal value −∞.
∗∗3.C.5 Suppose A, B ∈ MH [Hint: Solve Exercise 3.C.9 first, which computes σ12 + · · · +
n are positive semidefinite
and A B. σr2 , and maybe make use of Exercise 3.C.5.]
(a) Provide an example to show that it is not necessarily [Side note: A similar method can be used to construct
the case that A2 B2 . semidefinite programs that compute σ1p +· · ·+σrp whenever
(b) Show that, in spite of part (a), it is the case that p is an integer power of 2.]
1/p
tr(A2 ) ≥ tr(B2 ). [Side note: The quantity σ1p + · · · + σrp is sometimes
[Hint: Factor a difference of squares.] called the Schatten p-norm of A.]
3.C.6 Suppose A, B ∈ MH n are positive semidefinite and 3.C.11 Let A ∈ MH n be positive semidefinite. Show
A B. that the following semidefinite
√ √ √ program in the variable
(a) Show that if A is positive definite then A B. X ∈ Mn (C) computes tr A (i.e., the sum of the square
√ √ −1 roots of the eigenvalues of A):
[Hint:√ Use the fact that B A and
A−1/4 BA−1/4 are similar to each other, where
p√ −1 maximize: "Re tr(X)#
A−1/4 = A .] subject to: I X
(b) √Use the√techniques from Section 2.D.3 to show that O
A B even when A is just positive semidefinite. X∗ A
You may use the fact that the principal square root
function is continuous on the set of positive semidef- [Hint: Use duality or Exercises 2.2.18 and 3.C.6.]
inite matrices.
∗∗3.C.12 Let C ∈ MH n and let k be a positive integer such
∗∗3.C.7 Use computer software to solve the SDP from that 1 ≤ k ≤ n. Consider the following semidefinite program
Remark 3.C.3 and thus fill in the missing entries in the in the variable X ∈ MH n:
matrix
3 ∗ ∗ −2 ∗ maximize: tr(CX)
∗ 3 ∗ −2 ∗ subject to: tr(X) = k
X I
∗ ∗ 3 ∗ ∗
X O
−2 −2 ∗ 3 −2
∗ ∗ ∗ −2 3 (a) Construct the dual of this semidefinite program.
(b) Show that the optimal value of this semidefinite pro-
so as to make it positive semidefinite.
gram is the sum of the k largest eigenvalues of C.
[Hint: Try to “eyeball” optimal solutions for the pri-
∗∗3.C.8 Find a feasible point of the semidefinite program mal and dual problems.]
from Example 3.C.8 that attains its optimal value kAk.
3.C.13 Let A ∈ Mm,n (C). Construct a semidefinite pro-
∗∗ 3.C.9 Let A ∈ Mm,n (C). Show that the following gram for computing the sum of the k largest singular values
semidefinite program in the variable X ∈ MH
n computes of A.
kAk2F :
[Side note: This sum is called the Ky Fan k-norm of A.]
minimize: "tr(X) # [Hint: The results of Exercises 2.3.14 and 3.C.12 may be
subject to: Im A helpful.]
O
A∗ X
X O ∗∗ 3.C.14 Recall that a linear map Φ : Mn → Mm is
called decomposable if there exist completely positive lin-
[Hint: Either use duality and mimic Example 3.C.5, or use ear maps Ψ1 , Ψ2 : Mn → Mm such that Φ = Ψ1 + T ◦ Ψ2 ,
the Schur complement from Section 2.B.1.] where T is the transpose map.
Construct a semidefinite program that determines whether
3.C.10 Let A ∈ Mm,n (C). Show that the following or not a given matrix-valued linear map Φ is decomposable.
semidefinite program in the variables X,Y ∈ MH n computes
σ14 + · · · + σr4 , where σ1 , . . ., σr are the non-zero singular ∗∗3.C.15 Use computer software and the semidefinite pro-
values of A: gram from Exercise 3.C.14 to show that the map Φ : M2 →
M4 from Exercise 3.A.21 is decomposable.
minimize: "tr(Y ) # " #
subject to: Im A I X
, n O 3.C.16 Use computer software and the semidefinite pro-
A∗ X X Y
gram from Exercise 3.C.14 to show that the Choi map Φ
X, Y O from Theorem 3.A.7 is not decomposable.
[Side note: We already demonstrated this claim “directly”
in the proof of that theorem.]
414 Chapter 3. Tensors and Multilinearity
∗∗3.C.17 Use computer software and the semidefinite pro- Construct a semidefinite program that determines whether
gram from Exercise 3.C.14 to show that the map Φ : M2 → or not q can be written as a sum of squares of bilinear forms.
M4 from Exercise 3.A.22 is not decomposable. [Hint: Instead of checking q(x, y) = xT Φ(yyT )x for all x
[Side note: This map Φ is positive, so it serves as an example and y, it suffices to choose finitely many vectors {xi } and
to show that Theorem 3.A.8 does not hold when mn = 8.] {y j } so that the sets {xi xTi } and {y j yTj } span the set of
symmetric matrices.]
Here we review some of the basics of linear algebra that we expect the reader to
be familiar with throughout the main text. We present some of the key results of
introductory linear algebra, but we do not present any proofs or much context.
For a more thorough presentation of these results and concepts, the reader is
directed to an introductory linear algebra textbook like [Joh20].
y y y
Av
A−1 Av = v
v A A−1
−−→ −−−→
x x x
The rank and nullity of A are the dimensions of its range and null space,
respectively. The following theorem summarizes many of the important proper-
ties of the range, null space, rank, and nullity that are typically encountered in
introductory linear algebra courses.
We saw in Theorem A.1.1 that square matrices with maximal rank are
exactly the ones that are invertible. Intuitively, we can think of the rank of
a matrix as a rough measure of “how close to invertible” it is, so matrices
with small rank are in some sense the “least invertible” matrices out there. Of
A.1 Review of Introductory Linear Algebra 419
particular interest are matrices with rank 1, which are exactly the ones that can
be written in the form vwT , where v and w are non-zero column vectors. More
generally, a rank-r matrix can be written as a sum of r matrices of this form,
but not fewer:
Theorem A.1.3 Suppose A ∈ Mm,n . Then the smallest integer r for which there exist sets
Rank-One Sum of vectors {v j }rj=1 ⊂ Rm and {w j }rj=1 ⊂ Rn with
Decomposition
r
y y
e2
A
−−→ Ae2
x Ae1 x
e1
Figure A.3: A 2 × 2 matrix A stretches the unit square (with sides e1 and e2 ) into a
parallelogram with sides Ae1 and Ae2 (the columns of A). The determinant of A is
the area of this parallelogram.
Theorem A.1.4 Suppose A ∈ Mn . Then det(A) = ∑ sgn(σ )aσ (1),1 aσ (2),2 · · · aσ (n),n .
Determinants via σ ∈Sn
Permutations
This theorem is typically not used for actual calculations in practice, as
much faster methods of computing the determinant are known. However, this
formula is useful for the fact that we can use it to quickly derive several
useful properties of the determinant. For example, swapping two columns of a
matrix has the effect of swapping the sign of each permutation in the sum in
Theorem A.1.4, which leads to the following fact:
We can replace
! If B ∈ Mn is obtained from A ∈ Mn by swapping two of its
columns with rows in
columns then det(B) = − det(A).
both of these facts,
since det(AT ) = det(A)
for all A ∈ Mn . Similarly, the determinant is multilinear in the columns of the matrix it acts
on: it acts linearly on each column individually, as long as all other columns are
fixed. That is, much like we can “split up” linear transformations over vector
addition and scalar multiplication, we can similarly split up the determinant
These properties of over vector addition and scalar multiplication in a single column of a matrix:
the determinant
can be derived
from its geometric ! For all matrices A = [ a1 | · · · | an ] ∈ Mn , all v, w ∈ Rn , and
interpretation as well, all scalars c ∈ R, it is the case that
but it’s somewhat
messy to do so. det [ a1 | · · · | v + cw | · · · | an ]
= det [ a1 | · · · | v | · · · | an ] + c · det [ a1 | · · · | w | · · · | an ] .
y y
Av2 = 3v2
This matrix does " #
change the 1 2
A=
direction of any v1 = (−1, 1) v2 = (1, 1) 2 1
vector that is not on −−−−−−−−→
one of the two lines
displayed here. x x
Av1 = −v1
Figure A.4: Matrices do not change the line on which any of their eigenvectors lie,
but rather just scale them by the corresponding eigenvalue. The matrix displayed
here has eigenvectors v1 = (−1, 1) and v2 = (1, 1) with corresponding eigenvalues
−1 and 3, respectively.
λ 2 − 5λ − 6 = 0 ⇐⇒ (λ + 1)(λ − 6) = 0
⇐⇒ λ = −1 or λ = 6,
422 Appendix A. Mathematical Preliminaries
2v1 + 2v2 = 0
.
5v1 + 5v2 = 0
A.1.7 Diagonalization
One of the primary reasons that eigenvalues and eigenvectors are of interest
is that they let us diagonalize a matrix. That is, they give us a way of de-
composing a matrix A ∈ Mn into the form A = PDP−1 , where P is invertible
and D is diagonal. If the entries of P and D can be chosen to be real, we say
that A is diagonalizable over R. However, some real matrices A can only be
diagonalized if we allow P and D to have complex entries (see the upcoming
discussion of complex numbers in Appendix A.3). In that case, we say that A
is diagonalizable over C (but not over R).
A.1 Review of Introductory Linear Algebra 423
To get a feeling for why diagonalizations are useful, notice that computing
a large power of a matrix directly is quite cumbersome, as matrix multiplication
itself is an onerous process, and repeating it numerous times only makes it
worse. However, once we have diagonalized a matrix we can compute an
arbitrary power of it via just two matrix multiplications, since
this diagonalization:
" #" #" #
314 314 −1 1 −1 2 (−1)314 0 −5 2
Since 314 is even, A = PD P =
7 1 5 0 6314 1 1
(−1)314 = 1.
" #
1 5 + 2 · 6314 −2 + 2 · 6314
= .
7 −5 + 5 · 6314 2 + 5 · 6314
Now that we have these quantities, we can just plug them into the binomial
theorem (but we are careful to replace y in the theorem with 2y):
As a strange edge
4
case, we note that 4
0! = 1. (x + 2y)4 = ∑ k x4−k (2y)k
k=0
= x4 (2y)0 + 4x3 (2y)1 + 6x2 (2y)2 + 4x1 (2y)3 + x0 (2y)4
= x4 + 8x3 y + 24x2 y2 + 32xy3 + 16y4 .
we find that the number of times that a particular term x1k1 x2k2 · · · xnkn occurs
Each of the p factors equals the number of ways that we can choose x1 from the p factors a total of
must have some x j
k1 times, x2 from the p factors a total of k2 times, and so on. This quantity is
chosen from it, so
k1 + k2 + · · · + kn = p. called a multinomial coefficient, and it is given by
p p!
= .
k1 , k2 , . . . , kn k1! k2 ! · · · kn !
Now that we have these quantities, we can just plug them into the multino-
Expanding out mial theorem (but we replace y in the theorem with 2y and z with 3z):
(x1 + · · · + xn ) p results
in n+p−1 terms in
p
3 3
general (this fact (x + 2y + 3z) = ∑ xk (2y)` (3z)m
can be proved using k+`+m=3 k, `, m
the method of
Remark 3.1.2). In this = 6x1 (2y)1 (3z)1
example we have + x3 (2y)0 (3z)0 + x0 (2y)3 (3z)0 + x0 (2y)0 (3z)3
n = p = 3, so we get
5
3 = 10 terms. + 3x0 (2y)1 (3z)2 + 3x1 (2y)0 (3z)2 + 3x1 (2y)2 (3z)0
+ 3x0 (2y)2 (3z)1 + 3x2 (2y)0 (3z)1 + 3x2 (2y)1 (3z)0
= 36xyz + x3 + 8y3 + 27z3
+ 54yz2 + 27xz2 + 12xy2 + 36y2 z + 9x2 z + 6x2 y.
We sometimes say is called its degree-p Taylor polynomial, and it has the form
that these Taylor
polynomials are p
centered at a. f (n) (a)
Tp (x) = ∑ (x − a)n
n=0 n!
f 00 (a) f (p) (a)
= f (a) + f 0 (a)(x − a) + (x − a)2 + · · · + (x − a) p ,
Be somewhat 2 p!
careful with the n = 0
term: (x − a)0 = 1 for
as long as f has p derivatives (i.e., f (p) exists).
all x.
For example, the lowest-degree polynomials that best approximate f (x) =
sin(x) near x = 0 are
T1 (x) = T2 (x) = x,
x3
T3 (x) = T4 (x) = x − ,
3!
x3 x5
T5 (x) = T6 (x) = x − + , and
3! 5!
x3 x5 x7
T7 (x) = T8 (x) = x − + − ,
3! 5! 7!
which are graphed in Figure A.5. In particular, notice that T1 is simply the
tangent line at the point (0, 0), and T3 , T5 , and T7 provide better and better
approximations of f (x) = sin(x).
y y = T1 (x) y = T5 (x)
In general, the
graph of T0 is a
horizontal line going
through (a, f (a)) and y = sin(x)
the graph of T1 is the x
-6 -5 -4 -3 -2 -1 1 2 3 4 5 6
tangent line going
through (a, f (a)).
y = T3 (x) y = T7 (x)
Figure A.5: The graphs of the first few Taylor polynomials of f (x) = sin(x). As the
degree of the Taylor polynomial increases, the approximation gets better.
To get a rough feeling for why Taylor polynomials provide the best approxi-
mation of a function near a point, notice that the first p derivatives of Tp (x) and
of f (x) agree with each other at x = a, so the local behavior of these functions
is very similar. The following theorem pins this idea down more precisely and
says that the difference between f (x) and Tp (x) behaves like (x − a) p+1 , which
is smaller than any degree-p polynomial when x is sufficiently close to a:
f (p+1) (c)
f (x) − Tp (x) = (x − a) p+1 .
(p + 1)!
f (p+1) (x) does not get too large near x = a, this limit converges to f (x) near
x = a, so we have
∞
f (n) (a)
f (x) = lim Tp (x) = ∑ (x − a)n
p→∞
n=0 n!
f 00 (a) f 000 (a)
= f (a) + f 0 (a)(x − a) + (x − a)2 + (x − a)3 + · · ·
2 3!
For example, ex , cos(x), and sin(x) have the following Taylor series representa-
tions, which converge for all x ∈ R:
∞ n
x x2 x3 x4 x5
ex = ∑ = 1+x+ + + + +··· ,
n=0 n! 2 3! 4! 5!
∞
(−1)n x2n x2 x4 x6 x8
cos(x) = ∑ = 1 − + − + −··· , and
n=0 (2n)! 2 4! 6! 8!
∞
(−1)n x2n+1 x3 x5 x7 x9
sin(x) = ∑ = x− + − + −···
n=0 (2n + 1)! 3! 5! 7! 9!
Many tasks in advanced linear algebra are based on the concept of eigenvalues
and eigenvectors, and thus require us to be able to find roots of polynomials
(after all, the eigenvalues of a matrix are exactly the roots of its characteris-
tic polynomial). Because many polynomials do not have real roots, much of
linear algebra works out much more cleanly if we instead work with “com-
plex” numbers—a more general type of number with the property that every
polynomial has complex roots (see the upcoming Theorem A.3.1).
To construct the complex numbers, we start by letting i be an object with
the property that i2 = −1. It is clear that i cannot be a real number, but we
nonetheless think of it like a number anyway, as we will see that we can
manipulate it much like we manipulate real numbers. We call any real scalar
The term multiple of i like 2i or −(7/3)i an imaginary number, and they obey the
“imaginary” number
same laws of arithmetic that we might expect them to (e.g., 2i + 3i = 5i and
is absolutely awful.
These numbers are (3i)2 = 32 i2 = −9).
no more We then let C, the set of complex numbers, be the set
make-believe than
def
real numbers C = a + bi : a, b ∈ R
are—they are both
purely mathematical in which addition and multiplication work exactly as they do for real numbers,
constructions and
they are both useful.
as long as we keep in mind that i2 = −1. We call a the real part of a + bi, and
it is sometimes convenient to denote it by Re(a + bi) = a. We similarly call b
its imaginary part and denote it by Im(a + bi) = b.
Remark A.3.1 It might seem extremely strange at first that we can just define a new
Yes, We Can number i and start doing arithmetic with it. However, this is perfectly fine,
Do That and we do this type of thing all the time—one of the beautiful things about
mathematics is that we can define whatever we like. However, for that
definition to actually be useful, it should mesh well with other definitions
and objects that we use.
Complex numbers are useful because they let us do certain things that
A.3 Complex Numbers 429
1 = ε × 0 = ε × (0 + 0) = (ε × 0) + (ε × 0) = 1 + 1 = 2.
We thus cannot work with such a number without breaking at least one of
the usual laws of arithmetic.
Theorem A.3.1 Every non-constant polynomial has at least one complex root.
Fundamental Theorem
of Algebra Equivalently, the fundamental theorem of algebra tells us that every poly-
nomial can be factored as a product of linear terms, as long as we allow the
roots/factors to be complex numbers. That is, every degree-n polynomial p can
be written in the form
For example,
Im
4 + 3i
−2 + 2i
a + bi
b
Re
a
−1 − i 3−i
Figure A.6: The complex plane is a representation of the set C of complex numbers.
Im
b a + bi
Re
a
a + bi = a − bi
Figure A.7: Complex conjugation reflects a complex number through the real axis.
• The previous point tells us that we can multiply any complex number
by another one (its complex conjugate) to get a real number. We can
make use of this trick to come up with a method of dividing by complex
numbers:
In the first step here,
we just cleverly a + bi a + bi c − di
multiply by 1 so as to =
c + di c + di c − di
make the
denominator real. (ac + bd) + (bc − ad)i ac + bd bc − ad
= = + i.
c2 + d 2 c2 + d 2 c2 + d 2
Figure A.8: Every number on the unit circle in the complex plane can be written in
the form cos(θ ) + i sin(θ ) for some θ ∈ [0, 2π).
x2 x3 x4 x5
ex = 1 + x + + + + +··· ,
2 3! 4! 5!
x2 x4
cos(x) = 1 − + − · · · , and
2! 4!
x3 x5
sin(x) = x − + ··· .
3! 5!
Just like these Taylor In particular, if we plug x = iθ into the Taylor series for ex , then we see that
series converge for
all x ∈ R, they also
θ2 θ3 θ4 θ5
converge for all x ∈ C. eiθ = 1 + iθ − −i + +i −···
2 3! 4! 5!
θ2 θ4 θ3 θ5
= 1− + −··· +i θ − + +··· ,
2 4! 3! 5!
which equals cos(θ ) + i sin(θ ). We have thus proved the remarkable fact, called
In other words, Euler’s formula, that
Re(eiθ ) = cos(θ ) and
Im(eiθ ) = sin(θ ). eiθ = cos(θ ) + i sin(θ ) for all θ ∈ [0, 2π).
By making use of Euler’s formula, we see that we can write every complex
number z ∈ C in the form z = reiθ , where r is the magnitude of z (i.e., r = |z|)
and θ is the angle that z makes with the positive real axis (see Figure A.9). This
is called the polar form of z, and we can convert back and forth between the
polar form z = reiθ and its Cartesian form z = a + bi via the formulas
In the formula for θ ,
p
sign(b) = ±1, a = r cos(θ ) r= a2 + b2
depending on a
whether b is positive b = r sin(θ ) θ = sign(b) arccos √ .
or negative. If b < 0 a2 + b2
then we get
−π < θ < 0, which There is no simple way to “directly” add two complex numbers that are
we can put in the in polar form, but multiplication is quite straightforward: (r1 eiθ1 )(r2 eiθ2 ) =
interval [0, 2π) by (r1 r2 )ei(θ1 +θ2 ) . We can thus think of complex numbers as stretched rotations—
adding 2π to it.
multiplying by reiθ stretches numbers in the complex plane by a factor of r and
rotates them counter-clockwise by an angle of θ .
Because the polar form of complex numbers works so well with multipli-
cation, we can use it to easily compute powers and roots. Indeed, repeatedly
A.3 Complex Numbers 433
Im
a + bi = reiθ
b
eiθ
θ
Re
a
Figure A.9: Every complex number can be written in Cartesian form a + bi and also
in polar form reiθ .
You should try to
convince yourself multiplying a complex number in polar form by itself gives (reiθ )n = rn einθ for
that raising any of all positive integers n ≥ 1. We thus see that every non-zero complex number
these numbers to has at least n distinct n-th roots (and in fact, exactly n distinct n-th roots). In
the n-th power results
in reiθ . Use the fact particular, the n roots of z = reiθ are
that e2πi = e0i = 1.
r1/n eiθ /n , r1/n ei(θ +2π)/n , r1/n ei(θ +4π)/n , ..., r1/n ei(θ +2(n−1)π)/n .
Im
eπ i/2 = i
√ √
e5πi/6 = (− 3 + i)/2 eπi/6 = ( 3 + i)/2
π /6
Re
-2 2
e9πi/6 = −i
This definition of
principal roots
requires us to make Among the n distinct n-th roots of a complex number z = reiθ , we call
sure that θ ∈ [0, 2π).
434 Appendix A. Mathematical Preliminaries
r1/n eiθ /n its principal n-th root, which we denote by z1/n . The principal root
of a complex number is the one with the smallest angle so that, for example, if
z is a positive real number (i.e., θ = 0) then its principal roots are positive real
numbers as well. Similarly, the principal square root of a complex number is the
one in the upper half of the complex plane (for example, the principal square
root of −1 is i, not −i), and we showed in √ Example A.3.1 that the principal
cube root of z = eπi/2 = i is z1/3 = eπi/6 = ( 3 + i)/2.
work exactly the same as they do for complex numbers. We can thus think
In the language of of the complex number a + bi as “the same thing” as the 2 × 2 matrix
Section 1.3.1, the set
of 2 × 2 matrices of aI + bJ.
this special form is This representation of complex numbers perhaps makes it clearer why
“isomorphic” to C.
However, they are
they act like rotations when multiplying by them: we can rewrite aI + bJ
not just isomorphic as " # " #
as vector spaces, a −b p cos(θ ) − sin(θ )
but even as fields aI + bJ = = a2 + b2 ,
(i.e., the field
b a sin(θ ) cos(θ )
operations of √
Definition A.4.1 are where we recognize a2 + b2 as the magnitude of the complex number
preserved). a + bi, and the matrix on the right as the one that rotates R2√counter-
2 2
√ of θ (here, θ is chosen so that cos(θ ) = a/ a + b
clockwise by an angle
2
and sin(θ ) = b/ a + b ).2
A.4 Fields
Introductory linear algebra is usually carried out via vectors and matrices made
up of real numbers. However, most of its results and methods carry over just
fine if we instead make use of complex numbers, for example (in fact, much of
linear algebra works out better when using complex numbers rather than real
numbers).
This raises the question of what sets of “numbers” we can make use of in
linear algebra. In fact, it even raises the question of what a “number” is in the
first place. We saw in Remark A.3.2 that we can think of complex numbers
as certain special 2 × 2 matrices, so it seems natural to wonder why we think
The upcoming of them as “numbers” even though we do not think of general matrices in the
definition generalizes
R in much the same same way.
way that This appendix answers these questions. We say that a field is a set in which
Definition 1.1.1
generalizes Rn .
addition and multiplication behave in the same way that they do in the set of
A.4 Fields 435
Definition A.4.1 Let F be a set with two operations called addition and multiplication.
Field We write the addition of a ∈ F and b ∈ F as a + b, and the multiplication
of a and b as ab.
If the following conditions hold for all a, b, c ∈ F then F is called a field:
a) a+b ∈ F (closure under addition)
b) a+b = b+a (commutativity of addition)
c) (a + b) + c = a + (b + c) (associativity of addition)
d) There is a zero element 0 ∈ F such that 0 + a = a.
e) There is an additive inverse −a ∈ F such that a + (−a) = 0.
Notice that the first f) ab ∈ F (closure under multiplication)
five properties
g) ab = ba (commutativity of multiplication)
concern addition,
the next five h) (ab)c = a(bc) (associativity of multiplication)
properties concern i) There is a unit element 1 ∈ F such that 1a = a.
multiplication, and j) If a 6= 0, there is a multiplicative inverse 1/a ∈ F such that
the final property
combines them.
a(1/a) = 1.
k) a(b + c) = (ab) + (bc) (distributivity)
The sets R and C of real and complex numbers are of course fields (we
defined fields specifically so as to mimic R and C, after all). Similarly, it is
straightforward to show that the set Q of rational numbers—real numbers of
the form p/q where p and q 6= 0 are integers—is a field. After all, if a = p1 /q1
and b = p2 /q2 are rational then so are a + b = (p1 q2 + p2 q1 )/(q1 q2 ) and
This is analogous to ab = (p1 p2 )/(q1 q2 ), and all other field properties follow immediately from the
how we only need
to prove closure to corresponding facts about real numbers.
show that a Another less obvious example of a field is the set Z2 of numbers with arith-
subspace is a vector
space
metic modulo 2. That is, Z2 is simply the set {0, 1}, but with the understanding
(Theorem 1.1.2)—all that addition and multiplication of these numbers work as follows:
other properties are
inherited from the 0+0 = 0 0+1 = 1 1+0 = 1 1 + 1 = 0, and
parent structure.
0×0 = 0 0×1 = 0 1×0 = 0 1 × 1 = 1.
In other words, these operations work just as they normally do, with the
exception that 1 + 1 = 0 instead of 1 + 1 = 2 (since there is no “2” in Z2 ).
Phrased differently, we can think of this as a simplified form of arithmetic that
only keeps track of whether or not a number is even or odd (with 0 for even
and 1 for odd). All field properties are straightforward to show for Z2 , as long
as we understand that −1 = 1.
More generally, if p is a prime number then the set Z p = {0, 1, 2, . . . , p − 1}
with arithmetic modulo p is a field. For example, addition and multiplication in
Z5 = {0, 1, 2, 3, 4} work as follows:
+ 0 1 2 3 4 × 0 1 2 3 4
0 0 1 2 3 4 0 0 0 0 0 0
1 1 2 3 4 0 1 0 1 2 3 4
2 2 3 4 0 1 2 0 2 4 1 3
3 3 4 0 1 2 3 0 3 1 4 2
4 4 0 1 2 3 4 0 4 3 2 1
436 Appendix A. Mathematical Preliminaries
Again, all of the field properties of Z p are straightforward to show, with the
exception of property (j)—the existence of multiplicative inverses. The standard
Property (j) is why we way to prove this property is to use a theorem called Bézout’s identity, which
need p to be prime. says that, for all 0 ≤ a < p, we can find integers x and y such that ax + py = 1.
For example, Z4 is We can rearrange this equation to get ax = 1 − py, so that we can choose
not a field since 2
does not have a 1/a = x. For example, in Z5 we have 1/1 = 1, 1/2 = 3, 1/3 = 2, and 1/4 = 4
multiplicative inverse. (with 1/2 = 3, for example, corresponding to the fact that the integer equation
2x + 5y = 1 can be solved when x = 3).
A.5 Convexity
In linear algebra, the sets of vectors that arise most frequently are subspaces
(which are closed under linear combinations) and the functions that arise most
frequently are linear transformations (which preserve linear combinations). In
this appendix, we briefly discuss the concept of convexity, which is a slight
weakening of these linearity properties that can be thought of as requiring the
set or graph of the function to “not bend inward” rather than “be flat”.
If we required To convince ourselves that this definition captures the geometric idea
cv + dw ∈ S for all
involving line segments between two points, we observe that if t = 0 then
c, d ∈ R then S would
be a subspace. The (1 − t)v + tw = v, if t = 1 then (1 − t)v + tw = w, and as t increases from 0
restriction that the to 1 the quantity (1 − t)v + tw travels along the line segment from v to w (see
coefficients are Figure A.10).
non-negative and
add up to 1 makes
convex sets more
v
general. v
t =0 1 w
3 2 w
3
t =1
Figure A.10: A set is convex (a) if all line segments between two points in that set
are contained within it, and it is non-convex (b) otherwise.
Another way of thinking about convex sets is as ones with no holes or sides
A.5 Convexity 437
that bend inward—their sides can be flat or bend outward only. Some standard
examples of convex sets include
• Any interval in R.
• Any subspace of a real vector space.
Recall that even
though the entries of • Any quadrant in R2 or orthant in Rn .
• The set of positive (semi)definite matrices in MH
Hermitian matrices
can be complex, n (see Section 2.2).
MHn is a real vector The following theorem pins down a very intuitive, yet extremely powerful,
space (see idea—if two convex sets are disjoint (i.e., do not contain any vectors in com-
Example 1.1.5).
mon) then we must be able to fit a line/plane/hyperplane (i.e., a flat surface that
cuts V into two halves) between them.
The way that we formalize this idea is based on linear forms (see Sec-
tion 1.3), which are linear transformations from V to R. We also make use of
open sets, which are sets of vectors that do not include any of their boundary
points (standard examples include any subinterval of R that does not include
its endpoints, and the set of positive definite matrices in MHn ).
f (v) = c + 1
S f (v) = c
f (v) = c
f (v) = c − 1 S
Figure A.11: The separating hyperplane theorem (Theorem A.5.1) tells us (a) that
we can squeeze a hyperplane between any two disjoint convex sets. This is (b) not
necessarily possible if one of the sets is non-convex.
While the separating hyperplane theorem itself perhaps looks a bit abstract,
it can be made more concrete by choosing a particular vector space V and
thinking about how we can represent linear forms acting on V. For example,
if V = Rn then Theorem 1.3.3 tells us that for every linear form f : Rn → R
In fact, v is exactly
the unique (up to there exists a vector v ∈ Rn such that f (x) = v · x for all x ∈ Rn . It follows that,
scaling) vector that in this setting, if S and T are convex and T is open then there exists a vector
is orthogonal to the v ∈ Rn and a scalar c ∈ R such that
separating
hyperplane.
v·y > c ≥ v·x for all x ∈ S and y ∈ T.
438 Appendix A. Mathematical Preliminaries
Similarly, we know from Exercise 1.3.7 that for every linear form f : Mn →
R, there exists a matrix A ∈ Mn such that f (X) = tr(AX) for all X ∈ Mn . The
separating hyperplane theorem then tells us (under the usual hypotheses on S
and T ) that
tr(AY ) > c ≥ tr(AX) for all X ∈ S and Y ∈ T.
Definition A.5.2 Suppose V is a real vector space and S ⊆ V is a convex set. We say that a
Convex Function function f : S → R is convex if
f (1 − t)v + tw ≤ (1 − t) f (v) + t f (w) whenever v, w ∈ S
and 0 ≤ t ≤ 1.
y y = f (x)
y y = f (x)
4
The region above a
4
function’s graph is
called its epigraph. 3
(1 − t) f (x1 ) + t f (x2 ) 3
¢
≥ f (1 − t)x1 + tx2 2
1
1
x
1 2 3 4 x
(1 − t)x1 + tx2 1 2 3 4
Figure A.12: A function is convex if and only if (a) every line segment between two
points on its graph lies above its graph, if and only if (b) the region above its graph
is convex.
Figure A.13: A function is concave if and only if every line segment between two
points on its graph lies below its graph, if and only if the region below its graph is
convex.
The following theorem provides what is probably the simplest method of
checking whether or not a function of one real variable is convex, concave, or
neither.
For example, we can see that the function f (x) = x4 is convex on R since
f 00 (x)= 12x2 ≥ 0, whereas the natural logarithm g(x) = ln(x) is concave on
Similarly, a function is (0, ∞) since g00 (x) = −1/x2 ≤ 0. In fact, the second derivative of g(x) = ln(x)
strictly convex if is strictly negative, which tells us that any line between two points on its graph
f (1 − t)v + tw <
(1 − t) f (v) + t f (w) is strictly below its graph—it only touches the graph at the endpoints of the
whenever v 6= w and line segment. We call such a function strictly concave.
0 < t < 1. Notice the
strict inequalities.
One of the most useful applications of the fact that the logarithm is (strictly)
concave is the following inequality that relates the arithmetic and geometric
means of a set of n real numbers:
In this appendix, we prove some of the technical results that we made use of
throughout the main body of the textbook, but whose proofs are messy enough
(or unenlightening enough, or simply not “linear algebra-y” enough) that they
are hidden away here.
We mentioned near the start of Section 1.D that all norms on a finite-dimensional
vector space are equivalent to each other. That is, the statement of Theo-
rem 1.D.1 told us that, given any two norms k · ka and k · kb on V, there exist
real scalars c,C > 0 such that
Proof of Theorem 1.D.1. We first note that Exercise 1.D.28 tells us that it
suffices to prove this theorem in Fn (where F = R or F = C). Furthermore,
transitivity of equivalence of norms (Exercise 1.D.27) tells us that it suffices
just to show that every norm on Fn is equivalent to the usual vector length
The 2-norm is (2-norm) k · k, so the remainder of the proof is devoted to showing why this is
q the case. In particular, we will show that any norm k · ka is equivalent to the
kvk = |v1 |2 + · · · + |vn |2 .
usual 2-norm k · k.
To see that there is a constant C > 0 such that kvka ≤ Ckvk for all v ∈ Fn ,
we notice that
Recall that e1 , . . ., en
denote the standard kvka = kv1 e1 + · · · + vn en ka (v in standard basis)
basis vectors in Fn .
≤ |v1 |ke1 ka + · · · + |vn |ken ka (triangle ineq. for k · ka )
q q
≤ |v1 |2 + · · · + |vn |2 ke1 k2a + · · · + ken k2a (Cauchy−Schwarz ineq.)
q
= kvk ke1 k2a + · · · + ken k2a . (formula for kvk)
To see that there is similarly a constant c > 0 such that ckvk ≤ kvka for
all v ∈ V, we need to put together three key observations. First, observe that it
suffices to show that c ≤ kvka whenever v is such that kvk = 1, since otherwise
we can just scale v so that this is the case and both sides of the desired inequality
scale appropriately.
Second, we notice that the function k · ka is continuous on Fn , since the
Some details about inequality kvka ≤ Ckvk tells us that if a limk→∞ vk = v then
limits and continuity
can be found in
Section 2.D.2.
lim kvk − vk = 0
k→∞
=⇒ lim kvk − vka = 0 ( lim kvk − vka ≤ C lim kvk − vk = 0)
k→∞ k→∞ k→∞
=⇒ lim kvk ka = kvka . (reverse triangle ineq.: Exercise 1.D.6)
k→∞
Third, we define
Sn = v ∈ Fn : kvk = 1 ⊂ Fn ,
which we can think of as the unit circle (or sphere, or hypersphere, depending
on n), as illustrated in Figure B.14, and we notice that this set is closed and
A closed and bounded.
bounded subset of
Fn is typically called
compact. y
z
S2 S3
x
1 y
Figure B.14: The set Sn ⊂ Fn of unit vectors can be thought of as the unit circle, or
sphere, or hypersphere, depending on the dimension.
Continuous functions To put all of this together, we recall that the Extreme Value Theorem from
also attain their analysis says that every continuous function on a closed and bounded subset of
maximum value on Fn attains its minimum value on that set (i.e., it does not just approach some
a closed and
minimum value). It follows that the norm k · ka attains a minimum value (which
bounded set, but we
only need the we call c) on Sn , and since each vector in Sn is non-zero we know that c > 0.
minimum value. We thus conclude that c ≤ kvka whenever v ∈ Sn , which is exactly what we
wanted to show.
The first result that we present here demonstrates the claim that we made
in the middle of the proof of Theorem 2.4.5.
Theorem B.2.1 If two matrices A ∈ Mm (C) and C ∈ Mn (C) do not have any eigenvalues
Similarity of Block in common then, for all B ∈ Mm,n (C), the following block matrices are
Triangular Matrices similar: " # " #
A B A O
and .
O C O C
It thus suffices to show that there exists a matrix X ∈ Mm,n (C) such that
We provide another AX − XC = B. Since B ∈ Mm,n (C) is arbitrary, this is equivalent to showing
proof that such an X
that the linear transformation T : Mm,n (C) → Mm,n (C) defined by T (X) =
exists in Exercise 3.1.8.
AX − XC is invertible, which we do by showing that T (X) = O implies X = O.
To this end, suppose that T (X) = O, so AX = XC. Simply using associativ-
ity of matrix multiplication then shows that this implies
Theorem B.2.2 The set Bk from the proof of Theorem 2.4.1 is linearly independent.
Linear Independence
of Jordan Chains
444 Appendix B. Additional Proofs
Proof. Throughout this proof, we use the same notation as in the proof of
Theorem 2.4.1, but before proving anything specific about Bk , we prove the
following claim:
Claim: If S1 , S2 are subspaces of Cn and x ∈ Cn is a vector for which x ∈ /
span(S1 ∪ S2 ), then span({x} ∪ S1 ) ∩ span(S2 ) = span(S1 ) ∩ span(S2 ).
Note that
span(S1 ∪ S2 ) = S1 + S2 To see why this holds, notice that if {y1 , . . . , ys } and {z1 , . . . , zt } are
is the sum of S1 and bases of S1 and S2 , respectively, then we can write every member of
S2 , which was
span({x} ∪ S1 ) ∩ span(S2 ) in the form
introduced in
Exercise 1.1.19. s t
bx + ∑ ci yi = ∑ di zi .
i=1 i=1
Our goal is to show that this implies ci = di, j = 0 for all i and j. To this
end, notice that if we multiply the linear combination (B.2.1) on the left by
(A − λ I)k−1 then we get
!
p q k−1
0 = (A − λ I)k−1 ∑ ci vi + ∑ ∑ di, j (A − λ I) j+k−1 wi
i=1 i=1 j=0
! (B.2.2)
p q
k−1
= (A − λ I) ∑ ci vi + ∑ di,0 wi ,
i=1 i=1
where the second equality comes from the fact that (A − λ I) j+k−1 wi= 0
whenever j ≥ 1, since Ck \Bk+1 ⊆ null (A − λ I)k ⊆ null (A − λ I) j+k−1 .
In other words, we have shown that ∑i=1p
ci vi + ∑qi=1 di,0 wi ∈ null (A −
λ I)k−1 , which we claim implies di,0 = 0 for all 1 ≤ i ≤ q. To see why this
is the case, we repeatedly use the Claim from the start of this proof with x
ranging over the members of Ck \Bk+1 (i.e., the vectors that we added one
at a time in Step 2 ofthe proof of Theorem 2.4.1), S1 = span(Bk+1 ), and
S2 = null (A − λ I)k−1 to see that
span(Ck ) ∩ null (A − λ I)k−1 = span(Bk+1 ) ∩ null (A − λ I)k−1
(B.2.3)
⊆ span(Bk+1 ).
p
It follows that ∑i=1 ci vi + ∑qi=1 di,0 wi ∈ span(Bk+1 ). Since Ck is linearly inde-
pendent, this then tells us that di,0 = 0 for all i.
B.2 Details of the Jordan Decomposition 445
We now repeat the above argument, but instead of multiplying the linear
combination (B.2.1) on the left by (A − λ I)k−1 , we multiply it on the left by
smaller powers of A−λ I. For example, if we multiply on the left by (A−λ I)k−2
then we get
! !
p q q
0 = (A − λ I)k−2 ∑ ci vi + ∑ di,0 wi + (A − λ I)k−1 ∑ di,1 wi
i=1 i=1 i=1
! !
p q
= (A − λ I)k−2 ∑ ci vi + (A − λ I)k−1 ∑ di,1 wi ,
i=1 i=1
where the first equality comes from the fact that (A − λ I) j+k−2 wi = 0 whenever
j ≥ 2, and the second equality following from the fact that we already showed
that di,0 = 0 for all 1 ≤ i ≤ q.
In other words, p
we have shown that ∑i=1 ci vik−2 + (A − λ I) ∑qi=1 di,1 wi ∈
null (A − λ I) k−2 . If we notice that null (A − λ I) ⊆ null (A − λ I)k−1
p q
then it follows that ∑i=1 ci vi + ∑i=1 di,1 wi ∈ null (A − λ I)k−1 , so repeating
the same argument from earlier shows that di,1 = 0 for all 1 ≤ i ≤ q.
Repeating this argument shows that di, j = 0 for all i and j, so the linear
p
combination (B.2.1) simply says that ∑i=1 ci vi = 0. Since Bk+1 is linearly
independent, this immediately implies ci = 0 for all 1 ≤ i ≤ p, which finally
shows that Bk is linearly independent as well.
The final proof of this subsection shows that the way we defined analytic
functions of matrices (Definition 2.4.4) is actually well-defined. That is, we
now show that this definition leads to the formula of Theorem 2.4.6 regardless
of which value a we choose to center the Taylor series at, whereas we originally
just proved that theorem in the case when a = λ . Note that in the statement
of this theorem and its proof, we let Nn ∈ Mk (C) denote the matrix with
ones on its n-th superdiagonal and zeros elsewhere, just as in the proof of
Theorem 2.4.6.
The left-hand side of Proof. Since Jk (λ ) = λ I + N1 , we can rewrite the sum on the left as
the concluding line
of this theorem
is just ∞
f (n) (a) n ∞
f (n) (a) n
f Jk (λ ) , and the ∑ λ I + N1 − aI = ∑ (λ − a)I + N1 .
right-hand side is just n=0 n! n=0 n!
the large matrix
formula given by Using the binomial theorem (Theorem A.2.1) then shows that this sum equals
Theorem 2.4.6. !
∞
f (n) (a) k n n− j
∑ n! ∑ j (λ − a) N j .
n=0 j=0
Swapping the order of these sums and using the fact that nj = n!/((n − j)! j!)
then puts it into the form
We can swap the !
order of summation
k
1 ∞
f (n) (a)
∑ ∑ (λ − a)n− j N j . (B.2.4)
j=0 j! n=0 (n − j)!
here because the
sum over n con-
verges
for each j.
446 Appendix B. Additional Proofs
f (a) (n)
Since f is analytic we know that f (λ ) = ∑∞ n
n=0 n! (λ − a) . Further-
( j) ( j)
more, f must also be analytic, and replacing f by f in this Taylor series
(n)
f (a)
shows that f ( j) (λ ) = ∑∞n=0 (n− j)! (λ − a)
n− j . Substituting this expression into
Primal Dual
X ∈ MH
n and Y ∈ Mm
H
maximize: tr(CX) minimize: tr(BY )
are matrix variables.
subject to: Φ(X) B subject to: Φ∗ (Y ) C
X O Y O
We start by proving a helper theorem that does most of the heavy lifting
for strong duality. We note that we use the closely-related Exercises 2.2.20
and 2.2.21 in this proof, which tell us that A O if and only if tr(AB) ≥ 0
whenever B O, as well as some other closely-related variants of this fact
involving positive definite matrices.
tr(BY ) ≤ 0.
Proof. To see that (b) implies (a), suppose that Y is as described in (b). Then if
a matrix X as described in (a) did exist, we would have
In particular, the first
and second 0 < tr (B − Φ(X))Y (since Y 6= O,Y O, B − Φ(X) O)
inequalities here use ∗
Exercise 2.2.20. = tr(BY ) − tr XΦ (Y ) (linearity of trace, definition of adjoint)
≤ 0, (since X O, Φ∗ (Y ) O, tr(BY ) ≤ 0)
For the opposite implication, we notice that if (a) holds then the set
S = {B − Φ(X) : X ∈ MH
n is positive semidefinite}
Theorem B.3.2 Suppose that both problems in a primal/dual pair of SDPs are feasible, and
Strong Duality at least one of them is strictly feasible. Then the optimal values of those
for Semidefinite SDPs coincide. Furthermore,
Programs a) if the primal problem is strictly feasible then the optimal value is
attained in the dual problem, and
b) if the dual problem is strictly feasible then the optimal value is
attained in the primal problem.
Proof. We just prove part (a) of the theorem, since part (b) then follows from
swapping the roles of the primal and dual problems.
Let α be the optimal value of the primal problem (which is not necessarily
e : MH
attained). If we define Φ H
n → Mm+1 and B ∈ Mm+1 by
H
" #
e −tr(CX) 0 e −α 0
Φ(X) = and B = ,
0 Φ(X) 0 B
The inequality tr(BeYe ) ≤ 0 then tells us that tr(BY ) ≤ yα, and the constraint
e ∗ (Ye ) O tells us that Φ∗ (Y ) yC. If y 6= 0 then it follows that Y /y is
Φ
Keep in mind that
Ye O implies Y O
a feasible point of the dual problem that produces a value in the objective
and y ≥ 0. function no larger than α (and thus necessarily equal to α, by weak duality),
which is exactly what we wanted to find. All that remains is to show that y 6= 0,
so that we know that Y /y exists.
To pin down this final detail, we notice that if y = 0 then Φ∗ (Y ) = Φe ∗ (Ye )
If y = 0, we know that O and tr(BY ) = tr(BeYe ) ≤ 0. Applying Theorem B.3.1 (to B and Φ this time)
Y 6= O since Ye 6= O. shows that the primal problem is not strictly feasible, which is a contradiction
that completes the proof.
C. Selected Exercise Solutions
we can first choose cn to give us whatever coef- 1.1.14 (a) We check the two closure properties from
ficient of xn we want, then choose cn−1 to give Theorem 1.1.2: if f , g ∈ P O then ( f +
us whatever coefficient of xn−1 we want, and g)(−x) = f (−x) + g(−x) = − f (x) − g(x) =
so on. −( f + g)(x), so f + g ∈ P O too, and if c ∈ R
1.1.4 (b) True. Since V must contain a zero vector 0, we then (c f )(−x) = c f (−x) = −c f (x), so c f ∈
can just define W = {0}. P O too.
(d) False. For example, let V = R2 and let B = (b) We first notice that {x, x3 , x5 , . . .} ⊂ P O . This
{(1, 0)},C = {(2, 0)}. set is linearly independent since it is a subset
(f) False. For example, if V = P and B = of the linearly independent set {1, x, x2 , x3 , . . .}
{1, x, x2 , . . .} then span(B) = P, but there is from Example 1.1.16. To see that it spans P O ,
no finite subset of B whose span is all of P. we notice that if
(h) False. The set {0} is linearly dependent (but f (x) = a0 + a1 x + a2 x2 + a3 x3 + · · · ∈ P O
every other single-vector set is indeed linearly
independent). then f (x) − f (−x) = 2 f (x), so
2 f (x) = f (x) − f (−x)
1.1.5 Property (i) of vector spaces tells us that (−c)v =
= (a0 + a1 x + a2 x2 + a3 x3 + · · · )
(−1)(cv), and Theorem 1.1.1(b) tells us that
(−1)(cv) = −(cv). − (a0 − a1 x + a2 x2 − a3 x3 + · · · )
= 2(a1 x + a3 x3 + a5 x5 + · · · ).
1.1.7 All 10 of the defining properties from Definition 1.1.1
are completely straightforward and follow imme- It follows that f (x) = a1 x +a3 x3 +a5 x5 +· · · ∈
diately from the corresponding properties of F. For span(x, x3 , x5 , . . .), so {x, x3 , x5 , . . .} is indeed
example v + w = w + v (property (b)) for all v, w ∈ FN a basis of P O .
because v j + w j = w j + v j for all j ∈ N.
1.1.17 (a) We have to show that the following two proper-
1.1.8 If A, B ∈ MSn and c ∈ F then (A + B)T = AT + BT = ties hold for S1 ∩ S2 : (a) if v, w ∈ S1 ∩ S2 then
A + B, so A + B ∈ MSn and (cA)T = cAT = cA, so v + w ∈ S1 ∩ S2 as well, and (b) if v ∈ S1 ∩ S2
cA ∈ MSn . and c ∈ F then cv ∈ S1 ∩ S2 .
For property (a), we note that v+w ∈ S1 (since
1.1.10 (a) C is a subspace of F because if f , g are con- S1 is a subspace) and v + w ∈ S2 (since S2 is
tinuous and c ∈ R then f + g and c f are also a subspace). It follows that v + w ∈ S1 ∩ S2 as
continuous (both of these facts are typically well.
proved in calculus courses). Property (b) is similar: we note that cv ∈ S1
(b) D is a subspace of F because if f , g are differ- (since S1 is a subspace) and cv ∈ S2 (since S2
entiable and c ∈ R then f + g and c f are also is a subspace). It follows that cv ∈ S1 ∩ S2 as
differentiable. In particular, ( f + g)0 = f 0 + g0 well.
and (c f )0 = c f 0 (both of these facts are typi- (b) If V = R2 and S1 = {(x, 0) : x ∈ R}, S2 =
cally proved in calculus courses). {(0, y) : y ∈ R} then S1 ∪ S2 is the set of points
with at least one of their coordinates equal to 0.
1.1.11 If S is closed under linear combinations then cv ∈ S It is not a subspace since (1, 0), (0, 1) ∈ S1 ∪S2
and v + w ∈ S whenever c ∈ F and v, w ∈ S simply but (1, 0) + (0, 1) = (1, 1) 6∈ S1 ∪ S2 .
because cv and v + w are both linear combinations of
members of S, so S is a subspace of V. 1.1.19 (a) S1 + S2 consists of all vectors of the form
Conversely, if S is a subspace of V and v1 , . . . , vk ∈ S (x, 0, 0) + (0, y, 0) = (x, y, 0), which is the xy-
then c j v j ∈ S for each 1 ≤ j ≤ k. Then repeat- plane.
edly using closure under vector addition gives (b) We claim that MS2 + MsS 2 is all of M2 . To
c1 v1 + c2 v2 ∈ S, so (c1 v1 + c2 v2 ) + c3 v3 ∈ S, and so verify this claim, we need to show that we can
on to c1 v1 + c2 v2 + · · · + ck vk ∈ S. write every 2 × 2 matrix as a sum of a symmet-
ric and a skew-symmetric matrix. This can be
1.1.13 The fact that {E1,1 , E1,2 , . . . , Em,n } spans Mm,n is done as# follows:
" " # " #
clear: every matrix A ∈ Mm,n can be written in the a b
=
1 2a b+c
+
1 0 b−c
form c d 2 b+c 2d 2 c−b 0
m n
A= ∑ [Note: We discuss a generalization of this fact
∑ ai, j Ei, j . in Example 1.B.3 and Remark 1.B.1.]
i=1 j=1
Linear independence of {E1,1 , E1,2 , . . . , Em,n } follows (c) We check that the two properties of Theo-
from the fact that if rem 1.1.2 are satisfied by S1 + S2 . For prop-
m n erty (a), notice that if v, w ∈ S1 + S2 then there
∑ ∑ ci, j Ei, j = O exist v1 , w1 ∈ S1 and v2 , w2 ∈ S2 such that
i=1 j=1 v = v1 + v2 and w = w1 + w2 . Then
then v + w = (v1 + v2 ) + (w1 + w2 )
c1,1 c1,2 ··· c1,n
c2,1 c2,2 ··· c2,n = (v1 + w1 ) + (v2 + w2 ) ∈ S1 + S2 .
.. ..
.. = O, Similarly, if c ∈ F then
..
. . . . cv = c(v1 + v2 ) = cv1 + cv2 ∈ S1 + S2 ,
cm,1 cm,2 ··· cm,n
since cv1 ∈ S1 and cv2 ∈ C2 .
so ci, j = 0 for all i, j.
C.1 Vector Spaces 451
1.1.20 If every vector in V can be written as a linear combina- (c) Notice that the k-th derivative of xk is the scalar
tion of the members of B, then B spans V by definition, function k!, and every higher-order derivative
so we only need to show linear independence of B. To of xk is 0. It follows that W (x) is an upper
this end, suppose that v ∈ V can be written as a linear triangular matrix with 1, 1!, 2!, 3!, . . . on its di-
combination of the members of B in exactly one way: agonal, so det(W (x)) = 1 · 1! · 2! · · · n! 6= 0.
(d) We start by showing that det(W (x)) = 0 for all
v = c1 v1 + c2 v2 + · · · + ck vk
x. First, we note that f20 (x) = 2|x| for all x. To
for some v1 , v2 , . . . , vk ∈ B. If B were linearly depen- see this, split it into three cases: If x > 0 then
dent, then there must be a non-zero linear combination f2 (x) = x2 , which has derivative 2x = 2|x|. If
of the form x < 0 then f2 (x) = −x2 , which has derivative
−2x = 2|x|. If x = 0 then we have to use the
0 = d1 w1 + d2 w2 + · · · + dm wm limit definition of a derivative
for some w1 , w2 , . . . , wm ∈ B. By adding these two f2 (h) − f2 (0)
linear combinations, we see that f20 (0) = lim
h→0 h
v = c1 v1 + · · · + ck vk + d1 w1 + · · · + dm wm . h|h| − 0
= lim = lim |h| = 0,
h→0 h h→0
Since not all of the d j ’s are zero, this is a different
linear combination that gives v, which contradicts which also equals 2|x| (since x = 0). With that
uniqueness. We thus conclude that B must in fact be out of the way, we can compute
linearly independent. " #!
x2 x|x|
det(W (x)) = det
1.1.21 (a) Since f1 , f2 , . . . , fn are linearly dependent, it 2x 2|x|
follows that there exist scalars c1 , c2 , . . . , cn
such that = 2x2 |x| − 2x2 |x| = 0
1.2.12 We prove this in the upcoming Definition 3.1.3, and (b) Suppose v∈V. Then Exercise 1.2.22 tells us that
the text immediately following it.
c1 v1 + c2 v2 + · · · + cm vm = v
1.2.13 If AT = λ A then a j,i = λ ai, j for all i, j, so ai, j = if and only if
λ (λ ai, j ) = λ 2 ai, j for all i, j. If A 6= O then this im-
plies λ = ±1. The fact that the eigenspaces are MSn c1 [v1 ]B + c2 [v2 ]B + · · · + cm [vm ]B
and MsS n follows directly from the defining properties = [c1 v1 + c2 v2 + · · · + cm vm ]B = [v]B .
AT = A and AT = −A of these spaces.
In particular, we can find c1 , c2 , . . . , cm to
1.2.15 (a) range(T ) = P 2 , null(T ) = {0}. solve the first equation if and only if we
(b) The only eigenvalue of T is λ = 1, and the can find c1 , c2 , . . . , cm to solve the second
corresponding eigenspace is P 0 (the constant equation, so v ∈ span(v1 , v2 , . . . , vm ) if and
functions). only if [v]B ∈ span [v1 ]B , [v2 ]B , . . . , [vm ]B and
(c) One square root is the transformation S : P 2 → thus span(v1 , v2 , . . . , vm ) = V if and only if
P 2 given by S( f (x)) = f (x + 1/2). span [v1 ]B , [v2 ]B , . . . , [vm ]B = Fn .
(c) This follows immediately from combining
1.2.18 [D]B can be (complex) diagonalized as [D]B = PDP−1 parts (a) and (b).
via " # " #
1 1 i 0 1.2.24 (a) If v, w ∈ range(T ) and c ∈ F then there ex-
P= , D= . ist x, y ∈ V such that T (x) = v and T (y) =
−i i 0 −i
w. Then T (x + cy) = v + cw, so v + cw ∈
It follows that range(T ) too, so range(T ) is a subspace of w.
1/2
[D]B = PD1/2 P−1 (b) If v, w ∈ null(T ) and c ∈ F then T (v + cw) =
" # T (v) + cT (w) = 0 + c0 = 0, so v + cw ∈
1 1+i 0 null(T ) too, so null(T ) is a subspace of V.
=√ P P−1
2 0 1−i
" # 1.2.25 (a) Wejustnotethatw ∈ range(T )ifandonlyifthere
1 1 −1
=√ , exists v ∈ V such that T (v) = w, if and only if
2 1 1 there exists [v]B such that [T ]D←B [v]B = [w]D ,
which is the same square root we found in Exam- if and only if [w]D ∈ range([T ]D←B ).
ple 1.2.19. (b) Similar to part (a), v ∈ null(T ) if and only if
T (v) = 0, if and only if [T ]D←B [v]B = [0]B = 0,
1.2.21 It then follows from Exercise 1.2.20(c) that if and only if [v]B ∈ null([T ]D←B ).
if {v1 , v2 , . . . , vn } is a basis of V then (c) Using methods like in part (a), we can
{T (v1 ), T (v2 ), . . . , T (vn )} is a basis of W, so show that if w1 , . . . , wn is a basis of
dim(W ) = n = dim(V). range(T ) then [w1 ]D , . . . , [wn ]D is a ba-
sis of range([T ]D←B ), so rank(T ) =
1.2.22 (a) Write B={v1 , v2 , . . ., vn } and give names to the dim(range(T )) = dim(range([T ]D←B )) =
entries of [v]B and [w]B : [v]B = (c1 , c2 , . . . , cn ) rank([T ]D←B ).
and [w]B = (d1 , d2 , . . . , dn ), so that (d) Using methods like in part (b), we can
show that if v1 , . . . , vn is a basis of
v = c1 v1 + c2 v2 + · · · + cn vn null(T ) then [v1 ]B , . . . , [vn ]B is a basis of
w = d1 v1 + d2 v2 + · · · + dn vn . null([T ]D←B ), so nullity(T ) = dim(null(T )) =
By adding these equations we see that dim(null([T ]D←B )) = nullity([T ]D←B ).
v + w = (c1 + d1 )v1 + · · · + (cn + dn )vn , 1.2.27 (a) The “only if” implication is trivial since a ba-
which means that [v+w]B =(c1 +d1 , c2 +d2 , . . . , sis of V must, by definition, be linearly in-
cn +dn ), which is the same as [v]B +[w]B . dependent. Conversely, we note from Exer-
(b) Using the same notation as in part (a), we have cise 1.2.26(a) that, since B is linearly indepen-
cv = (cc1 )v1 + (cc2 )v2 + · · · + (ccn )vn , dent, we can add 0 or more vectors to B to
create a basis of V. However, B already has
which means that [cv]B = (cc1 , cc2 , . . . , cck ), dim(V) vectors, so the only possibility is that
which is the same as c[v]B . B becomes a basis when we add 0 vectors to
(c) It is clear that if v = w then [v]B = [w]B . For it—i.e., B itself is already a basis of V.
the converse, we note that parts (a) and (b) (b) Again, the “only if” implication is trivial since
tell us that if [v]B = [w]B then [v − w]B = 0, so a basis of V must span V. Conversely, Exer-
v − w = 0v1 + · · · + 0vn = 0, so v = w. cise 1.2.26(b) tells us that we can remove 0
or more vectors from B to create a basis of V.
1.2.23 (a) By Exercise 1.2.22, we know that However, we know that all bases of V contain
c1 v1 + c2 v2 + · · · + cm vm = 0 dim(V) vectors, and B already contains exactly
if and only if this many vectors, so the only possibility is that
B becomes a basis when we remove 0 vectors
c1 [v1 ]B + c2 [v2 ]B + · · · + cm [vm ]B from it—i.e., B itself is already a basis of V.
= [c1 v1 + c2 v2 + · · · + cm vm ]B = [0]B = 0.
In particular, this means that {v1 , v2 , . . . , vm } is
linearly independent if and only if c1 = c2 =
· · · = cm = 0 is the unique
solution to these equa-
tions, if and only if [v1 ]B , [v2 ]B , . . . , [vm ]B ⊂
n
F is linearly independent.
454 Appendix C. Selected Exercise Solutions
1.2.28 (a) All 10 vector space properties from Defini- 1.2.32 We check the two properties of Definition 1.2.4:
tion 1.1.1 are straightforward, so we do not (a) R (v1 , v2 , . . .) + (w1 , w2 , . . .) = R(v1 + w1 , v2 +
show them all here. Property (a) just says that w2 , . . .) = (0, v1 + w1 , v2 + w2 , . . .) = (0, v1 , v2 , . . .) +
the sum of two linear transformations is again (0, w1 , w2 , . . .) = R(v1 , v2 , . . .) + R(w1 , w2 , . . .),
a linear transformation, for example. and (b) R c(v1 , v2 , . . .) = R(cv1 , cv2 , . . .) =
(b) dim(L(V, W)) = mn, which can be seen by (0, cv1 , cv2 , . . .) = c(0, v1 , v2 , . . .) = cR(v1 , v2 , . . .).
noting that if {v1 , . . . , vn } and {w1 , . . . , wm }
are bases of V and W, respectively, then the 1.2.34 It is clear that dim(P 2 (Z2 )) ≤ 3 since {1, x, x2 } spans
mn linear transformations defined by P 2 (Z2 ) (just like it did in the real case). However, it
( is not linearly independent since the two polynomials
wi if j = k, f (x) = x and f (x) = x2 are the same on Z2 (i.e., they
Ti, j (vk ) =
0 otherwise provide the same output for all inputs). The set {1, x}
is indeed a basis of P 2 (Z2 ), so its dimension is 2.
form a basis of L(V, W). In fact, the standard
matrices of these linear transformations make 1.2.35 Suppose P ∈ Mm,n is any matrix such that P[v]B =
up the standard basis of Mm,n : [Ti, j ]D←B = Ei, j [T (v)]D for all v ∈ V. For every 1 ≤ j ≤ n, if v = v j
for all i, j. then we see that [v]B = e j (the j-th standard basis
vector in Rn ), so P[v]B = Pe j is the j-th column
1.2.30 [T ]B = In if and only if the j-th column of [T ]B equals of P. On the other hand, it is also the case that
e j for each j. If we write B = {v1 , v2 , . . . , vn } then P[v]B = [T (v)]D = [T (v j )]D . It follows that the j-
we see that [T ]B = In if and only if [T (v j )]B = e j for th column of P is [T (v j )]D for each 1 ≤ j ≤ n, so
all j, which is equivalent to T (v j ) = v j for all j (i.e., P = [T ]D←B , which shows uniqueness of [T ]D←B .
T = IV ).
1.2.36 Suppose that a set C has m < n vectors, which we call
1.2.31 (a) If B is a basis of W then by Exercise 1.2.26 v1 , v2 , . . . , vm . To see that C does not span V, we want
that we can extend B to a basis C ⊇ B of V. to show that there exists x ∈ V such that the equation
However, since dim(V) = dim(W) we know c1 v1 + · · · + cm vm = x does not have a solution. This
that B and C have the same number of vectors, equation is equivalent to
so B = C, so V = W.
(b) Let V = c00 from Example 1.1.10 and let W c1 [v1 ]B + · · · + cm [vm ]B = [x]B ,
be the subspace of V with the property that
which is a system of n linear equations in m variables.
the first entry of every member of w equals 0.
Since m < n, this is a “tall and skinny” linear system,
Then dim(V) = dim(W) = ∞ but V 6= W.
so applying Gaussian elimination to the augmented
matrix form [ A | [x]B ] of this linear system results in
a row echelon form [ R | [y]B ] where R has at least
one zero row at the bottom. If x is chosen so that the
bottom entry of [y]B is non-zero then this linear system
has no solution, as desired.
(b) S ◦ T is linear (even when S and T are any (b) It is still true that alternating implies skew-
not necessarily invertible linear transforma- symmetric, but the converse fails because
tions) since (S ◦ T )(v + cw) = S(T (v + cw)) = f (v, v) = − f (v, v) no longer implies f (v, v) =
S(T (v) + cT (w)) = S(T (v)) + cS(T (w)) = 0 (since 1 + 1 = 0 implies −1 = 1 in this field).
(S ◦ T )(v) + c(S ◦ T )(w). Furthermore, S ◦ T is
invertible since T −1 ◦ S−1 is its inverse. 1.3.18 (a) We could prove this directly by mimicking the
proof of Theorem 1.3.5, replacing the trans-
1.3.7 If we use the standard basis of Mm,n (F), then we poses by conjugate transposes. Instead, we
know from Theorem 1.3.3 that there are scalars prove it by denoting the vectors in the basis
{ai, j } such that f (X) = ∑i, j ai, j xi, j . If we let A be B by B = {v1 , . . . , vm } and defining a new bi-
the matrix with (i, j)-entry equal to ai, j , we get linear function g : V × W → C by
tr(AX) = ∑i, j ai, j xi, j as well.
g(v j , w) = f (v j , w) for all v j ∈ B, w ∈ W
1.3.8 Just repeat the argument of Exercise 1.3.7 with the and extending to all v ∈ V via linearity. That
bases of MSn (F) and MH n that we presented in the is, if v = c1 v1 + · · · + cm vm (i.e., [v]B =
solution to Exercise 1.2.2. (v1 , . . . , vm )) then
1.3.9 Just recall that inner products are linear in their second g(v, w) = g(c1 v1 + · · · + cm vm , w)
entry, so hv, 0i = hv, 0vi = 0hv, vi = 0. = c1 g(v1 , w) + · · · + cm g(vm , wm )
1.3.10 Recall that = c1 f (v1 , w) + · · · + cm f (vm , wm )
s = f (c1 v1 + · · · + cm vm , w).
m n
kAkF = ∑ ∑ |ai, j |2 Then Theorem 1.3.5 tells us that we can write
i=1 j=1
g in the form
which is clearly unchanged if we reorder the entries g(v, w) = [v]TB A[w]C .
of A (i.e., kAkF = kAT kF ) and is also unchanged if we
take the complex conjugate of some or all entries (so When combined with the definition of g above,
kA∗ kF = kAT kF ). we see that
f (v, w) = g(c1 v1 + · · · + cm vm , w)
1.3.12 The result follows from expanding out the norm
in terms of the inner product: kv + wk2 = hv + = ([v]B )T A[w]C = [v]∗B A[w]C ,
w, v + wi = hv, vi + hv, wi + hw, vi + hw, wi = as desired.
kvk2 + 0 + 0 + kwk2 . (b) If A = A∗ (i.e., A is Hermitian) then
1.3.13 We just compute f (v, w) = [v]∗B A[w]B = [v]∗B A∗ [w]B
kv + wk2 + kv − wk2 = ([w]∗B A[v]B )∗ = f (w, v),
= hv + w, v + wi + hv − w, v − wi with the final equality following from the fact
that [w]∗B A[v]B = f (w, v) is a scalar and thus
= hv, vi + hv, wi + hw, vi + hw, wi
equal to its own transpose.
+ hv, vi − hv, wi − hw, vi + hw, wi In the other direction, if f (v, w) = f (w, v) for
= 2hv, vi + 2hw, wi all v and w then in particular this holds if
[v]B = ei and [w]B = e j , so
= 2kvk2 + 2kwk2 .
ai, j = e∗i Ae j = [v]∗B A[w]B = f (v, w)
1.3.14 (a) Expanding the norm in terms of the inner prod-
uct gives = f (w, v) = ([w]∗B A[v]B )∗ = (e∗j Aei )∗ = a j,i .
Since this equality holds for all i and j, we
kv + wk2 − kv − wk2
conclude that A is Hermitian.
= hv + w, v + wi − hv − w, v − wi (c) Part (b) showed the equivalence of A being
= hv, vi + 2hv, wi + hw, wi Hermitian and f being conjugate symmetric,
so we just need to show that positive definite-
− hv, vi + 2hv, wi − hw, wi,
ness of f (i.e., f (v, v) ≥ 0 with equality if
= 4hv, wi, and only if f = 0) is equivalent to v∗ Av ≥ 0
from which the desired equality follows. with equality if and only if v = 0. This equiva-
(b) This follows via the same method as in part (a). lence follows immediately from recalling that
All that changes is that the algebra is uglier, f (v, v) = [v]∗B A[v]B .
and we have to be careful to not forget the com-
plex conjugate in the property hw, vi = hv, wi. 1.3.21 gx (y) = 1 + xy + x2 y2 + · · · + x p y p .
1.3.16 (a) If f is alternating then for all v, w ∈ V we 1.3.22 We note that dim((P p )∗ ) = dim(P p ) = p + 1, which
have 0 = f (v + w, v + w) = f (v, v) + f (v, w) + is the size of the proposed basis, so we just need to
f (w, v) + f (w, w) = f (v, w) + f (w, v), so show that it is linearly independent. To this end, we
f (v, w) = − f (w, v), which means that f is note that if
skew-symmetric. Conversely, if f is skew- d0 Ec0 + d1 Ec1 + · · · + d p Ec p = 0
symmetric then choosing v = w tells us that
f (v, v) = − f (v, v), so f (v, v) = 0 for all v ∈ V, then
so f is alternating. d0 f (c0 ) + d1 f (c1 ) + · · · + d p f (c p ) = 0
456 Appendix C. Selected Exercise Solutions
for all f ∈ P p . However, if we choose f to be the non- 1.3.25 We just check the three defining proper-
zero polynomial with roots at each of c1 , . . ., c p (but ties of inner products, each of which fol-
not c0 , since f has degree p and thus at most p roots) lows from the corresponding properties of
then this tells us that d0 f (c0 ) = 0, so d0 = 0. A similar the inner product on W. (a) hv1 , v2 iV =
argument with polynomials having roots at all except hT (v1 ), T (v2 )iW = hT (v2 ), T (v1 )iW = hv2 , v1 iV .
for one of the c j ’s shows that d0 = d1 = · · · = d p = 0, (b) hv1 , v2 + cv3 iV = hT (v1 ), T (v2 + cv3 )iW =
which gives linear independence. hT (v1 ), T (v2 ) + cT (v3 )iW = hT (v1 ), T (v2 )iW +
chT (v1 , T (v3 )iW = hv1 , v2 iV + chv1 , v3 iV . (c)
1.3.24 Suppose that hv1 , v1 iV = hT (v1 ), T (v1 )iW ≥ 0, with equality if
and only if T (v1 ) = 0, which happens if and only
c1 T (v1 ) · · · + cn T (vn ) = 0.
if v1 = 0. As a side note, the equality condition of
By linearity of T , this implies T (c1 v1 + · · · + cn vn ) = property (c) is the only place where we used the fact
0, so φc1 v1 +···+cn vn = 0, which implies that T is an isomorphism (i.e., invertible).
φc1 v1 +···+cn vn ( f ) = f (c1 v1 + · · · + cn vn ) 1.3.27 If jk = j` then the condition a j1 ,..., jk ,..., j` ,..., j p =
= c1 f (v1 ) + · · · + cn f (vn ) = 0 −a j1 ,..., j` ,..., jk ,..., j p tells us that a j1 ,..., jk ,..., j` ,..., j p = 0
whenever two or more of the subscripts are equal
for all f ∈ V ∗ . Since B is linearly independent, we to each other. If all of the subscripts are dis-
can choose f so that f (v1 ) = 1 and f (v2 ) = · · · = tinct from each other (i.e., there is a permutation
f (vn ) = 0, which implies c1 = 0. A similar argu- σ : {1, 2, . . . , p} → {1, 2, . . . , p} such that σ (k) = jk
ment (involving different choices of f ) shows that for all 1 ≤ k ≤ p) then we see that aσ (1),...,σ (p) =
c2 = · · · = cn = 0 as well, so C is linearly independent. sgn(σ )a1,2,...,p , where sgn(σ ) is the sign of σ (see
Appendix A.1.5). It follows that A is completely de-
termined by the value of a1,2,...,p , so it is unique up to
scaling.
1.4.5 (a) True. This follows from Theorem 1.4.8. 1.4.19 We mimic the proof of Theorem 1.4.9. To see that (a)
(c) False. If A = AT and B = BT then (AB)T = implies (c), suppose B∗ B = C∗C. Then for all v ∈ Fn
BT AT = BA, which in general does not equal we have
AB. For an explicit counter-example, you can p √
kBvk = (Bv) · (Bv) = v∗ B∗ Bv
choose √ p
" # " # = v∗C∗Cv = (Cv) · (Cv) = kCvk.
1 0 0 1
A= , B= .
0 −1 1 0 For the implication (c) =⇒ (b), note that if kBvk2 =
kCvk2 for all v ∈ Fn then (Bv) · (Bv) = (Cv) · (Cv).
(e) False. For example, U = V = I are unitary, but If x, y ∈ Fn then v = x + y)
this tells us (by choosing
U +V = 2I is not. that B(x + y) · B(x + y) = C(x + y) · C(x + y) .
(g) True. On any inner product space V, IV satis- Expanding this dot product on both the left and right
2 = I and I ∗ = I .
fies IV V V V then gives
1.4.8 If U and V are unitary then (UV )∗ (UV ) = (Bx) · (Bx) + 2Re (Bx) · (By) + (By) · (By)
V ∗U ∗UV = V ∗V = I, so UV is also unitary. = (Cx) · (Cx) + 2Re (Cx) · (Cy) + (Cy) · (Cy).
1.4.10 If λ is an eigenvalue of U corresponding to an eigen- By then using the facts that (Bx) · (Bx) = (Cx) · (Cx)
vector v then Uv = λ v. Unitary matrices preserve and (By) · (By) = (Cy) · (Cy), we can simplify the
length, so kvk = kUvk = kλ vk = |λ |kvk. Dividing above equation to the form
both sides of this equation by kvk (which is OK since
Re (Bx) · (By) = Re (Cx) · (Cy) .
eigenvectors are non-zero) gives 1 = |λ |.
If F = R then this implies (Bx) · (By) = (Cx) · (Cy) for
1.4.11 Since U ∗U = I, we have 1 = det(I) = det(U ∗U) = all x, y∈Fn , as desired. If instead F=C then we can
det(U ∗ ) det(U) = det(U) det(U) = | det(U)|2 . Taking repeat the above argument with v = x + iy to see that
the square root of both sides of this equation gives
Im (Bx) · (By) = Im (Cx) · (Cy) ,
| det(U)| = 1.
so in this case we have (Bx) · (By) = (Cx) · (Cy) for
1.4.12 We know from Exercise 1.4.10 that the eigenvalues all x, y ∈ Fn too, establishing (b).
(diagonal entries) of U must all have magnitude 1. Finally, to see that (b) =⇒ (a), note that if we rear-
Since the columns of U each have norm 1, it follows range (Bv) · (Bw) = (Cv) · (Cw) slightly, we get
that the off-diagonal entries of U must each be 0.
(B∗ B −C∗C)v · w = 0 for all v, w ∈ Fn .
1.4.15 Direct computation shows that the (i, j)-entry of F ∗ F If we choose w=(B∗ B − C∗C)v then this implies
∗
equals 1n ∑n−1
k=0 ω
( j−i)k . If i = j (i.e., j − i = 0) then
(B B − C∗C)v
2 =0 for all v ∈ Fn , so (B∗ B −
ω ( j−i)k = 1, so the (i, i)-entry of F ∗ F equals 1. If C∗C)v = 0 for all v∈Fn . This in turn implies
i 6= j then ω ( j−i)k 6= 1 is an n-th root of unity, so we B∗ B − C∗C=O, so B∗ B = C∗C, which completes the
claim that this sum equals 0 (and thus F ∗ F = I, so F proof.
is unitary).
To see why ∑n−1 k=0 ω
( j−i)k = 0 when i 6= j, we use a 1.4.20 Since B is mutually orthogonal and thus linearly in-
standard formula for summing a geometric series: dependent, Exercise 1.2.26(a) tells us that there is
a (not necessarily orthonormal) basis D of V such
n−1
1 − ω ( j−i)n that B ⊆ D. Applying the Gram–Schmidt process
∑ ω ( j−i)k = 1 − ω j−i
= 0,
to this bases D results in an orthonormal basis C
k=0
of V that also contains B (since the vectors from B
with the final equality following from the fact that are already mutually orthogonal and normalized, the
ω ( j−i)n = 1. Gram–Schmidt process does not affect them).
1.4.17 (a) For the “if” direction we note that (Ax) · y = 1.4.22 Pick orthonormal bases B and C of V and W,
(Ax)T y = xT AT y and x·(AT y) = xT AT y, which respectively, so that rank(T ) = rank([T ]C←B ) =
are the same. For the “only if” direction, we can rank([T ]C←B
∗ ) = rank([T ∗ ]B←C ) = rank(T ∗ ). In the
either let x and y range over all standard basis second equality, we used the fact that rank(A∗ ) =
vectors to see that the (i, j)-entry of A equals rank(A) for all matrices A, which is typically proved
the ( j, i)-entry of B, or use the “if” direction in introduced linear algebra courses.
together with uniqueness of the adjoint (Theo-
rem 1.4.8). 1.4.23 If R and S are two adjoints of T , then
(b) Almost identical to part (a), but just recall that
the complex dot product has a complex con- hv, R(w)i = hT (v), wi = hv, S(w)i for all v, w.
jugate in it, so (Ax) · y = (Ax)∗ y = x∗ A∗ y and
Rearranging slightly gives
x·(A∗ y) = x∗ A∗ y, which are equal to each other.
hv, (R − S)(w)i for all v ∈ V, w ∈ W.
1.4.18 This follows directly from the fact that if
Exercise 1.4.27 then shows that R − S = O, so R = S,
B {v1 , . . . , vn } then [v j ]E = vj for all 1 ≤ j ≤ n,
as desired.
so PE←B = v1 | v2 | · · · | vn , which is unitary if
and only if its columns (i.e., the members of B) form
1.4.24 (a) If v, w ∈ Fn , E is the standard basis of Fn , and B
an orthonormal basis of Fn .
is any basis of Fn , then
PB←E v = [v]B and PB←E w = [w]B .
458 Appendix C. Selected Exercise Solutions
By plugging this fact into Theorem 1.4.3, we see 1.4.29 (a) If P = P∗ then
that h·, ·i is an inner product if and only if it has
the form hP(v), v − P(v)i = hP2 (v), v − P(v)i
= hP(v), P∗ (v − P(v))i
hv, wi = [v]B · [w]B
= hP(v), P(v − P(v))i
= (PB←E v) · (PB←E w)
= hP(v), P(v) − P(v)i = 0,
= v∗ (PB←E
∗
PB←E )w.
so P is orthogonal.
Recalling that change-of-basis matrices are in- (b) If P is orthogonal (i.e., hP(v), v − P(v)i = 0
vertible and every invertible matrix is a change- for all v ∈ V) then h(P − P∗ ◦ P)(v), vi = 0
of-basis matrix completes the proof. for all v ∈ V, so Exercise 1.4.28 tells us that
(b) The matrix " # P − P∗ ◦ P = O, so P = P∗ ◦ P. Since P∗ ◦ P is
1 2 self-adjoint, it follows that P is as well.
P=
0 1
1.4.30 (a) If the columns of A are linearly independent
works.
then rank(A) = n, and we know in general
(c) If P is not invertible then it may be the case
that rank(A∗ A) = rank(A), so rank(A∗ A) = n
that hv, vi = 0 even if v 6= 0, which violates
as well. Since A∗ A is an n × n matrix, this tells
the third defining property of inner products.
us that it is invertible.
In particular, it will be the case that hv, vi = 0
(b) We first show that P is an orthogonal projec-
whenever v ∈ null(P).
tion:
1.4.25 If B = {v1 , . . . , vn } then set Q = v1 | · · · | vn ∈ Mn , P2 = A(A∗ A)−1 (A∗ A)(A∗ A)−1 A∗
which is invertible since B is linearly independent.
Then set P = Q−1 , so that the function = A(A∗ A)−1 A∗ = P,
Constructing Pc (ex ) via this basis (as we did (b) Standard techniques from calculus like
in Example 1.4.18) gives the following poly- L’Hôpital’s rule show that as c → 0, the co-
nomial (we use sinh(c) = (ec − e−c )/2 and efficients of x2 , x, and 1 in Pc (ex ) above go to
cosh(c) = (ec + e−c )/2 to simplify things a bit): 1/2, 1, and 1, respectively, so
15 lim Pc (ex ) = 21 x2 + x + 1,
Pc (ex ) = (c2 + 3) sinh(c) − 3c cosh(c) x2 c→0+
2c5
3 which we recognize as the degree-2 Taylor
− 3 sinh(c) − c cosh(c) x
c polynomial of ex at x = 0. This makes intuitive
3 sense since Pc (ex ) is the best approximation of
− 3 (c2 + 5) sinh(c) − 5c cosh(c) . ex on the interval [−c, c], whereas the Taylor
2c
polynomial is its best approximation at x = 0.
(c) If we let A = E j, j+1 and B = E j+1, j (with 1 ≤ It then follows from Theorem
1.A.4 that
j < n), then we get AB − BA = E j, j+1 E j+1, j − 0 (0) = det(A)tr A−1 B .
fA,B
E j+1, j E j, j+1 = E j, j − E j+1, j+1 . Similarly, if (b) The definition of the derivative says that
A = Ei,k and B = Ek, j (with i 6= j) then we get
AB − BA = Ei,k Ek, j − Ek, j Ei,k = Ei, j . It follows d det A(t + h) − det A(t)
det A(t) = lim .
that W contains each of the n2 − 1 matrices in dt h→0 h
the following set: Using Taylor’s theorem (Theorem A.2.3) tells
{E j, j − E j+1, j+1 } ∪ {Ei, j : i 6= j}. us that
Furthermore, it is straightforward to show that dA
A(t + h) = A(t) + h + hP(h),
this set is linearly independent, so dim(W) ≥ dt
n2 − 1. Since W ⊆ Z and dim(Z) = n2 − 1,
where P(h) is a matrix whose entries depend on
it follows that dim(W) = n2 − 1 as well, so
h in a way so that lim P(h) = O. Substituting
W = Z. h→0
this expression for A(t + h) into the earlier
ex-
1.A.7 (a) We note that f (E j, j ) = 1 for all 1 ≤ j ≤ n since pression for the derivative of det A(t) shows
E j, j is a rank-1 orthogonal projection. To see that
that f (E j,k ) = 0 whenever 1 ≤ j 6= k ≤ n (and d
thus show that f is the trace), we notice that det A(t)
dt
E j, j + E j,k + Ek, j + Ek,k is a rank-2 orthogonal pro-
det A(t) + h dA
dt + hP(h) − det A(t)
jection, so applying f to it gives a value of 2. Lin- = lim
earity of f then says that f (E j,k ) + f (Ek, j ) = 0 h→0 h
whenever j 6= k. A similar argument shows that det A(t) + h dA
dt − det A(t)
E j, j + iE j,k − iEk, j + Ek,k is a rank-2 orthogonal = lim
h→0 h
projection, so linearity of f says that f (E j,k ) − 0
= fA(t),dA/dt (0),
f (Ek, j ) = 0 whenever j 6= k. Putting these two
facts together shows that f (E j,k ) = f (Ek, j ) = 0, where the second-to-last equality follows from
which completes the proof. the fact that adding hP(h) inside the determi-
(b) Consider the linear form f : M2 (R) → R defined nant, which is a polynomial in the entries of
by " #! its argument, just adds lots of terms that have
a b h[P(h)]i, j as a factor for various values of i and
f = a + b − c + d. j. Since lim h[P(h)]i, j /h = 0 for all i and j,
c d h→0
This linear form f coincides with the trace on all these terms make no contribution to the value
symmetric matrices (and thus all orthogonal pro- of the limit.
jections P ∈ M2 (R)), but not on all A ∈ M2 (R). (c) This follows immediately from choosing
B = dA/dt in part (a) and then using the result
1.A.8 (a) First use the fact that I = AA−1 and multiplica- that we proved in part (b).
tivity of the determinant to write
fA,B (x) = det(A+xB) = det(A) det I +x(A−1 B) .
(d) True. We know from Theorem 1.B.7 that the By uniqueness of projections (Exercise 1.B.7),
range of A and the null space of A∗ are orthog- we now just need to show that range(P) =
onal complements of each other. range(A) and null(P) = null(B∗ ). These facts
(f) False. R2 ⊕ R3 has dimension 5, so it can- both follow quickly from the facts that
not have a basis consisting of 6 vectors. range(AC) ⊆ range(A) and null(A) ⊆ null(CA)
Instead, the standard basis of R2 ⊕ R3 is for all matrices A and C.
{(e1 , 0), (e2 , 0), (0, e1 ), (0, e2 ), (0, e3 )}. More specifically, these facts immediately tell
us that range(P) ⊆ range(A) and null(B∗ ) ⊆
1.B.4 We need to show that span(span(v1 ), . . . , span(vk )) = null(P). To see the opposite inclusions, no-
V and tice that PA = A so range(A) ⊆ range(P), and
[ B∗ P = B∗ so null(P) ⊆ null(B∗ ).
span(vi ) ∩ span v j = {0}
j6=i
1.B.10 To see that the orthogonal complement is indeed
for all 1 ≤ i ≤ k. The first property fol- a subspace, we need to check that it is non-empty
lows immediately from the fact that {v1 , . . . , vk } (which it is, since it contains 0) and verify that the
is a basis of V so span(v1 , . . . , vk ) = V, so two properties of Theorem 1.1.2 hold. For prop-
span(span(v1 ), . . . , span(vk )) = V too. erty (a), suppose v1 , v2 ∈ B⊥ . Then for all w ∈ B we
have hv1 + v2 , wi = hv1 , wi + hv2 , wi = 0 + 0 = 0,
To seeSthe other property, suppose w ∈ span(vi ) ∩
span j6=i v j for some 1 ≤ i ≤ k. Then there exist so v1 + v2 ∈ B⊥ . For property (b), suppose v ∈ B⊥
scalars c1 , c2 , . . ., ck such that w = ci vi = ∑ j6=i c j v j , and c ∈ F. Then for all w ∈ B we have hcv, wi =
which (since {v1 , v2 , . . . , vk } is a basis and thus lin- chv, wi = c0 = 0, so cv ∈ B⊥ .
early independent) implies c1 = c2 = · · · = ck = 0, ⊥
so w = 0. 1.B.14 Since B⊥ = span(B) (by Exercise 1.B.13), it
suffices to prove that (S ⊥ )⊥ = S when S is a
1.B.5 If v ∈ S1 ∩ S2 when we can write v = c1 v1 + subspace of V. To see this, just recall from The-
· · · + ck vk and v = d1 w1 + · · · + dm wm for orem 1.B.6 that S ⊕ S ⊥ = V. However, it is also
some v1 , . . . , vk ∈ B, w1 , . . . , wm ∈ C, and scalars the case that (S ⊥ )⊥ ⊕ S ⊥ = V, so S = (S ⊥ )⊥ by
c1 , . . . , ck , d1 , . . . , dm . By subtracting these two equa- Exercise 1.B.12.
tions for v from each other, we see that
1.B.15 We already know from part (a) that range(T )⊥ =
c1 v1 + · · · + ck vk − d1 w1 − · · · − dm wm = 0, null(T ∗ ) for all T . If we replace T by T ∗ throughout
which (since B ∪C is linearly independent) implies that equation, we see that range(T ∗ )⊥ = null(T ).
c1 = · · · = ck = 0 and d1 = · · · = dm = 0, so v = 0. Taking the orthogonal complement of both sides
(and using the fact that (S ⊥ )⊥ = S for all sub-
1.B.6 It suffices to show that span(S1 ∪ S2 ) = V. To spaces S, thanks to Exercise 1.B.14) we then see
this end, recall that Theorem 1.B.2 says that if B range(T ∗ ) = null(T )⊥ , as desired.
and C are bases of S1 and S2 , respectively, then
(since (ii) holds) B ∪C is linearly independent. Since 1.B.16 Since rank(T ) = rank(T ∗ ) (i.e., dim(range(T )) =
|B ∪C| = dim(V), Exercise 1.2.27 implies that B ∪C dim(range(T ∗ )), we just need to show that v = 0 is
is a basis of V, so using Theorem 1.B.2 again tells the only solution to S(v) = 0. To this end, notice
us that V = S1 ⊕ S2 . that if v ∈ range(T ∗ ) is such that S(v) = 0 then v ∈
null(T ) too. However, since range(T ∗ ) = null(T )⊥ ,
1.B.7 Notice that if P is any projection with range(P) = S1 we know that the only such v is v = 0, which is
and null(P) = S2 , then we know from Example 1.B.8 exactly what we wanted to show.
that V = S1 ⊕ S2 . If B and C are bases of S1 and S2 ,
respectively, then Theorem 1.B.2 tells us that B ∪C 1.B.17 Unfortunately, we cannot quite use Theorem 1.B.6,
is a basis of V and thus P is completely determined since that result only applies to finite-dimensional
by how it acts on the members of B and C. However, inner product spaces, but we R 1 can use the same idea.
P being a projection tells us exactly how it behaves First notice that h f , gi = −1 f (x)g(x) dx = 0 for
on those two sets: Pv = v for all v ∈ B and Pw = 0 all f ∈ P E [−1, 1] and g ∈ P O [−1, 1]. This follows
for all w ∈ C, so P is completely determined. immediately from the fact that if f is even and g is
odd then f g is odd and thus has integral (across any
1.B.8 (a) Since range(A) ⊕ null(B∗ ) = Fm , we know that interval that is symmetric through the origin) equal
if B∗ Av = 0 then Av = 0 (since Av ∈ range(A) to 0. It follows that (P E [−1, 1])⊥ ⊇ P O [−1, 1]. To
so the only way Av can be in null(B∗ ) too is see that equality holds, suppose g ∈ (P E [−1, 1])⊥
if Av = 0). Since A has linearly independent and write g = gE + gO , where gE ∈ P E [−1, 1] and
columns, this further implies v = 0, so the lin- gO ∈ P O [−1, 1] (which we know we can do thanks to
ear system B∗ Av = 0 in fact has a unique solu- Example 1.B.2). If h f , gi = 0 for all f ∈ P E [−1, 1]
tion. Since B∗ A is square, it must be invertible. then h f , gE i = h f , gO i = 0, so h f , gE i = 0 (since
(b) We first show that P is a projection: h f , gO i = 0 when f is even and gO is odd), which
implies (by choosing f = gE ) that gE = 0, so
P2 = A(B∗ A)−1 (B∗ A)(B∗ A)−1 B∗ g ∈ P O [−1, 1].
= A(B∗ A)−1 B∗ = P.
462 Appendix C. Selected Exercise Solutions
1.B.18 (a) Suppose f ∈ S ⊥ so that h f , gi = 0 for all for some {(vi , 0)} ⊆ B0 (i.e., {vi } ⊆ B), {(0, w j )} ⊆
g ∈ S. Notice that the function g(x) = x f (x) C0 (i.e., {w j } ⊆ C), and scalars c1 , . . . , ck , d1 , . . . , dm .
satisfies g(0) = 0, so g ∈ S, so 0 = h f , gi = This implies c1 v1 + · · · + ck vk = 0, which (since B
R1 2 2
0 x( f (x)) dx. Since x( f (x)) is continuous
is linearly independent) implies c1 = · · · = ck = 0.
and non-negative on the interval [0, 1], the only A similar argument via linear independence of C
way this integral can equal 0 is if x( f (x))2 = 0 shows that d1 = · · · = dm = 0, which implies B0 ∪C0
for all x, so (by continuity of f ) f (x) = 0 for is linearly independent as well.
all 0 ≤ x ≤ 1 (i.e., f = 0).
(b) It follows from part (a) that (S ⊥ )⊥ = {0}⊥ = 1.B.21 (a) If (v1 , 0), (v2 , 0) ∈ V 0 (i.e., v1 , v2 ∈ V) and
C[0, 1] 6= S. c is a scalar then (v1 , 0) + c(v2 , 0) = (v1 +
cv2 , 0) ∈ V 0 as well, since v1 + cv2 ∈ V. It fol-
1.B.19 Most of the 10 properties from Definition 1.1.1 lows that V 0 is a subspace of V ⊕ W, and a
follow straightforwardly from the corresponding similar argument works for W 0 .
properties of V and W, so we just note that the zero (b) The function T : V → V 0 defined by T (v) =
vector in V ⊕ W is (0, 0), and −(v, w) = (−v, w) for (v, 0) is clearly an isomorphism.
all v ∈ V and w ∈ W. (c) We need to show that V 0 ∩ W 0 = {(0, 0)} and
span(V 0 , W 0 ) = V ⊕ W. For the first property,
1.B.20 Suppose suppose (v, w) ∈ V 0 ∩ W 0 . Then w = 0 (since
(v, w) ∈ V 0 ) and v = 0 (since (v, w) ∈ W 0 ), so
k m
(v, w) = (0, 0), as desired. For the second prop-
∑ ci (vi , 0) + ∑ d j (0, w j ) = (0, 0) erty, just notice that every (v, w) ∈ V ⊕ W can
i=1 j=1
be written in the form (v, w) = (v, 0) + (0, w),
where (v, 0) ∈ V 0 and (0, w) ∈ W 0 .
Second, k f k p ≥ 0 for all f ∈ C[a, b] because inte- In particular, this means that there is a non-zero vec-
grating a non-negative function gives a non-negative tor (which has non-zero norm) c1 v1 + · · · + cn vn ∈ V
answer. Furthermore, k f k p = 0 implies f is the zero that gets sent to the zero vector (which has norm 0),
function since otherwise f (x) > 0 for some x ∈ [a, b] so T is not an isometry.
and thus f (x) > 0 on some subinterval of [a, b] by
continuity of f , so the integral and thus k f k p would 1.D.22 We prove this theorem by showing the chain of im-
both have to be strictly positive as well. plications (b) =⇒ (a) =⇒ (c) =⇒ (b), and mimic
the proof of Theorem 1.4.9.
1.D.14 We just mimic the proof of Hölder’s inequality To see that (b) implies (a), suppose T ∗ ◦T = IV . Then
for vectors in Cn . Without loss of generality, we for all v ∈ V we have
just need to prove the theorem in the case when
k f k p = kgkq = 1. By Young’s inequality, we know kT (v)k2W = hT (v), T (v)i = hv, (T ∗ ◦ T )(v)i
that = hv, vi = kvk2V ,
| f (x)| p |g(x)|q
| f (x)g(x)| ≤ +
p q so T is an isometry.
for all x ∈ [a, b]. Integrating then gives For (a) =⇒ (c), note that if T is an isometry then
Z b Z b Z b kT (v)k2W = kvk2V for all v ∈ V, so expanding these
| f (x)| p |g(x)|q
| f (x)g(x)| dx ≤ dx + dx quantities in terms of the inner product (like we did
a a p a q above) shows that
q
k f k pp kgkq 1 1
= + = + = 1, hT (v), T (v)i = hv, vi for all v ∈ V.
p q p q
as desired. Well, if x, y ∈ V then this tells us (by choosing
v = x + y) that
1.D.16 First, we compute
hT (x + y), T (x + y)i = hx + y, x + yi.
1 3 1
hv, vi = ∑ ik kv + ik vk2V = kvk2V ,
4 k=0 Expanding the inner product on both sides of the
above equation then gives
which is clearly non-negative with equality if and
only if v = 0. Similarly, hT (x), T (x)i + 2Re hT (x), T (y)i + hT (y), T (y)i
1 3 1 = hx, xi + 2Re hx, yi + hy, yi.
hv, wi = ∑ ik kv + ik wk2V
4 k=0
By then using the fact that hT (x), T (x)i = hx, xi and
1 3 1 hT (y), T (y)i = hy, yi, we can simplify the above
= ∑ k kik v + wk2V equation to the form
4 k=0 i
1 3 k Re hT (x), T (y)i = Re hx, yi .
= ∑ i kw + ik vk2V = hw, vi.
4 k=0 If V is a vector space over R, then this implies
All that remains is to show that hv, w + cxi = hv, wi + hT (x), T (y)i = hx, yi for all x, y ∈ V. If instead V
chv, xi for all v, w, x ∈ V and all c ∈ C. The fact that is a vector space over C then we can repeat the above
hv, w + xi = hv, wi + hv, xi for all v, w, x ∈ V can argument with v = x + iy to see that
be proved in a manner identical to that given in the
Im hT (x), T (y)i = Im hx, yi ,
proof of Theorem 1.D.8, so we just need to show
that hv, cwi = chv, wi for all v, w ∈ V and c ∈ C. As so in this case we have hT (x), T (y)i = hx, yi for all
suggested by the hint, we first notice that x, y ∈ V too, establishing (c).
1 3 1 Finally, to see that (c) =⇒ (b), note that if we rear-
hv, iwi = ∑ ik kv + ik+1 wk2V
4 k=0
range the equation hT (x), T (y)i = hx, yi slightly, we
get
1 3 1
= ∑ ik−1 kv + iwk2V
4 k=0
hx, (T ∗ ◦ T − IV )(y)i = 0 for all x, y ∈ V.
Well, if we choose x = (T ∗ ◦ T − IV )(y) then this
1 3 1 implies
=i ∑ ik kv + iwk2V = ihv, wi.
4 k=0
k(T ∗ ◦ T − IV )(y)k2V = 0 for all y ∈ V.
When we combine this observation with the fact that
this inner product reduces to exactly the one from This implies (T ∗ ◦T −IV )(y) = 0, so (T ∗ ◦T )(y) = y
the proof of Theorem 1.D.8 when v and w are real, for all y ∈ V, which means exactly that T ∗ ◦ T = IV ,
we see that hv, cwi = chv, wi simply by splitting thus completing the proof.
everything into their real and imaginary parts.
1.D.24 For the “if” direction, we note (as was noted in the
1.D.20 If dim(V) > dim(W) and {v1 , . . . , vn } is a basis of proof of Theorem 1.D.10) that any P with the speci-
V, then {T (v1 ), . . . , T (vn )} is necessarily linearly fied form just permutes the entries of v and possibly
dependent (since it contains more than dim(W) vec- multiplies them by a number with absolute value
tors). There thus exist scalars c1 , . . . , cn , not all equal 1, and such an operation leaves kvk1 = ∑ j |v j | un-
to 0, such that changed.
T (c1 v1 + · · · + cn vn ) = c1 T (v1 ) + · · · + cn T (vn )
= 0.
C.2 Matrix Decompositions 465
For the “only if” direction, suppose P = 1.D.26 Notice that if pn (x) = xn then
p1 | p2 | · · · | pn is an isometry of the 1-norm. Z 1
Then Pe j = p j for all j, so 1
kpn k1 = xn dx = and
0 n+1
kp j k1 = kPe j k1 = ke j k1 = 1 for all 1 ≤ j ≤ n.
kpn k∞ = max xn = 1.
Similarly, P(e j + ek ) = p j + pk for all j, k, so 0≤x≤1
kp j + pk k1 = kP(e j + ek )k1 = ke j + ek k1 = 2 In particular, this means that there does not exist a
for all j and k. We know from the triangle inequal- constant C > 0 such that 1 = kpn k∞ ≤ Ckpn k1 =
ity (or equivalently from Exercise 1.D.10(b)) that C/(n + 1) for all n ≥ 1, since we would need
the above equality can only hold if there exist non- C ≥ n + 1 for all n ≥ 1.
negative real constants ci, j,k ≥ 0 such that, for each
i, j, k, it is the case that either pi, j = ci, j,k pi,k or 1.D.27 Since k · ka and k · kb are equivalent, there exist
pi,k = 0. scalars c,C > 0 such that
However, we can repeat this argument with the fact
ckvka ≤ kvkb ≤ Ckvka for all v ∈ V.
that P(e j − ek ) = p j − pk for all j, k to see that
kp j − pk k1 = kP(e j − ek )k1 = ke j − ek k1 = 2 Similarly, equivalence of k · kb and k · kc tells us that
there exist scalars d, D > 0 such that
for all j and k as well. Then by using Exer-
cise 1.D.10(b) again, we see that there exist non- dkvkb ≤ kvkc ≤ Dkvkb for all v ∈ V.
negative real constants di, j,k ≥ 0 such that, for each
i, j, k, it is the case that either pi, j = −di, j,k pi,k or Basic algebraic manipulations of these inequalities
pi,k = 0. show that
Since each ci, j,k and di, j,k is non-negative, it follows cdkvka ≤ kvkc ≤ CDkvka for all v ∈ V,
that if pi,k 6= 0 then pi, j = 0 for all j 6= k. In other
words, each row of P contains at most one non-zero so k · ka and k · kc are equivalent too.
entry (and each row must indeed contain at least
one non-zero entry since P is invertible by Exer- 1.D.28 Both directions of this claim follow just from notic-
cise 1.D.21). ing that all three defining properties of k · kV follow
Every row thus has exactly one non-zero entry. By immediately from the three corresponding properties
using (again) the fact that isometries must be invert- of k · kFn (e.g., if we know that k · kF
n is a
norm then
ible, it follows that each of the non-zero entries must we can argue that kvk = 0 implies
[v]B
Fn = 0, so
occur in a distinct column (otherwise there would [v]B = 0, so v = 0).
be a zero column). The fact that each non-zero entry More generally, we recall that the function that sends
has absolute value 1 follows from simply noting that a vector v ∈ V to its coordinate vector [v]B ∈ Fn is
P must preserve the 1-norm of each standard basis an isomorphism. It is straightforward to check that
vector e j . if T : V → W is an isomorphism then the function
defined by kvk = kT (v)kW is a norm on V (com-
1.D.25 Instead of just noting that max |pi, j ± pi,k | = 1 pare with the analogous statement for inner products
1≤i≤n
for all 1 ≤ j, k ≤ n, we need to observe that given in Exercise 1.3.25).
max1≤i≤n |pi, j + zpi,k | = 1 whenever z ∈ C is
such that |z| = 1. The rest of the proof follows with
no extra changes needed.
2.1.4 In all parts of this question, we call the given matrix 2.1.12 If A has Schur triangularization A = UTU ∗ , then
A. cyclic commutativity of the trace shows that kAkF =
(a) A = UDU ∗ , where kT kF . Since the diagonal entries of T are the eigen-
" # " # values of A, we have
5 0 1 1 1 s
D= and U = √ . n
0 1 2 1 −1 kT kF = ∑ |λ j |2 + ∑ |ti, j |2 .
j=1 i< j
(c) A = UDU ∗ , where
" # " # It follows that
1 0 1 1 −1 s
D= and U = √ . n
0 −1 2 i i kAkF = ∑ |λ j |2
j=1
(e) A = UDU ∗ , where
if and only if T can actually be chosen to be diagonal,
√
1 + 3i 0 0 which we know from the spectral decomposition
1 √
happens if and only if A is normal.
D=
1 − 3i 0 and
2 0
0 0 4 2.1.13 Just apply Exercise 1.4.19 with B = A and C = A∗ . In
2 2 −2 particular, the equivalence of conditions (a) and (c)
1 √ √ gives us what we want.
U= √ −1 + 3i −1 − 3i −2 .
2 3 √ √
−1 − 3i −1 + 3i −2 2.1.14 If A∗ ∈ span I, A, A2 , A3 , . . . then A∗ commutes
with A (and thus A is normal) since A commutes
2.1.5 (b) False. For example, the matrices with each of its powers.
" # " # On the other hand, if A is normal then it has a spec-
1 0 0 1 tral decomposition A = UDU ∗ , and A∗ = UDU ∗ .
A= and B =
0 0 −1 0 Then let p be the interpolating polynomial with the
property that p(λ j ) = λ j for all 1 ≤ j ≤ n (some of
are normal, but A + B is not. the eigenvalues of A might be repeated, but that is
(d) False. This was shown in part (b) above. OK because the eigenvalues of A∗ are then repeated
(f) False. By the real spectral decomposition, we as well, so we do not run into a problem with trying
know that such a decomposition of A is pos- to set p(λ j ) to equal two
sible if and only if A is symmetric. There are different values).
Then
p(A) = A∗ , so A∗ ∈ span I, A, A2 , A3 , . . . . In partic-
(real) normal matrices that are not symmetric ular, this tells us
(and they require a complex D and/or U). that if A has k distinct
eigenvalues
then A∗ ∈ span I, A, A2 , A3 , . . . , Ak−1 .
(h) False. For a counter-example, just pick any
(non-diagonal) upper triangular matrix with 2.1.17 In all cases, we write A in a spectral decomposition
real diagonal entries. However, this becomes A = UDU ∗ .
true if you add in the restriction that A is nor- (a) A is Hermitian if and only if A∗ = (UDU ∗ )∗ =
mal and has real eigenvalues. UD∗U ∗ equals A = UDU ∗ . Multiplying on the
left by U ∗ and the right by U shows that this
2.1.7 Recall that pA (λ ) = λ 2 − tr(A)λ + det(A). By the happens if and only if D∗ = D, if and only if
Cayley–Hamilton theorem, it follows that the entries of D (i.e., the eigenvalues of A) are
pA (A) = A2 − tr(A)A + det(A)I = O. all real.
(b) The same as part (a), but noting that D∗ = −D
Multiplying this equation through by A−1 shows that if and only if its entries (i.e., the eigenvalues of
det(A)A−1 = tr(A)I − A, which we can rearrange as A) are all imaginary.
" # (c) A is unitary if and only if I = A∗ A =
−1 1 d −b (UDU ∗ )∗ (UDU ∗ ) = UD∗ DU ∗ . Multiplying
A = .
det(A) −c a on the left by U ∗ and the right by U shows
that A is unitary if and only if D∗ D = I, which
The reader may have seen this formula when learning (since D is diagonal) is equivalent to |d j, j |2 = 1
introductory linear algebra. for all 1 ≤ j ≤ n.
(d) If A is not normal then we can let it be triangu-
2.1.9 (a) The characteristic polynomial of A is pA (λ ) = lar (but not diagonal) with whatever eigenval-
λ 3 −3λ 2 +4, so the Cayley–Hamilton theorem ues (i.e., eigenvalues) we like. Such a matrix
tells us that A3 − 3A2 + 4I = O. Moving I to is not normal (see Exercise 2.1.16) and thus is
one side and then factoring A out of the other not Hermitian, skew-Hermitian, or unitary.
side then gives I = A( 43 A − 14 A2 ). It follows
that A−1 = 34 A − 41 A2 . 2.1.19 (a) Use the spectral decomposition to write A =
(b) From part (a) we know that A3 − 3A2 + 4I = O. UDU ∗ , where D is diagonal with the eigenval-
Multiplying through by A gives A4 − 3A3 + ues of A along its diagonal. If we recall that
4A = O, which can be solved for A to get rank is similarity-invariant then we see that
A = 14 (3A3 − A4 ). Plugging this into the for- rank(A) = rank(D), and the rank of a diagonal
mula A−1 = 34 A − 41 A2 gives us what we want: matrix equals the number of non-zero entries
A−1 = 16 9 3
A − 43 A4 − 14 A2 . that it has (i.e., the rank(D) equals the number
of non-zero eigenvalues of A).
C.2 Matrix Decompositions 467
(b) Any non-zero upper triangular matrix with Since U is unitary and D is diagonal, this com-
all diagonal entries (i.e., eigenvalues) equal pletes the inductive step and the proof.
to 0 works. They have non-zero rank, but no
non-zero eigenvalues. 2.1.23 (a) See solution to Exercise 1.4.29(a).
(b) Orthogonality of P tells us that hP(v), v −
2.1.20 (a) We already proved this in the proof of Theo- P(v)i = 0, so hT (v), vi = 0 for all v ∈ V. Ex-
rem 2.1.6. Since A is real and symmetric, we ercise 1.4.28 then tells us that T ∗ = −T .
have (c) If B is an orthonormal basis of V then [T ]TB =
λ kvk2 = λ v∗ v = v∗ Av = v∗ A∗ v −[T ]B , so tr([T ]B ) = tr([T ]TB ) = −tr([T ]B ). It
follows that tr([P]B − [P]TB [P]B ) = tr([T ]B ) = 0,
= (v∗ Av)∗ = (λ v∗ v)∗ = λ kvk2 , so tr([P]B ) = tr([P]TB [P]B ) = k[P]TB [P]B k2F .
which implies λ = λ (since every eigenvector v Since tr([P]B ) equals the sum of the eigen-
is, by definition, non-zero), so λ is real. values of [P]B , all of which are 0 or 1, it
(b) Let λ be a (necessarily real) eigenvalue of A with follows from Exercise 2.1.12 that [P]B is nor-
corresponding eigenvector v ∈ Cn . Then mal and thus has a spectral decomposition
[P]B = UDU ∗ . The fact that [P]TB = [P]∗B = [P]B
Av = Av = λ v = λ v,
and thus P∗ = P follows from again recalling
so v is also an eigenvector corresponding to λ . that the diagonal entries of D are all 0 or 1 (and
Since linear combinations of eigenvectors (corre- thus real).
sponding to the same eigenvalue) are still eigen-
vectors, we conclude that Re(v) = (v + v)/2 is a 2.1.26 If A has distinct eigenvalues, we can just notice that
real eigenvector of A corresponding to the eigen- if v is an eigenvector of A (i.e., Av = λ v for some λ )
value λ . then ABv = BAv = λ Bv, so Bv is also an eigenvector
(c) We proceed by induction on n (the size of A) and of A corresponding to the same eigenvalue. However,
note that the n = 1 base case is trivial since every the eigenvalues of A being distinct means that its
1 × 1 real symmetric matrix is real diagonal. For eigenspaces are 1-dimensional, so Bv must in fact be
the inductive step, let λ be a real eigenvalue of a multiple of v: Bv = µv for some scalar µ, which
A with corresponding real eigenvector v ∈ Rn . means exactly that v is an eigenvector of B as well.
By using the Gram–Schmidt process we can find If the eigenvalues of A are not distinct, we instead
a unitary matrix V ∈ Mn (R) with v as its first proceed by induction on n (the size of the matrices).
column: The base case n = 1 is trivial, and for the inductive
V = v | V2 , step we suppose that the result holds for matrices of
where V2 ∈ Mn,n−1 (R) satisfies V2T v = 0 (since size (n−1)×(n−1) and smaller. Let {v1 , . . . , vk } be
V is unitary, v is orthogonal to every column of an orthonormal basis of any one of the eigenspaces S
V2 ). Then direct computation shows that of A. If we let V1 = v1 | · · · | vk then we can extend
T V1 to a unitary matrix V = V1 | V2 . Furthermore,
V T AV = v | V2 A v | V2
T T Bv j ∈ S for all 1 ≤ j ≤ k, by the same argument
v v Av vT AV2 used in the previous paragraph, and straightforward
= T Av | AV2 = T T
V2 λV2 v V2 AV2 calculation shows that
T
∗
λ 0 V1 AV1 V1∗ AV2
= . V ∗ AV = and
0 V2T AV2 O ∗
V2 AV2
∗
We now apply the inductive hypothesis—since V1 BV1 V1∗ BV2
V2T AV2 is an (n − 1) × (n − 1) symmetric matrix, V ∗ BV = .
O V2∗ BV2
there exists a unitary matrix U2 ∈ Mn−1 (R) and
a diagonal D2 ∈ Mn−1 (R) such that V2T AV2 = Since the columns of V1 form an orthonormal basis
U2 D2U2T . It follows that of an eigenspace of A, we have V1∗ AV1 = λ Ik , where
λ is the corresponding eigenvalue, so V1∗ AV1 and
λ 0T
V T AV = V1∗ BV1 commute. By the inductive hypothesis, V1∗ AV1
0 U2 D2U2T
and V1∗ BV1 share a common eigenvector x ∈ Ck , and
1 0 T λ 0T 1 0T it follows that V1 x is a common eigenvector of each
= .
0 U2 0 D2 0 U2T of A and B.
By multiplying on the left by V and on the right
by V T , we see that A = UDU T , where
1 0T λ 0T
U =V and D = .
0 U2 0 D2
(c) D(1, 0), D(2, 0), and D(3, 0). Note that these 2.2.11 First note that a j,i = ai, j since PSD matrices are
discs are really just points located at 1, 2, and 3 Hermitian, so it suffices to show that ai, j = 0 for all
in the complex plane, so the eigenvalues of this i. To this end, let v ∈ Fn be a vector with v j = −1
matrix must be 1, 2, and 3 (which we could see and all entries except for vi and v j equal to 0. Then
directly since it is diagonal). v∗ Av = a j, j − 2Re(vi ai, j ). If it were the case that
(e) D(1, 3), D(3, 3), and another (redundant) copy ai, j 6= 0, then we could choose vi to be a sufficiently
of D(1, 3). large multiple of ai, j so that v∗ Av < 0. Since this
" # contradicts positive semidefiniteness of A, we con-
2.2.3 (a) 1 1 −1 clude that ai, j = 0.
√
2 −1 1
2.2.13 The “only if” direction follows immediately from
(c) 1 0 0 Theorem 2.2.4. For the “if” direction, note that for
√
0 2 0 any vector v (of suitable size that we partition in the
√ same way as A) we have
0 0 3
√ √
(e)
1 + 3 2 1 − 3 A1 · · · O v1
1 h i
. .
√ 2 .. .. ...
∗
2 4 v Av = v1 · · · vn .
∗ ∗
2 3 √ √ . .
1− 3 2 1+ 3 O · · · An vn
2.2.4 (a) This matrix is positive semidefinite so we can
just set U = I and choose P to be itself. = v∗1 A1 v1 + · · · + v∗n An vn .
(d) Since this matrix is invertible, its polar decom- Since A1 , A2 , . . . , An are positive (semi)definite it
position is unique: follows that each term in this sum is non-negative
1 −2 2 2 1 0 (or strictly positive, as appropriate), so the sum is as
1 well, so A is positive (semi)definite.
U = −2 1 2 , P = 1 3 1 .
3
2 2 1 0 1 1 2.2.14 This follows immediately from Theorem 2.2.10,
2.2.5 (a) False. The matrix B from Equation (2.2.1) is a which says that A∗ A = B∗ B if and only if there exists
counter-example. a unitary matrix U such that B = UA. If we choose
(c) True. The identity matrix is Hermitian and has B = A∗ then we see that A∗ A = AA∗ (i.e., A is normal)
all eigenvalues equal to 1. if and only if A∗ = UA.
(e) False. For example, let
" # " # 2.2.16 (a) For the “if” direction just note that any real
2 1 1/2 1 linear combination of PSD (self-adjoint) ma-
A= and B = ,
1 1/2 1 2 trices is self-adjoint. For the “only” direction,
which are both positive semidefinite. Then note that if A = UDU ∗ is a spectral decompo-
" # sition of A then we can define P = UD+U ∗
2 4 and N = −UD−U ∗ , where D+ and D− are the
AB = , diagonal matrices containing the strictly pos-
1 2
itive and negative entries of D, respectively.
which is not even Hermitian, so it cannot be Then P and N are each positive semidefinite
positive semidefinite. and P − N = UD+U ∗ + UD−U ∗ = U(D+ +
(g) False. A counter-example to this claim was D− )U ∗ = UDU ∗ = A.
provided in Remark 2.2.2. [Side note: The P and N constructed in this way
(i) False. I has many non-PSD square roots (e.g., are called the positive semidefinite part and
any diagonal matrix with ±1) entries on its negative semidefinite part of A, respectively.]
diagonal. (b) We know from Remark 1.B.1 that we can write
A as a linear combination of the two Hermitian
2.2.6 (a) x ≥ 0, since a diagonal matrix is PSD if and matrices A + A∗ and iA − iA∗ :
only if its diagonal entries are non-negative.
(c) x = 0. The matrix (which we will call A) is 1 1
A= (A + A∗ ) + (iA − iA∗ ).
clearly PSD if x = 0, and if x 6= 0 then we note 2 2i
from Exercise 2.2.11 that if A is PSD and has Applying the result of part (a) to these 2 Hermi-
a diagonal entry equal to 0 then every entry in tian matrices writes A as a linear combination
that row and column must equal 0 too. of 4 positive semidefinite matrices.
More explicitly, we can see that A is not (c) If F = R then every PSD matrix is symmetric,
PSD by letting v = (0, −1, v3 ) and computing and the set of symmetric matrices is a vector
v∗ Av = 1 − 2v3 x, which is negative as long as space. It follows that there is no way to write a
we choose v3 large enough. non-symmetric matrix as a (real) linear combi-
nation of (real) PSD matrices.
2.2.7 If A is PSD then we can write A = UDU ∗ , where U
is unitary and D is diagonal with non-negative real 2.2.19 (a) Since A is PSD, we know that a j, j = e∗j Ae j ≥ 0
diagonal entries. However, since A is also unitary, for all 1 ≤ j ≤ n. Adding up these non-negative
its eigenvalues (i.e., the diagonal entries of D) must diagonal entries shows that tr(A) ≥ 0.
lie on the unit circle in the complex plane. The only
non-negative real number on the unit circle is 1, so
D = I, so A = UDU ∗ = UU ∗ = I.
C.2 Matrix Decompositions 469
(b) Write A = D∗ D and B = E ∗ E (which we 2.2.23 To see that (a) =⇒ (b), let v be an eigenvector of
can do since A and B are PSD). Then A with corresponding eigenvalue λ . Then Av = λ v,
cyclic commutativity of the trace shows and multiplying this equation on the left by v∗ shows
that tr(AB) = tr(D∗ DE ∗ E) = tr(ED∗ DE ∗ ) = that v∗ Av = λ v∗ v = λ kvk2 . Since A is positive defi-
tr((DE ∗ )∗ (DE ∗ )). Since (DE ∗ )∗ (DE ∗ ) is nite, we know that v∗ Av > 0, so it follows that λ > 0
PSD, it follows from part (a) that this quan- too.
tity is non-negative. To see that (b) =⇒ (d), we just apply the spectral
(c) One example that works is decomposition theorem (either the complex Theo-
" # " # rem 2.1.4 or the real Theorem 2.1.6, as appropriate)
1 1 1 −2 to A.
A= , B= , √
1 1 −2 4 To see why (d) =⇒ (c), let D be the diagonal
" # matrix that is obtained by taking the (strictly posi-
4 −2 tive) square√root of the diagonal entries of D, and
C= .
−2 1 define B = DU ∗ . Then B is invertible since it is
the
√ product √of two invertible
√ ∗ √matrices, and B∗ B =
It is straightforward to verify that each of ( DU ) ( DU ) = U D DU = UDU ∗ = A.
∗ ∗ ∗ ∗
A, B, and C are positive semidefinite, but Finally, to see that (c) =⇒ (a), we let v ∈ Fn be any
tr(ABC) = −4. non-zero and we note that
2.2.20 (a) The “only if” direction is exactly Exer- v∗ Av = v∗ B∗ Bv = (Bv)∗ (Bv) = kBvk2 > 0,
cise 2.2.19(b). For the “if” direction, note that with the final inequality being strict because B is
if A is not positive semidefinite then it has a invertible so Bv 6= 0 whenever v 6= 0.
strictly negative eigenvalue λ < 0 with a cor-
responding eigenvector v. If we let B = vv∗ 2.2.24 In all parts of this question, we prove the statement
then tr(AB) = tr(Avv∗ ) = v∗ Av = v∗ (λ v) = for positive definiteness. For semidefiniteness, just
λ kvk2 < 0. make the inequalities not strict.
(b) In light of part (a), we just need to show (a) If A and B are self-adjoint then so is A + B, and
that if A is positive semidefinite but not pos- if v∗ Av > 0 and v∗ Bv > 0 for all v∈Fn then
itive definite, then there is a PSD B such v∗ (A + B)v=v∗ Av + v∗ Bv>0 for all v ∈ Fn too.
that tr(AB) = 0. To this end, we just let v (b) If A is self-adjoint then so is cA (recall c is real),
be an eigenvector of A corresponding to the and v∗ (cA)v = c(v∗ Av) > 0 for all v ∈ Fn when-
eigenvalue 0 of A and set B = vv∗ . Then ever c > 0 and A is positive definite.
tr(AB) = tr(Avv∗ ) = v∗ Av = v∗ (0v) = 0. (c) (AT )∗ = (A∗ )T , so AT is self-adjoint whenever
A is, and AT always has the same eigenvalues as
2.2.21 (a) The fact that c ≤ 0 follows simply from choos- A, so if A is positive definite (i.e., has positive
ing B = O. If A were not positive semidefinite eigenvalues) then so is AT .
then it has a strictly negative eigenvalue λ < 0
with a corresponding eigenvector v. If we let 2.2.25 (a) We use characterization (c) of positive semidef-
x ≥ 0 be a real number and B = I + xvv∗ then initeness from Theorem 2.2.1. If we let
{v1 , v2 , . . . , vn } be the columns of B (i.e., B =
tr(AB) = tr A(I + xvv∗ ) = tr(A) + xλ kvk2 , [ v1 | v2 | · · · | vn ]) then we have
which can be made arbitrarily large and neg- ∗
v1
ative (in particular, more negative than c) by v∗2
choosing x sufficiently large.
B∗ B = . [ v1 | v2 | · · · | vn ]
(b) To see that c ≤ 0, choose B = εI so that ..
tr(AB) = εtr(A), which tends to 0+ as ε → 0+ . v∗n
Positive semidefiniteness of A then follows via ∗
the same argument as in the proof of part (a) v1 v1 v∗1 v2 · · · v∗1 vn
∗
(notice that the matrix B = I + xvv∗ from that v2 v1 v∗2 v2 · · · v∗2 vn
proof is positive definite). = . .. .. .
. ..
. . . .
2.2.22 (a) Recall from Theorem A.1.2 that rank(A∗ A) =
v∗n v1 v∗n v2 ··· v∗n vn
rank(A). Furthermore, if A∗ A has spectral de-
composition A∗ A = UDU ∗ then rank(A∗ A) In particular, it follows that A = B∗ B (i.e., A
equals the number of non-zero diagonal en- is positive semidefinite) if and only if ai, j =
tries of D, which equals the v∗i v j = vi · v j for all 1 ≤ i, j ≤ n.
√ number of non-
zero diagonal entries (b) The set {v1 , v2 , . . . , vn } is linearly indepen-
√ of D, which equals
rank(|A|) = rank( A∗ A). dent if and only if the matrix B from our
p
(b) Just recall that kAkF = tr(A∗ A), so
|A|
F = proof of part (a) is invertible, if and only if
p p rank(B∗ B) = rank(B) = n, if and only if B∗ B
tr(|A|2 ) = tr(A∗ A) too.
2 is invertible, if and only if B∗ B is positive
(c) We compute
|A|v
= (|A|v) · (|A|v) = definite.
v∗ |A|2 v = v∗ A∗ Av = (Av) · (Av) = kAvk2 for
all v ∈ Fn . 2.2.28 (a) This follows from multilinearity of the deter-
minant: multiplying one of the columns of a
matrix multiplies its determinant by the same
amount.
470 Appendix C. Selected Exercise Solutions
(b) Since A∗ A is positive semidefinite, its eigenval- (d) Equality is attained if and only if the columns
ues λ1 , λ2 , . . . , λn are non-negative, so we can of A form a mutually orthogonal set. The
apply the AM–GM inequality to them to get reason for this is that equality is attained
!n in the AM–GM inequality if and only if
1 n
∗
det(A A) = λ1 λ2 · · · λn ≤ ∑ λj λ1 = λ2 = . . . = λn , which happens if and
n j=1 only if A∗ A is the identity matrix, so A must
n n be unitary in part (b) above. However, the
1 1
= tr(A∗ A) = kAk2F argument in part (b) relied on having already
n n
!n scaled A so that its columns have length 1—
1 n after “un-scaling” the columns, we see that any
= ∑ ka j k2 = 1n = 1.
n j=1 matrix with orthogonal columns also attains
equality.
(c) We just recall that the determinant is mul-
tiplicative, so det(A∗ A) = det(A∗ ) det(A) =
| det(A)|2 , so | det(A)|2 ≤ 1, so | det(A)| ≤ 1.
2.3.12 (a) If A = UΣV ∗ is an SVD (with the diag- 2.3.17 (a) Suppose A has SVD A = UΣV ∗ . If we let B =
onal entries of Σ being σ1 ≥ σ2 ≥ · · · ) UV ∗ then B is unitary and thus has kUV ∗ k = 1
then kABk2F = tr((UΣV ∗ B)∗ (UΣV ∗ B)) = (by Exercise 2.3.5, for example), so the given
tr(B∗V Σ∗ ΣV ∗ B) = tr(Σ∗ ΣV ∗ BB∗V ) ≤ maximization is at least as large as hA, Bi =
σ12 tr(V ∗ BB∗V ) = σ12 tr(BB∗ ) = σ12 kBk2F = tr(V ΣU ∗UV ∗ ) = tr(Σ) = kAktr .
kAk2 kBk2F , where the inequality in the middle To show the opposite inequality, note that
comes from the fact that multiplying V ∗ BB∗V if B is any matrix for which kBk ≤ 1
on the left by Σ∗ Σ multiplies its j-th diagonal then |hA,∗Bi| = |hUΣV , Bi| = |hΣ,U ∗ BV i| =
∗
To see that A = O (and thus complete the proof), (c) To see that BR is real we note that symme-
note that P2 = P implies T 2 = T , which in turn try of B ensures that the (i, j)-entry of BR is
implies A2 = A, so Ak = A for all k ≥ 1. However, (bi, j +bi, j ))/2 = Re(bi, j ) (a similar calculation
the diagonal of A consists entirely of zeros, so shows that the (i, j)-entry of BI is Im(bi, j )).
the first diagonal above the main diagonal in A2 Since BR and BI are clearly Hermitian, they
consists of zeros, the diagonal above that one must be symmetric. To see that they commute,
in A3 consists of zeros, and so on. Since these we compute
powers of A all equal A itself, we conclude that
A = O. B∗ B = (BR + iBI )∗ (BR + iBI )
= B2R + B2I + i(BR BI − BI BR ).
2.3.26 (a) One simple example is
" # Since B∗ B, BR , and BI are all real, this implies
0 1+i BR BI − BI BR = O, so BR and BI commute.
A= , (d) Since BR and BI are real symmetric and com-
1+i 1
mute, we know by (the “Side note” underneath
which has of) Exercise 2.1.28 that there exists a unitary
" # matrix W ∈ Mn (R) such that each of W T BRW
0 −2i and W T BIW are diagonal. Since B = BR + iBI ,
A∗ A − AA∗ = ,
2i 0 we conclude that
so A is complex symmetric but not normal. W T BW = W T (BR + iBI )W
(b) Since A∗ A is positive semidefinite, its spectral
= (W T BRW ) + (W T BIW )
decomposition has the form A∗ A = V DV ∗ for
some real diagonal D and unitary V . Then is diagonal too.
∗ T ∗ T ∗ ∗ (e) By part (d), we know that if U = VW then
B B = (V AV ) (V AV ) = V A AV = D,
U T AU = W T (V T AV )W = W T BW
so B is real (and diagonal and entrywise non-
negative, but we do not need those properties). is diagonal. It does not necessarily have non-
Furthermore, B is complex symmetric (for all negative (or even real) entries on its diagonal,
unitary matrices V ) since but this can be fixed by multiplying U on
the right by a suitable diagonal unitary matrix,
BT = (V T AV )T = V T AT V = V T AV = B.
which can be used to adjust the complex phases
of the diagonal matrix as we like.
" # " #
(a) 2 2 (c) 1 0 2.4.5 (a) False. Any matrix with a Jordan block of size
2 1 −2 −1 2 × 2 or larger (i.e., almost any matrix from
this section) serves as a counter-example.
(e) 0 1 1 (g) 1 −2 −1
(c) False. The Jordan canonical forms J1 and J2
2 1 2 1 1 0 are only unique up to re-ordering of their Jor-
1 1 2 2 −1 0 dan blocks (so we can shift the diagonal blocks
of J1 around to get J2 ).
(e) True. If A and B are diagonalizable then their
2.4.3 (a) Not similar, since their traces are not the same Jordan canonical forms are diagonal and so
(8 and 10). Theorem 2.4.3 tells us they are similar if those
(c) Not similar, since their determinants are not 1 × 1 Jordan blocks (i.e., their eigenvalues) co-
the same (0 and −2). incide.
C.2 Matrix Decompositions 473
(g) True. Since the function f (x) = ex is analytic 2.4.14 Since sin2 (x) + cos2 (x) = 1 for all x ∈ C, the func-
on all of C, Theorems 2.4.6 and 2.4.7 tell tion f (x) = sin2 (x) + cos2 (x) is certainly analytic (it
us that this sum converges for all A (and fur- is constant). Furthermore, f 0 (x) = 0 for all x ∈ C
thermore we can compute it via the Jordan and more generally f (k) (x) = 0 for all x ∈ C and in-
decomposition). tegers k ≥ 1. It follows from Theorem 2.4.6 that
f Jk (λ ) = I for all Jordan blocks Jk (λ ), and
2.4.8 We can choose C to be the standard basis of R2 Theorem 2.4.7 then tells us that f (A) = I for all
and T (v) = Av. Then it is straightforward to check A ∈ Mn (C).
that [T ]C = A. All that is left to do is find a basis
D such that [T ]D = B. To do so, we find a Jordan 2.4.16 (a) This follows directly from the definition of ma-
decomposition of A and B, which are actually diag- trix multiplication:
onalizations
" in#this case. In
" particular,
# if we define k
1 2 1 1 A2 = ∑ ai,` a`, j ,
P1 = and P2 = then i, j
−1 5 −1 −2 `=1
2.5.4 For the “only if” direction, use the spectral decompo- 2.5.7 (a) tr(A) is the sum of the diagonal entries of A,
sition to write A = UDU ∗ so A∗ = UD∗U ∗ = UDU ∗ . each of which is 0 or 1, so tr(A) ≤ 1 + 1 + · · · +
It follows that A and A∗ have the same eigenspaces 1 = n.
and the corresponding eigenvalues are just complex (b) The fact that det(A) is an integer follows from
conjugates of each other, as claimed. formulas like Theorem A.1.4 and the fact that
each entry of A is an integer. Since det(A) > 0
(by positive definiteness of A), it follows that
det(A) ≥ 1.
474 Appendix C. Selected Exercise Solutions
(c) The
p AM–GM √ inequality tells us that (d) Parts (b) and (c) tell us that det(A) = 1, so
n
det(A) = n λ1 · · · λn ≤ (λ1 + · · · + λn )/n = equality holds in the AM–GM inequality in
tr(A)/n. Since tr(A) ≤ n by part (a), we con- part (c). By the equality condition of the AM–
clude that det(A) ≤ 1. GM inequality, we conclude that λ1 = . . . = λn ,
so they all equal 1 and thus A has spectral
decomposition A = UDU ∗ with D = I, so
A = UIU ∗ = UU ∗ = I.
2.B.11 We need to transform det(AB − λ Im ) into a form that for x and B reveals that kxk2 = x∗ x = 0, so x = 0, so
lets us use Sylvester’s determinant identity (Exer- the only way for this to be a Cholesky decomposition
cise 2.B.9). Well, of A is if B = T , where A2,2 = T ∗ T is a Cholesky de-
det(AB − λ Im ) = (−1)m det(λ Im − AB) composition of A2,2 (which is unique by the inductive
hypothesis).
= (−λ )m det(Im + (−A/λ )B) On the other hand, if a1,1 6= 0 (i.e., Case 1 of the
= (−λ )m det(In + B(−A/λ )) proof of Theorem 2.B.5) then the Schur complement
= ((−λ )m /λ n ) det(λ In − BA) S has unique Cholesky decomposition S = T ∗ T , so
we know that the Cholesky decomposition of A must
= (−λ )m−n det(BA − λ In ), be of the form
which is what we wanted to show. "√ #∗ "√ # " #
a1,1 x∗ a1,1 x∗ a1,1 a∗2,1
A= = ,
T T a2,1 A2,2
√ if A√= [a] then the
2.B.12 In the 1 × 1 case, it is clear that 0 0
Cholesky decomposition A = [ a]∗ [ a] is unique.
For the inductive step, we assume that the Cholesky where x ∈ Fn−1 is some unknown column vector.
decompositions of (n − 1) × (n − 1) matrices are Just performing the matrix multiplication reveals
√
unique. If a1,1 = 0 (i.e., Case 1 of the proof of Theo- that it must be the case that x = a2,1 / a1,1 , so the
rem 2.B.5) then solving Cholesky decomposition of A is unique in this case
" # well.
0 0T ∗
A= = x|B x|B
0 A2,2
3.1.15 (a) Just notice that PT = ∑σ ∈S p WσT /p! = 3.1.21 (a) If A−1 and B−1 exist then we just use Theo-
∑σ ∈S p Wσ −1 /p! = ∑σ ∈S p Wσ /p! = P, with the rem 3.1.2(a) to see that (A ⊗ B)(A−1 ⊗ B−1 ) =
second equality following from the fact that (AA−1 ) ⊗ (BB−1 ) = I ⊗ I = I, so (A ⊗ B)−1 =
Wσ is unitary so WσT = Wσ−1 = Wσ −1 , and the A−1 ⊗ B−1 . If (A ⊗ B)−1 exists then we can use
third equality coming from the fact that chang- Theorem 3.1.3(a) to see that A and B have no
ing the order in which we sum over S p does zero eigenvalues (otherwise A ⊗ B would too),
not change the sum itself. so A−1 and B−1 exist.
(b) We compute (b) The ( j, i)-block of (A ⊗ B)T is the (i, j)-block
of A ⊗ BT , which equals ai, j BT , which is also
P2 = ∑ Wσ /p! ∑ Wτ /p! the ( j, i)-block of AT ⊗ BT .
σ ∈S p τ∈S p
(c) Just apply part (b) above (part (c) of the the-
= ∑ Wσ ◦τ /(p!)2 orem) to the complex conjugated matrices A
σ ,τ∈S p and B.
= ∑ p!Wσ /(p!)2 = P,
σ ∈S p 3.1.22 (a) If A and B are upper triangular then the (i, j)-
block of A ⊗ B is ai, j B. If i > j then this entire
with the third equality following from the fact block equals O since A is upper triangular. If
that summing Wσ ◦τ over all σ and all τ in S p i = j then this block is upper triangular since
just sums each Wσ a total of |S p | = p! times B is. If A and B are lower triangular then just
(once for each value of τ ∈ S p ). apply this result to AT and BT .
(b) Use part (a) and the fact that diagonal matrices
3.1.16 By looking at the blocks in the equation ∑kj=1 v j ⊗ are both upper lower triangular.
w j = 0, we see that ∑kj=1 [v j ]i w j = 0 for all 1 ≤ i ≤ m. (c) If A and B are normal then (A ⊗ B)∗ (A ⊗
By linear independence, this implies [v j ]i = 0 for B) = (A∗ A) ⊗ (B∗ B) = (AA∗ ) ⊗ (BB∗ ) = (A ⊗
each i and j, so v j = 0 for each j. B)(A ⊗ B)∗ .
(d) If A∗ A = B∗ B = I then (A ⊗ B)∗ (A ⊗ B) =
3.1.17 Since B ⊗ C consists of mn vectors, it suffices via (A∗ A) ⊗ (B∗ B) = I ⊗ I = I.
Exercise 1.2.27(a) to show that it is linearly inde- (e) If AT = A and BT = B then (A ⊗ B)T = AT ⊗
pendent. To this end, write B = {v1 , . . . , vm } and BT = A ⊗ B. Similarly, if A∗ = A and B∗ = B
C = {w1 , . . . , wn } and suppose then (A ⊗ B)∗ = A∗ ⊗ B∗ = A ⊗ B.
m n n
!
m
(f) If the eigenvalues of A and B are non-negative
then so are the eigenvalues of A ⊗ B, by Theo-
∑ ∑ ci, j vi ⊗ w j = ∑ ∑ ci, j vi ⊗ w j = 0.
rem 3.1.3.
i=1 j=1 j=1 i=1
We then know from Exercise 3.1.16 that (since C 3.1.23 In all parts of this exercise, we use the fact that if
is linearly independent) ∑m i=1 ci, j vi = 0 for each A = U1 Σ1V1∗ and B = U2 Σ2V2∗ are singular value
1 ≤ j ≤ n. By linear independence of B, this implies decompositions then so is A ⊗ B = (U1 ⊗U2 )(Σ1 ⊗
ci, j = 0 for each i and j, so B ⊗C is linearly indepen- Σ2 )(V1 ⊗V2 )∗ .
dent too. (a) If σ is a diagonal entry of Σ1 and τ is a diagonal
entry of Σ2 then σ τ is a diagonal entry of Σ1 ⊗ Σ2 .
3.1.18 (a) This follows almost immediately from Defini- (b) rank(A ⊗ B) equals the number of non-zero en-
tion 3.1.3(b), which tells us that the columns tries of Σ1 ⊗ Σ2 , which equals the product of the
of Wm,n are (in some order) {e j ⊗ ei }i, j , which number of non-zero entries of Σ1 and Σ2 (i.e.,
are the standard basis vectors in Fmn . In other rank(A)rank(B)).
words, Wm,n is obtained from the identity ma- (c) If Av ∈ range(A) and Bw ∈ range(B) then (Av) ⊗
trix by permuting its columns in some way. (Bw) = (A ⊗ B)(v ⊗ w) ∈ range(A ⊗ B). Since
(b) Since the columns of Wm,n are standard basis range(A ⊗ B) is a subspace, this shows that “⊇”
vectors, the dot products of its columns with inclusion. For the opposite inclusion, recall from
each other equal 1 (the dot product of a column Theorem 2.3.2 that range(A ⊗ B) is spanned by the
with itself) or 0 (the dot product of a column rank(A ⊗ B) columns of U1 ⊗U2 that correspond
with another column). This means exactly that to non-zero diagonal entries of Σ1 ⊗ Σ2 , which are
m,n = I, so Wm,n is unitary.
∗ W
Wm,n all of the form u1 ⊗ u2 for some u1 range(A) and
(c) This follows immediately from Defini- u2 ∈ range(B).
tion 3.1.3(c) and the fact that Ei,∗ j = E j,i . (d) Similar to part (c) (i.e., part (d) of the Theorem),
but using columns of V1 ⊗V2 instead of U1 ⊗U2 .
3.1.20 (a) The (i, j)-block of A ⊗ B is ai, j B, so the (k, `)- (e) kA ⊗ Bk = kAkkBk since we know from part (a)
block within the (i, j)-block of (A ⊗ B) ⊗C is of this theorem that the largest singular value of
ai, j bk,`C. On the other hand, the (k, `)-block of A ⊗ B is the product of the largest singular values
B ⊗ C is bk,`C, so the (k, `)-block within the
(i, j)-block of A ⊗ (B ⊗C) is ai, j bk,`C. p F = kAkF kBkF , just
of A and B. To see that kA⊗Bk
note that kA ⊗ BkF = tr((A ⊗ B)∗ (A ⊗ B)) =
(b) The (i, j)-block of (A + B) ⊗C is (ai, j + bi, j )C, p p p
tr((A∗ A) ⊗ (B∗ B)) = tr(A∗ A) tr(B∗ B) =
which equals the (i, j)-block of A ⊗C + B ⊗C,
kAkF kBkF .
which is ai, j C + bi, j C.
(c) The (i, j)-block of these matrices equal
(cai, j )B, ai, j (cB), c(ai, j B), respectively, which
are all equal.
478 Appendix C. Selected Exercise Solutions
3.1.24 For property (a), let P = ∑σ ∈S p sgn(σ )Wσ /p! and 3.1.26 (a) We can scale v and w freely, since if we replace
note that P2 = P = PT via an argument almost identi- v and w by cv and dw (where c, d > 0) then
cal to that of Exercise 3.1.15. It thus suffices to show the inequality becomes
that range(P) = Anp . To this end, notice that for all
cd v · w = (cv) · (dw)
τ ∈ S p we have
≤ kcvk p kdwkq = cdkvk p kwkq ,
1
sgn(τ)Wτ P = sgn(τ)sgn(σ )Wτ Wσ which is equivalent to the original inequal-
p! σ∑
∈S p
ity that we want to prove. We thus choose
1 c = 1/kvk p and d = 1/kwkq .
= sgn(τ ◦ σ )Wτ◦σ = P.
p! σ∑
∈S p
(b) We simply notice that we can rearrange the
condition 1/p + 1/q = 1 to the form 1/(q −
It follows that everything in range(P) is unchanged 1) = p − 1. Then dividing the left inequality
by sgn(τ)Wτ (for all τ ∈ S p ), so range(P) ⊆ Anp . To above by |v j | shows that it is equivalent to
prove the opposite inclusion, we just notice that if |w j | ≤ |v j | p−1 , whereas dividing the right in-
v ∈ Anp then equality by |w j | and then raising it to the power
1/(q − 1) = p − 1 shows that it is equivalent to
1 1
Pv = sgn(σ )Wσ v = v = v, |v j | p−1 ≤ |w j |. It is thus clear that at least one
p! σ∑
∈S p p! σ∑
∈S p of these two inequalities must hold (they sim-
ply point in opposite directions of each other),
so v ∈ range(P) and thus Anp ⊆ range(P), so P is a as claimed.
projection onto Anp as claimed. As a minor technicality, we should note that
To prove property (c), we first notice that the columns the above argument only works if p, q > 1 and
of the projection P from part (a) have the form each v j , w j 6= 0. If p = 1 then |v j w j | ≤ |v j | p
P(e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) = (since kwkq = 1 so |w j | ≤ 1). If q = 1 then
|v j w j | ≤ |w j |q . If v j = 0 or w j = 0 then both
1
sgn(σ )Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ), inequalities hold trivially.
p! σ∑
∈S p (c) Part (b) implies that |v j w j | ≤ |v j | p + |w j |q for
each 1 ≤ j ≤ n, so
where 1 ≤ j1 , j2 , . . . , j p ≤ n. To turn this set of
vectors into a basis of range(P) = Anp , we omit n n
the columns that are equal to each other or equal v · w = ∑ v j w j ≤ ∑ |v j w j |
j=1 j=1
to 0 by only considering the columns for which
n n
1 ≤ j1 < j2 < · · · < j p ≤ n. If F = R or F = C (so
we have an inner product to work with) then these
≤ ∑ |v j | p + ∑ |w j |q
j=1 j=1
remaining vectors are mutually orthogonal and thus
form an orthogonal basis of range(P), and otherwise = kvk pp + kwkqq = 2.
they are linearly independent (and thus form a basis (d) Notice that the inequality above does not de-
of range(P)) since the coordinates of their non-zero pend on the dimension n at all. It follows that
entries in the standard basis form disjoint subsets of if we pick a positive integer k, replacing v and
{1, 2, . . . , n p }. If we multiply these vectors each by w by v⊗k and w⊗k respectively shows that
p! then they form the basis described in the statement
k
of the theorem. 2 ≥ v⊗k · w⊗k = v · w ,
To demonstrate property (b), we simply notice that
the basis from part (c) of the theorem contains as with the final equality coming from Exer-
many vectors as there are sets of p distinct numbers cise 3.1.25. Taking the k-th root
√ of this in-
{ j1 , j2 , . . . , j p } ⊆ {1, 2, . . . , n}, which equals np . equality shows us that v · w ≤ k 2 for all
√
k
integers k ≥ 1. Since lim 2 = 1, it follows
k→∞
that v · w ≤ 1, which completes the proof.
The bound rank(T× ) ≥ mp comes from noting that ∑ f1, j (v) f2, j (w)x j
j=1
the standard block matrix A of T× (with respect to
= v1 (w2 + w3 )(e1 + e3 ) − (v1 + v3 )w2 e1
the standard basis) is an mp × mn2 p matrix with full
rank mp (it is straightforward to check that there are − v2 (w1 + w3 )(e2 + e3 )
standard block matrices that multiply to Ei, j for each + (v2 + v3 )w1 e2 + (v2 − v1 )w3 (e1 + e2 + e3 )
i, j). It follows that rank(T× ) ≥ rank(A) = mp.
= v1 (w2 + w3 ) − (v1 + v3 )w2 + (v2 − v1 )w3 e1
+ − v2 (w1 + w3 ) + (v2 + v3 )w1 + (v2 − v1 )w3 e2
3.2.11 (a) By definition, rank( f ) is the least r such
that we can write f (v, w) = ∑rj=1 f j (v)g j (w) + v1 (w2 + w3 ) − v2 (w1 + w3 ) + (v2 − v1 )w3 e3
for some linear forms { f j } and {g j }. If = (v2 w3 − v3 w2 )e1 + (v3 w1 − v1 w3 )e2
we represent each of these linear forms as + (v1 w2 − v2 w1 )e3 = C(v, w).
row vectors via f j (v) = xTj [v]B and g j (w) =
yTj [w]C then we can write this sum in the
3.2.16 Recall from Exercise 3.2.13 that if C : R3 × R3 → R3
form f (v, w) = [v]TB ∑rj=1 x j yTj [w]C . Since
is the cross product then rank(C) = 5. An optimal
f (v, w) = [v]TB A[w]C , Theorem A.1.3 tells us rank sum decomposition of the standard block matrix
that r = rank(A) as well. of C thus makes use of sets consisting of 5 vectors in
R3 , which cannot possibly be linearly independent.
A similar argument works with the matrix multiplica-
tion transformation T× : M2 × M2 → M2 , which
has rank 7 and vectors living in a 4-dimensional
space.
which has determinant −a2 − b2 and thus rank 2 We know from Exercise 1.1.22 that the set of func-
whenever (a, b) 6= (0, 0). It follows that Sv does not tions {ex , e2x , . . . , ekx , e(k+1)x } is linearly indepen-
have a basis consisting of elementary tensors, so The- dent. However, this contradicts the above linear
orem 3.3.6 tells us that rank(v) > dim(Sv ) = 2, so system, which says that {ex , e2x , . . . , ekx , e(k+1)x } is
rank(v) = 3. contained within the (k-dimensional) span of the
This argument breaks down if we work over the com- functions f1 , . . . , fk .
plex numbers since the above matrix can be made
rank 1 by choosing a = 1, b = i, for example, and Sv 3.3.10 Let T : V × W → W ⊗ V be the bilinear transfor-
is spanned by the elementary tensors (1, i, i, −1) and mation defined by T (v, w) = w ⊗ v. By the univer-
(1, −i, −i, −1). sal property, there exists a linear transformation
S : V ⊗ W → W ⊗ V with S(v ⊗ w) = w ⊗ v. A
3.3.3 (a) True. This is simply a consequence of similar argument shows that there exists a linear
C being 1-dimensional, so dim(C ⊗ C) = transformation that sends w ⊗ v to v ⊗ w. Since
dim(C) dim(C) = 1. Since all vector spaces of elementary tensors span the entire tensor product
the same (finite) dimension over the same field space, it follows that S is invertible and thus an
are isomorphic, we conclude that C ⊗ C ∼ = C. isomorphism.
(c) True. This follows from the fact that rank(v) =
rank(mat(v)), where mat(v) ∈ Mm,n (F). 3.3.12 Suppose
(e) True. This follows from every real tensor rank
(1) (2) (p)
decomposition being a complex one, since v = lim vk ⊗ vk ⊗ · · · ⊗ vk ,
k→∞
R ⊆ C.
where we absorb scalars among these Kronecker fac-
( j)
3.3.4 If there were two such linear transformations tors so that kvk k = 1 for all k and all j ≥ 2. The
S1 , S2 : V ⊗ W → X then we would have (S1 − length of a vector (i.e., its norm induced by the dot
S2 )(v ⊗ w) = 0 for all v ∈ V and w ∈ W. Since product) is continuous in its entries, so
V ⊗ W is spanned by elementary tensors, it follows
(1) (2) (p)
(1)
that (S1 − S2 )(x) = 0 for all x ∈ V ⊗ W, so S1 − S2 kvk = lim
vk ⊗ vk ⊗ · · · ⊗ vk
= lim
vk
.
k→∞ k→∞
is the zero linear transformation and S1 = S2 .
( j)
In particular, this implies that the vectors {vk } have
3.3.5 For property (a), we just note that it is straightfor- bounded norm, so the Bolzano–Weierstrass theorem
ward to see that every function of the form f (x) = tells us that there is a sequence k1 , k2 , k3 , . . . with the
Ax2 + Bx +C can be written as a linear combination property that
of elementary tensors—Ax2 , Bx, and C are them- (1)
lim vk
selves elementary tensors. For (b) and (c), we com- `→∞ `
3.A.2 We mimic the proof of Theorem 3.A.6: we construct 3.A.12 (a) If Φ is transpose-preserving and X is symmet-
CΨ and then let the Ai ’s be the matricizations of its ric then Φ(X)T = Φ(X T ) = Φ(X).
scaled eigenvectors. We recall from Example 3.A.8 (b) For example, consider the linear map Φ :
that CΨ = I −W3,3 , and applying the spectral decom- M2 → M2 defined by
position to this matrix shows that CΨ = ∑3i=1 vi vTi , " #! " #
where a b a b
Φ = .
c d (b + c)/2 d
v1 = (0, 1, 0, −1, 0, 0, 0, 0, 0),
v2 = (0, 0, 1, 0, 0, 0, −1, 0, 0), and If the input is symmetric (i.e., b = c) then so
is the output, but Φ(E1,2 ) 6= Φ(E2,1 )T , so Φ is
v3 = (0, 0, 0, 0, 0, 1, 0, −1, 0).
not transpose-preserving.
It follows that Ψ(X) = ∑3i=1 Ai XATi , where Ai =
mat(vi ) for all i, so 3.A.13 (a) Recall from Exercise 2.2.16 that every self-
adjoint A ∈ Mn can be written as a linear
0 1 0 0 0 1 combination of two PSD matrices: A = P − N.
A1 = −1 0 0 , A2 = 0 0 0 , and Then Φ(A) = Φ(P) − Φ(N), and since P and
0 0 0 −1 0 0 N are PSD, so are Φ(P) and Φ(N), so Φ(A)
is also self-adjoint (this argument works even
0 0 0 if F = R). If F = C then this means that Φ is
A3 = 0 0 1 . Hermiticity-preserving, so it follows from The-
0 −1 0 orem 3.A.2 that Φ is also adjoint-preserving.
(b) The same map from the solution to Exer-
3.A.4 Direct computation shows that (ΦC ⊗I2 )(xx∗ ) equals cise 3.A.12(b) works here. It sends PSD ma-
trices to PSD matrices (in fact, it acts as the
2 0 −1 0 0 −1
0 0 0 0 0 0 identity map on all symmetric matrices), but is
not transpose-preserving.
−1 0 1 0 0 −1
,
0 0 0 1 0 0
3.A.14 We know from Theorem 3.A.1 that T ◦ Φ = Φ ◦ T
0 0 0 0 1 0
−1 0 −1 0 0 1 (i.e., Φ is transpose-preserving) if and only if CΦ =
CΦT . The additional requirement that Φ = Φ ◦ T
√
which is not positive semidefinite (it has 1 − 3 ≈ tells us (in light of Exercise 3.A.10) that CΦ =
−0.7321 as an eigenvalue), so ΦC is not 2-positive. (I ⊗ T )(CΦ ) = Γ(CΦ ).
As a side note, observe this condition is also equiva-
lent to CΦ = CΦ T = Γ(C ), since Γ(C ) = Γ(CT ) =
3.A.5 Recall that kXkF = kvec(X)k for all matrices Φ Φ Φ
X, so kΦ(X)kF = kXkF for all X if and only if Γ(CΦ ).
k[Φ]vec(X)k = kvec(X)k for all vec(X). It follows
from Theorem 1.4.9 that this is equivalent to [Φ] 3.A.15 If Φ is bisymmetric and decomposable then Φ =
being unitary. T ◦ Φ = Φ ◦ T = Ψ1 + T ◦ Ψ2 for some completely
positive Ψ1 , Ψ2 . Then
3.A.7 By Exercise 2.2.20, Φ is positive if and only if
tr(Φ(X)Y ) ≥ 0 for all positive semidefinite X and Y . 2Φ = Φ + (T ◦ Φ)
Using the definition of the adjoint Φ∗ shows that this = (Ψ1 + Ψ2 ) + T ◦ (Ψ1 + Ψ2 ),
is equivalent to tr(XΦ∗ (Y )) ≥ 0 for all PSD X and
Y , which (also via Exercise 2.2.20) is equivalent to
Φ∗ being positive.
482 Appendix C. Selected Exercise Solutions
so we can choose Ψ = (Ψ1 + Ψ2 )/2. Conversely, if which is positive semidefinite since each Φ(Xk ) is
Φ = Ψ + T ◦ Ψ for some completely positive Ψ then positive definite (recall that the coefficients of charac-
Φ is clearly decomposable, and it is bisymmetric teristic polynomials are continuous, so the roots of a
because characteristic polynomial cannot jump from positive
to negative in a limit).
T ◦ Φ = T ◦ (Ψ + T ◦ Ψ) = T ◦ Ψ + Ψ = Φ,
and 3.A.22 (a) If X is Hermitian (so c = b and a and d are
Φ ◦ T = (Ψ + T ◦ Ψ) ◦ T = Ψ ◦ T + T ◦ Ψ ◦ T real) then the determinants of the top-left 1 × 1,
2 × 2, and 3 × 3 blocks of Φ(X) are
= T ◦ (T ◦ Ψ ◦ T ) + (T ◦ Ψ ◦ T ) = T ◦ Ψ + Ψ
= Φ, 4a − 4Re(b) + 3d, 4a2 − 4|b|2 + 6ad, and
where we used the fact that T ◦ Ψ ◦ T = Ψ thanks to 8a2 d + 12ad 2 − 4a|b|2 + 4Re(b)|b|2 − 11d|b|2 ,
Ψ being real and CP (and thus transpose-preserving). respectively. If√X is positive definite then
|Re(b)| ≤ |b| < ad ≤ (a+d)/2, which easily
3.A.16 Recall that {Ei, j ⊗ Ek,` } is a basis of Mmp,nq , and shows that the 1 × 1 and 2 × 2 determinants are
Φ ⊗ Ψ must satisfy positive. For the 3 × 3 determinant, after√we
(Φ ⊗ Ψ)(Ei, j ⊗ Ek,` ) = Φ(Ei, j ) ⊗ Ψ(Ek,` ) use |b|2 < ad three times and Re(b) > − ad
once, we see that
for all i, j, k, and `. Since linear maps are determined
by how they act on a basis, this shows that Φ ⊗ Ψ, if 8a2 d + 12ad 2 − 4a|b|2 + 4Re(b)|b|2 − 11d|b|2
it exists, is unique. Well-definedness (i.e., existence
> 4a2 d + ad 2 − 4(ad)3/2 .
of Φ ⊗ Ψ) comes from reversing this argument—
if we define Φ ⊗ Ψ by (Φ ⊗ Ψ)(Ei, j ⊗ Ek,` ) = This quantity is non-negative since the inequal-
Φ(Ei, j ) ⊗ Ψ(Ek,` ) then linearity shows that Φ ⊗ Ψ ity 4a2 d + ad 2 ≥ 4(ad)3/2 is equivalent to
also satisfies (Φ ⊗ Ψ)(A ⊗ B) = Φ(A) ⊗ Ψ(B) for all (4a2 d + ad 2 )2 ≥ 16a3 d 3 , which is equivalent
A and B. to a2 d 2 (d − 4a)2 ≥ 0.
(b) After simplifying and factoring as suggested
3.A.18 If X ∈ Mn ⊗ Mk is positive semidefinite then we by this hint, we find
can construct a PSD matrix Xe ∈ Mn ⊗ Mk+1 sim-
ply by padding it with n rows and columns of zeros det(Φ(X)) = 2 det(X)p(X), where
without affecting positive semidefiniteness. Then p(X) = 16a + 30ad + 9d 2
2
e is positive semidefinite
(Φ ⊗ Ik )(X) = (Φ ⊗ Ik+1 )(X)
by (k + 1)-positivity of Φ, and since X was arbitrary − 8aRe(b) − 12Re(b)d − 8|b|2 .
this shows that Φ is k-positive.
If we use the facts that |b|2 < ad and Re(b) <
(a + d)/2 then we see that
3.A.19 We need to show that Φ(X) is positive semidefinite
whenever X is positive semidefinite. To this end, p(X) > 12a2 + 12ad + 3d 2 > 0,
notice that every PSD matrix X can be written as
a limit of positive definite matrices X1 , X2 , X3 , . . .: so det(Φ(X)) > 0 as well.
X = limk→∞ Xk , where Xk = X + I/k. Linearity (and (c) Sylvester’s Criterion tells us that Φ(X) is pos-
thus continuity) of Φ then tells us that itive definite whenever X is positive definite.
Exercise 3.A.19 then says that Φ is positive.
Φ(X) = Φ lim Xk = lim Φ(Xk ),
k→∞ k→∞
(c) (1, 0)⊗3 + (0, 1)⊗3 − (1, 1)⊗3 + (1, −1)⊗3 If we multiply and divide the right-hand side by
x2 + y2 + z2 and then multiply out the numerators,
3.B.3 (a) False. The degree of g might be strictly less we get f (x, y, z) as a sum of 18 squares of rational
than that of f . For example, if f (x1 , x2 ) = functions. For example, the first three terms in this
x1 x2 + x22 then dividing through by x22 gives sum of squares are of the form
the dehomogenization g(x) = x + 1.
(c) True. This follows from Theorem 3.B.2. (x2 + y2 + z2 )g(x, y, z)2 xg(x, y, z) 2
=
(e) True. This follows from the real spectral de- (x2 + y2 + z2 )2 x2 + y2 + z2
composition (see Corollary 2.A.2). 2
yg(x, y, z) zg(x, y, z) 2
+ 2 + .
x + y2 + z2 x2 + y2 + z2
3.B.5 We just observe that in any 4th power of a linear form
3.B.9 Recall from Theorem 1.3.6 that we can write every
(ax + by)4 ,
quadrilinear form f in the form
if the coefficient of the x4 and y4 terms equal 0 then
f (x, y, z, w) = ∑ bi, j,k,` xi z j yk w` .
a = b = 0 as well, so no quartic form without x4 or i, j,k,`
y4 terms can be written as a sum of 4th powers of
linear forms. If we plug z = x and w = y into this equation then
we see that
3.B.6 For both parts of this question, we just need to show
f (x, y, x, y) = bi, j,k,` xi x j yk y` .
that rankS (w) ≤ rank(w), since the other inequality ∑
i, j,k,`
is trivial.
(a) w ∈ Sn2 if and only if mat(w) is a symmetric If we define ai, j;k,` = (bi, j,k,` + b j,i,k,` + bi, j,`,k +
matrix. Applying the real spectral decomposi- b j,i,` , k)/4 then we get
tion shows that
r
f (x, y, x, y) = ∑ ai, j;k,` xi x j yk y` = q(x, y),
i≤ j,k≤`
mat(w) = ∑ λi vi vTi ,
i=1 and the converse is similar.
where r = rank(mat(w)) = rank(w). It follows
that w = ∑ri=1 λi vi ⊗ vi , so rankS (w) ≤ r = 3.B.10 We think of q(x, y) as a (quadratic) polynomial in
rank(w). y3 , keeping x1 , x2 , x3 , y1 , and y2 fixed. That is, we
(b) The Takagi factorization (Exercise 2.3.26) says write q(x, y) = ay23 + by3 + c, where a, b, and c are
that we can write mat(w) = UDU T , where complicated expressions that depend on x1 , x2 , x3 , y1 ,
U is unitary and D is diagonal (with the sin- and y2 :
gular values of mat(w) on its diagonal). If
a = x22 + x32 ,
r = rank(mat(w)) = rank(w) and the first r
columns of U are v1 , . . ., vr , then b = −2x1 x3 y1 − 2x2 x3 y2 , and
r c = x12 y21 + x12 y22 + x22 y22 + x32 y21 − 2x1 x2 y1 y2 .
mat(w) = ∑ di,i vi vTi ,
i=1 If a = 0 then x2 = x3 = 0, so q(x, y) = x12 (y21 + y22 ) ≥
0, as desired, so we assume from now on that a > 0.
so w = ∑ri=1 di,i vi
⊗ vi , which tells us that
Our goal is to show that q has at most one real root,
rankS (w) ≤ r = rank(w), as desired.
since that implies it is never negative.
To this end, we note that after performing a laborious
3.B.7 (a) Apply the AM–GM inequality to the 4 quanti-
calculation to expand out the discriminant b2 − 4ac
ties w4 , x2 y2 , y2 z2 , and z2 x2 .
of q, we get the following sum of squares decompo-
(b) If we could write f as a sum of squares of
sition:
quadratic forms, those quadratic forms must
contain no x2 , y2 , or z2 terms (or else f would b2 − 4ac = −4(abx − b2 y)2
have an x4 , y4 , or z4 term, respectively). It then
follows that they also have no wx, wy, or wz − 4(aby − c2 x)2 − 4(acy − bcx)2 .
terms (or else f would have an a w2 x2 , w2 y2 , It follows that b2 − 4ac ≤ 0, so q has at most one
or w2 z2 term, respectively) and thus no way to real root and is thus positive semidefinite.
create the cross term −4wxyz in f , which is a
contradiction. 3.B.11 We check the 3 defining properties of Defini-
tion 1.3.6. Properties (a) and (b) are both immediate,
3.B.8 Recall from Remark 3.B.2 that and property (c) comes from computing
g(x, y, z)2 + g(y, z, x)2 + g(z, x, y)2
f (x, y, z) = |ak1 ,k2 ,...,kn |2
x 2 + y2 + z 2 hf, fi = ∑ p ,
k1 +···+kn =p k1 ,k2 ,...,kn
h(x, y, z)2 + h(y, z, x)2 + h(z, x, y)2
+ .
2(x2 + y2 + z2 ) which is clearly non-negative and equals zero if and
only if ak1 ,k2 ,...,kn = 0 for all k1 , k2 , . . ., kn (i.e., if
and only if f = 0).
484 Appendix C. Selected Exercise Solutions
3.C.12 (a) The dual of this SDP can be written in the fol- respectively.
lowing form, where Y ∈ MH n and z ∈ R are
variables: 3.C.17 (a) The Choi matrix of Φ is
minimize: tr(Y ) + kz 4 −2 −2 2 0 0 0 0
subject to: Y + zI C −2 3 0 0 0 0 0 0
Y O
−2 0 2 0 0 1 0 0
(b) If X is the orthogonal projection onto the 2 0 0 0 0 0 0 0
span of the eigenspaces corresponding to the 0 0 0 0 0 0 0 −2
k largest eigenvalues of C then X is a feasible
0 0 1 0 0 2 0 −1
point of the primal problem of this SDP and
0 0 0 0 0 0 4 0
tr(CX) is the sum of those k largest eigenvalues,
0 0 0 0 −2 −1 0 2
as claimed. On the other hand, if C has spectral
decomposition C = ∑nj=1 λ j v j v∗j then we can Using the same dual SDP from the solution
choose z = λk+1 (where λ1 ≥ · · · ≥ λn are the to Exercise 3.C.16, we find that the matrix Z
eigenvalues of C) and Y = ∑kj=1 (λ j − z)v j v∗j . given by
It is straightforward to check that Y O and
Y + zI C, so Y is feasible. Furthermore, 2 0 0 −28 0 0 0 0
tr(Y ) + kz = ∑kj=1 λ j − kz + kz = ∑kj=1 λ j , as 0
16 0 0 0 0 0 0
desired.
0 0 49 0 0 −49 0 0
−28 0 0 392 0 0 0 0
3.C.14 If Φ = Ψ1 + T ◦ Ψ2 then CΦ = CΨ1 + Γ(CΨ2 ), so Φ
0 0 0 0 392 0 0 28
is decomposable if and only if there exist PSD ma-
trices X and Y such that CΦ = X + Γ(Y ). It follows 0 0 −49 0 0 49 0 0
that the optimal value of the following SDP is 0 if 0 0 0 0 0 0 16 0
Φ is decomposable, and it is −∞ otherwise (this is a 0 0 0 0 28 0 0 2
feasibility SDP—see Remark 3.C.3):
is feasible and has tr(CΦ Z) = −2. Scaling Z
maximize: 0 up by an arbitrary positive scalar shows that
subject to: X + Γ(Y ) = CΦ the optimal value of this SDP is −∞ and thus
X,Y O the primal problem is infeasible and so Φ is
not decomposable.
3.C.15 We can write Φ = Ψ1 + T ◦ Ψ2 , where 2CΨ1 and
2CΨ2 are the (PSD) matrices 3.C.18 Recall from Theorem 3.B.3 that q is a sum of squares
of bilinear forms if and only if we can choose Φ
3 0 0 0 0 −2 −1 0
0 to be decomposable and bisymmetric. Let {xi xTi }
1 0 0 0 0 0 −1
and {y j yTj } be the rank-1 bases of MSm and MSn ,
0 0 2 0 0 0 0 −2 respectively, described by Exercise 1.2.8.
0 0 0 2 −2 0 0 0 It follows that the following SDP determines whether
0 0 0 −2 2 0 0 0 or not q is a sum of squares, since it determines
whether or not there is a decomposable bisymmetric
−2 0 0 0 0 2 0 0
map Φ that represents q:
−1 0 0 0 0 0 1 0
0 −1 −2 0 0 0 0 3 maximize: 0
and subject to: xTi Φ(y j yTj )xi = q(xi , y j ) for all i, j
Φ = T ◦Φ
1 0 0 0 0 0 −1 0
0 Φ = Φ◦T
3 0 0 −2 0 0 −1
Φ = Ψ1 + T ◦ Ψ2
0 0 2 0 0 −2 0 0 CΨ1 ,CΨ2 O
0 0 0 2 0 0 −2 0
0 0 0 2 0 0 0
−2 [Side note: We can actually choose Ψ1 = Ψ2 to make
0 0 −2 0 0 2 0 0 this SDP slightly simpler, thanks to Exercise 3.A.15.]
−1 0 0 −2 0 0 3 0
0 −1 0 0 0 0 0 1
Bibliography
[Aga85] S.S. Agaian, Hadamard Matrices and Their Applications (Springer, 1985)
[AM57] A.A. Albert, B. Muckenhoupt, On matrices of trace zeros. Mich. Math. J. 4(1), 1–3 (1957)
[Art27] E. Artin, Über die zerlegung definiter funktionen in quadrate. Abhandlungen aus dem Mathe-
matischen Seminar der Universität Hamburg 5(1), 100–115 (1927)
[BBP95] P. Borwein, J. Borwein, S. Plouffe, The inverse symbolic calculator (1995), http://wayback.
cecm.sfu.ca/projects/ISC/
[BV09] S.P. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, 2009)
[Cal73] A.P. Calderón, A note on biquadratic forms. Linear Algebra Appl. 7(2), 175–177 (1973)
[Cho75] M.-D. Choi, Positive semidefinite biquadratic forms. Linear Algebra Appl. 12(2), 95–100
(1975)
[Chv83] V. Chvatal, Linear Programming (Freeman, W. H, 1983)
[CL92] S. Chang, C.-K. Li, Certain isometries on Rn . Linear Algebra Appl. 165, 251–265 (1992)
[CVX12] CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version
2.0 (August 2012), http://cvxr.com/cvx
[DCAB18] S. Diamond, E. Chu, A. Agrawal, S. Boyd, CVXPY: A Python-embedded modeling
language for convex optimization (2018), http://www.cvxpy.org
[FR97] B. Fine, G. Rosenberger, The Fundamental Theorem of Algebra (Undergraduate Texts in
Mathematics (Springer, New York, 1997)
[Hil88] D. Hilbert, Ueber die darstellung definiter formen als summe von formenquadraten. Mathema-
tische Annalen 32(3), 342–350 (1888)
[HJ12] R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, 2012)
[Hor06] K.J. Horadam, Hadamard Matrices and Their Applications (Princeton University Press, 2006)
[Joh20] N. Johnston, Introduction to Linear and Matrix Algebra (Undergraduate Texts in Mathematics
(Springer, New York, 2020)
[KB09] T.G. Kolda, B.W. Bader, Tensor decompositions and applications. SIAM Rev. 51(3), 455–500
(September 2009)
[Lan12] J.M. Landsberg, Tensors: Geometry and Applications (American Mathematical Society, 2012)
[LP01] C.-K. Li, S. Pierce, Linear preserver problems. Am. Math. Mon. 108, 591–605 (2001)
488 Bibliography
[LS94] C.-K. Li, W. So, Isometries of ` p norm. Am. Math. Mon. 101, 452–453 (1994)
[LT92] C.-K. Li, N.-K. Tsing, Linear preserver problems: a brief introduction and some special
techniques. Linear Algebra Appl. 162, 217–235 (1992)
[Mir60] L. Mirsky, Symmetric gauge functions and unitarily invariant norms. Q. J. Math. 11, 50–59
(1960)
[MO15] M. Miller, R. Olkiewicz, Topology of the cone of positive maps on qubit systems. J. Phys. A
Math. Theor. 48, (2015)
[Shi18] Y. Shitov, A counterexample to Comon’s conjecture. SIAM J. Appl. Algebra Geom. 2(3),
428–443 (2018)
[Stø63] E. Størmer, Positive linear maps of operator algebras. Acta Mathematica 110(1), 233–278
(1963)
[Sto10] A. Stothers, On the Complexity of Matrix Multiplication. Ph.D. thesis, University of Edinburgh
(2010)
transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 W
transpose-preserving . . . . . . . . . . . . . . . . . . 361
type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Wronskian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
U Y
unit ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Young’s inequality . . . . . . . . . . . . . . . . . . . . 154
unit vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
unital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Z
unitary matrix . . . . . . . . . . . . . . . . . . . . . . . . . 96
unitary similarity . . . . . . . . . . . . . . . . . . . . . 168 zero
universal property . . . . . . . . . . . . . . . . . . . . 343 transformation . . . . . . . . . . . . . . . . . . . . 37
vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
V vector space . . . . . . . . . . . . . . . . . . 28, 317
vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 2
vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
vectorization . . . . . . . . . . . . . . . . . . . . . . . . . 305
494 Index
Symbol Index
⊥ orthogonal complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
∼
= isomorphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
, Loewner partial order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
⊕ direct sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121, 135
⊗ Kronecker product, tensor product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298, 343
0 zero vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
O zero matrix or zero transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 37
[A]i, j (i, j)-entry of the matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
[T ]B standard matrix of T with respect to basis B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
k·k the length/norm of a vector or the operator norm of a matrix . . . . . . . . . . . . . . . . . . 74, 221
kAkF Frobenius norm of the matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
AT transpose of the matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A∗ conjugate transpose of the matrix A, adjoint of the linear transformation A . . . . . . . . 6, 92
Anp antisymmetric subspace of (Fn )⊗p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
|B| size (i.e., number of members) of the set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
C complex numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 428
Cn vectors/tuples with n complex entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
C vector space of continuous functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
C[a, b] continuous functions on the real interval [a, b] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
c00 infinite real sequences with finitely many non-zero entries . . . . . . . . . . . . . . . . . . . . . . . . . 11
D vector space of differentiable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
ej j-th standard basis vector in Fn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Ei, j standard basis matrix with 1 in its (i, j)-entry and 0s elsewhere . . . . . . . . . . . . . . . . . . . . 19
F a field (often R or C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 434
Fn vectors/tuples with n entries from F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
FN vector space of infinite sequences with entries from F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
F vector space of real-valued functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
HP np homogeneous polynomials in n variables with degree p . . . . . . . . . . . . . . . . . . . . . . . 23, 378
I identity matrix or identity transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12, 37
Mm,n m × n matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Mn n × n matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
MsS n n × n skew-symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
MsH n n × n (complex) skew-Hermitian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
MH n n × n (complex) Hermitian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
MSn n × n symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
N natural numbers {1, 2, 3, . . .} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
P polynomials in one variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
P E, P O even and odd polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
Pn polynomials in n variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Pp polynomials in one variable with degree ≤ p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Pnp polynomials in n variables with degree ≤ p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Q rational numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285, 435
R real numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Rn vectors/tuples with n real entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Snp symmetric subspace of (Fn )⊗p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
V∗ dual of the vector space V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Z integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11