Sanet - ST 3030528146

Nathaniel Johnston
Advanced
Linear and
Matrix Algebra
Advanced Linear and Matrix Algebra
Nathaniel Johnston
Advanced Linear
and Matrix Algebra
123
Nathaniel Johnston
Department of Mathematics and
Computer Science
Mount Allison University
Sackville, NB, Canada
ISBN 978-3-030-52814-0 ISBN 978-3-030-52815-7 (eBook)

https://doi.org/10.1007/978-3-030-52815-7
Mathematics Subject Classification: 15Axx, 97H60, 00-01
© Springer Nature Switzerland AG 2021

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in
this book are believed to be true and accurate at the date of publication. Neither the publisher nor
the authors or the editors give a warranty, expressed or implied, with respect to the material
contained herein or for any errors or omissions that may have been made. The publisher remains
neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
For Devon
…who was very eager at age 2 to contribute to this book:
ndfshfjds kfdshdsf kdfsh kdsfhfdsk hdfsk
The Purpose of this Book
Linear algebra, more so than any other mathematical subject, can be approached in numerous ways.
Many textbooks present the subject in a very concrete and numerical manner, spending much of their
time solving systems of linear equations and having students perform laborious row reductions on
matrices. Many other books instead focus very heavily on linear transformations and other
basis-independent properties, almost to the point that their connection to matrices is considered an
inconvenient after-thought that students should avoid using at all costs.
This book is written from the perspective that both linear transformations and matrices are useful
objects in their own right, but it is the connection between the two that really unlocks the magic of
linear algebra. Sometimes when we want to know something about a linear transformation, the easiest
way to get an answer is to grab onto a basis and look at the corresponding matrix. Conversely, there
are many interesting families of matrices and matrix operations that seemingly have nothing to do with
linear transformations, yet can nonetheless illuminate how some basis-independent objects and
properties behave.
This book introduces many difficult-to-grasp objects such as vector spaces, dual spaces, and tensor
products. Because it is expected that this book will accompany one of the first courses where students
are exposed to such abstract concepts, we typically sandwich this abstractness between concrete
examples. That is, we first introduce or emphasize a standard, prototypical example of the object to be
introduced (e.g., Rn ), then we discuss its abstract generalization (e.g., vector spaces), and finally we
explore other specific examples of that generalization (e.g., the vector space of polynomials and the
vector space of matrices).
This book also delves somewhat deeper into matrix decompositions than most others do. We of course
cover the singular value decomposition as well as several of its applications, but we also spend quite a
bit of time looking at the Jordan decomposition, Schur triangularization, and spectral decomposition,
and we compare and contrast them with each other to highlight when each one is appropriate to use.
Computationally-motivated decompositions like the QR and Cholesky decompositions are also
covered in some of this book’s many “Extra Topic” sections.
Continuation of Introduction to Linear and Matrix Algebra

This book is the second part of a two-book series, following the book Introduction to Linear and
Matrix Algebra [Joh20]. The reader is expected to be familiar with the basics of linear algebra covered
in that book (as well as other introductory linear algebra books): vectors in Rn , the dot product,
matrices and matrix multiplication, Gaussian elimination, the inverse, range, null space, rank, and
vii
viii Preface
determinant of a matrix, as well as eigenvalues and eigenvectors. These preliminary topics are briefly
reviewed in Appendix A.1.
Because these books aim to not overlap with each other and repeat content, we do not discuss some
topics that are instead explored in that book. In particular, diagonalization of a matrix via its eigen-
values and eigenvectors is discussed in the introductory book and not here. However, many extensions
and variations of diagonalization, such as the spectral decomposition (Section 2.1.2) and Jordan
decomposition (Section 2.4) are explored here.
Features of this Book

This book makes use of numerous features to make it as easy to read and understand as possible. Here
we highlight some of these features and discuss how to best make use of them.
Notes in the Margin
This text makes heavy use of notes in the margin, which are used to introduce some additional
terminology or provide reminders that would be distracting in the main text. They are most commonly
used to try to address potential points of confusion for the reader, so it is best not to skip them.
For example, if we want to clarify why a particular piece of notation is the way it is, we do so in the
margin so as to not derail the main discussion. Similarly, if we use some basic fact that students are
expected to be aware of (but have perhaps forgotten) from an introductory linear algebra course, the
margin will contain a brief reminder of why it’s true.
Exercises
Several exercises can be found at the end of every section in this book, and whenever possible there
are three types of them:
• There are computational exercises that ask the reader to implement some algorithm or make
use of the tools presented in that section to solve a numerical problem.
• There are true/false exercises that test the reader’s critical thinking skills and reading com-
prehension by asking them whether some statements are true or false.
• There are proof exercises that ask the reader to prove a general statement. These typically are
either routine proofs that follow straight from the definition (and thus were omitted from the
main text itself), or proofs that can be tackled via some technique that we saw in that section.
Roughly half of the exercises are marked with an asterisk (), which means that they have a solution
provided in Appendix C. Exercises marked with two asterisks () are referenced in the main text and
are thus particularly important (and also have solutions in Appendix C).
To the Instructor and Independent Reader

This book is intended to accompany a second course in linear algebra, either at the advanced
undergraduate or graduate level. The only prerequisites that are expected of the reader are an intro-
ductory course in linear algebra (which is summarized in Appendix A.1) and some familiarity with
mathematical proofs. It will help the reader to have been exposed to complex numbers, though we do
little more than multiply and add them (their basics are reviewed in Appendix A.3).
Preface ix
The material covered in Chapters 1 and 2 is mostly standard material in upper-year undergraduate
linear algebra courses. In particular, Chapter 1 focuses on abstract structures like vector spaces and
inner products, to show students that the tools developed in their previous linear algebra course can be
applied to a much wider variety of objects than just lists of numbers like vectors in Rn . Chapter 2 then
explores how we can use these new tools at our disposal to gain a much deeper understanding of
matrices.
Chapter 3 covers somewhat more advanced material—multilinearity and the tensor product—which is
aimed particularly at advanced undergraduate students (though we note that no knowledge of abstract
algebra is assumed). It could serve perhaps as content for part of a third course, or as an independent
study in linear algebra. Alternatively, that chapter is also quite aligned with the author’s research
interests as a quantum information theorist, and it could be used as supplemental reading for students
who are trying to learn the basics of the field.
Sectioning
The sectioning of the book is designed to make it as simple to teach from as possible. The author
spends approximately the following amount of time on each chunk of this book:
• Subsection: 1–1.5 hour lecture
• Section: 2 weeks (3–4 subsections per section)
• Chapter: 5–6 weeks (3–4 sections per chapter)
• Book: 12-week course (2 chapters, plus some extra sections)
Of course, this is just a rough guideline, as some sections are longer than others. Furthermore, some
instructors may choose to include material from Chapter 3, or from some of the numerous in-depth
“Extra Topic” sections. Alternatively, the additional topics covered in those sections can serve as
independent study topics for students.
Extra Topic Sections
Half of this book’s sections are called “Extra Topic” sections. The purpose of the book being arranged
in this way is that it provides a clear main path through the book (Sections 1.1–1.4, 2.1–2.4, and 3.1–
3.3) that can be supplemented by the Extra Topic sections at the reader’s/instructor’s discretion. It is
expected that many courses will not even make it to Chapter 3, and instead will opt to explore some
of the earlier Extra Topic sections instead.
We want to emphasize that the Extra Topic sections are not labeled as such because they are less
important than the main sections, but only because they are not prerequisites for any of the main
sections. For example, norms and isometries (Section 1.D) are used constantly throughout advanced
mathematics, but they are presented in an Extra Topic section since the other sections of this book do
not depend on them (and also because they lean quite a bit into “analysis territory”, whereas most
of the rest of the book stays firmly in “algebra territory”).
Similarly, the author expects that many instructors will include the section on the direct sum and
orthogonal complements (Section 1.B) as part of their course’s core material, but this can be done at
their discretion. The subsections on dual spaces and multilinear forms from Section 1.3.2 can be
omitted reasonably safely to make up some time if needed, as can the subsection on Gershgorin discs
(Section 2.2.2), without drastically affecting the book’s flow.
For a graph that depicts the various dependencies of the sections of this book on each other, see
Figure H.
x Preface
Introduction to Linear and Matrix Algebra
Chapter 1: Vector Spaces

§1.1 §1.2 §1.3 §1.4
Vector spaces Lin. transforms Isomorphisms Orthogonality
§1.A §1.B §1.C §1.D

Trace v2 Direct sum QR decomp. Norms, isometry
Chapter 2: Matrix Decompositions
§2.1 §2.2 §2.3 §2.4

Spectral decomp. Positive definite Singular values Jordan decomp.
§2.A §2.B §2.C §2.D

Quadrat. forms Schur complem. Singular vals. v2 Matrix analysis
Chapter 3: Tensors and Multilinearity

§3.1 §3.2 §3.3
Kronecker product Multilinearity Tensor product
§3.A §3.B §3.C

Matrix maps Homogen. polynomials Semidef. programming
Figure H: A graph depicting the dependencies of the sections of this book on each other. Solid arrows indicate that the section
is required before proceeding to the section that it points to, while dashed arrows indicate recommended (but not required) prior
reading. The main path through the book consists of Sections 1–4 of each chapter. The extra sections A–D are optional and can
be explored at the reader’s discretion, as none of the main sections depend on them.
Acknowledgments
Thanks are extended to Geoffrey Cruttwell, Mark Hamilton, David Kribs, Chi-Kwong Li, Benjamin
Lovitz, Neil McKay, Vern Paulsen, Rajesh Pereira, Sarah Plosker, Jamie Sikora, and John Watrous for
various discussions that have either directly or indirectly improved the quality of this book.
Thank you to Everett Patterson, as well as countless other students in my linear algebra classes at
Mount Allison University, for drawing my attention to typos and parts of the book that could be
improved.
Preface xi
Parts of the layout of this book were inspired by the Legrand Orange Book template by Velimir
Gayevskiy and Mathias Legrand at LaTeXTemplates.com.
Finally, thank you to my wife Kathryn for tolerating me during the years of my mental absence glued
to this book, and thank you to my parents for making me care about both learning and teaching.
Sackville, NB, Canada Nathaniel Johnston

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
The Purpose of this Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Features of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
To the Instructor and Independent Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1: Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Vector Spaces and Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Subspaces 8
1.1.2 Spans, Linear Combinations, and Independence 11
1.1.3 Bases 18
Exercises 22
1.2 Coordinates and Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.1 Dimension and Coordinate Vectors 24
1.2.2 Change of Basis 30
1.2.3 Linear Transformations 36
1.2.4 Properties of Linear Transformations 44
Exercises 51
1.3 Isomorphisms and Linear Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.3.1 Isomorphisms 54
1.3.2 Linear Forms 58
1.3.3 Bilinearity and Beyond 64
1.3.4 Inner Products 70
Exercises 77
1.4 Orthogonality and Adjoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
1.4.1 Orthonormal Bases 81
1.4.2 Adjoint Transformations 92
xiii
xiv Table of Contents
1.4.3 Unitary Matrices 96

1.4.4 Projections 101
Exercises 112
1.5 Summary and Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
1.A Extra Topic: More About the Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
1.B Extra Topic: Direct Sum, Orthogonal Complement . . . . . . . . . . . . . . . . . . . . . . 121
1.C Extra Topic: The QR Decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
1.D Extra Topic: Norms and Isometries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Chapter 2: Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

2.1 The Schur and Spectral Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
2.1.1 Schur Triangularization 169
2.1.2 Normal Matrices and the Complex Spectral Decomposition 176
2.1.3 The Real Spectral Decomposition 182
Exercises 186
2.2 Positive Semidefiniteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
2.2.1 Characterizing Positive (Semi)Definite Matrices 189
2.2.2 Diagonal Dominance and Gershgorin Discs 197
2.2.3 Unitary Freedom of PSD Decompositions 201
Exercises 207
2.3 The Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
2.3.1 Geometric Interpretation and the Fundamental Subspaces 212
2.3.2 Relationship with Other Matrix Decompositions 217
2.3.3 The Operator Norm 220
Exercises 224
2.4 The Jordan Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
2.4.1 Uniqueness and Similarity 228
2.4.2 Existence and Computation 234
2.4.3 Matrix Functions 245
Exercises 251
2.A Extra Topic: Quadratic Forms and Conic Sections. . . . . . . . . . . . . . . . . . . . . . . 255
2.B Extra Topic: Schur Complements and Cholesky . . . . . . . . . . . . . . . . . . . . . . . . 263
2.C Extra Topic: Applications of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
2.D Extra Topic: Continuity and Matrix Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Table of Contents xv
Chapter 3: Tensors and Multilinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
3.1 The Kronecker Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

3.1.1 Definition and Basic Properties 298
3.1.2 Vectorization and the Swap Matrix 305
3.1.3 The Symmetric and Antisymmetric Subspaces 311
Exercises 318
3.2 Multilinear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
3.2.1 Definition and Basic Examples 320
3.2.2 Arrays 323
3.2.3 Properties of Multilinear Transformations 333
Exercises 340
3.3 The Tensor Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
3.3.1 Motivation and Definition 342
3.3.2 Existence and Uniqueness 345
3.3.3 Tensor Rank 350
Exercises 356
3.A Extra Topic: Matrix-Valued Linear Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
3.B Extra Topic: Homogeneous Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
3.C Extra Topic: Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Appendix A: Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

A.1 Review of Introductory Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
A.1.1 Systems of Linear Equations 415
A.1.2 Matrices as Linear Transformations 416
A.1.3 The Inverse of a Matrix 417
A.1.4 Range, Rank, Null Space, and Nullity 418
A.1.5 Determinants and Permutations 419
A.1.6 Eigenvalues and Eigenvectors 421
A.1.7 Diagonalization 422
A.2 Polynomials and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

A.2.1 Monomials, Binomials and Multinomials 424
A.2.2 Taylor Polynomials and Taylor Series 426
xvi Table of Contents
A.3 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

A.3.1 Basic Arithmetic and Geometry 429
A.3.2 The Complex Conjugate 430
A.3.3 Euler’s Formula and Polar Form 431
A.4 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

A.5 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
A.5.1 Convex Sets 436
A.5.2 Convex Functions 438
Appendix B: Additional Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
B.1 Equivalence of Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

B.2 Details of the Jordan Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
B.3 Strong Duality for Semidefinite Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Appendix C: Selected Exercise Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

C.1 Chapter 1: Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
C.2 Chapter 2: Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
C.3 Chapter 3: Tensors and Multilinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Symbol Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
1. Vector Spaces
It is my experience that proofs involving matrices

can be shortened by 50% if one throws the
matrices out.
Emil Artin
Rn denotes the set of Our first exposure to linear algebra is typically a very concrete thing—it consists
vectors with n (real) of some basic facts about Rn (a set made up of lists of real numbers, called
entries and Mm,n
denotes the set of vectors) and Mm,n (a set made up of m × n arrays of real numbers, called
m × n matrices. matrices). We can do numerous useful things in these sets, such as solve
systems of linear equations, multiply matrices together, and compute the rank,
determinant, and eigenvalues of a matrix.
When we look carefully at our procedures for doing these calculations,
as well as our proofs for why they work, we notice that most of them do not
actually require much more than the ability to add vectors together and multiply
them by scalars. However, there are many other mathematical settings where
addition and scalar multiplication work, and almost all of our linear algebraic
techniques work in these more general settings as well.
With this in mind, our goal right now is considerably different from what
it was in introductory linear algebra—we want to see exactly how far we can
push our techniques. Instead of defining objects and operations in terms of
explicit formulas and then investigating what properties they satisfy (as we
have done up until now), we now focus on the properties that those familiar
objects have and ask what other types of objects have those properties.
For example, in a typical introduction to linear algebra, the dot product of
two vectors v and w in Rn is defined by
v · w = v1 w1 + v2 w2 + · · · + vn wn ,
and then the “nice” properties that the dot product satisfies are investigated. For
example, students typically learn the facts that
v·w = w·v and v · (w + x) = (v · w) + (v · x) for all v, w, x ∈ Rn
almost immediately after being introduced to the dot product. In this chapter,
we flip this approach around and instead define an “inner product” to be any
function satisfying those same properties, and then show that everything we
learned about the dot product actually applies to every single inner product
(even though many inner products look, on the surface, quite different from the
dot product).
© Springer Nature Switzerland AG 2021 1

N. Johnston, Advanced Linear and Matrix Algebra,
https://doi.org/10.1007/978-3-030-52815-7_1
2 Chapter 1. Vector Spaces
1.1 Vector Spaces and Subspaces
In order to use our linear algebraic tools with objects other than vectors in Rn ,
we need a proper definition that tells us what types of objects we can consider in
a linear algebra setting. The following definition makes this precise and serves
as the foundation for this entire chapter. Although the definition looks like an
absolute beast, the intuition behind it is quite straightforward—the objects that
we work with should behave “like” vectors in Rn . That is, they should have the
same properties (like commutativity: v + w = w + v) that vectors in Rn have
with respect to vector addition and scalar multiplication.
More specifically, the following definition lists 10 properties that must be
satisfied in order for us to call something a “vector space” (like R3 ) or a “vector”
(like (1, 3, −2) ∈ R3 ). These 10 properties can be thought of as the answers to
the question “what properties of vectors in Rn can we list without explicitly
referring to their entries?”
Definition 1.1.1 Let F be a set of scalars (usually either R or C) and let V be a set with
Vector Space two operations called addition and scalar multiplication. We write the
addition of v, w ∈ V as v + w, and the scalar multiplication of c ∈ F and v
R is the set of real
as cv.
numbers and C is If the following ten conditions hold for all v, w, x ∈ V and all c, d ∈ F, then
the set of complex V is called a vector space and its elements are called vectors:
numbers (see
Appendix A.3). a) v+w ∈ V (closure under addition)
b) v+w = w+v (commutativity)
c) (v + w) + x = v + (w + x) (associativity)
d) There exists a “zero vector” 0 ∈ V such that v + 0 = v.
e) There exists a vector −v such that v + (−v) = 0.
Notice that the first f) cv ∈ V (closure under scalar multiplication)
five properties
g) c(v + w) = cv + cw (distributivity)
concern addition,
while the last five h) (c + d)v = cv + dv (distributivity)
concern scalar i) c(dv) = (cd)v
multiplication. j) 1v = v
Remark 1.1.1 Not much would be lost throughout this book if we were to replace the set
Fields and of scalars F from Definition 1.1.1 with “either R or C”. In fact, in many
Sets of Scalars cases it is even enough to just explicitly choose F = C, since oftentimes if
a property holds over C then it automatically holds over R simply because
R ⊆ C.
It’s also useful to However, F more generally can be any “field”, which is a set of objects
know that every field
has a “0” and a “1”:
in which we can add, subtract, multiply, and divide according to the
numbers such that usual laws of arithmetic (e.g., ab = ba and a(b + c) = ab + ac for all
0a = 0 and 1a = a for a, b, c ∈ F)—see Appendix A.4. Just like we can keep Rn in mind as the
all a ∈ F. standard example of a vector space, we can keep R and C in mind as the
standard examples of a field.
We now look at several examples of sets that are and are not vector spaces,
to try to get used to this admittedly long and cumbersome definition. As our
first example, we show that Rn is indeed a vector space (which should not be
surprising—the definition of a vector space was designed specifically so as to
1.1 Vector Spaces and Subspaces 3
mimic Rn ).
Example 1.1.1 Show that Rn is a vector space.

Euclidean Space
Solution:
is a Vector Space
We have to check the ten properties described by Definition 1.1.1. If
v, w, x ∈ Rn and c, d ∈ R then:
a) v + w ∈ Rn (there is nothing to prove here—it follows directly from
the definition of vector addition in Rn ).
b) We just repeatedly use commutativity of real number addition:
v j , w j , and x j denote v + w = (v1 + w1 , . . . , vn + wn ) = (w1 + v1 , . . . , wn + vn ) = w + v.

the j-th entries of v,
w, and x,
respectively. c) This property follows in a manner similar to property (b) by making
use of associativity of real number addition:
(v + w) + x = (v1 + w1 , . . . , vn + wn ) + (x1 , . . . , xn )
= (v1 + w1 + x1 , . . . , vn + wn + xn )
= (v1 , . . . , vn ) + (w1 + x1 , . . . , wn + xn ) = v + (w + x).
d) The zero vector 0 = (0, 0, . . . , 0) ∈ Rn clearly satisfies v + 0 = v.

e) We simply choose −v = (−v1 , . . . , −vn ), which indeed satisfies
v + (−v) = 0.
f) cv ∈ Rn (again, there is nothing to prove here—it follows straight
from the definition of scalar multiplication in Rn ).
g) We just expand each of c(v + w) and cv + cw in terms of their
entries:
c(v + w) = c(v1 + w1 , . . . , vn + wn )
= (cv1 + cw1 , . . . , cvn + cwn ) = cv + cw.
h) Similarly to property (g), we just notice that for each 1 ≤ j ≤ n, the

j-th entry of (c + d)v is (c + d)v j = cv j + dv j , which is also the j-th
entry of cv + dv.
i) Just like properties (g) and (h), we just notice that for each 1 ≤ j ≤ n,
the j-th entry of c(dv) is c(dv j ) = (cd)v j , which is also the j-th
entry of (cd)v.
j) The fact that 1v = v is clear.
We should keep Rn in our mind as the prototypical example of a vector

space. We will soon prove theorems about vector spaces in general and see
plenty of exotic vector spaces, but it is absolutely fine to get our intuition about
how vector spaces work from Rn itself. In fact, we should do this every time we
define a new concept in this chapter—keep in mind what the standard example
of the new abstractly-defined object is, and use that standard example to build
up our intuition.
For all fields F, the set It is also the case that Cn (the set of vectors/tuples with n complex en-
Fn (of ordered
n-tuples of elements
tries) is also a vector space, and the proof of this fact is almost identical to
of F) is also a vector the argument that we provided for Rn in Example 1.1.1. In fact, the set of all
space. infinite sequences of scalars (rather than finite tuples of scalars like we are
used to) is also a vector space. That is, the sequence space
N is the set of natural
def
numbers. That is, FN = (x1 , x2 , x3 , . . .) : x j ∈ F for all j ∈ N
N = {1, 2, 3, . . .}.
is a vector space with the standard addition and scalar multiplication operations
(we leave the proof of this fact to Exercise 1.1.7).
To get a bit more comfortable with vector spaces, we now look at several
examples of vector spaces that, on the surface, look significantly different from
these spaces of tuples and sequences.
Example 1.1.2 Show that Mm,n , the set of m × n matrices, is a vector space.
The Set of Matrices
Solution:
is a Vector Space
We have to check the ten properties described by Definition 1.1.1. If
A, B,C ∈ Mm,n and c, d ∈ F then:
a) A + B is also an m × n matrix (i.e., A + B ∈ Mm,n ).
b) For each 1 ≤ i ≤ m and 1 ≤ j ≤ n, the (i, j)-entry of A + B is ai, j +
The “(i, j)-entry” of a
matrix is the scalar in bi, j = bi, j + ai, j , which is also the (i, j)-entry of B + A. It follows
its i-th row and j-th that A + B = B + A.
column. We denote
the (i, j)-entry of A c) The fact that (A + B) + C = A + (B + C) follows similarly from
and B by ai, j and bi, j , looking at the (i, j)-entry of each of these matrices and using asso-
respectively, or ciativity of addition in F.
sometimes by [A]i, j
and [B]i, j . d) The “zero vector” in this space is the zero matrix O (i.e., the m × n
matrix with every entry equal to 0), since A + O = A.
e) We define −A to be the matrix whose (i, j)-entry is −ai, j , so that
A + (−A) = O.
f) cA is also an m × n matrix (i.e., cA ∈ Mm,n ).
g) The (i, j)-entry of c(A + B) is c(ai, j + bi, j ) = cai, j + cbi, j , which is
the (i, j)-entry of cA + cB, so (A + B) = cA + cB.
h) Similarly to property (g), the (i, j)-entry of (c + d)A is (c + d)ai, j =
cai, j + dai, j , which is also the (i, j)-entry of cA + dA. It follows that
(c + d)A = cA + dA.
i) Similarly to property (g), the (i, j)-entry of c(dA) is c(dai, j ) =
(cd)ai, j , which is also the (i, j)-entry of (cd)A, so c(dA) = (cd)A.
j) The fact that 1A = A is clear.
If we wish to emphasize which field F the entries of the matrices in Mm,n

come from, we denote it by Mm,n (F). For example, if we say that A ∈ M2,3 (R)
Since R ⊂ C, real and B ∈ M4,5 (C) then we are saying that A is a 2 × 3 matrix with real entries
entries are also
complex, so the
and B is a 4 × 5 matrix with complex entries. We use the briefer notation Mm,n
entries of B might be if the choice of field is unimportant or clear from context, and we use the even
real. briefer notation Mn when the matrices are square (i.e., m = n).
Typically, proving that a set is a vector space is not hard—all of the proper-
ties either follow directly from the relevant definitions or via a one-line proof.
However, it is still important to actually verify that all ten properties hold,
especially when we are first learning about vector spaces, as we will shortly
see some examples of sets that look somewhat like vector spaces but are not.
def
Example 1.1.3 Show that the set of real-valued functions F = f : R → R is a vector
The Set of Functions space.
is a Vector Space Solution:
Once again, we have to check the ten properties described by Defini-
tion 1.1.1. To do this, we will repeatedly use the fact that two functions
are the same if and only if their outputs are always the same (i.e., f = g if
and only if f (x) = g(x) for all x ∈ R). With this observation in mind, we
note that if f , g, h ∈ F and c, d ∈ R then:
a) f + g is the function defined by ( f + g)(x) = f (x) + g(x) for all
x ∈ R. In particular, f + g is also a function, so f + g ∈ F .
b) For all x ∈ R, we have
( f + g)(x) = f (x) + g(x) = g(x) + f (x) = (g + f )(x),
so f + g = g + f .
All of these c) For all x ∈ R, we have
properties of
functions are trivial.
The hardest part of ( f + g) + h (x) = f (x) + g(x) + h(x)

this example is not = f (x) + g(x) + h(x) = f + (g + h) (x),
the math, but rather
getting our heads
around the so ( f + g) + h = f + (g + h).
unfortunate notation d) The “zero vector” in this space is the function 0 with the property
that results from
that 0(x) = 0 for all x ∈ R.
using parentheses to
group function e) Given a function f , the function − f is simply defined by (− f )(x) =
addition and also to − f (x) for all x ∈ R. Then
denote inputs of
functions.
f + (− f ) (x) = f (x) + (− f )(x) = f (x) − f (x) = 0
for all x ∈ R, so f + (− f ) = 0.
f) c f is the function defined by (c f )(x) = c f (x) for all x ∈ R. In
particular, c f is also a function, so c f ∈ F .
g) For all x ∈ R, we have

c( f + g) (x) = c f (x) + g(x) = c f (x) + cg(x) = (c f + cg)(x),
so c( f + g) = c f + cg.
h) For all x ∈ R, we have

(c + d) f (x) = (c + d) f (x) = c f (x) + d f (x) = (c f + d f )(x),
so (c + d) f = c f + d f .
i) For all x ∈ R, we have

c(d f ) (x) = c d f (x) = (cd) f (x) = (cd) f (x),
so c(d f ) = (cd) f .
j) For all x ∈ R, we have (1 f )(x) = 1 f (x) = f (x), so 1 f = f .
In the previous three examples, we did not explicitly specify what the
addition and scalar multiplication operations were upfront, since there was an
obvious choice on each of these sets. However, it may not always be clear what
these operations should actually be, in which case they have to be explicitly
defined before we can start checking whether or not they turn the set into a
vector space.
In other words, vector spaces are a package deal with their addition and
scalar multiplication operations—a set might be a vector space when the op-
erations are defined in one way but not when they are defined another way.
Furthermore, the operations that we call addition and scalar multiplication
might look nothing like what we usually call “addition” or “multiplication”.
All that matters is that those operations satisfy the ten properties from Defini-
tion 1.1.1.
Example 1.1.4 Let V = {x ∈ R : x > 0} be the set of positive real numbers. Show that V
A Vector Space is a vector space when we define addition ⊕ on it via usual multiplication
With Weird of real numbers (i.e., x ⊕ y = xy) and scalar multiplication on it via
Operations exponentiation (i.e., c x = xc ).
Solution:
In this example, we Once again, we have to check the ten properties described by Defini-
use bold variables
tion 1.1.1. Well, if x, y, z ∈ V and c, d ∈ R then:
like x when we think
of objects as vectors a) x ⊕ y = xy, which is still a positive number since x and y are both
in V, and we use positive, so x ⊕ y ∈ V.
non-bold variables
like x when we think
b) x ⊕ y = xy = yx = y ⊕ x.
of them just as c) (x ⊕ y) ⊕ z = (xy)z = x(yz) = x ⊕ (y ⊕ z).
positive real
numbers. d) The “zero vector” in this space is the number 1 (so we write 0 = 1),
since 0 ⊕ x = 1x = x = x.
e) For each vector x ∈ V, we define −x = 1/x. This works because
x ⊕ (−x) = x(1/x) = 1 = 0.
f) c x = xc , which is still positive since x is positive, so c x ∈ V.
g) c (x + y) = (xy)c = xc yc = (c x) ⊕ (c y).
h) (c + d) x = xc+d = xc xd = (c x) ⊕ (d x).
i) c (d x) = (xd )c = xcd = (cd) x.
j) 1 x = x1 = x = x.
The previous examples illustrate that vectors, vector spaces, addition, and
scalar multiplication can all look quite different from the corresponding con-
cepts in Rn . As one last technicality, we note that whether or not a set is a
vector space depends on the field F that is being considered. The field can often,
but not always, be inferred from context.
Before presenting an example that demonstrates why the choice of field is
not always obvious, we establish some notation and remind the reader of some
terminology. The transpose of a matrix A ∈ Mm,n is the matrix AT ∈ Mn,m
whose (i, j)-entry is a j,i . That is, AT is obtained from A by reflecting its entries
across its main diagonal. For example, if
 
" # 1 4
1 2 3  
A= then AT = 2 5 .
4 5 6
3 6
Similarly, the conjugate transpose of A ∈ Mm,n (C) is the matrix
A∗ ∈ Mn,m (C) whose (i, j)-entry is a j,i , where the horizontal line denotes
def T
complex conjugation (i.e., a + ib = a − ib). In other words, A∗ = A . For
Complex example, if
conjugation is
reviewed in  
Appendix A.3.2. 1 2 − 3i
1 3 − i 2i
A= then A∗ = 3 + i i .
2 + 3i −i 0
−2i 0
Finally, we say that a matrix A ∈ Mn is symmetric if AT = A and that a

matrix B ∈ Mn (C) is Hermitian if B∗ = B. For example, if

1 2 2 1 + 3i 1 2
A= , B= , and C = ,
2 3 1 − 3i −4 3 4
then A is symmetric (and Hermitian), B is Hermitian (but not symmetric), and

C is neither.
Example 1.1.5 Let MHn be the set of n × n Hermitian matrices. Show that if the field is
Is the Set of F = R then MH n is a vector space, but if F = C then it is not.
Hermitian Matrices
Solution:
a Vector Space?
Before presenting the solution, we clarify that the field F does not
specify what the entries of the Hermitian matrices are: in both cases (i.e.,
when F = R and when F = C), the entries of the matrices themselves can
be complex. The field F associated with a vector space just determines
what types of scalars are used in scalar multiplication.
To illustrate this point, we start by showing that if F = C then MH n is
not a vector space. For example, property (f) of vector spaces fails because

0 1 0 i
A= ∈ MH 2 , but iA = ∈/ MH2.
1 0 i 0
That is, MH H
2 (and by similar reasoning, Mn ) is not closed under multipli-
cation by complex scalars.
On the other hand, to see that MHn is a vector space when F = R, we
check the ten properties described by Definition 1.1.1. If A, B,C ∈ MH n
and c, d ∈ R then:
a) (A + B)∗ = A∗ + B∗ = A + B, so A + B ∈ MH n.
d) The zero matrix O is Hermitian, so we choose it as the “zero vector”
in MHn.
e) If A ∈ MH H
n then −A ∈ Mn too.
f) Since c is real, we have (cA)∗ = cA∗ = cA, so cA ∈ MH n (whereas
if c were complex then we would just have (cA)∗ = cA, which does
not necessarily equal cA, as we saw above).
All of the other properties of vector spaces follow immediately using
the same arguments that we used to show that Mm,n is a vector space in
Example 1.1.2.
In cases where we wish to clarify which field we are using for scalar
multiplication in a vector space V, we say that V is a vector space “over” that
field. For example, we say that MH n (the set of n × n Hermitian matrices) is a
vector space over R, but not a vector space over C. Alternatively, we refer to
the field in question as the ground field of V (so the ground field of MH
n , for
example, is R).
Despite the various vector spaces that we have seen looking so different on
the surface, not much changes when we do linear algebra in this more general
setting. To get a feeling for how we can prove things about vector spaces in
general, we now prove our very first theorem. We can think of this theorem as
answering the question of why we did not include some other properties like
“0v = 0” in the list of defining properties of vector spaces, even though we did
include “1v = v”. The reason is simply that listing these extra properties would
be redundant, as they follow from the ten properties that we did list.
Theorem 1.1.1 Suppose V is a vector space and v ∈ V. Then

Zero Vector and a) 0v = 0, and
Additive Inverses via b) (−1)v = −v.
Scalar Multiplication
Proof. To show that 0v = 0, we carefully use the various defining properties

of vector spaces from Definition 1.1.1:
0v = 0v + 0 (property (d))

Proving things about = 0v + 0v + (−(0v)) (property (e))
vector spaces will

= (0v + 0v) + − (0v) (property (c))
quickly become less
tedious than in this = (0 + 0)v + − (0v) (property (h))
theorem—as we
develop more tools, = 0v + − (0v) (0 + 0 = 0 in every field)
we will find ourselves = 0. (property (e))
referencing the
defining properties
(a)–(j) less and less. Now that we have 0v = 0 to work with, proving that (−1)v = −v is a bit
more straightforward:
0 = 0v (we just proved this)

= (1 − 1)v (1 − 1 = 0 in every field)
= 1v + (−1)v (property (h))
= v + (−1)v. (property (j))
It follows that (−1)v = −v, which completes the proof.

From now on, we write vector subtraction in the usual way v − w that we
are used to. The above theorem ensures that there is no ambiguity when we
write subtraction in this way, since it does not matter if v − w is taken to mean
v + (−w) or v + (−1)w.
1.1.1 Subspaces
It is often useful to work with vector spaces that are contained within other
vector spaces. This situation comes up often enough that it gets its own name:
Definition 1.1.2 If V is a vector space and S ⊆ V, then S is a subspace of V if S is itself

Subspace a vector space with the same addition, scalar multiplication, and ground
field as V.
It turns out that checking whether or not something is a subspace is much

simpler than checking whether or not it is a vector space. We already saw this
somewhat in Example 1.1.5, where we only explicitly showed that four vector
space properties hold for MH n instead of all ten. The reason we could do this is
that MH n is a subset of M n (C), so it inherited many of the properties of vector
spaces “for free” from the fact that we already knew that Mn (C) is a vector
space.
It turns out that even checking four properties for subspaces is overkill—we
only have to check two:
Theorem 1.1.2 Let V be a vector space over a field F and let S ⊆ V be non-empty. Then
Determining if a Set S is a subspace of V if and only if the following two conditions hold:
is a Subspace a) If v, w ∈ S then v + w ∈ S, and (closure under addition)
b) if v ∈ S and c ∈ F then cv ∈ S. (closure under scalar mult.)
Proof. For the “only if” direction, properties (a) and (b) in this theorem are
properties (a) and (f) in Definition 1.1.1 of a vector space, so of course they
must hold if S is a subspace (since subspaces are vector spaces).
For the “if” direction, we have to show that all ten properties (a)–(j) in
Definition 1.1.1 of a vector space hold for S:
• Properties (a) and (f) hold by hypothesis.
• Properties (b), (c), (g), (h), (i), and (j) hold for all vectors in V, so they
certainly hold for all vectors in S too, since S ⊆ V.
Every subspace, just • For property (d), we need to show that 0 ∈ S. If v ∈ S then we know that
like every vector
0v ∈ S too since S is closed under scalar multiplication. However, we
space, must contain
a zero vector. know from Theorem 1.1.1(a) that 0v = 0, so we are done.
• For property (e), we need to show that if v ∈ S then −v ∈ S too. We
know that (−1)v ∈ S since S is closed under scalar multiplication, and
we know from Theorem 1.1.1(b) that (−1)v = −v, so we are done.

Example 1.1.6 Let P p be the set of real-valued polynomials of degree at most p. Show
The Set of Polynomials that P p a subspace of F , the vector space of all real-valued functions.
is a Subspace
Solution:
We just have to check the two properties described by Theorem 1.1.2.
The degree of a Hopefully it is somewhat clear that adding two polynomials of degree at
polynomial is the
most p results in another polynomial of degree at most p, and similarly
largest exponent to
which the variable is multiplying a polynomial of degree at most p by a scalar results in another
raised, so a one, but we make this computation explicit.
polynomial of
degree at most p
Suppose f (x) = a p x p + · · · + a1 x + a0 and g(x) = b p x p + · · · + b1 x + b0
looks like f (x) = are polynomials of degree at most p (i.e., f , g ∈ P p ) and c ∈ R is a scalar.
a p x p + · · · + a1 x + a0 , Then:
where a p , . . ., a1 ,
a0 ∈ R. The degree of a) We compute
f is exactly p if a p 6= 0.
See Appendix A.2 for ( f + g)(x) = (a p x p + · · · + a1 x + a0 ) + (b p x p + · · · + b1 x + b0 )
an introduction to
polynomials.
= (a p + b p )x p + · · · + (a1 + b1 )x + (a0 + b0 ),
which is also a polynomial of degree at most p, so f + g ∈ P p .

b) Just like before, we compute
(c f )(x) = c(a p x p + · · · + a1 x + a0 )
= (ca p )x p + · · · + (ca1 )x + (ca0 ),
which is also a polynomial of degree at most p, so c f ∈ P p .
Slightly more generally, the set P p (F) of polynomials with coefficients

from a field F is also a subspace of the vector space F (F) of functions from F
to itself, even when F 6= R. However, P p (F) has some subtle properties that
make it difficult to work with in general (see Exercise 1.2.34, for example), so
we only consider it in the case when F = R or F = C.
Recall that a Similarly, the set of continuous functions C is also a subspace of F , as is the
function is called
set of differentiable functions D (see Exercise 1.1.10). Since every polynomial
“differentiable” if it
has a derivative. is differentiable, and every differentiable function is continuous, it is even the
case that P p is a subspace of D, which is a subspace of C, which is a subspace
of F .
Example 1.1.7 Show that the set of n × n upper triangular matrices a subspace of Mn .
The Set of Upper
Solution:
Triangular Matrices
Again, we have to check the two properties described by Theorem 1.1.2,
is a Subspace
and it is again somewhat clear that both of these properties hold. For ex-
ample, if we add two matrices with zeros below the diagonal, their sum
Recall that a matrix will still have zeros below the diagonal.
A is called upper
triangular if ai, j = 0
We now formally prove that these properties hold. Suppose A, B ∈ Mn
whenever i > j. For are upper triangular (i.e., ai, j = bi, j = 0 whenever i > j) and c ∈ F is a
example, a 2 × 2 scalar.
upper triangular
matrix has the form a) The (i, j)-entry of A + B is ai, j + bi, j = 0 + 0 = 0 whenever i > j,
" # and
a b
A= .
0 c b) the (i, j)-entry of cA is cai, j = 0c = 0 whenever i > j.
Since both properties are satisfied, we conclude that the set of n × n upper
triangular matrices is indeed a subspace of Mn .
Similar to the previous example, the set MSn of n × n symmetric matrices

is also a subspace of Mn (see Exercise 1.1.8). It is worth noting, however, that
the set MH n of n × n Hermitian matrices is not a subspace of Mn (C) unless we
regard Mn (C) as a vector space over R instead of C (which is possible, but
quite non-standard).
Example 1.1.8 Show that the set of non-invertible 2 × 2 matrices is not a subspace of M2 .
The Set of Non-
Solution:
Invertible Matrices
This set is not a subspace because it is not closed under addition. For
is Not a Subspace
Similar examples can example, the following matrices are not invertible:
be used to show
that, for all n ≥ 1, the 1 0 0 0
set of non-invertible A= , B= .
0 0 0 1
n × n matrices is not
a subspace of Mn .
However, their sum is

1 0
A+B = ,
0 1
which is invertible. It follows that property (a) of Theorem 1.1.2 does not
hold, so this set is not a subspace of M2 .
Example 1.1.9 Show that Z3 , the set of 3-entry vectors with integer entries, is not a
The Set of Integer- subspace of R3 .
Entry Vectors
Solution:
is Not a Subspace
This set is a subset of R3 and is closed under addition, but it is not a
subspace because it is not closed under scalar multiplication. Because we
are asking whether or not it is a subspace of R3 , which uses R as its ground
field, we must use the same scalars here as well. However, if v ∈ Z3 and
c ∈ R then cv may not be in Z3 . For example, if
1
v = (1, 2, 3) ∈ Z3 then / Z3 .
v = (1/2, 1, 3/2) ∈
2
It follows that property (b) of Theorem 1.1.2 does not hold, so this set is
not a subspace of R3 .
Example 1.1.10 Show that the set c00 ⊂ FN of sequences with only finitely many non-zero
The Set of Eventu- entries is a subspace of the sequence space FN .
ally-Zero Sequences
Solution:
is a Subspace
Once again, we have to check properties (a) and (b) of Theorem 1.1.2.
That is, if v, w ∈ c00 have only finitely many non-zero entries and c ∈ F is
In the notation c00 , a scalar, then we have to show that (a) v + w and (b) cv each have finitely
the “c” refers to the
many non-zero entries as well.
sequences
converging and the These properties are both straightforward to show—if v has m non-zero
“00” refers to how entries and w has n non-zero entries then v + w has at most m + n non-zero
they converge to 0
and eventually
entries and cv has either m non-zero entries (if c 6= 0) or 0 non-zero entries
equal 0. (if c = 0).
1.1.2 Spans, Linear Combinations, and Independence

We now start re-introducing various aspects of linear algebra that we have
already seen in an introductory course in the more concrete setting of subspaces
of Rn . All of these concepts (in particular, spans, linear combinations, and
linear (in)dependence for now) behave almost exactly the same in general
vector spaces as they do in Rn (and its subspaces), so our presentation of these
topics is quite brief.
Definition 1.1.3 Suppose V is a vector space over a field F. A linear combination of the
Linear Combina- vectors v1 , v2 , . . . , vk ∈ V is any vector of the form
tions
c1 v1 + c2 v2 + · · · + ck vk ,
It is important that where c1 , c2 , . . . , ck ∈ F.

the sum presented
here is finite.
For example, in the vector space P 2 of polynomials with degree at most 2,
the polynomial p(x) = 3x2 + x − 2 is a linear combination of the polynomials
x2 , x, and 1 (and in fact every polynomial in P 2 is a linear combination of x2 , x,
and 1). We now start looking at some less straightforward examples.
Example 1.1.11 Suppose f , g, h ∈ P 2 are given by the formulas

Linear Combina-
tions of Polynomials f (x) = x2 − 3x − 4, g(x) = x2 − x + 2, and h(x) = 2x2 − 3x + 1.
Determine whether or not f is a linear combination of g and h.

Solution:
We want to know whether or not there exist c1 , c2 ∈ R such that
f (x) = c1 g(x) + c2 h(x).
Writing this equation out more explicitly gives
x2 − 3x − 4 = c1 (x2 − x + 2) + c2 (2x2 − 3x + 1)
= (c1 + 2c2 )x2 + (−c1 − 3c2 )x + (2c1 + c2 ).
We now use the fact that two polynomials are equal if and only if their
coefficients are equal: we set the coefficients of x2 on both sides of the
equation equal to each other, the coefficients of x equal to each other, and
the constant terms equal to each other. This gives us the linear system
1 = c1 + 2c2
−3 = −c1 − 3c2
−4 = 2c1 + c2 .
This linear system can be solved using standard techniques (e.g., Gaussian
elimination) to find the unique solution c1 = −3, c2 = 2. It follows that f
is a linear combination of g and h: f (x) = −3g(x) + 2h(x).
Example 1.1.12 Determine whether or not the identity matrix I ∈ M2 (C) is a linear com-
Linear Combina- bination of the three matrices
tions of Matrices
0 1 0 −i 1 0
X= , Y= , and Z = .
The four matrices 1 0 i 0 0 −1
I, X,Y, Z ∈ M2 (C) are
sometimes called Solution:
the Pauli matrices. We want to know whether or not there exist c1 , c2 , c3 ∈ C such that
I = c1 X + c2Y + c3 Z. Writing this matrix equation out more explicitly
gives

1 0 0 1 0 −i 1 0
= c1 + c2 + c3
0 1 1 0 i 0 0 −1

c3 c1 − ic2
= .
c1 + ic2 −c3
The (1, 1)-entry of the above matrix equation tells us that c3 = 1, but
the (2, 2)-entry tells us that c3 = −1, so this system of equations has no
solution. It follows that I is not a linear combination of X, Y , and Z.
Linear combinations are useful for the way that they combine both of the
vector space operations (vector addition and scalar multiplication)—instead of
phrasing linear algebraic phenomena in terms of those two operations, we can
often phrase them more elegantly in terms of linear combinations.
For example, we saw in Theorem 1.1.2 that a subspace is a subset of a
vector space that is closed under vector addition and scalar multiplication.
Equivalently, we can combine those two operations and just say that a subspace
is a subset of a vector space that is closed under linear combinations (see
Exercise 1.1.11). On the other hand, a subset B of a vector space that is not
closed under linear combinations is necessarily not a subspace.
However, it is often useful to consider the smallest subspace containing B.
We call this smallest subspace the span of B, and to construct it we just take all
linear combinations of members of B:
Definition 1.1.4 Suppose V is a vector space and B ⊆ V is a set of vectors. The span of B,
Span denoted by span(B), is the set of all (finite!) linear combinations of vectors
from B:
( )
k
def
span(B) = ∑ c j v j k ∈ N, c j ∈ F and v j ∈ B for all 1 ≤ j ≤ k .
j=1
Furthermore, if span(B) = V then we say that V is spanned by B.
In Rn , for example, the span of a single vector is the line through the origin
in the direction of that vector, and the span of two non-parallel vectors is the
plane through the origin containing those vectors (see Figure 1.1).
Two or more vectors

can also span a line,
if they are all parallel.
Figure 1.1: The span of a set of vectors is the smallest subspace that contains
all of those vectors. In Rn , this smallest subspace is a line, plane, or hyperplane
containing all of the vectors in the set.
When we work in other vector spaces, we lose much of this geometric

interpretation, but algebraically spans still work much like they do in Rn . For
We use span(1, x, x2 ) example, span(1, x, x2 ) = P 2 (the vector space of polynomials with degree at
as slight shorthand
for span({1, x, x2 })—
most 2) since every polynomial f ∈ P 2 can be written in the form f (x) =
we sometimes omit c1 + c2 x + c3 x2 for some c1 , c2 , c3 ∈ R. Indeed, this is exactly what it means for
the curly set braces. a polynomial to have degree at most 2. More generally, span(1, x, x2 , . . . , x p ) =
P p.
However, it is important to keep in mind that linear combinations are
always finite, even if B is not. To illustrate this point, consider the vector space
P = span(1, x, x2 , x3 , . . .), which is the set of all polynomials (of any degree).
If we recall from calculus that we can represent the function f (x) = ex in the
form
∞ n
This is called a Taylor x x2 x3 x4
series for ex (see ex = ∑ = 1+x+ + + +··· ,
Appendix A.2.2). n=0 n! 2 6 24
we might expect that ex ∈ P, since we have written ex as a sum of scalar
While all subspaces multiples of 1, x, x2 , and so on. However, ex ∈
/ P since ex can only be written
of Rn are “closed” as an infinite sum of polynomials, not a finite one (see Figure 1.2).
(roughly speaking,
they contain their
edges / limits /
F (all real-valued functions)
boundaries), this
example illustrates P (polynomials) cos(x) ex
the fact that some
vector spaces (like
P) are not. 2x85 + 3x7 − x2 + 6
P2 (quadratic functions)
x2 − x + 3
P 1 (linear functions)
To make this idea of
a function being on P 0 (constant functions)
the “boundary” of P
precise, we need a 2x − 1
way of measuring
f (x) = 3
the distance
between
functions—we
describe how to do
this in Sections 1.3.4
and 1.D.
Figure 1.2: The vector spaces P 0 ⊂ P 1 ⊂ P 2 ⊂ · · · ⊂ P p ⊂ · · · ⊂ P ⊂ F are subspaces
of each other in the manner indicated. The vector space P of all polynomials is
interesting for the fact that it does not contain its boundary: functions like ex and
cos(x) can be approximated by polynomials and are thus on the boundary of P,
but are not polynomials themselves (i.e., they cannot be written as a finite linear
combination of 1, x, x2 , . . .).
As we suggested earlier, our primary reason for being interested in the span
of a set of vectors is that it is always a subspace. We now state and prove this
fact rigorously.
Theorem 1.1.3 Let V be a vector space and let B ⊆ V. Then span(B) is a subspace of V.
Spans are Subspaces
Proof. We need to check that the two closure properties described by Theo-
rem 1.1.2 are satisfied. We thus suppose that v, w ∈ span(B) and b ∈ F. Then
(by the definition of span(B)) there exist scalars c1 , c2 , . . . , ck , d1 , d2 , . . . , d` ∈ F
The sets {v1 , . . . , vk } and vectors v1 , v2 , . . . , vk , w1 , w2 , . . . , w` ∈ B such that

and {w1 , . . . , w` } may
be disjoint or they v = c1 v1 + c2 v2 + · · · + ck vk and w = d1 w1 + d2 w2 + · · · + d` w` .
may overlap—it
does not matter. To establish property (a) of Theorem 1.1.2, we note that
v + w = (c1 v1 + c2 v2 + · · · + ck vk ) + (d1 w1 + d2 w2 + · · · + d` w` ),
which is a linear combination of v1 , v2 , . . . , vk , w1 , w2 , . . . , w` , so we conclude
that v + w ∈ span(B).
Similarly, to show that property (b) holds, we check that
bv = b(c1 v1 + c2 v2 + · · · + ck vk ) = (bc1 )v1 + (bc2 )v2 + · · · + (bck )vk ,
which is a linear combination of v1 , v2 , . . . , vk , so bv ∈ span(B). Since proper-
ties (a) and (b) are both satisfied, span(B) is a subspace of V.
Just like subspaces and spans, linear dependence and independence in
general vector spaces are defined almost identically to how they are defined in
Rn . The biggest difference in this more general setting is that it is now useful
to consider linear independence of sets containing infinitely many vectors
(whereas any infinite subset of Rn is necessarily linearly dependent, so the
definition of linear independence in Rn typically bakes finiteness right in).
Definition 1.1.5 Suppose V is a vector space over a field F and B ⊆ V. We say that B is
Linear Dependence linearly dependent if there exist scalars c1 , c2 , . . ., ck ∈ F, at least one of
and Independence which is not zero, and vectors v1 , v2 , . . ., vk ∈ B such that
c1 v1 + c2 v2 + · · · + ck vk = 0.
Notice that we If B is not linearly dependent then it is called linearly independent.

choose finitely many
vectors v1 , v2 , . . ., vk
in this definition even Linear dependence and independence work very similarly to how they
if B is infinite. did in Rn . The rough intuition for them is that linearly dependent sets are
“redundant” in the sense that one of the vectors does not point in a “really
different” direction than the other vectors—it can be obtained from the others
via some linear combination. Geometrically, in Rn this corresponds to two or
more vectors lying on a common line, three or more vectors lying on a common
plane, and so on (see Figure 1.3).
Figure 1.3: A set of vectors is linearly independent if and only if each vector
contributes a new direction or dimension.
Furthermore, just as was the case with linear (in)dependence in Rn , we still

have all of the following properties that make working with linear (in)dependence
easier in certain special cases:
• A set of two vectors B = {v, w} is linearly dependent if and only if v and

w are scalar multiples of each other.
• A finite set of vectors B = {v1 , v2 , . . . , vk } is linearly independent if and
only if the equation
c1 v1 + c2 v2 + · · · + ck vk = 0
has a unique solution: c1 = c2 = · · · = ck = 0.

• A (not necessarily finite) set B is linearly dependent if and only if there
exists a vector v ∈ B that can be written as a linear combination of the
other vectors from B.
Example 1.1.13 Is the set B = {2x2 + 1, x2 − x + 1} linearly independent in P 2 ?

Linear Independence
Solution:
of Polynomials
Since this set contains just two polynomials, it suffices to just check
whether or not they are scalar multiples of each other. Since they are not
scalar multiples of each other, B is linearly independent.
Example 1.1.14 Determine whether or not the following set of matrices linearly indepen-
Linear Independence dent in M2 (R):
of Matrices
1 1 1 −1 −1 2
B= , , .
1 1 −1 1 2 −1
Solution:
Since this set is finite, we want to check whether the equation

1 1 1 −1 −1 2 0 0
c1 + c2 + c3 =
1 1 −1 1 2 −1 0 0
has a unique solution (which would necessarily be c1 = c2 = c3 = 0,

corresponding to linear independence) or infinitely many solutions (cor-
responding to linear dependence). We can solve for c1 , c2 , and c3 by
comparing entries of the matrices on the left- and right-hand sides above
to get the linear system
c1 + c2 − c3 = 0
c1 − c2 + 2c3 = 0
In general, to find an c1 − c2 + 2c3 = 0
explicit linear c1 + c2 − c3 = 0.
combination that
demonstrates linear
dependence, we Solving this linear system via our usual methods reveals that c3 is a
can choose the free free variable (so there are infinitely many solutions) and c1 = −(1/2)c3 ,
variable(s) to be any c2 = (3/2)c3 . It follows that B is linearly dependent, and in particular,
non-zero value(s)
choosing c3 = 2 gives c1 = −1 and c2 = 3, so
that we like.

1 1 1 −1 −1 2 0 0
− +3 +2 = .
1 1 −1 1 2 −1 0 0
Example 1.1.15 Is the set B = {1, x, x2 , . . . , x p } linearly independent in P p ?

Linear Independence
Solution:
of Many Polynomials
Since this set is finite, we want to check whether the equation
c0 + c1 x + c2 x2 + · · · + c p x p = 0 (1.1.1)
All we are doing has a unique solution (linear independence) or infinitely many solutions
here is showing that
(linear dependence). By plugging x = 0 into that equation, we see that
if f ∈ P p satisfies
f (x) = 0 for all x then c0 = 0.
all of its coefficients Taking the derivative of both sides of Equation (1.1.1) then reveals that
equal 0. This likely
seems “obvious”, but c1 + 2c2 x + 3c3 x2 + · · · + pc p x p−1 = 0,
it is still good to pin
down why it is true.
and plugging x = 0 into this equation gives c1 = 0. By repeating this
procedure (i.e., taking the derivative and then plugging in x = 0) we
similarly see that c2 = c3 = · · · = c p = 0, so B is linearly independent.
Example 1.1.16 Is the set B = {1, x, x2 , x3 , . . .} linearly independent in P?

Linear Independence
Solution:
of an Infinite Set
This example is a bit trickier than the previous one since the set
contains infinitely many polynomials (vectors), and we are not asking
whether or not there exist scalars c0 , c1 , c2 , . . ., not all equal to zero, such
that
c0 + c1 x + c2 x2 + c3 x3 + · · · = 0.
Instead, we are asking whether or not there exists some finite linear com-
bination of 1, x, x2 , x3 , . . . that adds to 0 (and does not have all coefficients
equal to 0).
Let p be the largest power of x in such a linear combination—we want
Recall that in Rn it
was not possible to
to know if there exists (not all zero) scalars c0 , c1 , c2 , . . . , c p such that
have a linearly
independent set c0 + c1 x + c2 x2 + · · · + c p x p = 0.
consisting of infinitely
many vectors (such It follows from the exact same argument used in Example 1.1.15 that this
sets had at most n
vectors). This is one
equation has unique solution c0 = c1 = c2 = · · · = c p = 0, so B is a linearly
way in which independent set.
general vector
spaces can differ
The previous examples could all be solved more or less just by “plugging
from Rn .
and chugging”—we just set up the linear (in)dependence equation
c1 v1 + c2 v2 + · · · + ck vk = 0
and worked through the calculation to solve for c1 , c2 , . . . , ck . We now present

some examples to show that, in some more exotic vector spaces like F , check-
ing linear independence is not always so straightforward, as there is no obvious
or systematic way to solve that linear (in)dependence equation.
Example 1.1.17 Is the set B = {cos(x), sin(x), sin2 (x)} linearly independent in F ?
Linear Independence
Solution:
of a Set of Functions
Since this set is finite, we want to determine whether or not there exist
scalars c1 , c2 , c3 ∈ R (not all equal to 0) such that

c1 cos(x) + c2 sin(x) + c3 sin2 (x) = 0.
We could also plug Plugging in x = 0 tells us that c1 = 0. Then plugging in x = π/2 tells
in other values of x to
us that c2 + c3 = 0, and plugging in x = 3π/2 tells us that −c2 + c3 = 0.
get other equations
involving c1 , c2 , c3 . Solving this system of equations involving c2 and c3 reveals that c2 = c3 =
0 as well, so B is a linearly independent set.
Example 1.1.18 Is the set B = {sin2 (x), cos2 (x), cos(2x)} linearly independent in F ?
Linear Dependence
Solution:
of a Set of Functions
Since this set is finite, we want to determine whether or not there exist
scalars c1 , c2 , c3 ∈ R (not all equal to 0) such that
c1 sin2 (x) + c2 cos2 (x) + c3 cos(2x) = 0.
On the surface, this set looks linearly independent, and we could try
proving it by plugging in specific x values to get some equations involving
c1 , c2 , c3 (just like in Example 1.1.17). However, it turns out that this won’t
work—the resulting system of linear equations will not have a unique
solution, regardless of the x values we choose. To see why this is the case,
recall the trigonometric identity
This identity follows
from the angle-sum
identity cos(θ + φ ) =
cos(2x) = cos2 (x) − sin2 (x).
cos(θ ) cos(φ ) −
sin(θ ) sin(φ ). In In particular, this tells us that if c1 = 1, c2 = −1 and c3 = 1, then
particular, plug in
θ = x and φ = x. sin2 (x) − cos2 (x) + cos(2x) = 0,
so B is linearly dependent.
We introduce a somewhat more systematic method of proving linear in-

dependence of a set of functions in F in Exercise 1.1.21. However, checking
linear (in)dependence in general can still be quite difficult.
1.1.3 Bases
In the final two examples of the previous subsection (i.e., Examples 1.1.17 and
1.1.18, which involve functions in F ), we had to fiddle around and “guess” an
approach that would work to show linear (in)dependence. In Example 1.1.17,
we stumbled upon a proof that the set is linearly independent by luckily picking
Another method of a bunch of x values that gave us a linear system with a unique solution, and
proving linear
in Example 1.1.18 we were only able to prove linear dependence because
independence in F
is explored in we conveniently already knew a trigonometric identity that related the given
Exercise 1.1.21. functions to each other.
The reason that determining linear (in)dependence in F is so much more
difficult than in P p or Mm,n is that we do not have a nice basis for F that we can
work with, whereas we do for the other vector spaces that we have introduced
so far (Rn , polynomials, and matrices). In fact, we have been working with
those nice bases already without even realizing it, and without even knowing
what a basis is in vector spaces other than Rn .
With this in mind, we now introduce bases of arbitrary vector spaces

and start discussing how they make vector spaces easier to work with. Not
surprisingly, they are defined in almost exactly the same way as they are for
subspaces of Rn .
Definition 1.1.6 A basis of a vector space V is a set of vectors in V that

Bases a) spans V, and
b) is linearly independent.
For example, in R3
there are three The prototypical example of a basis is the set consisting of the standard
standard basis
vectors: e1 = (1, 0, 0),
basis vectors e j = (0, . . . , 0, 1, 0, . . . , 0) ∈ Rn for 1 ≤ j ≤ n, where there is a
e2 = (0, 1, 0), and single 1 in the j-th position and 0s elsewhere. It is straightforward to show
e3 = (0, 0, 1). that the set {e1 , e2 , . . . , en } ⊂ Rn is a basis of Rn , and in fact it is called the
standard basis of Rn .
As an example of a basis of a vector space other than Rn , consider the
set B = {1, x, x2 , . . . , x p } ⊂ P p . We showed in Example 1.1.15 that this set is
linearly independent, and it spans P p since every polynomial f ∈ P p can be
written as a linear combination of the members of B:
f (x) = c0 + c1 x + c2 x2 + · · · + c p x p .
It follows that B is a basis of P p , and in fact it is called the standard basis of

P p.
Keep in mind that, Similarly, there is also a standard basis of Mm,n . We define Ei, j ∈ Mm,n to
just like in Rn , bases
be the matrix with a 1 in its (i, j)-entry and zeros elsewhere (these are called the
are very non-unique.
If a vector space has standard basis matrices). For example, in M2,2 the standard basis matrices
one basis then it has are
many different
bases. 1 0 0 1 0 0 0 0
E1,1 = , E1,2 = , E2,1 = , and E2,2 = .
0 0 0 0 1 0 0 1
The set {E1,1 , E1,2 , . . . , Em,n } is a basis (called the standard basis) of Mm,n .
This fact is hopefully believable enough, but it is proved explicitly in Exer-
cise 1.1.13).
Example 1.1.19 Recall the matrices

Strange Basis of
0 1 0 −i 1 0
Matrices X= , Y= , and Z = ,
1 0 i 0 0 −1
which we introduced in Example 1.1.12. Is the set of matrices B =

{I, X,Y, Z} a basis of M2 (C)?
Solution:
We start by checking whether or not span(B) = M2 (C). That is, we
determine whether or not an arbitrary matrix A ∈ M2 (C) can be written
as a linear combination of the members of B:

1 0 0 1 0 −i 1 0 a a1,2
c1 + c2 + c3 + c4 = 1,1 .
0 1 1 0 i 0 0 −1 a2,1 a2,2
By setting the entries of the matrices on the left- and right-hand-sides

equal to each other, we arrive at the system of linear equations

c1 + c4 = a1,1 , c2 − ic3 = a1,2 ,
Keep in mind that
a1,1 , a1,2 , a2,1 , and a2,2 c2 + ic3 = a2,1 , c1 − c4 = a2,2 .
are constants in this
linear system, It is straightforward to check that, no matter what A is, this linear
whereas the
system has the unique solution
variables are c1 , c2 ,
c3 , and c4 . 1 1
c1 = (a1,1 + a2,2 ), c2 = (a1,2 + a2,1 ),
2 2
i 1
c3 = (a1,2 − a2,1 ), c4 = (a1,1 − a2,2 ).
2 2
It follows that A is always a linear combination of the members of B, so
span(B) = M2 (C). Similarly, since this linear combination is unique (and
in particular, it is unique when A = O is the zero matrix), we see that B is
linearly independent and thus a basis of M2 (C).
Example 1.1.20 Is the set of polynomials B = {1, x, 2x2 − 1, 4x3 − 3x} a basis of P 3 ?
Strange Basis of
Solution:
Polynomials
We start by checking whether or not span(B) = P 3 . That is, we deter-
mine whether or not an arbitrary polynomial a0 + a1 x + a2 x2 + a3 x3 ∈ P 3
can be written as a linear combination of the members of B:
c0 + c1 x + c2 (2x2 − 1) + c3 (4x3 − 3x) = a0 + a1 x + a2 x2 + a3 x3 .
By setting the coefficients of each power of x equal to each other (like we

did in Example 1.1.11), we arrive at the system of linear equations
c0 − c2 = a0 , c1 − 3c3 = a1 ,
2c2 = a2 , 4c3 = a3 .
It is straightforward to check that, no matter what a0 , a1 , a2 , and a3

are, this linear system has the unique solution
1 3 1 1
c0 = a0 + a2 , c1 = a1 + a3 , c2 = a2 , c3 = a3 .
2 4 2 4
It follows that a0 + a1 x + a2 x2 + a3 x3 is a linear combination of the mem-
bers of B regardless of the values of a0 , a1 , a2 , and a3 , so span(B) = P 3 .
Similarly, since the solution is unique (and in particular, it is unique when
a0 + a1 x + a2 x2 + a3 x3 is the zero polynomial), we see that B is linearly
independent and thus a basis of P 3 .
Example 1.1.21 Let P E and P O be the sets of even and odd polynomials, respectively:
Bases of Even and def def
Odd Polynomials P E = f ∈ P : f (−x) = f (x) and P O = f ∈ P : f (−x) = − f (x) .
Show that P E and P O are subspaces of P, and find bases of them.

Solution:
We show that P E is a subspace of P by checking the two closure
properties from Theorem 1.1.2.

a) If f , g ∈ P E then
( f + g)(−x) = f (−x) + g(−x) = f (x) + g(x) = ( f + g)(x),
so f + g ∈ P E too.
b) If f ∈ P E and c ∈ R then
(c f )(−x) = c f (−x) = c f (x) = (c f )(x),
so c f ∈ P E too.
To find a basis of P E , we first notice that {1, x2 , x4 , . . .} ⊂ P E . This set
is linearly independent since it is a subset of the linearly independent set
{1, x, x2 , x3 , . . .} from Example 1.1.16. To see that it spans P E , we notice
that if
f (x) = a0 + a1 x + a2 x2 + a3 x3 + · · · ∈ P E
then f (x) + f (−x) = 2 f (x), so
2 f (x) = f (x) + f (−x)

= (a0 + a1 x + a2 x2 + · · · ) + (a0 − a1 x + a2 x2 − · · · )
= 2(a0 + a2 x2 + a4 x4 + · · · ).
It follows that f (x) = a0 + a2 x2 + a4 x4 + · · · ∈ span(1, x2 , x4 , . . .), so the

set {1, x2 , x4 , . . .} spans P E and is thus a basis of it.
A similar argument shows that P O is a subspace of P with basis
{x, x3 , x5 , . . .}, but we leave the details to Exercise 1.1.14.
Just as was the case in Rn , the reason why we require a basis to span a vec-
tor space V is so that we can write every vector v ∈ V as a linear combination
of those basis vectors, and the reason why we require a basis to be linearly
independent is so that those linear combinations are unique. The following the-
orem pins this observation down, and it roughly says that linear independence
is the property needed to remove redundancies in linear combinations.
Theorem 1.1.4 Suppose B is a basis of a vector space V. For every v ∈ V, there is exactly
Uniqueness of one way to write v as a linear combination of the vectors in B.
Linear Combinations
Proof. Since B spans V, we know that every v ∈ V can be written as a linear

combination of the vectors in B, so all we have to do is use linear independence
Conversely, if B is a
of B to show that this linear combination is unique.
set with the property Suppose that v could be written as a (finite) linear combination of vectors
that every vector from B in two ways:
v ∈ V can be written
as a linear
combination of the v = c1 v1 + c2 v2 + · · · + ck vk and
(1.1.2)
members of B in v = d1 w1 + d2 w2 + · · · + d` w` .
exactly one way,
then B must be a
It might be the case that some of the vi vectors equal some of the w j vectors.
basis of V (see
Exercise 1.1.20). Let m be the number of such pairs of equal vectors in these linear combinations,
and order the vectors so that vi = wi when 1 ≤ i ≤ m, but vi 6= w j when i, j > m.
Subtracting these linear combinations from each other then gives
0 = v − v = ((c1 − d1 )v1 + · · · + (cm − dm )vm )

+ (cm+1 vm+1 + · · · + ck vk ) − (dm+1 wm+1 + · · · + d` w` ).
Since B is a basis, and thus linearly independent, this linear combination

tells us that c1 − d1 = 0, . . ., cm − dm = 0, cm+1 = · · · = ck = 0, and dm+1 =
· · · = d` = 0. It follows that the two linear combinations in Equation (1.1.2) are
in fact the same linear combination, which proves uniqueness.
Most properties of bases of subspaces of Rn carry over to bases of arbitrary
vector spaces, as long as those bases are finite. However, we will see in the next
section that vector spaces with infinite bases (like P) can behave somewhat
strangely, and for some vector spaces (like F) it can be difficult to even say
whether or not they have a basis.
Exercises solutions to starred exercises on page 449

(" # " # " #)
1.1.1 Determine which of the following sets are and are 1 0 1 1 1 0
not subspaces of the indicated vector space. ∗(a) , , ⊂ M2
0 1 0 1 1 1
∗(a) The set {(x, y) ∈ R2 : x ≥ 0, y ≥ 0} in R2 . (" # " # " # " #)
(b) The set MsS 1 0 1 1 1 0 1 1
n of skew-symmetric matrices (i.e., ma- (b) , , , ⊂ M2
trices B satisfying BT = −B) in Mn . 0 1 0 1 1 1 1 1
(" # " # " # " #)
∗(c) The set of invertible matrices in M3 (C). 1 0 1 1 1 0 1 1
(d) The set of polynomials f satisfying f (4) = 0 in P 3 . ∗(c) , , , ⊂ M2
0 1 0 1 1 1 1 0
∗(e) The set of polynomials f satisfying f (2) = 2 in P 4 .
(f) The set of polynomials with even degree in P. (d) {x + 1, x − 1} ⊂ P 1
∗(g) The set of even functions (i.e., functions f satisfying ∗(e) {x2 + 1, x + 1, x2 − x} ⊂ P 2
f (x) = f (−x) for all x ∈ R) in F . (f) {x2 + 1, x + 1, x2 − x + 1} ⊂ P 2
(h) The set of functions ∗(g) {1, x − 1, (x − 1)2 , (x − 1)3 , . . .} ⊂ P
(h) {ex , e−x } ⊂ F
{ f ∈ F : f 0 (x) + f (x) = 2}
1.1.4 Determine which of the following statements are true
in F (here f 0 denotes the derivative of f ). and which are false.
∗(i) The set of functions
(a) The empty set is a vector space.
{ f ∈ F : f 00 (x) − 2 f (x) = 0} ∗(b) Every vector space V 6= {0} contains a subspace W
such that W 6= V.
in F (here f 00 denotes the second derivative of f ).
(c) If B and C are subsets of a vector space V with B ⊆ C
(j) The set of functions
then span(B) ⊆ span(C).
{ f ∈ F : sin(x) f 0 (x) + ex f (x) = 0} ∗(d) If B and C are subsets of a vector space V with
span(B) ⊆ span(C) then B ⊆ C.
in F . (e) Linear combinations must contain only finite many
∗(k) The set of matrices with trace (i.e., sum of diagonal terms in the sum.
entries) equal to 0. ∗(f) If B is a subset of a vector space V then the span of
B is equal to the span of some finite subset of B.
1.1.2 Determine which of the following sets are and are (g) Bases must contain finitely many vectors.
not linearly independent. If the set is linearly dependent, ∗(h) A set containing a single vector must be linearly
explicitly write one of the vectors as a linear combination independent.
of the other vectors.
(" #)
1 2 ∗1.1.5 Suppose V is a vector space over a field F. Show
∗(a) ⊂ M2 that if v ∈ V and c ∈ F then (−c)v = −(cv).
3 4
(b) {1 + x, 1 + x , x2 − x} ⊂ P 2
2
∗(c) {sin2 (x), cos2 (x), 1} ⊂ F 1.1.6 Consider the subset S = {(a + bi, a − bi) : a, b ∈ R}
(d) {sin(x), cos(x), 1} ⊂ F of C2 . Explain why S is a vector space over the field R, but
∗(e) {sin(x + 1), sin(x), cos(x)} ⊂ F not over C.
(f) {ex , e−x } ⊂ F
∗∗(g) {ex , xex , x2 ex } ⊂ F ∗∗1.1.7 Show that the sequence space FN is a vector space.
1.1.3 Determine which of the following sets are and are

not bases of the indicated vector space.
∗∗1.1.8 Show that the set MSn of n × n symmetric matrices ∗1.1.17 Let S1 and S2 be subspaces of a vector space V.
(i.e., matrices A satisfying AT = A) is a subspace of Mn . (a) Show that S1 ∩ S2 is also a subspace of V.
(b) Provide an example to show that S1 ∪ S2 might not
1.1.9 Let Pn denote the set of n-variable polynomials, Pnp be a subspace of V.
denote the set of n-variable polynomials of degree at most p,
and HP np denote the set of n-variable homogeneous poly- 1.1.18 Let B and C be subsets of a vector space V.
nomials (i.e., polynomials in which every term has the exact
same degree) of degree p, together with the 0 function. For (a) Show that span(B ∩C) ⊆ span(B) ∩ span(C).
example, (b) Provide an example for which span(B ∩ C) =
span(B) ∩ span(C) and another example for which
x3 y + xy2 ∈ P24 and x4 y2 z + xy3 z3 ∈ HP 73 . span(B ∩C) ( span(B) ∩ span(C).
(a) Show that Pn is a vector space.
(b) Show that Pnp is a subspace of Pn . ∗∗1.1.19 Let S1 and S2 be subspaces of a vector space V.
(c) Show that HP np is a subspace of Pnp . The sum of S1 and S2 is defined by
[Side note: We explore HP np extensively in Sec- def
tion 3.B.] S1 + S2 = {v + w : v ∈ S1 , w ∈ S2 }.
(a) If V = R3 , S1 is the x-axis, and S2 is the y-axis, what
∗∗1.1.10 Let C be the set of continuous real-valued func- is S1 + S2 ?
tions, and let D be the set of differentiable real-valued func- (b) If V = M2 , MS2 is the subspace of symmetric ma-
tions. trices, and MsS2 is the subspace of skew-symmetric
matrices (i.e., the matrices B satisfying BT = −B),
(a) Briefly explain why C is a subspace of F .
what is MS2 + MsS 2 ?
(b) Briefly explain why D is a subspace of F .
(c) Show that S1 + S2 is always a subspace of V.
∗∗1.1.11 Let V be a vector space over a field F and let

∗∗1.1.20 Show that the converse of Theorem 1.1.4 holds.
S ⊆ V be non-empty. Show that S is a subspace of V if and
That is, show that if V is a vector space and B is a set with
only if it is closed under linear combinations:
the property that every vector v ∈ V can be written as a
c1 v1 + c2 v2 + · · · + ck vk ∈ S linear combination of the members of B in exactly one way,
then B must be a basis of V.
for all c1 , c2 , . . . , ck ∈ F and v1 , v2 , . . . , vk ∈ S.
∗∗1.1.21 In the space of functions F , proving linear depen-

1.1.12 Recall the geometric series
dence or independence can be quite difficult. In this question,
1 we derive one method that can help us if the functions are
1 + x + x2 + x3 + · · · = whenever |x| < 1.
1−x smooth enough (i.e., can be differentiated enough).
(a) Why does the above expression not imply that Given a set of n functions f1 , f2 , . . . , fn : R → R that are dif-
1/(1 − x) is a linear 2 ferentiable at least n − 1 times, we define the following n × n
combination of 1, x, x , . . .?
(b) Show that the set 1/(1 − x), 1, x, x2 , x3 , . . . of func- matrix (note that each entry in this matrix is a function):
tions from (−1, 1) to R is linearly independent.  
f1 (x) f2 (x) ... fn (x)
 0 
 f1 (x) f20 (x) ... fn0 (x) 
∗∗1.1.13 Prove that the standard basis  
 00 
 f1 (x) f200 (x) ... fn00 (x) 
E1,1 , E1,2 , . . . , Em,n W (x) =  .
 .. .. .. 
 .. 
is indeed a basis of Mm,n .  . . . . 
 
(n−1) (n−1) (n−1)
f1 (x) f2 (x) . . . fn (x)
∗∗1.1.14 We showed in Example 1.1.21 that the set of even
polynomials P E is a subspace of P with {1, x2 , x4 , . . .} as a The notation f (n−1) (x) above means the (n − 1)-th deriva-
basis. tive of f .
(a) Show that the set of odd polynomials P O is also a (a) Show that if f1 , f2 , . . . , fn are linearly dependent,
subspace of P. then det(W (x)) = 0 for all x ∈ R. [Hint: What can
(b) Show that {x, x3 , x5 , . . .} is a basis of P O . you say about the columns of W (x)?] [Side note: The
determinant det(W (x)) is called the Wronskian of
1.1.15 We showed in Example 1.1.19 that the set of Pauli f1 , f2 , . . . , fn .]
matrices B = {I, X,Y, Z} is a basis of M2 (C). Show that it (b) Use the result of part (a) to show that the set
is also a basis of MH {x, ln(x), sin(x)} is linearly independent.
2 (the vector space of 2 × 2 Hermitian
matrices). (c) Use the result of part (a) to show that the set
{1, x, x2 , . . . , xn } is linearly independent.
(d) This method cannot be used to prove linear de-
1.1.16 In FN , let e j = (0, . . . , 0, 1, 0, . . .) be the sequence pendence! For example, show that the functions
with a single 1 in the j-th entry, and all other terms equal to f1 (x) = x2 and f2 (x) = x|x| are linearly independent,
0. but det(W (x)) = 0 for all x.
(a) Show that {e1 , e2 , e3 , . . . , } is a basis of the subspace
c00 from Example 1.1.10.
(b) Explain why {e1 , e2 , e3 , . . . , } is not a basis of FN .
∗∗1.1.22 Let a1 , a2 , . . . , an be n distinct real numbers (i.e., [Hint: Construct the Wronskian of this set of functions (see
ai 6= a j whenever i 6= j). Show that the set of functions Exercise 1.1.21) and notice that it is the determinant of a
{ea1 x , ea2 x , . . . , ean x } is linearly independent. Vandermonde matrix.]
1.2 Coordinates and Linear Transformations
We now start investigating what we can do with bases of vector spaces. It is

perhaps not surprising that we can do many of the same things that we do
with bases of subspaces of Rn , like define the dimension of arbitrary vector
spaces and construct coordinate vectors. However, bases really shine in this
more general setting because we can use them to turn any question about a
vector space V (as long as it has a finite basis) into one about Rn or Cn (or Fn ,
where V is a vector space over the field F).
In other words, we spend this section showing that we can think of general
vector spaces (like polynomials or matrices) just as copies of Rn that have just
been written in a different way, and we can think of linear transformations on
these vector spaces just as matrices.
1.2.1 Dimension and Coordinate Vectors

Recall from Theorem 1.1.4 that every vector v in a vector space V can be
written as a linear combination of the vectors from a basis B in exactly one
way. We can interpret this fact as saying that the vectors in a basis each specify
a different direction, and the (unique) coefficients in the linear combination
specify how far v points in each of those directions.
For example, if B = {e1 , e2 } is the standard basis of R2 then when we write
In other words, v = c1 e1 + c2 e2 , the coefficient c1 tells us how far v points in the direction of
v = (c1 , c2 ).
e1 (i.e., along the x-axis) and c2 tells us how far v points in the direction of e2
(i.e., along the y-axis). A similar interpretation works with other bases of R2
(see Figure 1.4) and with bases of other vector spaces.
Figure 1.4: When we write a vector as a linear combination of basis vectors, the
coefficients of that linear combination tell us how far it points in the direction of
those basis vectors.
1.2 Coordinates and Linear Transformations 25
It is often easier to just keep track of and work with these coefficients c1 , c2 ,
. . ., cn in a linear combination, rather than the original vector v itself. However,
We abuse notation a there is one technicality that we have to deal with before we can do this: the
bit when using bases.
Order does not fact that bases are sets of vectors, and sets do not care about order. For example,
matter in sets like {e1 , e2 } and {e2 , e1 } are both the same standard basis of R2 . However, we want
{e1 , e2 }, but it matters to be able to talk about things like the “first” vector in a basis and the “third”
if that set is a basis.
coefficient in a linear combination. For this reason, we typically consider bases
to be ordered—even though they are written using the same notation as sets,
their order really is meant as written.
Definition 1.2.1 Suppose V is a vector space over a field F with a finite (ordered) basis
Coordinate Vectors B = {v1 , v2 , . . . , vn }, and v ∈ V. Then the unique scalars c1 , c2 , . . ., cn ∈ F
for which
v = c1 v1 + c2 v2 + · · · + cn vn
are called the coordinates of v with respect to B, and the vector
def
[v]B = (c1 , c2 , . . . , cn )
is called the coordinate vector of v with respect to B.
It is worth noting that coordinate vectors in Rn (or Fn in general) with

respect to the standard basis are equal to those vectors themselves: if B =
{e1 , e2 , . . . , en } then [v]B = v for all v ∈ Rn , since the first entry of v is the
coefficient of e1 , the second entry of v is the coefficient of e2 , and so on. We
can also construct coordinate vectors with respect to different bases and of
vectors from other vector spaces though.
For example, the coordinate vector of 3 + 2x + x2 ∈ P 2 with respect to the
basis B = {1, x, x2 } is [3 + 2x + x2 ]B = (3, 2, 1): we just place the coefficients
of the polynomial in a vector in R3 , and this vector completely specifies the
polynomial. More generally, the coordinate vector of the polynomial a0 + a1 x +
a2 x2 + · · · + a p x p with respect to the standard basis B = {1, x, x2 , . . . , x p } is

a0 + a1 x + a2 x2 + · · · + a p x p B = (a0 , a1 , a2 , . . . , a p ).
The key idea behind coordinate vectors is that they let us use bases to
treat every vector space with a finite basis just like we treat Fn (where F is
the ground field and n is the number of vectors in the basis). In particular,
coordinate vectors with respect to any basis B interact with vector addition and
scalar multiplication how we would expect them to (see Exercise 1.2.22):
[v + w]B = [v]B + [w]B and [cv]B = c[v]B for all v, w ∈ V, c ∈ F.
It follows that we can answer pretty much any linear algebraic question
about vectors in an n-dimensional vector space V just by finding their coordinate
vectors and then answering the corresponding question in Fn (which we can do
via techniques from introductory linear algebra). For example, a set is linearly
independent if and only if the set consisting of their coordinate vectors is
linearly independent (see Exercise 1.2.23).
Example 1.2.1 Show that the following sets of vectors C are linearly independent in the
Checking Linear indicated vector space V:
(In)Dependence via a) V = R4 , C = {(0, 2, 0, 1), (0, −1, 1, 2), (3, −2, 1, 0)},
Coordinate Vectors
b) V = P 3 , C = {2x + x3 , −x + x2 + 2x3 , 3 − 2x + x2 }, and

c) V = M2 , C = 0 2 0 −1 3 −2 .
, ,
0 1 1 2 1 0
Solutions:
a) Recall that C is linearly independent if and only if the linear system
c1 (0, 2, 0, 1) + c2 (0, −1, 1, 2) + c3 (3, −2, 1, 0) = (0, 0, 0, 0)
has a unique solution. Explicitly, this linear system has the form
3c3 = 0
2c1 − c2 − 2c3 = 0
c2 + c3 = 0
c1 + 2c2 =0
Need a refresher on which can be solved via Gaussian elimination to see that the unique
solving linear
solution is c1 = c2 = c3 = 0, so C is indeed a linearly independent
systems? See
Appendix A.1.1. set.
b) We could check linear independence directly via the methods of
Section 1.1.2, but we instead compute the coordinate vectors of the
members of C with respect to the standard basis B = {1, x, x2 , x3 }
of P 3 :
[2x + x3 ]B = (0, 2, 0, 1),
[−x + x2 + 2x3 ]B = (0, −1, 1, 2), and
2
[3 − 2x + x ]B = (3, −2, 1, 0).
These coordinate vectors are exactly the vectors from part (a), which
we already showed are linearly independent in R4 , so C is linearly
independent in P 3 as well.
c) Again, we start by computing the coordinate vectors of the members
of C with respect to the standard basis B = {E1,1 , E1,2 , E2,1 , E2,2 } of
M2 :
The notation here is " #
quite unfortunate. 0 2
= (0, 2, 0, 1),
The inner square 0 1
brackets indicate B
" #
that these are
0 −1
matrices, while the = (0, −1, 1, 2), and
outer square 1 2
B
brackets denote the " #
coordinate vectors 3 −2
of these matrices. = (3, −2, 1, 0).
1 0
B
Again, these coordinate vectors are exactly the vectors from part (a),
which we already showed are linearly independent in R4 , so C is
linearly independent in M2 .
The above example highlights the fact that we can really think of R4 , P 3 ,
and M2 as “essentially the same” vector spaces, just dressed up and displayed
differently: we can work with the polynomial a0 + a1 x + a2 x2 + a3 x3 in the
same way that we work with the vector (a0 , a1 , a2 , a3 ) ∈ R4 , which we can
work with in the same way as the matrix

a0 a1
∈ M2 .
a2 a3
However, it is also often useful to represent vectors in bases other than the
standard basis, so we briefly present an example that illustrates how to do this.
Example 1.2.2 Find the coordinate vector of 2 + 7x + x2 ∈ P 2 with respect to the basis
Coordinate Vectors B = {x + x2 , 1 + x2 , 1 + x}.
in Weird Bases
Solution:
We want to find scalars c1 , c2 , c3 ∈ R such that
We did not explicitly
prove that B is a 2 + 7x + x2 = c1 (x + x2 ) + c2 (1 + x2 ) + c3 (1 + x).
basis of P 2 here—try
to convince yourself
that it is.
By matching up coefficients of powers of x on the left- and right-hand
sides above, we arrive at the following system of linear equations:
c2 + c3 = 2
c1 + c3 = 7
c1 + c2 =1
To find c1 , c2 , c3 , just This linear system has c1 = 3, c2 = −2, c3 = 4 as its unique solution, so
apply Gaussian
our desired coordinate vector is
elimination like usual.

2 + 7x + x2 B = (c1 , c2 , c3 ) = (3, −2, 4).
By making use of coordinate vectors, we can leech many of our results

concerning bases of subspaces of Rn directly into this more general setting. For
example, the following theorem tells us that we can roughly think of a spanning
set as one that is “big” and a linearly independent set as one that is “small” (just
like we did when we worked with these concepts for subspaces of Rn ), and bases
are exactly the sweet spot in the middle where these two properties meet.
Theorem 1.2.1 Suppose n ≥ 1 is an integer and V is a vector space with a basis B consisting
Linearly Independent of n vectors.
Sets Versus a) Any set of more than n vectors in V must be linearly dependent, and
Spanning Sets b) any set of fewer than n vectors cannot span V.
Proof. For property (a), suppose that a set C has m > n vectors, which we call
v1 , v2 , . . . , vm . To see that C is necessarily linearly dependent, we must show
that there exist scalars c1 , c2 , . . ., cm , not all equal to zero, such that
c1 v1 + c2 v2 + · · · + cm vm = 0.
If we compute the coordinate vector of both sides of this equation with

respect to B then we see that it is equivalent to
c1 [v1 ]B + c2 [v2 ]B + · · · + cm [vm ]B = 0,
which is a homogeneous system of n linear equations in m variables. Since

m > n, this is a “short and fat” linear system (i.e., its coefficient matrix has
more columns than rows) and thus must have infinitely many solutions. In
particular, it has at least one non-zero solution, from which it follows that C is
linearly dependent.
Part (b) is proved in a similar way and thus left as Exercise 1.2.36.
By recalling that a basis of a vector space V is a set that both spans V and
is linearly independent, we immediately get the following corollary:
Corollary 1.2.2 Suppose n ≥ 1 is an integer and V is a vector space with a basis consisting
Uniqueness of of n vectors. Then every basis of V has exactly n vectors.
Size of Bases
For example, we saw in Example 1.1.19 that the set

1 0 0 1 0 −i 1 0
, , ,
0 1 1 0 i 0 0 −1
is a basis of M2 (C), which is consistent with Corollary 1.2.2 since the standard
basis {E1,1 , E1,2 , E2,1 , E2,2 } also contains exactly 4 matrices. In a sense, this
tells us that M2 (C) contains 4 “degrees of freedom” or requires 4 (complex)
numbers to describe each matrix that it contains. This quantity (the number of
vectors in any basis) gives a useful description of the “size” of the vector space
that we are working with, so we give it a name:
Definition 1.2.2 A vector space V is called...

Dimension of a a) finite-dimensional if it has a finite basis. Its dimension, denoted
Vector Space by dim(V), is the number of vectors in any of its bases.
b) infinite-dimensional if it does not have a finite basis, and we write
dim(V) = ∞.
For example, since Rn has {e1 , e2 , . . . , en } as a basis, which has n vectors,

we see that dim(Rn ) = n. Similarly, the standard basis {E1,1 , E1,2 , . . . , Em,n } of
Mm,n contains mn vectors (matrices), so dim(Mm,n ) = mn. We summarize the
dimensions of some of the other commonly-occurring vector spaces that we
have been investigating in Table 1.1.
Vector space V Standard basis dim(V)

Fn {e1 , e2 , . . . , en } n
Mm,n {E1,1 , E1,2 , . . . , Em,n } mn
Pp {1, x, x2 , . . . , x p } p+1
P {1, x, x2 , . . .} ∞
F – ∞
Table 1.1: The standard basis and dimension of some common vector spaces.
We refer to {0} as There is one special case that is worth special attention, and that is the
the zero vector
vector space V = {0}. The only basis of this vector space is the empty set
space.
{}, not {0} as we might first guess, since any set containing 0 is necessarily
linearly dependent. Since the empty set contains no vectors, we conclude that
dim({0}) = 0.
If we already know the dimension of the vector space that we are working
with, it becomes much simpler to determine whether or not a given set is a
basis of it. In particular, if a set contains the right number of vectors (i.e., its
size coincides with the dimension of the vector space) then we can show that it
is a basis just by showing that it is linearly independent or spanning—the other
property comes for free (see Exercise 1.2.27).
Remark 1.2.1 Notice that we have not proved a theorem saying that all vector spaces
Existence of Bases? have bases (whereas this fact is true of subspaces of Rn and is typically
proved in this case in introductory linear algebra textbooks). The reason
for this omission is that constructing bases is much more complicated in
arbitrary (potentially infinite-dimensional) vector spaces.
While it is true that all finite-dimensional vector spaces have bases (in
fact, this is baked right into Definition 1.2.2(a)), it is much less clear what
a basis of (for example) the vector space F of real-valued functions would
look like. We could try sticking all of the “standard” functions that we
know of into a set like
{1, x, x2 , . . . , |x|, ln(x2 + 1), cos(x), sin(x), 1/(3 + 2x2 ), e6x , . . .},
but there will always be lots of functions left over that are not linear
combinations of these familiar functions.
Most (but not all!) It turns out that the existence of bases depends on something called
mathematicians the “axiom of choice”, which is a mathematical axiom that is independent
accept the axiom of
of the other set-theoretic underpinnings of modern mathematics. In other
choice, and thus
would say that every words, we can neither prove that every vector space has a basis, nor can
vector space has a we construct a vector space that does not have one.
basis. However, the
axiom of choice is
From a practical point of view, this means that it is simply not possible
non-constructive, so to write down a basis of many vector spaces like F , even if they exist. Even
we still cannot in the space C of continuous real-valued functions, any basis necessarily
actually write down contains some extremely hideous and pathological functions. Roughly
explicit examples of
bases in many speaking, the reason for this is that any finite linear combination of “nice”
infinite-dimensional functions will still be “nice”, but there are many “not nice” continuous
vector spaces. functions out there.
For example, there are continuous functions that are nowhere differ-
entiable (i.e., they do not have a derivative anywhere). This means that
no matter how much we zoom in on the graph of the function, it never
starts looking like a straight line, but rather looks jagged at all zoom levels.
Every basis of C must contain strange functions like these, and an explicit
Again, keep in mind example is the Weierstrass function W defined by
that this does not
mean that W is a ∞
cos(2n x)
linear combination W (x) = ∑ ,
of cos(2x), cos(4x), n=1 2n
cos(8x), . . ., since this
sum has infinitely whose graph is displayed on the next page:
many terms in it.
y
1
0.5
x
-2 -1 -0.5 0.5 1 2
-0.5
1.2.2 Change of Basis

When we change the basis B that we are working with, the resulting coordinate
vectors change as well. For instance, we showed in Example 1.2.2 that, with
respect to a certain basis B of P 2 , we have

2 + 7x + x2 B = (3, −2, 4).
However, if C = {1, x, x2 } is the standard basis of P 2 then we have

2 + 7x + x2 C = (2, 7, 1).
In order to convert coordinate vectors between two bases B and C, we could

of course compute v from [v]B and then compute [v]C from v. However, there
is a “direct” way of doing this conversion that avoids the intermediate step of
computing v.
Definition 1.2.3 Suppose V is a vector space with bases B = {v1 , v2 , . . . , vn } and C. The
Change-of-Basis change-of-basis matrix from B to C, denoted by PC←B , is the n × n matrix
Matrix whose columns are the coordinate vectors [v1 ]C , [v2 ]C , . . . , [vn ]C :
def
PC←B = [v1 ]C | [v2 ]C | · · · | [vn ]C .
It is worth emphasizing the fact that in the above definition, we place the
coordinate vectors [v1 ]C , [v2 ]C , . . ., [vn ]C into the matrix PC←B as its columns,
not its rows. Nonetheless, when considering those vectors in isolation, we still
write them using round parentheses and in a single row, like [v2 ]C = (1, 4, −3),
just as we have been doing up until this point. The reason for this is that vectors
(i.e., members of Fn ) do not have a shape—they are just lists of numbers, and
we can arrange those lists however we like.
The change-of-basis matrix from B to C, as its name suggests, converts
coordinate vectors with respect to B into coordinate vectors with respect to C.
That is, we have the following theorem:
Theorem 1.2.3 Suppose B and C are bases of a finite-dimensional vector space V, and let
Change-of-Basis PC←B be the change-of-basis matrix from B to C. Then
Matrices a) PC←B [v]B = [v]C for all v ∈ V, and
−1
b) PC←B is invertible and PC←B = PB←C .
Furthermore, PC←B is the unique matrix with property (a).
Proof. Let B = {v1 , v2 , . . . , vn } so that we have names for the vectors in B.

For property (a), suppose v ∈ S and write v = c1 v1 + c2 v2 + · · · + cn vn , so
that [v]B = (c1 , c2 , . . . , cn ). We can then directly compute
 
c1
 . 
PC←B [v]B = [v1 ]C | · · · | [vn ]C  ..  (definition of PC←B )
cn
= c1 [v1 ]C + · · · + cn [vn ]C (block matrix multiplication)
= [c1 v1 + · · · + cn vn ]C (by Exercise 1.2.22)
= [v]C . (v = c1 v1 + c2 v2 + · · · cn vn )
To see that property (b) holds, we just note that using property (a) twice tells
us that PB←C PC←B [v]B = [v]B for all v ∈ V, so PB←C PC←B = I, which implies
−1
PC←B = PB←C .
Finally, to see that PB←C the unique matrix satisfying property (a), suppose
P ∈ Mn is any matrix for which P[v]B = [v]C for all v ∈ V. For every 1 ≤ j ≤ n,
if v = v j then we see that [v]B = [v j ]B = e j (the j-th standard basis vector in
Fn ), so P[v]B = Pe j is the j-th column of P. On the other hand, it is also the
case that P[v]B = [v]C = [v j ]C . The j-th column of P thus equals [v j ]C for each
1 ≤ j ≤ n, so P = PC←B .
The two properties described by the above theorem are illustrated in Fig-
ure 1.5: the bases B and C provide two different ways of making the vector
space V look like Fn , and the change-of-basis matrices PC←B and PB←C convert
these two different representations into each other.
V
One of the
advantages of using
change-of-basis v
matrices, rather than
computing each
coordinate vector [ · ]C [ · ]B
directly, is that we
can re-use
change-of-basis Fn Fn
matrices to change
multiple vectors PC←B
between bases.
[v]C [v]B
−1
PC←B = PB←C
Figure 1.5: A visualization of the relationship between vectors, their coordinate

vectors, and change-of-basis matrices. There are many different bases that let us
think of V as Fn , and change-of-basis matrices let us convert between them.
Example 1.2.3 Find the change-of-basis matrices PB←C and PC←B for the bases
Computing a Change-
of-Basis Matrix B = {x + x2 , 1 + x2 , 1 + x} and C = {1, x, x2 }
of P 2 . Then find the coordinate vector of 2 + 7x + x2 with respect to B.

Solution:
To start, we find the coordinate vectors of the members of B with
respect to the basis C. Since C is the standard basis of P 2 , we can eyeball
these coordinate vectors:

x + x2 C = (0, 1, 1), 1 + x2 C = (1, 0, 1), and [1 + x]C = (1, 1, 0).
These vectors are the columns of the change-of-basis matrix PC←B :

 
0 1 1
PC←B = 1 0 1 .
1 1 0
The simplest method for finding PB←C from here is to compute the
inverse of PC←B :
This inverse can be
 
found by −1 1 1
row-reducing [ A | I ] −1 1
to [ I | A−1 ] (see PB←C = PC←B = 1 −1 1 .
2
Appendix A.1.3). 1 1 −1
To then find [2 + 7x + x2 ]B , we just multiply [2 + 7x + x2 ]C (which is

easy to compute, since C is the standard basis) by PB←C :
    
−1 1 1 2 3
1
2+7x+x2 B = PB←C 2+7x+x2 C =  1 −1 1  7  = −2 ,
2
1 1 −1 1 4
which is the same answer that we found in Example 1.2.2.
Generally, The previous example was not too difficult since C happened to be the
converting vectors
into the standard
standard basis of P 2 , so we could compute PC←B just by “eyeballing” the
basis is much easier standard basis coefficients, and we could then compute PB←C just by taking
than converting into the inverse of PC←B . However, if C weren’t the standard basis, computing the
other bases. columns of PC←B would have been much more difficult—each column would
require us to solve a linear system.
A quicker and easier way to compute a change-of-basis matrix is to change
from the input basis B to the standard basis E (if one exists in the vector space
being considered), and then change from E to the output basis C, as described
by the following theorem:
Theorem 1.2.4 Let V be a finite-dimensional vector space with bases B, C, and E. Then
Computing Change-
of-Basis Matrices PC←B = PC←E PE←B .
(Method 1)
Proof. We just use the properties of change-of-basis matrices that we know

Think of the “E”s in
the middle of
from Theorem 1.2.3. If v ∈ V is any vector then
PC←B = PC←E PE←B as
“canceling out”. The
(PC←E PE←B )[v]B = PC←E (PE←B [v]B ) = PC←E [v]E = [v]C .
notation was
designed specifically It then follows from uniqueness of change-of-basis matrices that PC←E PE←B =
so that this works. PC←B .
To actually make use of the above theorem, we note that if E is chosen to be

the standard basis of V then PE←B can be computed by eyeballing coefficients,
and PC←E can be computed by first eyeballing the entries of PE←C and then
inverting (like in Example 1.2.3).
Example 1.2.4 Use Theorem 1.2.4 to find the change-of-basis matrix PC←B , where
Computing a Change-
of-Basis Matrix 1 0 0 1 0 −i 1 0
B= , , , and
Between Ugly Bases 0 1 1 0 i 0 0 −1

1 0 1 1 1 1 1 1
C= , , ,
0 0 0 0 1 0 1 1
We showed that the are bases of M2 (C). Then compute [v]C if [v]B = (1, 2, 3, 4).
set B is indeed a
basis of M2 (C) in Solution:
Example 1.1.19. You To start, we compute the change-of-basis matrices from B and C into
should try to
convince yourself the standard basis
that C is also a basis.
1 0 0 1 0 0 0 0
E= , , , ,
0 0 0 0 1 0 0 1
which can be done by inspection:

Coordinate vectors "
of matrices with # " #
1 0 0 1
respect to the = (1, 0, 0, 1), = (0, 1, 1, 0),
standard matrix can 0 1 1 0
E E
be obtained just by " # " #
writing the entries of 0 −i 1 0
the matrices, in = (0, −i, i, 0), and = (1, 0, 0, −1).
i 0 0 −1
order, row-by-row. E E
We then get PE←B by placing these vectors into a matrix as its columns:
 
1 0 0 1
 
0 1 −i 0 
PE←B =  .
0 1 i 0
1 0 0 −1
A similar procedure can be used to compute PE←C , and then we find

PC←E by inverting:
   
1 1 1 1 1 −1 0 0
   
0 1 1 1 −1 0 1 −1 0 
PE←C =   , so PC←E = PE←C = .
0 0 1 1 0 0 1 −1
0 0 0 1 0 0 0 1
To put this all together and compute PC←B , we multiply:

  
1 −1 0 0 1 0 0 1
  
0 1 −1 0  0 1 −i 0 
PC←B = PC←E PE←B =   
0 0 1 −1 0 1 i 0
0 0 0 1 1 0 0 −1
 
1 −1 i 1
 
0 0 −2i 0 
= .
−1 1 i 1
1 0 0 −1
Finally, since we were asked for [v]C if [v]B = (1, 2, 3, 4), we compute
    
1 −1 i 1 1 3 + 3i
    
0 0 −2i 0  2  −6i 
[v]C = PC←B [v]B =    =  .
−1 1 i 1  3 5 + 3i
1 0 0 −1 4 −3
The previous example perhaps seemed a bit long and involved. Indeed, to
make use of Theorem 1.2.4, we have to invert PE←C and then multiply two
matrices together. The following theorem tells us that we can reduce the amount
of work required slightly by instead row reducing a certain cleverly-chosen
matrix based on PE←B and PE←C . This takes about the same amount of time
as inverting PE←C , and lets us avoid the matrix multiplication step that comes
afterward.
Corollary 1.2.5 Let V be a finite-dimensional vector space with bases B, C, and E. Then
Computing Change- the reduced row echelon form of the augmented matrix
of-Basis Matrices
(Method 2) [ PE←C | PE←B ] is [ I | PC←B ].
To help remember this corollary, notice that [PE←C | PE←B ] has C and B in
the same order as [I | PC←B ], and we start with [PE←C | PE←B ] because changing
into the standard basis E is easy.
Proof of Corollary 1.2.5. Recall that PE←C is invertible, so its reduced row
echelon form is I and thus the RREF of [ PE←C | PE←B ] has the form [ I | X ]
for some matrix X.
To see that X = PC←B (and thus complete the proof), recall that sequences of
elementary row operations correspond to multiplication on the left by invertible
matrices (see Appendix A.1.3), so there is an invertible matrix Q such that
Q[ PE←C | PE←B ] = [ I | X ].
On the other hand, block matrix multiplication shows that
Q[ PE←C | PE←B ] = [ QPE←C | QPE←B ].
Comparing the left blocks of these matrices shows that QPE←C = I, so

−1
Q = PE←C = PC←E . Comparing the right blocks of these matrices then shows
that X = QPE←B = PC←E PE←B = PC←B , as claimed.
Example 1.2.5 Use Corollary 1.2.5 to find the change-of-basis matrix PC←B , where
Computing a Change-
of-Basis Matrix via 1 0 0 1 0 −i 1 0
B= , , , and
Row Operations 0 1 1 0 i 0 0 −1

1 0 1 1 1 1 1 1
C= , , ,
0 0 0 0 1 0 1 1
are bases of M2 (C).

Solution:
Recall that we already computed PE←B and PE←C in Example 1.2.4:
   
1 0 0 1 1 1 1 1
   
0 1 −i 0  0 1 1 1
PE←B =   and PE←C =  .
0 1 i 0 0 0 1 1
1 0 0 −1 0 0 0 1
To compute PC←B , we place these matrices side-by-side in an aug-

mented matrix and then row reduce:
 
1 1 1 1 1 0 0 1
 0 1 1 1 0 1 −i 0 
[ PE←C | PE←B ] =  0 0 1 1 0 1 i

0 
0 0 0 1 1 0 0 −1
 
1 0 0 0 1 −1 i 1
 1 −i 0 
R −R2  0 1 1 1 0 
−−1−−→  0 0 1 1 0 1 i 0 
0 0 0 1 1 0 0 −1
 
1 0 0 0 1 −1 i 1
 0 −2i 0 
R −R3  0 1 0 0 0 
−−2−−→  0 0 1 1 0 1 i 0 
0 0 0 1 1 0 0 −1
 
1 0 0 0 1 −1 i 1
 0 −2i 0 
R −R4  0 1 0 0 0 
−−3−−→  0 0 1 0 −1 1 i 1 
0 0 0 1 1 0 0 −1
= [ I | PC←B ].
Since the left block is now the identity matrix I, it follows from Corol-
lary 1.2.5 that the right block is PC←B :
 
1 −1 i 1
 
0 0 −2i 0 
PC←B =  ,
−1 1 i 1
1 0 0 −1
which agrees with the answer that we found in Example 1.2.4.

1.2.3 Linear Transformations

When working with vectors in Rn , we often think of matrices as functions
that move vectors around. That is, we fix a matrix A ∈ Mm,n and consider the
Whenever we function T : Rn → Rm defined by T (v) = Av. When we do this, many of the
multiply a matrix by
usual linear algebraic properties of the matrix A can be interpreted as properties
a vector, we
interpret that vector of the function T : the rank of A is the dimension of the range of T , the absolute
as a column vector value of det(A) is a measure of how much T expands space, and so on.
so that the matrix
multiplication makes
Any function T that is constructed in this way inherits some nice properties
sense. from matrix multiplication: T (v + w) = A(v + w) = Av + Aw = T (v) + T (w)
for all vectors v, w ∈ Rn , and similarly T (cv) = A(cv) = cAv = cT (v) for
all vectors v ∈ Rn and all scalars c ∈ R. We now explore functions between
arbitrary vectors spaces that have these same two properties.
Definition 1.2.4 Let V and W be vector spaces over the same field F. A linear transforma-
Linear tion is a function T : V → W that satisfies the following two properties:
Transformations a) T (v + w) = T (v) + T (w) for all v, w ∈ V, and
b) T (cv) = cT (v) for all v ∈ V and c ∈ F.
All of our old examples of linear transformations (i.e., matrices) acting on

V = Rn satisfy this definition, and our intuition concerning how they move
vectors around Rn still applies. However, there are linear transformations acting
on other vector spaces that perhaps seem a bit surprising at first, so we spend
some time looking at examples.
Example 1.2.6 Show that the matrix transpose is a linear transformation. That is, show
The Transpose is a that the function T : Mm,n → Mn,m defined by T (A) = AT is a linear
Linear Transformation transformation.
Solution:
We need to show that the two properties of Definition 1.2.4 hold. That
is, we need to show that (A + B)T = AT + BT and (cA)T = cAT for all
A, B ∈ Mm,n and c ∈ F. Both of these properties follow almost immediately
from the definition, so we do not dwell on them.
Example 1.2.7 The trace is the function tr : Mn (F) → F that adds up the diagonal entries
The Trace is a of a matrix:
Linear Transformation def
tr(A) = a1,1 + a2,2 + · · · + an,n for all A ∈ Mn (F).
Show that the trace is a linear transformation.

Solution:
We need to show that the two properties of Definition 1.2.4 hold:
The notation [A + B]i, j a) tr(A + B) = [A + B]1,1 + · · · + [A + B]n,n = a1,1 + b1,1 + · · · + an,n +
means the (i, j)-entry
of A + B.
bn,n = (a1,1 + · · · + an,n ) + (b1,1 + · · · + bn,n ) = tr(A) + tr(B), and
b) tr(cA) = [cA]1,1 + · · · + [cA]n,n = ca1,1 + · · · + can,n = ctr(A).
It follows that the trace is a linear transformation.
Example 1.2.8 Show that the derivative is a linear transformation. That is, show that the
The Derivative is a function D : D → F defined by D( f ) = f 0 is a linear transformation.
Linear Transformation
Solution:
We need to show that the two properties of Definition 1.2.4 hold. That
Recall that D is the is, we need to show that ( f + g)0 = f 0 + g0 and (c f )0 = c f 0 for all f , g ∈ D
vector space of and c ∈ R. Both of these properties are typically presented in introductory
differentiable
real-valued
calculus courses, so we do not prove them here.
functions. Sorry for
using two different There are also a couple of particularly commonly-occurring linear transfor-
“D”s in the same mations that it is useful to give names to: the zero transformation O : V → W
sentence to mean
different things (curly is the one defined by O(v) = 0 for all v ∈ V, and the identity transformation
D is a vector space, I : V → V is the one defined by I(v) = v for all v ∈ V. If we wish to clar-
block D is the ify which spaces the identity and zero transformations are acting on, we use
derivative linear subscripts as in IV and OV ,W .
transformation).
On the other hand, it is perhaps helpful to see an example of a linear
algebraic function that is not a linear transformation.
Example 1.2.9 Show that the determinant, det : Mn (F) → F, is not a linear transformation
The Determinant is Not when n ≥ 2.
a Linear Transformation
Solution:
To see that det is not a linear transformation, we need to show that at
least one of the two defining properties from Definition 1.2.4 do not hold.
Well, det(I) = 1, so
det(I + I) = det(2I) = 2n det(I) = 2n 6= 2 = det(I) + det(I),
so property (a) of that definition fails (property (b) also fails for a similar
reason).
We now do for linear transformations what we did for vectors in the previous
section: we give them coordinates so that we can explicitly write them down
using numbers from the ground field F. More specifically, just like every vector
in a finite-dimensional vector space can be associated with a vector in Fn , every
linear transformation between vector spaces can be associated with a matrix in
Mm,n (F):
Theorem 1.2.6 Let V and W be vector spaces with bases B and D, respectively, where
Standard Matrix of a B = {v1 , v2 , . . . , vn } and W is m-dimensional. A function T : V → W is a
Linear Transformation linear transformation if and only if there exists a matrix [T ]D←B ∈ Mm,n
for which
[T (v)]D = [T ]D←B [v]B for all v ∈ V.
Furthermore, the unique matrix [T ]D←B with this property is called the
standard matrix of T with respect to the bases B and D, and it is
def
[T ]D←B = [T (v1 )]D | [T (v2 )]D | · · · | [T (vn )]D .
In other words, this theorem tells us that instead of working with a vector
v ∈ V, applying the linear transformation T : V → W to it, and then converting
it into a coordinate vector with respect to the basis D, we can convert v to its
coordinate vector [v]B and then multiply by the matrix [T ]D←B (see Figure 1.6).
W V
T (v) T v
Read this figure as

starting at the
top-right corner and [ · ]D [ · ]B
moving to the
bottom-left. Fm Fn
[T (v)]D = [T ]D←B [v]B [T ]D←B [v]B
Figure 1.6: A visualization of the relationship between linear transformations, their

standard matrices, and coordinate vectors. Just like T sends v to T (v), the standard
matrix [T ]D←B sends the coordinate vector of v to the coordinate vector of T (v).
Proof of Theorem 1.2.6. It is straightforward to show that a function that mul-

tiplies a coordinate vector by a matrix is a linear transformation, so we only
prove that for every linear transformation T : V → W, the matrix

[T ]D←B = [T (v1 )]D | [T (v2 )]D | · · · | [T (vn )]D
satisfies [T ]D←B [v]B = [T (v)]D , and no other matrix has this property. To see
that [T ]D←B [v]B = [T (v)]D , suppose that [v]B = (c1 , c2 , . . . , cn ) (i.e., v = c1 v1 +
c2 v2 + · · · + cn vn ) and do block matrix multiplication:
 
c
 .1 
[T ]D←B [v]B = [T (v1 )]D | · · · | [T (vn )]D  ..  (definition of [T ]D←B )
This proof is almost cn
identical to that of
= c1 [T (v1 )]D + · · · + cn [T (vn )]D (block matrix mult.)
Theorem 1.2.3. The
reason for this is that = [c1 T (v1 ) + · · · + cn T (vn )]D (by Exercise 1.2.22)
change-of-basis
matrices are exactly = [T (c1 v1 + · · · + cn vn )]D (linearity of T )
the standard = [T (v)]D . (v = c1 v1 + · · · cn vn )
matrices of the
identity
transformation.
The proof of uniqueness of [T ]D←B is almost identical to the proof of
uniqueness of PC←B from Theorem 1.2.3, so we leave it to Exercise 1.2.35.
For simplicity of notation, in the special case when V = W and B = D
we denote the standard matrix [T ]B←B simply by [T ]B . If V furthermore has a
standard basis E then we sometimes denote [T ]E simply by [T ].
Example 1.2.10 Find the standard matrix [T ] of the transposition map T : M2 → M2 with
Standard Matrix of the respect to the standard basis E = {E1,1 , E1,2 , E2,1 , E2,2 }.
Transposition Map
Solution:
T , E T , E T , and E T ,
We need to compute the coefficient vectors of E1,1 1,2 2,1 2,2
and place them (in that order) into the matrix [T ]:

" # " #
T 1 0 T 0 0
[E1,1 ]E = = (1, 0, 0, 0), [E1,2 ]E = = (0, 0, 1, 0),
0 0 1 0
E E
" # " #
T 0 1 T 0 0
[E2,1 ]E = = (0, 1, 0, 0), [E2,2 ]E = = (0, 0, 0, 1).
0 0 0 1
E E
The standard matrix It follows that

of T looks different
depending on which  
bases B and D are 1 0 0 0
used (just like h
T
T T T i  0 0 1 0

coordinate vectors [T ] = [T ]E = [E1,1 ]E [E1,2 ]E [E2,1 ]E [E2,2 ]E =  .
look different
0 1 0 0
depending on the 0 0 0 1
basis B).
We could be done at this point, but as a bit of a sanity check, it is
perhaps useful to verify that [T ]E [A]E = [AT ]E for all A ∈ M2 . To this end,
we notice that if
" #
a b
A= then [A]E = (a, b, c, d),
c d
We generalize this so
example to higher
dimensions in     
1 0 0 0 a a "
Exercise 1.2.12.
   c #
0 0 1 0  b a c
[T ]E [A]E =    
  =   = = AT E ,
0 1 0 0 c b b d
E
0 0 0 1 d d
as desired.
Example 1.2.11 Find the standard matrix [D]C←B of the derivative map D : P 3 → P 2 with
Standard Matrix of the respect to the standard bases B = {1, x, x2 , x3 } ⊂ P 3 and C = {1, x, x2 } ⊂
Derivative P 2.
Solution:
We need to compute the coefficient vectors of D(1) = 0, D(x) = 1,
D(x2 ) = 2x, and D(x3 ) = 3x2 , and place them (in that order) into the matrix
[D]C←B :
2
[0]C = (0, 0, 0), [1]C = (1, 0, 0), [2x]C = (0, 2, 0), 3x C = (0, 0, 3).
It follows that
 
h 2 i 0 1 0 0
[D]C←B = [0]C [1]C [2x]C 3x C = 0 0 2 0 .
0 0 0 3
Once again, as a bit of a sanity check, it is perhaps useful to verify that

[D]C←B [ f ]B = [ f 0 ]C for all f ∈ P 3 . If f (x) = a0 + a1 x + a2 x2 + a3 x3 then
[ f ]B = (a0 , a1 , a2 , a3 ), so
 
  a  
0 1 0 0  0 a1
a1    2

[D]C←B [ f ]B = 0 0 2 0 
a2  = 2a 2 = a1 + 2a2 x + 3a3 x C .
0 0 0 3 a 3a3
3
Since a1 + 2a2 x + 3a3 x2 = (a0 + a1 x + a2 x2 + a3 x3 )0 = f 0 (x), we thus see

that we indeed have [D]C←B [ f ]B = [ f 0 ]C , as expected.
Keep in mind that if we change the input and/or output vector spaces of the
derivative map D, then its standard matrix can look quite a bit different (after
all, even just changing the bases used on these vector spaces can change the
standard matrix). For example, if we considered D as a linear transformation
from P 3 to P 3 , instead of from P 3 to P 2 as in the previous example, then its
standard matrix would be 4 × 4 instead of 3 × 4 (it would have an extra zero
row at its bottom).
The next example illustrates this observation with the derivative map on a
slightly more exotic vector space.
Example 1.2.12 Let B = {ex , xex , x2 ex } be a basis of the vector space V = span(B). Find
Standard Matrix of the the standard matrix [D]B of the derivative map D : V → V with respect to
Derivative (Again) B.
Solution:
We need to compute the coefficient vectors of D(ex ) = ex , D(xex ) =
Recall that spans are e + xex , and D(x2 ex ) = 2xex + x2 ex , and place them (in that order) as
x
always subspaces. B columns into the matrix [D]B :
was shown to be
linearly independent x x x
in Exercise 1.1.2(g), so e B = (1, 0, 0), e + xex B = (1, 1, 0), 2xe + x2 ex B = (0, 2, 1).
it is indeed a basis of
V. It follows that
 
h i 1 1 0
[D]B = ex B ex + xex B 2xex + x2 ex B = 0 1 2 .
0 0 1
It is often useful to consider the effect of applying two or more linear

transformations to a vector, one after another. Rather than thinking of these
linear transformations as separate objects that are applied in sequence, we can
combine their effect into a single new function that is called their composition.
More specifically, suppose V, W, and X are vector spaces and T : V → W
and S : W → X are linear transformations. We say that the composition of S
Like most things in and T , denoted by S ◦ T , is the function defined by
this section, the
composition S ◦ T def
(S ◦ T )(v) = S(T (v)) for all v ∈ V.
should be read
right-to-left. The That is, the composition S ◦ T of two linear transformations is the function that
input vector v goes
in on the right, then
we get if we apply T first and then S. In other words, while T sends V to W
T acts on it, and and S sends W to X , the composition S ◦ T skips the intermediate step and
then S acts on that. sends V directly to X , as illustrated in Figure 1.7.
Importantly, S ◦ T is also a linear transformation, and its standard matrix can
be obtained simply by multiplying together the standard matrices of S and T .
V W X
S(T (v)) =
v T (v) (S ◦ T )(v)
T S
S◦T
Figure 1.7: The composition of S and T , denotes by S ◦ T , is the function that sends
v ∈ V to S(T (v)) ∈ X .
In fact, this is the primary reason that matrix multiplication is defined in the
seemingly bizarre way that it is—we want it to capture the idea of applying
one matrix (linear transformation) after another to a vector.
Theorem 1.2.7 Suppose V, W, and X are finite-dimensional vector spaces with bases B,
Composition of C, and D, respectively. If T : V → W and S : W → X are linear transfor-
Linear Transformations mations then S ◦ T : V → X is a linear transformation, and its standard
matrix is
[S ◦ T ]D←B = [S]D←C [T ]C←B .
Notice that the
middle C subscripts
match, while the left
D subscripts and the Proof. Thanks to the uniqueness condition of Theorem 1.2.6, we just need
right B subscripts also to show that [(S ◦ T )(v)]D = [S]D←C [T ]C←B [v]B for all v ∈ V. To this end, we
match. compute [(S ◦ T )(v)]D by using Theorem 1.2.6 applied to each of S and T
individually:
[(S ◦ T )(v)]D = [S(T (v))]D = [S]D←C [T (v)]C = [S]D←C [T ]C←B [v]B .
It follows that S ◦ T is a linear transformation, and its standard matrix is

[S]D←C [T ]C←B , as claimed.
In the special case when the linear transformations that we are composing
are equal to each other, we denote their composition by T 2 = T ◦ T . More
def
generally, we can define powers of a linear transformation T : V → V via
T k = |T ◦ T {z
def
◦ · · · ◦ T}.
k copies
Example 1.2.13 Use standard matrices to compute the fourth derivative of x2 ex + 2xex .
Iterated Derivatives
Solution:
We let B = {ex , xex , x2 ex } and V = span(B) so that we can make use
of the standard matrix
 
1 1 0
[D]B = 0 1 2
0 0 1
of the derivative map D that we computed in Example 1.2.12. We know

from Theorem 1.2.7 that the standard matrix of D2 = D ◦ D (i.e., the linear
map that takes the derivative twice) is

 2  
1 1 0 1 2 2
2
D B = [D]2B = 0 1 2 = 0 1 4 .
0 0 1 0 0 1
Similar reasoning tells us that the standard matrix of D4 = D ◦ D ◦ D ◦ D

(i.e., the linear map that takes the derivative four times) is
 2  
4 2 2 1 2 2 1 4 12
D B= D B= 0 1  
4 = 0 1 8 .
0 0 1 0 0 1
With this standard matrix in hand, one way to find the fourth derivative

of x2 ex + 2xex is to multiply its coordinate vector (0, 2, 1) by D4 B :

(x2 ex + 2xex )0000 B
= D4 B x2 ex + 2xex B
    
1 4 12 0 20
= 0 1 8  2 = 10 .
0 0 1 1 1
It follows that (x2 ex + 2xex )0000 = 20ex + 10xex + x2 ex .

The basis E here
does not have to be Recall from earlier that we learned how to convert a coordinate vector from
the standard basis. the basis B to another one C by using a change-of-basis matrix PC←B . We now
There just are not learn how to do the same thing for linear transformations: there is a reasonably
enough letters in the
alphabet.
direct method of converting a standard matrix with respect to bases B and D to
a standard matrix with respect to bases C and E.
Theorem 1.2.8 Suppose V is a finite-dimensional vector space with bases B and C, W is

Change of Basis for a finite-dimensional vector space with bases D and E, and T : V → W is a
Linear Transformations linear transformation. Then
[T ]E←B = PE←D [T ]D←C PC←B .
As always, notice in Proof. We simply multiply the matrix PE←D [T ]D←C PC←B on the right by an
this theorem that
arbitrary coordinate vector [v]B , where v ∈ V. Well,
adjacent subscripts
always match (e.g.,
the two “D”s are PE←D [T ]D←C PC←B [v]B = PE←D [T ]D←C [v]C (since PC←B [v]B = [v]C )
next to each other, = PE←D [T (v)]D (by Theorem 1.2.6)
as are the two “C”s).
= [T (v)]E .
However, we know from Theorem 1.2.6 that [T ]E←B is the unique matrix for
which [T ]E←B [v]B = [T (v)]E for all v ∈ V, so it follows that
PE←D [T ]D←C PC←B = [T ]E←B , as claimed.
A schematic that illustrates the statement of the above theorem is provided
by Figure 1.8. All it says is that there are two different (but equivalent) ways
of converting [v]B into [T (v)]E : we could multiply [v]B by [T ]E←B , or we
could convert [v]B into basis C, then multiply by [T ]D←C , and then convert
into basis E.
W V
T (v) T v
Again, read this
figure as starting at [ · ]E [ · ]B
the top-right corner [ · ]D [ · ]C
and moving to the Fm Fm Fn Fn
bottom-left.
[T (v)]E PE←D [T (v)]D [T ]D←C [v]C PC←B [v]B
[T ]E←B = PE←D [T ]D←C PC←B
Figure 1.8: A visualization of the relationship between linear transformations, stan-

dard matrices, change-of-basis matrices, and coordinate vectors. In particular,
the bottom row illustrates Theorem 1.2.8, which says that we can construct [T ]E←B
from [T ]D←C by multiplying on the right and left by the appropriate change-of-basis
matrices. This image is basically just a combination of Figures 1.5 and 1.6.
Example 1.2.14 Compute the standard matrix [T ]C←B of the transpose map T : M2 (C) →
Representing the M2 (C), where
Transpose in
Weird Bases 1 0 0 1 0 −i 1 0
B= , , , and
0 1 1 0 i 0 0 −1

1 0 1 1 1 1 1 1
C= , , ,
0 0 0 0 1 0 1 1
are bases of M2 (C).

Solution:
Recall that if E = {E1,1 , E1,2 , E2,1 , E2,2 } is the standard basis of M2 (C)
then we already computed PE←B and PC←E in Example 1.2.4:
   
1 0 0 1 1 −1 0 0
   
0 1 −i 0  0 1 −1 0 
PE←B =   and PC←E =  ,
0 1 i 0 0 0 1 −1
1 0 0 −1 0 0 0 1
as well as [T ]E in Example 1.2.10:

 
1 0 0 0
 
0 0 1 0
[T ]E =  .
0 1 0 0
0 0 0 1
Theorem 1.2.8 tells us that we can compute [T ]C←B just by multiplying

these matrices together in the appropriate order:
We could also
[T ]C←B = PC←E [T ]E PE←B
   
compute [T ]C←B 1 −1 0 0 1 0 0 0 1 0 0 1
directly from the    
definition, but that is 0 1 −1 0  0 0 1 0 0 1 −i 0 
=   
a lot more work as it 0 0 1 −1 0 1 0 0 0 1 i 0
requires the
computation of
0 0 0 1 0 0 0 1 1 0 0 −1
  
numerous ugly
1 −1 0 0 1 0 0 1
coordinate vectors.   
0 1 −1 0  0 1 i 0
=  
0 0 1 −1 0 1 −i 0
0 0 0 1 1 0 0 −1
 
1 −1 −i 1
 
0 0 2i 0
= .
−1 1 −i 1
1 0 0 −1
1.2.4 Properties of Linear Transformations

Now that we know how to represent linear transformations as matrices, we
can introduce many properties of linear transformations that are analogous to
properties of their standard matrices. Because of the close relationship between
these two concepts, all of the techniques that we know for dealing with matrices
carry over immediately to this more general setting, as long as the underlying
vector spaces are finite-dimensional.
For example, we say that a linear transformation T : V → W is invertible
if there exists a linear transformation T −1 : W → V such that
T −1 (T (v)) = v for all v ∈ V and
−1
T (T (w)) = w for all w ∈ W.
In other words, T −1 ◦ T = IV and T ◦ T −1 = IW . Our intuitive understanding
of the inverse of a linear transformation is identical to that of matrices (T −1
is the linear transformation that “undoes” what T “does”), and the following
theorem says that invertibility of T can in fact be determined from its standard
matrix exactly how we might hope:
Theorem 1.2.9 Suppose V and W are finite-dimensional vector spaces with bases B and
Invertibility of D, respectively, and dim(V) = dim(W). Then a linear transformation
Linear Transformations T : V → W is invertible if and only if its standard matrix [T ]D←B is
invertible, and −1 −1
T B←D
= [T ]D←B .
Proof. We make use of the fact from Exercise 1.2.30 that the standard matrix
of the identity transformation is the identity matrix: [T ]B = I if and only if
T = IV .
For the “only if” direction, note that if T is invertible then we have

I = [IV ]B = T −1 ◦ T B = T −1 B←D [T ]D←B .
Since [T −1 ]B←D and [T ]D←B multiply to the identity matrix, it follows that they
are inverses of each other.
We show in For the “if” direction, suppose that [T ]D←B is invertible, with inverse matrix
Exercise 1.2.21 that if
A. Then there is some linear transformation S : W → V such that A = [S]B←D ,
dim(V) 6= dim(W)
then T cannot so for all v ∈ V we have
possibly be invertible.
[v]B = A[T ]D←B [v]B = [S]D←B [T ]D←B [v]B = [(S ◦ T )(v)]B .
This implies [S ◦ T ]B = I, so S ◦ T = IV , and a similar argument shows that

T ◦ S = IW . It follows that T is invertible, and its inverse is S.
In a sense, the above theorem tells us that when considering the invertibil-
ity of linear transformations, we do not actually need to do anything new—
everything just carries over from the invertibility of matrices straightforwardly.
The important difference here is our perspective—we can now use everything
we know about invertible matrices in many other situations where we want
to invert an operation. The following example illustrates this observation by
inverting the derivative (i.e., integrating).
Example 1.2.15 Use standard matrices to compute

Indefinite Integrals via Z
Standard Matrices (3x2 ex − xex ) dx.
Solution:
The “standard” way We let B = {ex , xex , x2 ex } and V = span(B) so that we can make use
to compute this
of the standard matrix
indefinite integral
directly would be to  
use integration by 1 1 0
parts twice. [D]B = 0 1 2
0 0 1
of the derivative map D that we computed in Example 1.2.12. Since the

indefinite integral and derivative are inverse operations of one another
on V, D−1 is the integration linear transformation. It then follows from
Theorem 1.2.9 that the standard matrix of D−1 is
See Appendix A.1.3 if
 
you need a refresher
1 −1 2
on how to compute −1 −1 
the inverse of a D B
= [D]B = 0 1 −2 .
matrix. Recall that 0 0 1
applying Gaussian
elimination to [ A | I ] It follows that, to find the coefficient vector of the indefinite integral
produces [ I | A−1 ].
of 3x2 ex − xex , we compute
    
−1 1 −1 2 0 7
D B [3x2 ex − xex ]B = 0 1 −2 −1 = −7 .
0 0 1 3 3
We thus conclude that

Z
(3x2 ex − xex ) dx = 7ex − 7xex + 3x2 ex +C.
It is worth pointing out that the method of integration presented in Exam-

ple 1.2.15 does not work if the derivative map D is not invertible on the vector
space in question. For example, if we tried to compute

Z
x3 dx
via this method, we could run into trouble since D : P 3 → P 3 is not invertible:
direct computation reveals that its standard matrix is 4 × 4, but has rank 3. After
all, it has a 1-dimensional null space consisting of the coordinate vectors of
the constant functions, which are all mapped to 0. One way to get around this
problem is to use the pseudoinverse, which we introduce in Section 2.C.1.
Remark 1.2.2 Because we can think of linear transformations acting on finite-dimensional

Invertibility in vector spaces as matrices, almost any property or theorem involving matri-
Infinite-Dimensional ces can be carried over to this new setting without much trouble.
Vector Spaces However, if the vector spaces involved are infinite-dimensional, then
many of the nice properties that matrices have can break down. For ex-
A “one-sided
inverse” of a linear ample, a standard result of introductory linear algebra says that every
transformation T is a one-sided inverse of a matrix, and thus every one-sided inverse of a linear
linear transformation transformation between finite-dimensional vector spaces, is necessarily a
T −1 such that
T −1 ◦ T = IV or
two-sided inverse as well. However, this property does not hold in infinite-
T ◦ T −1 = IW (but not dimensional vector spaces.
necessarily both). For example, if R : FN → FN is the “right shift” map defined by
R(c1 , c2 , c3 , . . .) = (0, c1 , c2 , c3 , . . .),

Recall that FN is the
vector space of
infinite sequences of then it is straightforward to verify that R is linear (see Exercise 1.2.32).
scalars from F. Furthermore, the “left shift” map L : FN → FN defined by
L(c1 , c2 , c3 , . . .) = (c2 , c3 , c4 , . . .)
is a one-sided inverse of R, since L ◦R = IFN . However, L is not a two-sided

inverse of R, since
(R ◦ L)(c1 , c2 , c3 , . . .) = (0, c2 , c3 , . . .),
so R ◦ L 6= IFN .
We can also introduce many other properties and quantities associated with
linear transformations, such as their range, null space, rank, and eigenvalues.
In all of these cases, the definitions are almost identical to what they were for
matrices, and in the finite-dimensional case they can all be handled simply by
appealing to what we already know about matrices.
In particular, we now define the following concepts concerning a linear
transformation T : V → W:
def
We show that • range(T ) = {T (x) : x ∈ V},
range(T ) is a def
subspace of W and • null(T ) = {x ∈ V : T (x) = 0},
null(T ) is a subspace def
of V in Exercise 1.2.24.
• rank(T ) = dim(range(T )), and
def
• nullity(T ) = dim(null(T )).
In all of these cases, we can compute these quantities by converting ev-
erything to standard matrices and coordinate vectors, doing our computations
on matrices using the techniques that we already know, and then converting
back to linear transformations and vectors in V. We now illustrate this tech-

nique with some examples, but keep in mind that if V and/or W are infinite-
dimensional, then these concepts become quite a bit more delicate and are
beyond the scope of this book.
Example 1.2.16 Determine the range, null space, rank, and nullity of the derivative map
Range and Null Space D : P 3 → P 3.
of the Derivative
Solution:
We first compute these objects directly from the definitions.
• The range of D is the set of all polynomials of the form D(p), where
p ∈ P 3 . Since the derivative of a degree-3 polynomial has degree-2
(and conversely, every degree-2 polynomial is the derivative of some
degree-3 polynomial), we conclude that range(D) = P 2 .
• The null space of D is the set of all polynomials p for which D(p) =
0. Since D(p) = 0 if and only if p is constant, we conclude that
null(D) = P 0 (the constant functions).
• The rank of D is the dimension of its range, which is dim(P 2 ) = 3.
• The nullity of D is the dimension of its null space: dim(P 0 ) = 1.
Alternatively, we could have arrived at these answers by working
with the standard matrix of D. Using an argument analogous to that of
Example 1.2.11, we see that the standard matrix of D with respect to the
In Example 1.2.11, D standard basis B = {1, x, x2 , x3 } is
was a linear
transformation into  
P 2 (instead of P 3 ) so 0 1 0 0
 
its standard matrix 0 0 2 0
there was 3 × 4 [D]B =  .
0 0 0 3
instead of 4 × 4.
0 0 0 0
Straightforward computation then shows the following:

• range([D]B ) = span (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) . These three
vectors are [1]B , [x]B , and [x2 ]B , so range(D) = span(1, x, x2 ) = P 2 ,
as we saw earlier.

• null([D]B ) = span (1, 0, 0, 0) . Since (1, 0, 0, 0) = [1]B , we con-
clude that null(D) = span(1) = P 0 , as we saw earlier.
• rank([D]B ) = 3, so rank(D) = 3 as well.
• nullity([D]B ) = 1, so nullity(D) = 1 as well.
In the above example, we were able to learn about the range, null space,
rank, and nullity of a linear transformation by considering the corresponding
properties of its standard matrix (with respect to any basis). These facts are
hopefully intuitive enough (after all, the entire reason we introduced standard
matrices is because they act on Fn in the same way that the linear transformation
acts on V), but they are proved explicitly in Exercise 1.2.25.
We can also define eigenvalues and eigenvectors of linear transformations in
If T : V → W with
V 6= W then T (v) and almost the exact same way that is done for matrices. If V is a vector space over a
λ v live in different field F then a non-zero vector v ∈ V is an eigenvector of a linear transformation
vector spaces, so it T : V → V with corresponding eigenvalue λ ∈ F if T (v) = λ v. We furthermore
does not make
say that the eigenspace corresponding to a particular eigenvalue is the set
sense to talk about
them being equal. consisting of all eigenvectors corresponding to that eigenvalue, together with 0.
Just like the range and null space, eigenvalues and eigenvectors can be
computed either straight from the definition or via the corresponding properties
of a standard matrix.
Example 1.2.17 Compute the eigenvalues and corresponding eigenspaces of the transpose
Eigenvalues and map T : M2 → M2 .
Eigenvectors of the
Solution:
Transpose
We compute the eigenvalues and eigenspaces of T by making use of its
standard matrix with respect to the standard basis E = {E1,1 , E1,2 , E2,1 , E2,2 }
that we computed in Example 1.2.10:
 
1 0 0 0
 
0 0 1 0
[T ] =  .
0 1 0 0
0 0 0 1
Then the characteristic polynomial p[T ] of [T ] is

Computing  
determinants of 4 × 4 1−λ 0 0 0
matrices is normally  
 0 −λ 1 0 
 = λ 2 (1 − λ )2 − (1 − λ )2
not particularly fun,
p[T ] (λ ) = det 
but it’s doing so here  0 1 −λ 0 
is not too bad since = (λ 2 − 1)(1 − λ )2
[T ] has so many zero 0 0 0 1−λ
entries. = (λ − 1)3 (λ + 1).
It follows that the eigenvalues of [T ] are 1, with algebraic multiplic-

ity 3, and −1, with algebraic multiplicity 1. We now find bases of its
eigenspaces.
λ = 1: This eigenspace equals the null space of
 
0 0 0 0
 
0 −1 1 0
A−λI = A−I =  ,
0 1 −1 0
0 0 0 0
which has B = {(1, 0, 0, 0), (0, 1, 1, 0), (0, 0, 0, 1)} as a basis.

These three basis vectors are the coordinate vectors of the
matrices

1 0 0 1 0 0
, , and ,
0 0 1 0 0 1
When we say which thus make up a basis of the λ = 1 eigenspace of T . This

“symmetric” here,
we really mean
makes sense since these matrices span the set MS2 of sym-
symmetric: AT = A. metric 2 × 2 matrices. In other words, we have just restated
We do not mean the trivial fact that AT = A means that A is an eigenvector of
“Hermitian”, even if the transpose map with eigenvalue 1 (i.e., AT = λ A with λ = 1).
the underlying
field is C.
λ = −1: Similarly, this eigenspace equals the null space of

 
2 0 0 0
 
0 1 1 0
A−λI = A+I =  ,
0 1 1 0
0 0 0 2
which has basis B = {(0, 1, −1, 0)}. This vector is the coordi-
nate vector of the matrix

0 1
,
−1 0
which thus makes up a basis of the λ = −1 eigenspace of

T . This makes sense since this matrix spans the set MsS 2 of
skew-symmetric 2 × 2 matrices. Similar to before, we have
just restated the trivial fact that AT = −A means that A is an
eigenvector of the transpose map with eigenvalue −1 (i.e., AT =
λ A with λ = −1).
The above example generalizes straightforwardly to higher dimensions:

for any integer n ≥ 2, the transpose map T : Mn → Mn only has eigenvalues
±1, and the corresponding eigenspaces are exactly the spaces MSn and MsS
n of
symmetric and skew-symmetric matrices, respectively (see Exercise 1.2.13).
Because we can talk about eigenvalues and eigenvectors of linear transfor-
mations via their standard matrices, all of our results based on diagonalization
Refer to automatically apply in this new setting with pretty much no extra work (as long
Appendix A.1.7 if you
as everything is finite-dimensional). We can thus diagonalize linear transforma-
need a refresher on
diagonalization. tions, take arbitrary (non-integer) powers of linear transformations, and even
apply strange functions like the exponential to linear transformations.
Example 1.2.18 Find a square root of the transpose map T : M2 (C) → M2 (C). That is,
Square Root of the find a linear transformation S : M2 (C) → M2 (C) with the property that
Transpose S2 = T .
Solution:
Since we already know how to solve problems like this for matrices,
Throughout this we just do the corresponding matrix computation on the standard matrix
example,
[T ] rather than trying to solve it “directly” on T itself. That is, we find a
E = {E1,1 , E1,2 , E2,1 , E2,2 }
is the standard basis matrix square root of [T ] via diagonalization.
of M2 . First, we diagonalize [T ]. We learned in Example 1.2.17 that [T ] has
eigenvalues 1 and −1, with corresponding eigenspace bases equal to
{(1, 0, 0, 0), (0, 1, 1, 0), (0, 0, 0, 1)} and {(0, 1, −1, 0)},
respectively. It follows that one way to diagonalize [T ] as [T ] = PDP−1 is

to choose
Recall that to    
diagonalize a matrix, 1 0 0 0 2 0 0 0
we place the   1  
0 1 0 1 0 1 1 0
eigenvalues along P= , P−1 =  , and
the diagonal of D 0 1 0 −1 2 0 0 0 2
and bases of the
corresponding
0 0 1 0 0 1 −1 0
 
eigenspaces as 1 0 0 0
columns of P in the  
same order. 0 1 0 0
D= .
0 0 1 0
0 0 0 −1
To find a square root of [T ], we just take a square root of each diagonal

entry in D and then multiply these matrices back together. It follows that
one square root of [T ] is given by
[T ]1/2 = PD1/2 P−1

   
1 0 0 0 1 0 0 0 2 0 0 0
The 1/2 in front of this 1 
0 1 0 1  0 1 0

0 0 1 1 0

diagonalization =    
2 0 1 0 −1 0 0 1 0 0 0 0 2
comes from P−1 .
0 0 1 0 0 0 0 i 0 1 −1 0
  
1 0 0 0 2 0 0 0
10 1 0 1

 0 1 1

0
=   
2 0 1 0 −1 0 0 0 2
0 0 1 0 0 i −i 0
 
2 0 0 0
10 1 + i 1 − i 0

=  .
2 0 1 − i 1 + i 0
0 0 0 2
To unravel [T ]1/2 back into our desired linear transformation S, we

just look at how it acts on coordinate vectors:
    
2 0 0 0 a 2a
10 1 + i 1 − i 0 
  1
b

(1 + i)b + (1 − i)c
  =  ,
2 0 1 − i 1 + i 0  c  2 (1 − i)b + (1 + i)c
0 0 0 2 d 2d
so we conclude that the linear transformation S : M2 (C) → M2 (C)

To double-check our defined by
work, it is perhaps a
good idea to apply " #! " #
S to a matrix twice a b 1 2a (1 + i)b + (1 − i)c
S =
and see that we c d 2 (1 − i)b + (1 + i)c 2d
indeed end up with
the transpose of that
matrix. is a square root of T .
As perhaps an even weirder application of this method, we can also take

non-integer powers of the derivative map. That is, we can talk about things like
the “half derivative” of a function (just like we regularly talk about the second
or third derivative of a function).
Example 1.2.19 Let B = {sin(x), cos(x)} and V = span(B). Find a square root D1/2
of
How to Take Half the derivative map D : V → V and use it to compute D1/2 sin(x) and

of a Derivative D1/2 cos(x) .
Solution:
Just like in the previous example, instead of working with D itself, we
We think of D1/2 as work with its standard matrix, which we can compute straightforwardly to
the “half derivative”, be
just like D2 is the
0 −1
second derivative. [D]B = .
1 0
1/2
While we could compute [D]B by diagonalizing [D]B (we do exactly this
in Exercise 1.2.18), it is simpler to recognize [D]B as a counter-clockwise
Recall that the rotation by π/2 radians:
standard matrix of a
counter-clockwise y y
rotation by angle θ is
" #
cos(θ ) − sin(θ )
.
sin(θ ) cos(θ )
Plugging in θ = π/2 [D]B
[cos(x)]B −−−→ [D sin(x)]B
gives exactly [D]B .
= e2
x x
[sin(x)]B = e1 [D cos(x)]B
It is then trivial to find a square root of [D]B : we just construct the

standard matrix of the linear transformation that rotates counter-clockwise
by π/4 radians:
" #
1/2 cos(π/4) − sin(π/4) 1 1 −1
[D]B = =√ .
sin(π/4) cos(π/4) 2 1 1
Unraveling this standard matrix back into the linear transformation D1/2
shows that the half derivatives of sin(x) and cos(x) are
To double-check our
1
work, we could D1/2 sin(x) = √ sin(x) + cos(x) and
apply D1/2 to sin(x) 2
twice and see that
1/2
1
we get cos(x). D cos(x) = √ cos(x) − sin(x) .
2
We return to diagonalization and the idea of applying strange functions like

the square root to matrices later, in Section 2.4.3.

" #
1.2.1 Find the coordinate vector of the given vector v with 1 2
respect to the indicated basis B. ∗(c) v = , B = {E1,1 , E2,2 , E1,2 , E2,1 } ⊂ M2
3 4
∗(a) v = 3x − 2, B = {x + 1, x − 1} ⊂ P 1 " #
(b) v = 2x2 + 3, B = {x2 + 1, x + 1, x2 − x + 1} ⊂ P 2 1 2
(d) v = ,
3 4
(" # " # " # " #)
0 1 1 0 1 1 1 1
B= , , , 1.2.5 Find the change-of-basis matrix PC←B between the
1 1 1 1 0 1 1 0 given bases B and C of their span.
1.2.2 Find a basis of the indicated vector space V and then (a) B = {1 + 2x, 2 − x2 , 1 + x + 2x2 }, C = {1, x, x2 }
determine its dimension. ∗(b) B = {1, x, x2 }, C = {1 + 2x, 2 − x2 , 1 + x + 2x2 }
(c) B = {1 − x, 2 + x + x2 , 1 − x2 },
(a) V = MS2 (the set of 2 × 2 symmetric matrices) {x + 3x2 ,#x, 1"− x +#x2 }"
C = ("
∗∗(b) V = MSn # " #)
(c) V = MsS T 1 0 1 1 1 0 1 1
2 = {A ∈ M2 : A = −A} ∗(d) B = , , , ,
(the set of 2 × 2 skew-symmetric matrices) 0 1 0 1 1 1 1 0
∗∗(d) V = MsS T
n = {A ∈ Mn : A = −A} C = {E1,1 , E1,2 , E2,1 , E2,2 }
(e) V = MH 2 (the set of 2 × 2 Hermitian matrices) (" # " # " # " #)
∗(f) V = MH n 1 0 1 1 1 0 1 1
(g) V = { f ∈ P 2 : f (0) = 0} (e) B = , , , ,
∗(h) V = { f ∈ P 3 : f (3) = 0} 0 1 0 1 1 1 1 0
(" # " # " # " #)
(i) V = { f ∈ P 4 : f (0) = f (1) = 0}
1 0 0 1 0 −i 1 0
∗(j) V = { f ∈ P 3 : f (x) − x f 0 (x) = 0} C= , , ,
(k) V = span{(0, 1, 1), (1, 2, −1), (−1, −1, 2)} 0 1 1 0 i 0 0 −1
∗(l) V = span{ex , e2x , e3x }
1.2.6 Find the standard matrix [T ]D←B of the linear trans-
formation T with respect to the given bases B and D.
1.2.3 Determine which of the following functions are and
(a) T : R3 → R2 , T (v) = (v1 − v2 + 2v3 , 2v1 − 3v2 + v3 ),
are not linear transformations.
B and D are the standard bases of R3 and R2 , respec-
(a) The function T : Rn → Rn defined by tively.
T (v1 , v2 , . . . , vn ) = (vn , vn−1 , . . . , v1 ). ∗(b) T : P 2 → P 2 , T ( f (x)) = f (3x + 1),
∗(b) The function T : P p → P p defined by T ( f (x)) = B = D = {x2 , x, 1}. R
f (2x − 1). (c) T : P 2 → P 3 , T ( f (x)) = 1x f (t) dt,
(c) Matrix inversion (i.e., the function Inv : Mn → Mn B = {1, x, x }, D = {1, x, x2 , x3 }.
2
defined by Inv(A) = A−1 ). ∗(d) T : M2 → M2 , T (X) = AX −"XAT , B # = D is the
∗(d) One-sided matrix multiplication: given a fixed matrix standard basis of M2 , and A = 1 2 .
B ∈ Mn , the function RB : Mn → Mn defined by
3 4
RB (A) = AB.
(e) Two-sided matrix conjugation: given a fixed matrix
1.2.7 The dimension of a vector space depends on its
B ∈ Mm,n , the function TB : Mn → Mm defined by
ground field. What is the dimension of Cn as a vector space
TB (A) = BAB∗ .
over C? What is its dimension as a vector space over R?
∗(f) Conjugate transposition (i.e., the function T :
Mn (C) → Mn (C) defined by T (A) = A∗ ).
(g) The function T : P 2 → P 2 defined by T ( f ) = f (0) + ∗∗1.2.8 Consider the symmetric matrices
f (1)x + f (2)x2 .
Si,i = ei eTi for 1 ≤ i ≤ n, and
Si, j = (ei + e j )(ei + e j )T for 1 ≤ i < j ≤ n.
1.2.4 Determine which of the following statements are
true and which are false. Show that B = {Si, j : 1 ≤ i ≤ j ≤ n} is a basis of MSn .
∗(a) Every finite-dimensional vector space has a basis. [Side note: This basis is useful because every member of it
(b) The vector space M3,4 is 12-dimensional. has rank 1.]
∗(c) The vector space P 3 is 3-dimensional.
(d) If 4 vectors span a particular vector space V, then
every set of 6 vectors in V is linearly dependent. 1.2.9 Let C = {1 + x, 1 + x + x2 , x + x2 } be a basis of P 2 .
∗(e) The zero vector space V = {0} has dimension 1. Find a basis B of P 2 for which
 
(f) If V is a vector space and v1 , . . . , vn ∈ V are such that 1 1 2
span(v1 , . . . , vn ) = V, then {v1 , . . . , vn } is a basis of  
PC←B = 1 2 1 .
V.
2 1 1
∗(g) The set {1, x, x2 , x3 , . . .} is a basis of C (the vector
space of continuous functions). 1.2.10 Use the method of Example 1.2.15 to compute the
(h) For all A ∈ Mn it is true that tr(A) = tr(AT ). given indefinite integral.
∗(i) The transposition map T : Mm,n → Mn,m is invert- R
ible. (a) R (e−x + 2xe−x ) dx
(j) The derivative map D : P 5 → P 5 is invertible. ∗(b) R ex sin(2x) dx
(c) x sin(x) dx
1.2.11 Recall that we computed the standard matrix [T ] of

the transpose map on M2 in Example 1.2.10.
(a) Compute [T ]2 . [Hint: Your answer should be a well-
known named matrix.]
(b) Explain how we could have gotten the answer to
part (a) without actually computing [T ].
∗∗1.2.12 Show that the standard matrix of the transpose 1.2.20 Suppose V and W are vector spaces and T :
map T : Mm,n → Mn,m with respect to the standard basis V → W is an invertible linear transformation. Let B =
E = {E1,1 , E1,2 , . . . , Em,n } is the mn × mn block matrix that, {v1 , v2 , . . . , vn } ⊆ V be a set of vectors.
for all i and j, has E j,i in its (i, j)-block:
(a) Show that if B is linearly independent then so is
 
E1,1 E2,1 · · · Em,1 T (B) = {T (v1 ), T (v2 ), . . . , T (vn )}.
  (b) Show that if B spans V then T (B) spans W.
E1,2 E2,2 · · · Em,2 

[T ] =  .

. (c) Show that if B is a basis of V then T (B) is a basis of
 . .. .. .. 
 W.
 . . . .  (d) Provide an example to show that none of the results
E1,n E2,n · · · Em,n of parts (a), (b), or (c) hold if T is not invertible.
∗∗1.2.13 Suppose T : Mn → Mn is the transpose map.

Show that 1 and −1 are the only eigenvalues of T , and the ∗∗1.2.21 Suppose V, W are vector spaces and T : V → W
corresponding eigenspaces are the spaces MSn and MsS is a linear transformation. Show that if T is invertible then
n of
symmetric and skew-symmetric matrices, respectively. dim(V) = dim(W).
[Hint: It is probably easier to work directly with T rather
than using its standard matrix like in Example 1.2.17.] ∗∗1.2.22 Let V be a finite-dimensional vector space over
a field F and suppose that B is a basis of V.
1.2.14 Suppose T : M2,3 → M3,2 is the transpose map. (a) Show that [v + w]B = [v]B + [w]B for all v, w ∈ V.
Explain why T does not have any eigenvalues or eigenvec- (b) Show that [cv]B = c[v]B for all v ∈ V and c ∈ F.
tors even though its standard matrix does. (c) Suppose v, w ∈ V. Show that [v]B = [w]B if and only
if v = w.
∗ 1.2.15 Let T : P 2 → P 2 be the linear transformation [Side note: This means that the function T : V → Fn defined
defined by T ( f (x)) = f (x + 1). by T (v) = [v]B is an invertible linear transformation.]
(a) Find the range and null space of T .

(b) Find all eigenvalues and corresponding eigenspaces ∗∗1.2.23 Let V be an n-dimensional vector space over a
of T . field F and suppose that B is a basis of V.
(c) Find a square root of T . (a) Show that a set {v1 , v2, . . . , vm } ⊂ V is linearly inde-
pendent if and only if [v1 ]B , [v2 ]B , . . . , [vm ]B ⊂ Fn
is linearly independent.
1.2.16 In Example 1.2.19, we came up with a formula for
(b) Show that span(v1 , v2 , . . . ,vm ) = V if and only if
the half derivatives D1/2 (sin(x)) and D1/2 (cos(x)). Gener-
span [v1 ]B , [v2 ]B , . . . , [vm ]B = Fn .
alize this by coming up with a formula for Dr (sin(x)) and
a set {v1 , v2 , . . . , vn
(c) Show that } is a basis of V if and
Dr (cos(x)) where r ∈ R is arbitrary.
only if [v1 ]B , [v2 ]B , . . . , [vn ]B is a basis of Fn .
1.2.17 Let D : D → D be the derivative map.

∗∗1.2.24 Suppose V and W are vector spaces and T :
(a) Find an eigenvector of D with corresponding eigen- V → W is a linear transformation.
value λ = 1.
(a) Show that range(T ) is a subspace of W.
(b) Show that every real number λ ∈ R is an eigenvalue
(b) Show that null(T ) is a subspace of V.
of D. [Hint: You can tweak the eigenvector that you
found in part (a) slightly to change its eigenvalue.]
∗∗1.2.25 Suppose V and W are finite-dimensional vector
spaces with bases B and D, respectively, and T : V → W is
∗∗1.2.18 In Example 1.2.19, we computed a square root
a linear transformation.
of the matrix " #
0 −1 (a) Show that
[D]B =
1 0 range(T ) = w ∈ W : [w]D ∈ range [T ]D←B .
by considering how it acts on R2 geometrically. Now find (b) Show that
one of its square roots by diagonalizing it (over C).
null(T ) = v ∈ V : [v]B ∈ null [T ]D←B .

(c) Show that rank(T ) = rank [T ]D←B .
1.2.19 Suppose V and W are vector spaces and let
B = {v1 , v2 , . . . , vn } ⊂ V and C = {w1 , w2 , . . . , wn } ⊂ W (d) Show that nullity(T ) = nullity [T ]D←B .
be sets of vectors.
1.2.26 Suppose B is a subset of a finite-dimensional vector
(a) Show that if B is a basis of V then the function space V.
T : V → W defined by
(a) Show that if B is linearly independent then there is a
T (c1 v1 + · · · + cn vn ) = c1 w1 + · · · + cn wn basis C of V with B ⊆ C ⊆ V.
is a linear transformation. (b) Show that if B spans V then there is a basis C of V
(b) Show that if C is also a basis of W then T is invert- with C ⊆ B ⊆ V.
ible.
(c) This definition of T may not make sense if B is not a ∗∗1.2.27 Suppose B is a subset of a finite-dimensional
basis of V—explain why. vector space V consisting of dim(V) vectors.
(a) Show that B is a basis of V if and only if it is linearly ∗∗ 1.2.32 Show that the “right shift” map R from Re-
independent. mark 1.2.2 is indeed a linear transformation.
(b) Show that B is a basis of V if and only if it spans V.
1.2.33 Show that the “right shift” map R from Re-

∗∗1.2.28 Suppose V and W are vector spaces, and let mark 1.2.2 has no eigenvalues or eigenvectors, regardless of
L(V, W) be the set of linear transformations T : V → W. the field F.
(a) Show that L(V, W) is a vector space. [Side note: Recall that every square matrix over C has an
(b) If dim(V) = n and dim(W) = m, what is eigenvalue and eigenvector, so the same is true of every lin-
dim(L(V, W))? ear transformation acting on a finite-dimensional complex
vector space. This exercise shows that this claim no longer
1.2.29 Let B and C be bases of a finite-dimensional vector holds in infinite dimensions.]
space V. Show that PC←B = I if and only if B = C.
∗∗1.2.34 Let P p (F) denote the vector space of polyno-
∗∗1.2.30 Let B be a basis of an n-dimensional vector space mials of degree ≤ p acting on the field F (and with coeffi-
V and let T : V → V be a linear transformation. Show that cients from F). We noted earlier that dim(P p (F)) = p + 1
[T ]B = In if and only if T = IV . when F = R or F = C. Show that dim(P 2 (Z2 )) = 2 (not 3),
where Z2 = {0, 1} is the finite field with 2 elements (see
Appendix A.4).
∗∗1.2.31 Suppose that V is a vector space and W ⊆ V is
a subspace with dim(W) = dim(V).
∗∗1.2.35 Complete the proof of Theorem 1.2.6 by show-
(a) Show that if V is finite-dimensional then W = V.
ing that the standard matrix [T ]D←B is unique.
(b) Provide an example to show that the conclusion of
part (a) does not necessarily hold if V is infinite-
dimensional. ∗∗1.2.36 Prove part (b) of Theorem 1.2.1. That is, show
that if a vector space V has a basis B consisting of n vectors,
then any set C with fewer than n vectors cannot span V.
1.3 Isomorphisms and Linear Forms
Now that we are familiar with how most of the linear algebraic objects from
Rn (e.g., subspaces, linear independence, bases, linear transformations, and
eigenvalues) generalize to vector spaces in general, we take a bit of a detour to
discuss some ideas that it did not really make sense to talk about when Rn was
the only vector space in sight.
1.3.1 Isomorphisms
Recall that every finite-dimensional vector space V has a basis B, and we can
use that basis to represent a vector v ∈ V as a coordinate vector [v]B ∈ Fn ,
where F is the ground field. We used this correspondence between V and
Fn to motivate the idea that these vector spaces are “the same” in the sense
that, in order to do a linear algebraic calculation in V, we can instead do the
corresponding calculation on coordinate vectors in Fn .
We now make this idea of vector spaces being “the same” a bit more precise
and clarify under exactly which conditions this “sameness” happens.
Definition 1.3.1 Suppose V and W are vector spaces over the same field. We say that V
Isomorphisms and W are isomorphic, denoted by V ∼ = W, if there exists an invertible
linear transformation T : V → W (called an isomorphism from V to W).
We can think of isomorphic vector spaces as having the same structure and
the same vectors as each other, but different labels on those vectors. This is
perhaps easiest to illustrate by considering the vector spaces M1,2 and M2,1
1.3 Isomorphisms and Linear Forms 55
The expression “the of row and column vectors, respectively. Vectors (i.e., matrices) in these vector
map is not the
spaces have the forms
territory” seems
relevant here—we
typically only care a
a b ∈ M1,2 and ∈ M2,1 .
about the b
underlying vectors,
not the form we use The fact that we write the entries of vectors in M1,2 in a row whereas we
to write them down.
write those from M2,1 in a column is often just as irrelevant as if we used a
different font when writing the entries of the vectors in one of these vector
spaces. Indeed, vector addition and scalar multiplication in these spaces are
both performed entrywise, so it does not matter how we arrange or order those
entries.
To formally see that M1,2 and M2,1 are isomorphic, we just construct the
“obvious” isomorphism between them: the transpose map T : M1,2 → M2,1
satisfies
a
T a b = .
b
Furthermore, we already noted in Example 1.2.6 that the transpose is a linear
transformation, and it is clearly invertible (it is its own inverse, since transposing
twice gets us back to where we started), so it is indeed an isomorphism. The
same argument works to show that each of Fn , Mn,1 , and M1,n are isomorphic.
Remark 1.3.1 The fact that Fn , Mn,1 , and M1,n are isomorphic justifies something that
Column Vectors is typically done right from the beginning in linear algebra—treating mem-
and Row Vectors bers of Fn (vectors), members of Mn,1 (column vectors), and members of
M1,n (row vectors) as the same thing.
When we walk about isomorphisms and vector spaces being “the
same”, we only mean with respect to the 10 defining properties of vector
spaces (i.e., properties based on vector addition and scalar multiplication).
A “morphism” is a We can add column vectors in the exact same way that we add row vectors,
function and the
and similarly scalar multiplication works the exact same for those two
prefix “iso” means
“identical”. The word types of vectors. However, other operations like matrix multiplication may
“isomorphism” thus behave differently on these two sets (e.g., if A ∈ Mn then Ax makes sense
means a function when x is a column vector, but not when it is a row vector).
that keeps things
“the same”.
As an even simpler example of an isomorphism, we have implicitly been
using one when we say things like vT w = v · w for all v, w ∈ Rn . Indeed, the
quantity v · w is a scalar in R, whereas vT w is actually a 1 × 1 matrix (after
all, it is obtained by multiplying a 1 × n matrix by an n × 1 matrix), so it does
not quite make sense to say that they are “equal” to each other. However, the
spaces R and M1,1 (R) are trivially isomorphic, so we typically sweep this
technicality under the rug.
Before proceeding, it is worthwhile to specifically point out some basic
properties of isomorphisms that follow almost immediately from facts that we
already know about (invertible) linear transformations in general:
We prove these two • If T : V → W is an isomorphism then so is T −1 : W → V.
properties in
Exercise 1.3.6. •If T : V → W and S : W → X are isomorphisms then so is S ◦ T : V → X .
In particular, if V ∼
= W and W ∼= X then V ∼
= X.
For example, we essentially showed in Example 1.2.1 that M2 ∼ = R4 and
4 ∼ 3 ∼ 3
R = P , so it follows that M2 = P as well. The fact that these vector spaces
are isomorphic can be demonstrated by noting that the functions T : M2 → R4

and S : R4 → P 3 defined by
" #!
a b
T = (a, b, c, d) and S(a, b, c, d) = a + bx + cx2 + dx3
c d
are clearly isomorphisms (i.e., invertible linear transformations).
Example 1.3.1 Show that the vector spaces V = span(ex , xex , x2 ex ) and R3 are isomorphic.
Isomorphism of a
Solution:
Space of Functions
The standard way to show that two spaces are isomorphic is to con-
struct an isomorphism between them. To this end, consider the linear
Our method of transformation T : R3 → V defined by
coming up with this
map T is very
naïve—just send a T (a, b, c) = aex + bxex + cx2 ex .
basis of R3 to a basis
of V. This technique It is straightforward to show that this function is a linear transforma-
works fairly generally tion, so we just need to convince ourselves that it is invertible. To this
and is a good way
of coming up with
end, we recall from Exercise 1.1.2(g) that B = {ex , xex , x2 ex } is linearly
“obvious” independent and thus a basis of V, so we can construct the standard matrix
isomorphisms. [T ]B←E , where E = {e1 , e2 , e3 } is the standard basis of R3 :
h i
[T ]B←E = T (1, 0, 0) B T (0, 1, 0) B T (0, 0, 1) B
 
h i 1 0 0
= ex B xex B x2 ex B = 0 1 0 .
0 0 1
Since [T ]B←E is clearly invertible (the identity matrix is its own inverse),
T is invertible too and is thus an isomorphism.
Example 1.3.2 Show that the vector space of polynomials P and the vector space of
Polynomials are eventually-zero sequences c00 from Example 1.1.10 are isomorphic.
Isomorphic to
Solution:
Eventually-Zero
Sequences As always, our method of showing that these two spaces are isomorphic
is to explicitly construct an isomorphism between them. As with the
previous examples, there is an “obvious” choice of isomorphism T : P →
c00 , and it is defined by
T (a0 + a1 x + a2 x2 + · · · + a p x p ) = (a0 , a1 , a2 , . . . , a p , 0, 0, . . .).
If we worked with FN It is straightforward to show that this function is a linear transformation,

here instead of c00 , so we just need to convince ourselves that it is invertible. To this end, we
this inverse would
not work since the just explicitly construct its inverse T −1 : c00 → P:
sum
a0 + a1 x + a2 x2 + · · · T −1 (a0 , a1 , a2 , a3 , . . .) = a0 + a1 x + a2 x2 + a3 x3 + · · ·
might have infinitely
many terms and thus At first glance, it might seem like the sum on the right is infinite
not be a polynomial.
and thus not a polynomial, but recall that every sequence in c00 has only
finitely many non-zero entries, so there is indeed some final non-zero term
a p x p in the sum. Furthermore, it is straightforward to check that

(T −1 ◦ T )(a0 + a1 x + · · · + a p x p ) = a0 + a1 x + · · · + a p x p and
−1
(T ◦ T )(a0 , a1 , a2 , a3 , . . .) = (a0 , a1 , a2 , a3 , . . .)
for all a0 + a1 x + · · · + a p x p ∈ P and all (a0 , a1 , a2 , a3 , . . .) ∈ c00 , so T is

indeed invertible and thus an isomorphism.
As a generalization of Example 1.3.1 and many of the earlier observations that

we made, we now note that coordinate vectors let us immediately conclude that
every n-dimensional vector space over a field F is isomorphic to Fn .
Theorem 1.3.1 = Fn .
Suppose V is an n-dimensional vector space over a field F. Then V ∼
Isomorphisms of
Finite-Dimensional
Vector Spaces Proof. We just recall from Exercise 1.2.22 that if B is any basis of V then the
function T : V → Fn defined by T (v) = [v]B is an invertible linear transforma-
tion (i.e., an isomorphism).
In particular, the above theorem tells us that any two vector spaces of the
same (finite) dimension over the same field are necessarily isomorphic, since
they are both isomorphic to Fn and thus to each other. The following corollary
states this observation precisely and also establishes its converse.
Corollary 1.3.2 Suppose V and W are vector spaces over the same field and V is finite-
Finite-Dimensional dimensional. Then V ∼
= W if and only if dim(V) = dim(W).
Vector Spaces are
Isomorphic
Proof. We already explained how Theorem 1.3.1 gives us the “if” direction, so
we now prove the “only if” direction. To this end, we just note that if V ∼
=W
then there is an invertible linear transformation T : V → W, so Exercise 1.2.21
tells us that dim(V) = dim(W).
Remark 1.3.2 In a sense, we did not actually do anything new in this subsection—we
Why Isomorphisms? already knew about linear transformations and invertibility, so it seems
natural to wonder why we would bother adding the “isomorphism” layer
of terminology on top of it.
An isomorphism While it’s true that there’s nothing really “mathematically” new about
T : V → V (i.e., from a
isomorphisms, the important thing is the new perspective that it gives us.
vector space back
to itself ) is called an It is very useful to be able to think of vector spaces as being the same as
automorphism. each other, as it can provide us with new intuition or cut down the amount
of work that we have to do.
For example, instead of having to do a computation or think about
vector space properties in P 3 or M2 , we can do all of our work in R4 ,
which is likely a fair bit more intuitive. Similarly, we can always work with
whichever of c00 (the space of eventually-zero sequences) or P (the space
If it is necessary to of polynomials) we prefer, since an answer to any linear algebraic question
clarify which type of
in one of those spaces can be straightforwardly converted into an answer
isomorphism we are
talking about, we to the corresponding question in the other space via the isomorphism that
call an isomorphism we constructed in Example 1.3.2.
in the linear algebra
sense a vector
More generally, isomorphisms are used throughout all of mathemat-
space isomorphism. ics, not just in linear algebra. In general, they are defined to be invertible
maps that preserve whatever the relevant structures or operations are.

In our setting, the relevant operations are scalar multiplication and vector
addition, and those operations being preserved is exactly equivalent to the
invertible map being a linear transformation.
1.3.2 Linear Forms

One of the simplest types of linear transformations are those that send vec-
tors to scalars. For example, it is straightforward to check that the functions
f1 , f2 , . . . , fn : Rn → R defined by
f1 (v) = v1 , f2 (v) = v2 , . . . , fn (v) = vn ,
where v = (v1 , v2 , . . . , vn ), are linear transformations. We now give type of
linear transformation a name to make it easier to discuss.
Definition 1.3.2 Suppose V is a vector space over a field F. Then a linear transformation
Linear Forms f : V → F is called a linear form.
Linear forms are Linear forms can be thought of as giving us snapshots of vectors—knowing
sometimes instead
the value of f (v) tells us what v looks like from one particular direction or
called linear
functionals. angle (just like having a photograph tells us what an object looks like from
one side), but not necessarily what it looks like as a whole. For example, the
linear transformations f1 , f2 , . . ., fn described above each give us one of v’s
coordinates (i.e., they tell us what v looks like in the direction of one of the
standard basis vectors), but tell us nothing about its other coordinates.
Alternatively, linear forms can be thought of as the building blocks that
make up more general linear transformations. For example, consider the linear
transformation T : R2 → R2 (which is not a linear form) defined by
T (x, y) = (x + 2y, 3x − 4y) for all (x, y) ∈ R2 .
If we define f (x, y) = x + 2y and g(x, y) = 3x − 4y then it is straightforward
to check that f and g are each linear forms, and T (x, y) = ( f (x, y), g(x, y)).
That is, T just outputs the value of two linear forms. Similarly, every linear
transformation into an n-dimensional vector space can be thought of as being
made up of n linear forms (one for each of the n output dimensions).
Example 1.3.3 Suppose v ∈ Fn is a fixed vector. Show that the function fv : Fn → F

(Half of) defined by
the Dot Product
is a Linear Form fv (w) = v1 w1 + v2 w2 + · · · + vn wn for all w ∈ Fn
is a linear form.
Solution:
This follows immediately from the more general fact that multiplica-
tion by a matrix is a linear transformation. Indeed, if we let A = v ∈ M1,n
be v as a row vector, then fv (w) = Aw for all column vectors w.
Recall that the dot
product on Rn is In particular, if F = R then the previous example tells us that
defined by
v · w = v1 w1 + · · · + vn wn . fv (w) = v · w for all w ∈ Rn
is a linear form. This is actually the “standard” example of a linear form, and
the one that we should keep in mind as our intuition builder. We will see shortly
that every linear form on a finite-dimensional vector space can be written in this
way (in the exact same sense that every linear transformation can be written as
a matrix).
Example 1.3.4 Show that the trace tr : Mn (F) → F is a linear form.

The Trace is a
Solution:
Linear Form
We already showed in Example 1.2.7 that the trace is a linear transfor-
mation. Since it outputs scalars, it is necessarily a linear form.
Example 1.3.5 Show that the function E2 : P 3 → R defined by E2 ( f ) = f (2) is a linear

Evaluation is a form.
Linear Form
Solution:
We just need to show that E2 is a linear transformation, since it is clear
that its output is (by definition) always a scalar. We thus check the two
E2 is called the properties of Definition 1.2.4:
evaluation map at
x = 2. More generally,
a) E2 ( f + g) = ( f + g)(2) = f (2) + g(2) = E2 ( f ) + E2 (g).
the function b) E2 (c f ) = (c f )(2) = c f (2) = cE2 ( f ).
Ex : P → R defined
by Ex ( f ) = f (x) is also
The steps used in Example 1.3.5 might seem somewhat confusing at first,
a linear form,
regardless of the since we are applying a function (E2 ) to functions ( f and g). It is very important
value of x ∈ R. to be careful when working through this type of problem to make sure that the
correct type of object is being fed into each function (e.g., E2 ( f ) makes sense,
but E2 (4) does not, since E2 takes a polynomial as its input).
Also, that example highlights our observation that linear forms give us one
linear “piece of information” about vectors. In this case, knowing the value of
E2 ( f ) = f (2) tells us a little bit about the polynomial (i.e., the vector) f . If we
also knew the value of three other linear forms on f ∈ P 3 (e.g., f (1), f (3), and
f (4)), we could use polynomial interpolation to reconstruct f itself.
Example 1.3.6 Let C[a, b] be the vector space of continuous real-valued functions on the
Integration is a interval [a, b]. Show that the function I : C[a, b] → R defined by
Linear Form Z b
I( f ) = f (x) dx
a
Recall that every is a linear form.

continuous function
is integrable, so this Solution:
linear form makes We just need to show that I is a linear transformation, since it is (yet
sense.
again) clear that its output is always a scalar. We thus check the two
properties of Definition 1.2.4:
a) By properties of integrals that are typically covered in calculus
courses, we know that for all f , g ∈ C[a, b] we have

Z b
I( f + g) = ( f + g)(x) dx
a
Z b Z b
= f (x) dx + g(x) dx = I( f ) + I(g).
a a
b) We similarly know that we can pull scalars in and out of integrals:

Z b Z b
I(c f ) = (c f )(x) dx = c f (x) dx = cI( f )
a a
for all f ∈ C[a, b] and c ∈ R.
We now pin down the claim that we made earlier that every linear form
on a finite-dimensional vector space looks like one half of the dot product. In
particular, to make this work we just do what we always do when we want to
make abstract vector space concepts more concrete—we represent vectors as
coordinate vectors with respect to some basis.
Theorem 1.3.3 Let B be a basis of a finite-dimensional vector space V over a field F, and
The Form of let f : V → F be a linear form. Then there exists a unique vector v ∈ V
Linear Forms such that
f (w) = [v]TB [w]B for all w ∈ V,
where we are treating [v]B and [w]B as column vectors.
Proof. Since f is a linear transformation, Theorem 1.2.6 tells us that it has a

standard matrix—a matrix A such that f (w) = A[w]B for all w ∈ V. Since f
maps into F, which is 1-dimensional, the standard matrix A is 1 × n, where
n = dim(V). It follows that A is a row vector, and since every vector in Fn is
the coordinate vector of some vector in V, we can find some v ∈ V such that
A = [v]TB , so that f (w) = [v]TB [w]B .
Uniqueness of v follows immediately from uniqueness of standard matrices
and of coordinate vectors.
In the special case when F = R or F = C, it makes sense to talk about the
dot product of the coordinate vectors [v]B and [w]B , and the above theorem can
Recall that the dot be rephrased as saying that there exists a unique vector v ∈ V such that
product on Cn is
defined by f (w) = [v]B · [w]B for all w ∈ V.
v · w = v1 w1 + · · · + vn wn .
The only thing to be slightly careful of here is that if F = C then [v]B · [w]B =
[v]∗B [w]B (not [v]B · [w]B = [v]TB [w]B ), so we have to absorb a complex conjugate
into the vector v to make this reformulation work.
Example 1.3.7 Let E2 : P 3 → R be the evaluation map from Example 1.3.5, defined by
The Evaluation Map E2 ( f ) = f (2), and let E = {1, x, x2 , x3 } be the standard basis of P 3 . Find
as a Dot Product a polynomial g ∈ P 3 such that E2 ( f ) = [g]E · [ f ]E for all f ∈ P 3 .
This example just Solution:
illustrates how
Theorem 1.3.3 works If we write f (x) = a + bx + cx2 + dx3 (i.e., [ f ]E = (a, b, c, d)) then
out for the linear
form E2 . E2 ( f ) = f (2) = a + 2b + 4c + 8d = (1, 2, 4, 8) · (a, b, c, d).
It follows that we want to choose g ∈ P 3 so that [g]E = (1, 2, 4, 8). In other

words, we want g(y) = 1 + 2y + 4y2 + 8y3 .
Slightly more generally, for all x ∈ R, the evaluation map Ex ( f ) = f (x)
can be represented in this sense via the polynomial gx (y) = 1 + xy + x2 y2 +
x3 y3 (see Exercise 1.3.21).
The Dual Space

Theorem 1.3.3 tells us that every linear form on V corresponds to a particular
vector in V, at least when V is finite-dimensional, so it seems like there is
an isomorphism lurking in the background here. We need to make one more
definition before we can discuss this isomorphism properly.
Definition 1.3.3 Let V be a vector space over a field F. Then the dual of V, denoted by V ∗ ,
Dual of a is the vector space consisting of all linear forms on V.
Vector Space
The fact that V ∗ is indeed a vector space is established by Exercise 1.2.28(a),
and part (b) of that same exercise even tells us that dim(V ∗ ) = dim(V) when
V is finite-dimensional, so V and V ∗ are isomorphic by Corollary 1.3.2. In
fact, one simple isomorphism from V ∗ to V is exactly the one that sends a
linear form f to its corresponding vector v from Theorem 1.3.3. However, this
isomorphism between V and V ∗ is somewhat strange, as it depends on the
particular choice of basis that we make on V—if we change the basis B in
Theorem 1.3.3 then the vector v corresponding to a linear form f changes as
well.
The fact that the isomorphism between V and V ∗ is basis-dependent sug-
gests that something somewhat unnatural is going on, as many (even finite-
dimensional) vector spaces do not have a “natural” or “standard” choice of
basis. However, if we go one step further and consider the double-dual space
Yes, it is pretty V ∗∗ consisting of linear forms acting on V ∗ then things become a bit more
awkward to think
about what the well-behaved, so we now briefly explore this double-dual space.
members of V ∗∗ are. Using the exact same ideas as earlier, if V is finite-dimensional then we
They are linear forms
acting on linear
still have dim(V ∗∗ ) = dim(V ∗ ) = dim(V), so all three of these vector spaces
forms (i.e., functions are isomorphic. However, V and V ∗∗ are isomorphic in a much more natural
of functions). way than V and V ∗ , since there is a basis-independent choice of isomorphism
between them. To see what it is, notice that for every vector v ∈ V we can
define a linear form φv ∈ V ∗∗ by
φv ( f ) = f (v) for all f ∈ V ∗. (1.3.1)

This spot right here is
exactly as abstract
as this book gets. It’s Showing that φv is linear form does not require any clever insight—we just
been a slow climb to have to check the two defining properties from Definition 1.2.4, and each of
this point, and from these properties follows almost immediately from the relevant definitions. The
here on it’s downhill hard part of this verification is keeping the notation straight and making sure
back into more
concrete things. that the correct type of object goes into and comes out of each function at every
step:
a) For all f , g ∈ V ∗ we have
φv ( f + g) = ( f + g)(v) (definition of φv )
= f (v) + g(v) (definition of “ + ” in V ∗ )
= φv ( f ) + φv (g) (definition of φv )
b) Similarly, for all c ∈ F and f ∈ V ∗ we have

φv (c f ) = (c f )(v) = c f (v) = cφv ( f ).
Example 1.3.8 Show that for each φ ∈ (P 3 )∗∗ there exists f ∈ P 3 such that
The Double-Dual of
Polynomials φ (Ex ) = f (x) for all x ∈ R,
where Ex ∈ (P 3 )∗ is the evaluation map at x ∈ R (see Example 1.3.5).

Solution:
We know from Exercise 1.3.22 that B = {E1 , E2 , E3 , E4 } is a basis of
(P 3 )∗ , so for every x ∈ R there exist scalars c1,x , . . . , c4,x such that
Ex = c1,x E1 + c2,x E2 + c3,x E3 + c4,x E4 .
We then expand the quantities
φ (Ex ) = φ (c1,x E1 + c2,x E2 + c3,x E3 + c4,x E4 )

= c1,x φ (E1 ) + c2,x φ (E2 ) + c3,x φ (E3 ) + c4,x φ (E4 )
and

f (x) = Ex ( f ) = c1,x E1 + c2,x E2 + c3,x E3 + c4,x E4 ( f )
= c1,x E1 ( f ) + c2,x E2 ( f ) + c3,x E3 ( f ) + c4,x E4 ( f )
= c1,x f (1) + c2,x f (2) + c3,x f (3) + c4,x f (4)
for all f ∈ P 3 and x ∈ R.

In fact, f is uniquely If we choose f so that f (1) = φ (E1 ), f (2) = φ (E2 ), f (3) = φ (E3 ),
determined by φ .
and f (4) = φ (E4 ) (which can be done by polynomial interpolation) then
we find that φ (Ex ) = f (x) for all x ∈ R, as desired.
The above example is suggestive of a natural isomorphism between V and

V ∗∗ : if V = P 3 then every φ ∈ (P 3 )∗∗ looks the exact same as a polynomial
in P 3 ; we just have to relabel the evaluation map Ex as x itself. The following
theorem pins down how this “natural” isomorphism between V and V ∗∗ works
for other vector spaces.
Theorem 1.3.4 The function T : V → V ∗∗ defined by T (v) = φv is an isomorphism, where

Canonical Double- φv ∈ V ∗∗ is as defined in Equation (1.3.1).
Dual Isomorphism
Proof. We must show that T is linear and invertible, and we again do not have
to be clever to do so. Rather, as long as we keep track of what space all of these
Yes, T is a function objects live in and parse the notation carefully, then linearity and invertibility
that sends vectors to
functions of of T follow almost immediately from the relevant definitions.
functions of vectors. Before showing that T is linear, we first make a brief note on notation.
If you recall that V
itself might be a
Since T maps into V ∗∗ , we know that T (v) is a function acting on V ∗ . We thus
vector space made use the (admittedly unfortunate) notation T (v)( f ) to refer to the scalar value
up of functions then that results from applying the function T (v) to f ∈ V ∗ . Once this notational
your head might nightmare is understood, the proof of linearity is straightforward:
explode.
a) For all f ∈ V ∗ we have
T (v + w)( f ) = φv+w ( f ) (definition of T )

= f (v + w) (definition of φv+w )
= f (v) + f (w) (linearity of each f ∈ V ∗ )
= φv ( f ) + φw ( f ) (definition of φv and φw )
= T (v)( f ) + T (w)( f ). (definition of T )
Since T (v + w)( f ) = (T (v) + T (w))( f ) for all f ∈ V ∗ , we conclude that

T (v + w) = T (v) + T (w).
If dim(V) = ∞, b) Similarly,
invertibility of T fails.
Even if V has a basis,
it is never the case
T (cv)( f ) = φcv ( f ) = f (cv) = c f (v) = cφv ( f ) = cT (v)( f )
that any of V, V ∗ ,
and V ∗∗ are for all f ∈ V ∗ , so T (cv) = cT (v) for all c ∈ F.
isomorphic (V ∗∗ is For invertibility, we claim that if B = {v1 , v2 , . . . , vn } is linearly indepen-
“bigger” than V ∗ , dent then so is C = {T (v1 ), T (v2 ), . . . , T (vn )} (this claim is pinned down in
which is “bigger”
than V). Exercise 1.3.24). Since C contains n = dim(V) = dim(V ∗∗ ) vectors, it must
be a basis of V ∗∗ by Exercise 1.2.27(a). It follows that [T ]C←B = I, which is
invertible, so T is invertible as well.
This double-dual space V ∗∗ and its correspondence with V likely still seems
quite abstract, so it is useful to think about what it means when V = Fn , which
The close we typically think of as consisting of column vectors. Theorem 1.3.3 tells
relationship between
V and V ∗∗ is why we
us that each f ∈ (Fn )∗ corresponds to some row vector vT (in the sense that
use the term “dual f (w) = vT w for all w ∈ Fn ). Theorem 1.3.4 says that if we go one step further,
space” in the first then each φ ∈ (Fn )∗∗ corresponds to some column vector w ∈ Fn in the sense
place—duality refers that φ ( f ) = f (w) = vT w for all f ∈ V ∗ (i.e., for all vT ).
to an operation or
concept that, when For this reason, it is convenient (and for the most part, acceptable) to think
applied a second
time, gets us back to
of V as consisting of column vectors and V ∗ as consisting of the corresponding
where we started. row vectors. In fact, this is exactly why we use the notation V ∗ for the dual
space in the first place—it is completely analogous to taking the (conjugate)
transpose of the vector space V. The fact that V ∗ is isomorphic to V, but in a
way that depends on the particular basis chosen, is analogous to the fact that if
v ∈ Fn is a column vector then v and vT have the same size (and entries) but
not shape, and the fact that V ∗∗ is so naturally isomorphic to V is analogous to
the fact that (vT )T and v have the same size and shape (and are equal).
Remark 1.3.3 While V ∗ is defined as a set of linear forms on V, this can be cumbersome
Linear Forms Versus to think about once we start considering vector spaces like V ∗∗ (and,
Vector Pairings heaven forbid, V ∗∗∗ ), as it is somewhat difficult to make sense of what a
function of a function (of a function...) “looks like”.
Instead, notice that if w ∈ V and f ∈ V ∗ then the expression f (w) is
linear in each of w and f , so we can think of it just as combining members
of two vector spaces together in a linear way, rather than as members of
one vector space acting on the members of another. One way of making
this observation precise is via Theorem 1.3.3, which says that applying a
linear form f ∈ V ∗ to w ∈ V is the same as taking the dot product of two
vectors [v]B and [w]B (at least in the finite-dimensional case).
1.3.3 Bilinearity and Beyond

As suggested by Remark 1.3.3, we are often interested not just in applying
linear functions to vectors, but also in combining vectors from different vector
spaces together in a linear way. We now introduce a way of doing exactly this.
Definition 1.3.4 Suppose V and W are vector spaces over the same field F. Then a function
Bilinear Forms f : V × W → F is called a bilinear form if it satisfies the following
properties:
a) It is linear in its first argument:
i) f (v1 + v2 , w) = f (v1 , w) + f (v2 , w) and
ii) f (cv1 , w) = c f (v1 , w) for all c ∈ F, v1 , v2 ∈ V, and w ∈ W.
Recall that the b) It is linear in its second argument:
notation
i) f (v, w1 + w2 ) = f (v, w1 ) + f (v, w2 ) and
f : V × W → F means
that f takes two ii) f (v, cw1 ) = c f (v, w1 ) for all c ∈ F, v ∈ V, and w1 , w2 ∈ W.
vectors as
input—one from V While the above definition might seem like a mouthful, it simply says that
and one from
W—and provides a f is a bilinear form exactly if it becomes a linear form when one of its inputs is
scalar from F as held constant. That is, for every fixed vector w ∈ W the function gw : V → F
output. defined by gw (v) = f (v, w) is a linear form, and similarly for every fixed vector
v ∈ V the function hv : W → F defined by hv (w) = f (v, w) is a linear form.
Yet again, we look at some examples to try to get a feeling for what bilinear
forms look like.
Example 1.3.9 Show that the function f : Rn × Rn → R defined by

The Real Dot Product
is a Bilinear Form f (v, w) = v · w for all v, w ∈ Rn
is a bilinear form.
Solution:
We could work through the four defining properties of bilinear forms,
but an easier way to solve this problem is to recall from Example 1.3.3 that
the dot product is a linear form if we keep the first vector v fixed, which
establishes property (b) in Definition 1.3.4.
Since v·w = w·v, it is also the case that the dot product is a linear form
if we keep the second vector fixed, which in turn establishes property (a)
in Definition 1.3.4. It follows that the dot product is indeed a bilinear form.
The function The real dot product is the prototypical example of a bilinear form, so keep
f (x, y) = xy is a it in mind when working with bilinear forms abstractly to help make them seem
bilinear form but not a bit more concrete. Perhaps even more simply, notice that multiplication (of
a linear
transformation. real numbers) is a bilinear form. That is, if we define a function f : R × R → R
Linear simply via f (x, y) = xy, then f is a bilinear form. This of course makes sense
transformations must since multiplication of real numbers is just the one-dimensional dot product.
be linear “as a
whole”, whereas In order to simplify proofs that certain functions are bilinear forms, we
bilinear forms just can check linearity in the first argument by showing that f (v1 + cv2 , w) =
need to be linear
with respect to each f (v1 , w) + c f (v2 , w) for all c ∈ F, v1 , v2 ∈ V, and w ∈ W, rather than checking
variable vector addition and scalar multiplication separately as in conditions (a)(i)
independently. and (a)(ii) of Definition 1.3.4 (and we similarly check linearity in the second
argument).
Example 1.3.10 Let V be a vector space over a field F. Show that the function g : V ∗ × V →
The Dual Pairing F defined by
is a Bilinear Form
g( f , v) = f (v) for all f ∈ V ∗, v ∈ V
is a bilinear form.
Solution:
We just notice that g is linear in each of its input arguments individually.
For the first input argument, we have
g( f1 + c f2 , v) = ( f1 + c f2 )(v) = f1 (v) + c f2 (v) = g( f1 , v) + cg( f2 , v),
for all f1 , f2 ∈ V ∗ , v ∈ V, and c ∈ F simply from the definition of addition

and scalar multiplication of functions. Similarly, for the second input
argument we have
g( f , v1 + cv2 ) = f (v1 + cv2 ) = f (v1 ) + c f (v2 ) = g( f , v1 ) + cg( f , v2 ),
for all f ∈ V ∗ , v1 , v2 ∈ V, and c ∈ F since each f ∈ V ∗ is (by definition)

linear.
Example 1.3.11 Let A ∈ Mm,n (F) be a matrix. Show that the function f : Fm × Fn → F
Matrices are defined by
Bilinear Forms
f (v, w) = vT Aw for all v ∈ Fm , w ∈ Fn
is a bilinear form.
Solution:
In this example (and Once again, we just check the defining properties from Definition 1.3.4,
as always, unless
all of which follow straightforwardly from the corresponding properties
specified otherwise),
F refers to an of matrix multiplication:
arbitrary field.
a) For all v1 , v2 ∈ Fm , w ∈ Fn , and c ∈ F we have
f (v1 + cv2 , w) = (v1 + cv2 )T Aw

= vT1 Aw + cvT2 Aw = f (v1 , w) + c f (v2 , w).
b) Similarly, for all v ∈ Fm , w1 , w2 ∈ Fn , and c ∈ F we have
f (v, w1 + cw2 ) = vT A(w1 + cw2 )

= vT Aw1 + cvT Aw2 = f (v, w1 ) + c f (v, w2 ).
In fact, if F = R and A = I then the bilinear form f in Example 1.3.11

simplifies to f (v, w) = vT w = v · w, so we recover the fact that the real dot
product is bilinear. This example also provides us with a fairly quick way of
showing that certain functions are bilinear forms—if we can write them in
terms of a matrix in this way then bilinearity follows immediately.
Example 1.3.12 Show that the function f : R2 × R2 → R defined by

A Numerical
Bilinear Form f (v, w) = 3v1 w1 − 4v1 w2 + 5v2 w1 + v2 w2 for all v, w ∈ R2
is a bilinear form.
Solution:
We could check the defining properties from Definition 1.3.4, but an
easier way to show that f is bilinear is to notice that we can group its
Notice that the coefficients into a matrix as follows:
coefficient of vi w j
goes in the " #
(i, j)-entry of the 3 −4 w1
f (v, w) = v1 v2 .
matrix. This always 5 1 w2
happens.
It follows that f is a bilinear form, since we showed in Example 1.3.11

that any function of this form is.
In fact, we now show that every bilinear form acting on finite-dimensional

vector spaces can be written in this way. Just like matrices can be used to
represent linear transformations, they can also be used to represent bilinear
forms.
Theorem 1.3.5 Let B and C be bases of m- and n-dimensional vector spaces V and W,
The Form of respectively, over a field F, and let f : V × W → F be a bilinear form.
Bilinear Forms There exists a unique matrix A ∈ Mm,n (F) such that
f (v, w) = [v]TB A[w]C for all v ∈ V, w ∈ W,
where we are treating [v]B and [w]C as column vectors.
Proof. We just use the fact that bilinear forms are linear when we keep one
of their inputs constant, and we then leech off of the representation of linear
forms that we already know from Theorem 1.3.3.
Specifically, if we denote the vectors in the basis B by B = {v1 , v2 , . . . , vm }
then [v j ]B = e j for all 1 ≤ j ≤ m and Theorem 1.3.3 tells us that the linear form
g j : W → F defined by g j (w) = f (v j , w) can be written as g j (w) = aTj [w]C
n
for some fixed (column)
T T T
vector a j ∈ F . If we let A be the matrix with rows
a1 , . . . am (i.e., A = a1 | · · · | am ) then
Here we use the fact
that eTj A equals the f (v j , w) = g j (w) = aTj [w]C
j-th row of A (i.e., aTj ). = eTj A[w]C = [v j ]TB A[w]C for all 1 ≤ j ≤ m, w ∈ W.
To see that this same equation holds when we replace v j by an arbitrary

v ∈ V, we just use linearity in the first argument of f and the fact that every
v ∈ V can be written as a linear combination of the basis vectors from B (i.e.,
v = c1 v1 + · · · + cm vm for some c1 , . . ., cm ∈ F):

!
m
f (v, w) = f ∑ c jv j, w
j=1
m m
= ∑ c j f (v j , w) = ∑ c j [v j ]TB A[w]C
j=1 j=1
!T
m
= ∑ c je j A[w]C = [v]TB A[w]C
j=1
for all v ∈ V and w ∈ W.

Finally, to see that A is unique we just note that if C = {w1 , w2 , . . . , wn }
then f (vi , w j ) = [vi ]TB A[w j ]C = eTi Ae j = ai, j for all 1 ≤ i ≤ m and 1 ≤ j ≤ n,
so the entries of A are completely determined by f .
As one particularly interesting example of how the above theorem works,
we recall that the determinant is multilinear in the columns of the matrix it
acts on, and thus in particular it can be interpreted as a bilinear form on 2 × 2
matrices. More specifically, if we define a function f : F2 × F2 → F via

f (v, w) = det v | w
then f is a bilinear form and thus can be represented by a 2 × 2 matrix.
Example 1.3.13 Find the matrix A ∈ M2 (F) with the property that
The 2 × 2 Determinant
as a Matrix det v | w = vT Aw for all v, w ∈ F2 .
Solution:
Recall that det v | w = v1 w2 − v2 w1 , while direct calculation
shows that
vT Aw = a1,1 v1 w1 + a1,2 v1 w2 + a2,1 v2 w1 + a2,2 v2 w2 .
By simply comparing these two expressions, we see that the unique matrix
The fact that A is A that makes them equal to each other has entries a1,1 = 0, a1,2 = 1,
skew-symmetric (i.e.,
a2,1 = −1, and a2,2 = 0. That is,
AT = −A)
corresponds to the
fact that swapping T 0 1
det v|w = v Aw if and only if A = .
two columns of a −1 0
matrix multiplies its
determinant by−1
(i.e., det v | w =

− det w | v ). Multilinear Forms
In light of Example 1.3.13, it might be tempting to think that the determinant
of a 3 × 3 matrix can be represented via a single fixed 3 × 3 matrix, but this
is not the case—the determinant of a 3 × 3 matrix is not a bilinear form, but
rather it is linear in the three columns of the input matrix. More generally, the
determinant of a p × p matrix is multilinear—linear in each of its p columns.
This generalization of bilinearity is captured by the following definition, which
requires that the function being considered is a linear form when all except for
one of its inputs are held constant.
Definition 1.3.5 Suppose V1 , V2 , . . . , V p are vector spaces over the same field F. A function
Multilinear Forms f : V1 × V2 × · · · × V p → F is called a multilinear form if, for each 1 ≤
j ≤ p and each v1 ∈ V1 , v2 ∈ V2 , . . ., v p ∈ V p , it is the case that the function
A multilinear form g : V j → F defined by
with p input
arguments is
sometimes called
g(v) = f (v1 , . . . , v j−1 , v, v j+1 , . . . , v p ) for all v ∈ V j
p-linear.
is a linear form.
When p = 1 or p = 2, this definition gives us exactly linear and bilin-

ear forms, respectively. Just like linear forms can be represented by vectors
(1-dimensional lists of numbers) and bilinear forms can be represented by
matrices (2-dimensional arrays of numbers), multilinear forms in general can
be represented by p-dimensional arrays of numbers.
We investigate This characterization of multilinear forms is provided by the upcoming
arrays and
Theorem 1.3.6. To prepare ourselves for this theorem (since it looks quite ugly
multilinearity in much
more depth in at first, so it helps to be prepared), recall that the entries of vectors are indexed
Chapter 3. by a single subscript (as in v j for the j-th entry of v) and the entries of matrices
are indexed by two subscripts (as in ai, j for the (i, j)-entry of A). Similarly, the
entries of a p-dimensional array are indexed by p subscripts (as in a j1 , j2 ,..., j p
for the ( j1 , j2 , . . . , j p )-entry of the p-dimensional array A).
Just like a matrix can be thought of as made up of several column vectors,
a 3-dimensional array can be thought of as made up of several matrices (see
Figure 1.9), a 4-dimensional array can be thought of as made up of several
3-dimensional arrays, and so on (though this quickly becomes difficult to
visualize).
¸
,1 ¸
a1,3 ,2 ¸
,1 a2,3
,1 a1,3 ¸
a1,2 ,2 ,2 a1,3
,3
· 1,1,1 ,1 a1,2 a2,3 ,4
a a2,2 ,3 a2,3
,3 a1,3
,1 · 1,1,2 a2,2
,2 a1,2 ,4 3,4
a2,1 a · 1,1,3 ,3 a1,2 a2,
,2 a a2,2
a2,1 ,3 · 1,1,4 a2,2
,4
a2,1 a
,4
a2,1
Figure 1.9: A visualization of a 3-dimensional array with 2 rows, 3 columns, and 4

“layers”. We use ai, j,k to denote the entry of this array in the i-th row, j-th column,
and k-th layer.
Theorem 1.3.6 Suppose V1 , . . . , V p are finite-dimensional vector spaces over a field F and
The Form of f : V1 × · · · × V p → F is a multilinear form. For each 1 ≤ i ≤ p and vi ∈ Vi ,
Multilinear Forms let vi, j denote the j-th coordinate of vi with respect to some basis of Vi .
Then there exists a unique p-dimensional array (with entries {a j1 ,..., j p }),
called the standard array of f , such that
f (v1 , . . . , v p ) = ∑ a j1 ,..., j p v1, j1 · · · v p, j p for all v1 ∈ V1 , . . . , v p ∈ V p .

j1 ,..., j p
Proof. We proceed just as we did in the proof of the characterization of bilinear

forms (Theorem 1.3.5), using induction on the number p of vector spaces.
We already know from Theorems 1.3.3 and 1.3.5 that this result holds
when p = 1 or p = 2, which establishes the base case of the induction. For
the inductive step, suppose that the result is true for all (p − 1)-linear forms
We use j1 here acting on V2 × · · · × V p . If we let B = {w1 , w2 , . . . , wm } be a basis of V1 then the
instead of just j since
inductive hypothesis tells us that the (p−1)-linear forms g j1 : V2 ×· · ·×Vn → F
it will be convenient
later on. defined by
g j1 (v2 , . . . , v p ) = f (w j1 , v2 , . . . , v p )
can be written as
The scalar a j1 , j2 ,..., j p
here depends on j1 f (w j1 , v2 , . . . , v p ) = g j1 (v2 , . . . , v p ) = ∑ a j1 , j2 ,..., j p v2, j2 · · · v p, j p
(not just j2 , . . . , jn ) j2 ,..., j p
since each choice
of w j1 gives a for some fixed family of scalars {a j1 , j2 ,..., j p }.
different If we write an arbitrary vector v1 ∈ V1 as a linear combination of the
(p − 1)-linear form g j1 .
basis vectors w1 , w2 , . . . , wm (i.e., v1 = v1,1 w1 + v1,2 w2 + · · · + v1,m wm ), it then
follows from linearity of the first argument of f that
!
m m
f (v1 , . . . , v p ) = f ∑ v1, j1 w j1 , v2 , . . . , v p (v1 = ∑ v1, j1 w j1 )
j1 =1 j1 =1
m
= ∑ v1, j1 f (w j1 , v2 , . . . , v p ) (multilinearity of f )
j1 =1
m
= ∑ v1, j1 ∑ a j1 , j2 ,..., j p v2, j2 · · · v p, j p (inductive hypothesis)
j1 =1 j2 ,..., j p
= ∑ a j1 ,..., j p v1, j1 · · · v p, j p (group sums together)

j1 ,..., j p
for all v1 ∈ V1 , . . . , v p ∈ V p , which completes the inductive step and shows that
the family of scalars {a j1 ,..., j p } exists.
To see that the scalars {a j1 ,..., j p } are unique, just note that if we choose v1
to be the j1 -th member of the basis of V1 , v2 to be the j2 -th member of the
basis of V2 , and so on, then we get
f (v1 , . . . , v p ) = a j1 ,..., j p .
In particular, these scalars are completely determined by f .
For example, the determinant of a 3 × 3 matrix is a 3-linear (trilinear?)
form, so it can be represented by a single fixed 3 × 3 × 3 array (or equivalently,
a family of 33 = 27 scalars), just like the determinant of a 2 × 2 matrix is a
bilinear form and thus can be represented by a 2 × 2 matrix (i.e., a family of
22 = 4 scalars). The following example makes this observation explicit.
Example 1.3.14 Find the 3 × 3 × 3 array A with the property that

The 3 × 3 Determinant
3
as a 3 × 3 × 3 Array
det v|w|x = ∑ ai, j,k vi w j xk for all v, w, x ∈ F3 .
i, j,k=1
Solution:
We recall the following explicit formula for the determinant of a 3 × 3
matrix:

det v|w|x = v1 w2 x3 +v2 w3 x1 +v3 w1 x2 −v1 w3 x2 −v2 w1 x3 −v3 w2 x1 .
This is exactly the representation of the determinant that we want, and we

can read the coefficients of the array A from it directly:
a1,2,3 = a2,3,1 = a3,1,2 = 1 and a1,3,2 = a2,1,3 = a3,2,1 = −1,

The standard array
of the p × p
determinant is and all other entries of A equal 0. We can visualize this 3-dimensional array
described fairly as follows, where the first subscript i indexes rows, the second subscript j
explicitly by indexes columns, and the third subscript k indexes layers:
Theorem A.1.4:
a j1 ,..., j p = 0 if any two
1
k =
subscripts equal
2
each other, and
0 k = 3
a j1 ,..., j p = k =
sgn( j1 , . . . , j p ) 0 0 1
 −1 0
otherwise, where sgn 0 0  0 0 1 0
is the sign of a 0 −1 0 0 0 0 0
permutation. 0 0 0 0
1 −1 0
0
It is worth noting that the array that represents the 3 × 3 determinant in

the above example is antisymmetric, which means that it is skew-symmetric
no matter how we “slice” it. For example, the three layers shown are each
skew-symmetric matrices, but the matrix consisting of each of their top rows
is also skew-symmetric, as is the matrix consisting of each of their central
columns, and so on. Remarkably, the array that corresponds to the determinant
is, up to scaling, the unique array with this property (see Exercise 1.3.27), just
like the matrix
0 1
A=
−1 0
is, up to scaling, the unique 2 × 2 skew-symmetric matrix.
1.3.4 Inner Products

The dot product on Rn let us do several important geometrically-motivated
things, like computing the lengths of vectors, the angles between them, and
determining whether or not they are orthogonal. We now generalize the dot
product to other vector spaces, which will let us carry out these same tasks in
this new more general setting.
To this end, we make use of bilinear forms, for which we have already seen
that the dot product serves as the prototypical example. However, bilinear forms
are actually too general, since they do not satisfy all of the “nice” properties that
the dot product satisfies. For example, bilinear forms do not typically mimic the
commutativity property of the dot product (i.e., v · w = w · v, but most bilinear
forms f have f (v, w) 6= f (w, v)), nor are they typically positive definite (i.e.,
v · v ≥ 0, but most bilinear forms do not have f (v, v) ≥ 0).
We now investigate functions that satisfy these two additional properties,
with the caveat that the commutativity condition that we need looks slightly
different than might be naïvely expected:
Definition 1.3.6 Suppose that F = R or F = C and that V is a vector space over F. Then an
Inner Product inner product on V is a function h·, ·i : V × V → F such that the following
three properties hold for all c ∈ F and all v, w, x ∈ V:
The notation
a) hv, wi = hw, vi (conjugate symmetry)
h·, ·i : V × V → F b) hv, w + cxi = hv, wi + chv, xi (linearity)
means that h·, ·i is a c) hv, vi ≥ 0, with equality if and only if v = 0. (pos. definiteness)
function that takes in
two vectors from V
and outputs a single If F = R then inner products are indeed bilinear forms, since property (b)
number from F. gives linearity in the second argument and then the symmetry condition (a)
guarantees that it is also linear in its first argument. However, if F = C then
“Sesquilinear” they are instead sesquilinear forms—they are linear in their second argument,
means
but only conjugate linear in their first argument:
“one-and-a-half
linear”.
hv + cx, wi = hw, v + cxi = hw, vi + chw, xi = hv, wi + chx, wi.
Remark 1.3.4 Perhaps the only “weird” property in the definition of an inner product is
Why a Complex the fact that we require hv, wi = hw, vi rather than the seemingly simpler
Conjugate? hv, wi = hw, vi. The reason for this strange choice is that if F = C then
there does not actually exist any function satisfying hv, wi = hw, vi as well
as properties (b) and (c)—if there did, then for all v 6= 0 we would have
0 < hiv, ivi = ihiv, vi = ihv, ivi = i2 hv, vi = −hv, vi < 0,
which makes no sense.

Some books instead It is also worth noting that we restrict our attention to the fields R and
define inner
C when discussing inner products, since otherwise it is not at all clear what
products to be linear
in their first argument the positive definiteness property hv, vi ≥ 0 even means. Many fields do
and conjugate not have a natural ordering on them, so it does not make sense to discuss
linear in the second. whether or not their members are bigger than 0. In fact, there is also no
natural ordering on the field C, but that is okay because the conjugate
symmetry condition hv, vi = hv, vi ensures that hv, vi is real.
To get a bit more comfortable with inner products, we now present several
examples of standard inner products in the various vector spaces that we have
been working with. As always though, keep the real dot product in mind as the
canonical example of an inner product around which we build our intuition.
Example 1.3.15 Show that the function h·, ·i : Cn × Cn → C defined by

Complex Dot Product
n
hv, wi = v∗ w = ∑ vi wi for all v, w ∈ Cn
i=1
This inner product on
Cn is also called the is an inner product on Cn .
dot product.
Solution:
We must check that the three properties described by Definition 1.3.6
hold. All of these properties follow very quickly from the corresponding
properties of matrix multiplication and conjugate transposition:
Keep in mind that
w∗ v is a number, so
transposing it has no a) hv, wi = v∗ w = (w∗ v)∗ = w∗ v = hw, vi
effect, so
(w∗ v)∗ = w∗ v. b) hv, w + cxi = v∗ (w + cx) = v∗ w + cv∗ x = hv, wi + chv, xi
c) hv, vi = v∗ v = |v1 |2 + · · · + |vn |2 , which is non-negative, and it

equals 0 if and only if v = 0.
Just like in Rn , we often denote the inner product (i.e., the dot product)
from the above example by v · w instead of hv, wi.
Example 1.3.16 Show that the function h·, ·i : Mm,n (F) × Mm,n (F) → F defined by
The Frobenius
Inner Product hA, Bi = tr(A∗ B) for all A, B ∈ Mm,n (F),
where F = R or F = C, is an inner product on Mm,n (F).

Recall that tr(A∗ B) is
the trace of A∗ B, Solution:
which is the sum of We could directly verify that the three properties described by Defini-
its diagonal entries. tion 1.3.6 hold, but it is perhaps more illuminating to compute
m n
hA, Bi = tr(A∗ B) = ∑ ∑ ai, j bi, j .
We call this inner i=1 j=1
product on Mm,n (F)
the Frobenius inner In other words, this inner product multiplies all of the entries of A by the
product. Some
books call it the
corresponding entries of B and adds them up, just like the dot product on
Hilbert–Schmidt Fn . For example, if m = n = 2 then
inner product. * " #+
a1,1 a1,2 b1,1 b1,2
, = a1,1 b1,1 + a2,1 b2,1 + a1,2 b1,2 + a2,2 b2,2
a2,1 a2,2 b2,1 b2,2
= (a1,1 , a1,2 , a2,1 , a2,2 ) · (b1,1 , b1,2 , b2,1 , b2,2 ).
More generally, if E is the standard basis of Mm,n (F) then we have
hA, Bi = [A]E · [B]E for all A, B ∈ Mm,n (F).
In other words, this inner product is what we get if we forget about the
shape of A and B and just take their dot product as if they were vectors in
Fmn . The fact that this is an inner product now follows directly from the
fact that the dot product on Fn is an inner product.
Example 1.3.17 Let a < b be real numbers and let C[a, b] be the vector space of continuous
An Inner Product on functions on the real interval [a, b]. Show that the function h·, ·i : C[a, b] ×
Continuous Functions C[a, b] → R defined by
Z b
h f , gi = f (x)g(x) dx for all f , g ∈ C[a, b]
a
is an inner product on C[a, b].

Solution:
We must check that the three properties described by Definition 1.3.6
hold. All of these properties follow quickly from the corresponding prop-
erties of definite integrals:
Since all numbers a) We simply use commutativity of real number multiplication:

here are real, we do
not need to worry Z b Z b
about the complex h f , gi = f (x)g(x) dx = g(x) f (x) dx = hg, f i.
conjugate in a a
property (a).
b) This follows from linearity of integrals:
Z b
h f , g + chi = f (x) g(x) + ch(x) dx
a
Z b Z b
= f (x)g(x) dx + c f (x)h(x) dx
a a
= h f , gi + ch f , hi.
c) We just recall that the integral of a positive function is positive:

Z b
hf, fi = f (x)2 dx ≥ 0,
a
with equality if and only if f (x) = 0 for all x ∈ [a, b] (i.e., f is the
zero function).
We can make a bit more sense of the above inner product on C[a, b] if we
think of definite integrals as “continuous sums”. While the dot product v · w on
Rn adds up all values of v j w j for 1 ≤ j ≤ n, this inner product h f , gi on C[a, b]
“adds up” all values of f (x)g(x) for a ≤ x ≤ b (and weighs them appropriately
so that the sum is finite).
All of the inner products that we have seen so far are called the standard
inner products on spaces that they act on. That is, the dot product is the standard
inner product on Rn or Cn , the Frobenius inner product is the standard inner
product on Mm,n , and
Z b
h f , gi = f (x)g(x) dx
a
is the standard inner product on C[a, b]. We similarly use P[a, b] and P p [a, b]
to denote the spaces of polynomials (of degree at most p, respectively) acting
on the real interval [a, b], and we assume that the inner product acting on these
spaces is this standard one unless we indicate otherwise.
Inner products can also look quite a bit different from the standard ones
that we have seen so far, however. The following example illustrates how the
same vector space can have multiple different inner products, and at first glance
they might look quite different than the standard inner product.
Example 1.3.18 Show that the function h·, ·i : R2 × R2 → R defined by

A Weird Inner
Product hv, wi = v1 w1 + 2v1 w2 + 2v2 w1 + 5v2 w2 for all v, w ∈ R2
is an inner product on R2 .
Solution:
Properties (a) and (b) of Definition 1.3.6 follow fairly quickly from the
definition of this function, but proving property (c) is somewhat trickier.
We will show in To this end, it is helpful to rewrite this function in the form
Theorem 1.4.3 that
we can always
rewrite inner
hv, wi = (v1 + 2v2 )(w1 + 2w2 ) + v2 w2 .
products in a similar
manner so as to It follows that
make their positive
definiteness hv, vi = (v1 + 2v2 )2 + v22 ≥ 0,
“obvious”.
with equality if and only if v1 + 2v2 = v2 = 0, which happens if and only
if v1 = v2 = 0 (i.e., v = 0), as desired.
Recall that a vector space V is not just a set of vectors, but rather it also
includes a particular addition and scalar multiplication operation as part of it.
Similarly, if we have a particular inner product in mind then we typically group
it together with V and call it an inner product space. If there is a possibility
for confusion among different inner products (e.g., because there are multiple
different inner product spaces V and W being used simultaneously) then we
may write them using notation like h·, ·iV or h·, ·iW .
The Norm Induced by the Inner Product
Now that we have inner products to work with, we can define the length of
a vector in a manner that is completely analogous to how we did it with the
dot product in Rn . However, in this setting of general vector spaces, we are a
bit beyond the point of being able to draw a geometric picture of what length
means (for example, the “length” of a matrix does not quite make sense), so
we change terminology slightly and instead call this function a “norm”.
Definition 1.3.7 Suppose that V is an inner product space. Then the norm induced by the
Norm Induced by inner product is the function k · k : V → R defined by
the Inner Product p
def
kvk = hv, vi for all v ∈ V.
We use the notation When V = Rn or V = Cn and the inner product is just the usual dot product,
k · k to refer to the
the norm induced by the inner product is just the usual length of a vector, given
standard length on
Rn or Cn , but also to by q
refer to the norm √
kvk = v · v = |v1 |2 + |v2 |2 + · · · + |vn |2 .
induced by
whichever inner However, if we change which inner product we are working with, then the
product we are norm induced by the inner product changes as well. For example, the norm on
currently discussing.
If there is ever a R2 induced by the weird inner product of Example 1.3.18 has the form
chance for p q
confusion, we use kvk∗ = hv, vi = (v1 + 2v2 )2 + v22 ,
subscripts to
distinguish different which is different from the norm that we are used to (we use the notation k · k∗
norms. just to differentiate this norm from the standard length k · k).
The norm induced by the standard (Frobenius) inner product on Mm,n from
Example 1.3.16 is given by
s
p m n
The Frobenius norm is kAkF = tr(A A) = ∑ ∑ |ai, j |2 .
∗
also sometimes i=1 j=1
called the
Hilbert–Schmidt This norm on matrices is often called the Frobenius norm, and it is usually
norm and denoted
written as kAkF rather than just kAk to avoid confusion with another matrix
by kAkHS .
norm that we will see a bit later in this book.
Similarly, the norm on C[a, b] induced by the inner product of Exam-

ple 1.3.17 has the form
s
Z b
kfk = f (x)2 dx.
a
Perhaps not surprisingly, the norm induced by an inner product satisfies the
same basic properties as the length of a vector in Rn . The next few theorems
are devoted to establishing these properties.
Theorem 1.3.7 Suppose that V is an inner product space, v ∈ V is a vector, and c ∈ F is

Properties of the a scalar. Then the following properties of the norm induced by the inner
Norm Induced product hold:
by the Inner Product a) kcvk = |c|kvk, and (absolute homogeneity)
b) kvk ≥ 0, with equality if and only if v = 0. (pos. definiteness)
Proof. Both of these properties follow fairly quickly from the definition. For
property (a), we compute
p p q
kcvk = hcv, cvi = chcv, vi = chv, cvi
q q
= cchv, vi = |c|2 kvk2 = |c|kvk.
Property (b) follows immediately from the property of inner products that says
that hv, vi ≥ 0, with equality if and only if v = 0.
The above theorem tells us that it makes sense to break vectors into their
“length” and “direction”, just like we do with vectors in Rn . Specifically, we can
write every vector v ∈ V in the form v = kvku, where u ∈ V has kuk = 1 (so
we call u a unit vector). In particular, if v 6= 0 then we can choose u = v/kvk,
which we think of as encoding the direction of v.
The two other main properties that vector length in Rn satisfies are the
Cauchy–Schwarz inequality and the triangle inequality. We now show that
these same properties hold for the norm induced by any inner product.
Theorem 1.3.8 If V is an inner product space then

Cauchy–Schwarz
Inequality for |hv, wi| ≤ kvkkwk for all v, w ∈ V.
Inner Products
Furthermore, equality holds if and only if v and w are collinear (i.e., {v, w}
is a linearly dependent set).
Proof. We start by letting c, d ∈ F be arbitrary scalars and expanding the

quantity kcv + dwk2 in terms of the inner product:
Re(z) is the real part
0 ≤ kcv + dwk2 (pos. definiteness)
of z. That is, if z ∈ R

then Re(z) = z, and if = cv + dw, cv + dw (definition of k · k)
z = a + ib ∈ C then
2 2
Re(z) = a. = |c| hv, vi + cdhv, wi + cdhw, vi + |d| hw, wi (sesquilinearity)

= |c|2 kvk2 + 2Re cdhv, wi + |d|2 kwk2 . (since z + z = 2Re(z))
If w = 0 then the Cauchy–Schwarz inequality holds trivially (it just says

0 ≤ 0), so we can assume that w 6= 0. We can thus choose c = kwk and d =
To simplify, we use −hw, vi/kwk, which tells us that
the fact that

hw, vihv, wi = hw, vihw, vi 0 ≤ kvk2 kwk2 − 2Re kwkhw, vihv, wi/kwk + |hw, vi|2 kwk2 /kwk2 .
= |hw, vi|2 . = kvk2 kwk2 − |hw, vi|2 .
Rearranging and taking the square root of both sides gives us |hv, wi| ≤ kvkkwk,
which is exactly the Cauchy–Schwarz inequality.
To see that equality holds if and only if {v, w} is a linearly dependent set,
suppose that |hv, wi| = kvkkwk. We can then follow the above proof backward
to see that 0 = kcv + dwk2 , so cv + dw = 0 (where c = kwk = 6 0), so {v, w} is
linearly dependent. In the opposite direction, if {v, w} is linearly dependent
then either v = 0 (in which case equality clearly holds in the Cauchy–Schwarz
inequality since both sides equal 0) or w = cv for some c ∈ F. Then
|hv, wi| = |hv, cvi| = |c|kvk2 = kvkkcvk = kvkkwk,
which completes the proof.

For example, if we apply the Cauchy–Schwarz inequality to the Frobenius
inner product on Mm,n , it tells us that

tr(A∗ B)2 ≤ tr(A∗ A)tr(B∗ B) for all A, B ∈ Mm,n ,
and if we apply it to the standard inner product on C[a, b] then it says that
Z b
2 Z b
Z b

2 2
f (x)g(x) dx ≤ f (x) dx g(x) dx for all f , g ∈ C[a, b].
a a a
These examples illustrate the utility of thinking abstractly about vector spaces.
These matrix and integral inequalities are tricky to prove directly from prop-
erties of the trace and integrals, but follow straightforwardly when we forget
about the fine details and only think about vector space properties.
Just as was the case in Rn , the triangle inequality now follows very quickly
from the Cauchy–Schwarz inequality.
Theorem 1.3.9 If V is an inner product space then

The Triangle Inequality
for the Norm Induced kv + wk ≤ kvk + kwk for all v, w ∈ V.
by the Inner Product
Furthermore, equality holds if and only if v and w point in the same
direction (i.e., v = 0 or w = cv for some 0 ≤ c ∈ R).
Proof. We start by expanding kv + wk2 in terms of the inner product:

For any z = x + iy ∈ C, kv + wk2 = hv + w, v + wi
we have Re(z) ≤ |z| = hv, vi + hv, wi + hw, vi + hw, wi (sesquilinearity)
since Re(z) = x ≤ 2 2
p = kvk + 2Re(hv, wi) + kwk (since z + z = 2Re(z))
x2 + y2 = |x + iy| = |z|. 2 2
≤ kvk + 2|hv, wi| + kwk (since Re(z) ≤ |z|)
≤ kvk2 + 2kvkkwk + kwk2 (by Cauchy–Schwarz)
= (kvk + kwk)2 .
We can then take the square root of both sides of this inequality to see that
kv + wk ≤ kvk + kwk, as desired.
The above argument demonstrates that equality holds in the triangle in-
equality if and only if Re(hv, wi) = |hv, wi| = kvkkwk. We know from the
Cauchy–Schwarz inequality that the second of these equalities holds if and
only if {v, w} is linearly dependent (i.e., v = 0 or w = cv for some c ∈ F). In
this case, the first equality holds if and only if v = 0 or Re(hv, cvi) = |hv, cvi|.
Well, Re(hv, cvi) = Re(c)kvk2 and |hv, cvi| = |c|kvk2 , so we see that equality
holds in the triangle inequality if and only if Re(c) = |c| (i.e., 0 ≤ c ∈ R).
It is worth noting that isomorphisms in general do not preserve inner prod-
The fact that inner ucts, since inner products cannot be derived from only scalar multiplication
products cannot be
and vector addition (which isomorphisms do preserve). For example, the iso-
expressed in terms of
vector addition and morphism T : R2 → R2 defined by T (v) = (v1 + 2v2 , v2 ) has the property that
scalar multiplication
means that they T (v) · T (w) = (v1 + 2v2 , v2 ) · (w1 + 2w2 , w2 )
give us some extra
structure that vector = v1 w1 + 2v1 w2 + 2v2 w1 + 5v2 w2 .
spaces alone do not
have. In other words, T turns the usual dot product on R2 into the weird inner product
from Example 1.3.18.
However, even though isomorphisms do not preserve inner products, they
at least do always convert one inner product into another one. That is, if V
and W are vector spaces, h·, ·iW is an inner product on W, and T : V → W
is an isomorphism, then we can define an inner product on V via hv1 , v2 iV =
hT (v1 ), T (v2 )iW (see Exercise 1.3.25).
1.3.1 Determine whether or not the given vector spaces V

and W are isomorphic. 1.3.3 Determine which of the following functions are inner
products.
∗(a) V = R6 , W = M2,3 (R)
(b) V = M2 (C), W = M2 (R) ∗(a) On R2 , the function
∗(c) V = P 8 , W = M3 (R)
hv, wi = v1 w1 + v1 w2 + v2 w2 .
(d) V = R, W is the vector space from Example 1.1.4
∗(e) V = P 8, W = P (b) On R2 , the function
(f) V = {A ∈ M2 : tr(A) = 0}, W = P 2
hv, wi = v1 w1 + v1 w2 + v2 w1 + v2 w2 .
1.3.2 Determine which of the following functions are and (c) On R2 , the function
are not linear forms.
hv, wi = 3v1 w1 + v1 w2 + v2 w1 + 3v2 w2 .
∗(a) The function f : Rn → R defined by f (v) = kvk.
(b) The function f : Fn → F defined by f (v) = v1 . ∗(d) On Mn , the function hA, Bi = tr(A∗ + B).
∗(c) The function f : M2 → M2 defined by (e) On P 2 , the function
" #!
c a
hax2 + bx + c, dx2 + ex + f i = ad + be + c f .
a b
f = .
c d d b ∗(f) On C[−1, 1], the function
Z 1
(d) The determinant of a matrix (i.e., the function det : f (x)g(x)
h f , gi = √ dx.
Mn → F). −1 1 − x2
∗(e) The function g : P → R defined by g( f ) = f 0 (3),
where f 0 is the derivative of f . 1.3.4 Determine which of the following statements are
(f) The function g : C → R defined by g( f ) = cos( f (0)). true and which are false.
∗(a) If T : V → W is an isomorphism then so is T −1 :
W → V.
(b) Rn is isomorphic to Cn .
∗(c) Two vector spaces V and W over the same field are
isomorphic if and only if dim(V) = dim(W).
(d) If w ∈ Rn then the function fw : Rn → R defined by (a) Show that if the ground field is R then
fw (v) = v · w for all v ∈ Rn is a linear form.
1
∗(e) If w ∈ Cn then the function fw : Cn → C defined by hv, wi = kv + wk2 − kv − wk2
fw (v) = v · w for all v ∈ Cn is a linear form. 4
(f) If A ∈ Mn (R) is invertible then so is the bilinear for all v, w ∈ V.
form f (v, w) = vT Aw. (b) Show that if the ground field is C then
∗(g) If Ex : P 2 → R is the evaluation map defined by
Ex ( f ) = f (x) then E1 + E2 = E3 . 1 3 1
hv, wi = ∑ ik kv + ik wk2
4 k=0
1.3.5 Show that if we regard Cn as a vector space over R for all v, w ∈ V.

(instead of over C as usual) then it is isomorphic to R2n . [Side note: This is called the polarization identity.]
∗∗ 1.3.6 Suppose V, W, and X are vector spaces and 1.3.15 Let V be an inner product space and let k · k be the
T : V → W and S : W → X are isomorphisms. norm induced by V’s inner product.
(a) Show that T −1 : W → V is an isomorphism. (a) Show that if the ground field is R then
(b) Show that S ◦ T : V → X is an isomorphism.
2 !
1 v w

hv, wi = kvkkwk 1 − −
∗∗1.3.7 Show that for every linear form f : Mm,n (F) → 2 kvk kwk
F, there exists a matrix A ∈ Mm,n (F) such that f (X) =
tr(AT X) for all X ∈ Mm,n (F). for all v, w ∈ V.
[Side note: This representation of the inner product
implies the Cauchy–Schwarz inequality as an imme-
∗∗1.3.8 Show that the result of Exercise 1.3.7 still holds if diate corollary.]
we replace every instance of Mm,n (F) by MSn (F) or every (b) Show that if the ground field is C then
instance of Mm,n (F) by MH n (and set F = R in this latter 2
case). 1 v w
hv, wi = kvkkwk 1 − −
2 kvk kwk

∗∗1.3.9 Suppose V is an inner product space. Show that 2 !
i v iw
hv, 0i = 0 for all v ∈ V. +i− −
2 kvk kwk
∗∗ 1.3.10 Suppose A ∈ Mm,n (C). Show that kAkF = for all v, w ∈ V.

kAT kF = kA∗ kF .
∗1.3.16 Suppose V is a vector space over a field F in which
1.3.11 Suppose a < b are real numbers and f ∈ C[a, b]. 1 + 1 6= 0, and f : V × V → F is a bilinear form.
Show that (a) Show that f (v, w) = − f (w, v) for all v, w ∈ V (in
Z b 2 Z b which case f is called skew-symmetric) if and only
1 1
f (x) dx ≤ f (x)2 dx . if f (v, v) = 0 for all v ∈ V (in which case f is called
b−a a b−a a
alternating).
[Side note: In words, this says that the square of the average (b) Explain what goes wrong in part (a) if 1 + 1 = 0
of a function is never larger than the average of its square.] (e.g., if F = Z2 is the field of 2 elements described
in Appendix A.4).
∗∗1.3.12 Let V be an inner product space and let k · k
be the norm induced by V’s inner product. Show that if 1.3.17 Suppose V is a vector space over a field F and
v, w ∈ V are such that hv, wi = 0 then f : V × V → F is a bilinear form.
kv + wk2 = kvk2 + kwk2 . (a) Show that f (v, w) = f (w, v) for all v, w ∈ V (in
which case f is called symmetric) if and only if the
[Side note: This is called the Pythagorean theorem for matrix A ∈ Mn (F) from Theorem 1.3.5 is symmetric
inner products.] (i.e., satisfies AT = A).
(b) Show that f (v, w) = − f (w, v) for all v, w ∈ V (in
∗∗1.3.13 Let V be an inner product space and let k · k be which case f is called skew-symmetric) if and only
the norm induced by V’s inner product. Show that if the matrix A ∈ Mn (F) from Theorem 1.3.5 is
skew-symmetric (i.e., satisfies AT = −A).
kv + wk2 + kv − wk2 = 2kvk2 + 2kwk2
for all v, w ∈ V. ∗∗1.3.18 Suppose V and W are m- and n-dimensional
vector spaces over C, respectively, and let f : V × W → C
[Side note: This is called the parallelogram law, since it re-
be a sesquilinear form (i.e., a function that is linear in its
lates the norms of the sides of a parallelogram to the norms
second argument and conjugate linear in its first).
of its diagonals.]
(a) Show that if B and C are bases of V and W, respec-
tively, then there exists a unique matrix A ∈ Mm,n (C)
∗∗1.3.14 Let V be an inner product space and let k · k be such that
the norm induced by V’s inner product.
f (v, w) = [v]∗B A[w]C for all v ∈ V, w ∈ W.
(b) Suppose W = V and C = B. Show that f is conju- 1.3.23 Let Ex : P → R be the evaluation map defined by
gate symmetric (i.e., f (v, w) = f (w, v) for all v ∈ V Ex ( f ) = f (x). Show that the set
and w ∈ W) if and only if the matrix A from part (a)
is Hermitian. Ex : x ∈ R
(c) Suppose W = V and C = B. Show that f is an inner
product if and only if the matrix A from part (a) is is linearly independent.
Hermitian and satisfies v∗ Av ≥ 0 for all v ∈ Cm with
equality if and only if v = 0. ∗∗ 1.3.24 Let T : V → V ∗∗ be the canonical double-
[Side note: A matrix A with these properties is called dual isomorphism described by Theorem 1.3.4. Com-
positive definite, and we explore such matrices in plete the proof of that theorem by showing that if B =
Section 2.2.] {v1 , v2 , . . . , vn } ⊆ V is linearly independent then so is
C = {T (v1 ), T (v2 ), . . . , T (vn )} ⊆ V ∗∗ .
1.3.19 Suppose V is a vector space with basis B =
{v1 , . . . , vn }. Define linear functionals f1 , . . ., fn ∈ V ∗ by ∗∗1.3.25 Suppose that V and W are vector spaces over a
( field F, h·, ·iW is an inner product on W, and T : V → W is
1 if i = j,
fi (v j ) = an isomorphism. Show that the function h·, ·iV : V × V → F
0 otherwise. defined by hv1 , v2 iV = hT (v1 ), T (v2 )iW is an inner prod-
def
Show that the set B∗ = { f1 , . . . , fn } is a basis of V ∗ . uct on V.
[Side note: B∗ is called the dual basis of B.]
1.3.26 Suppose V is a vector space over a field F,
f : V × · · · × V → F is a multilinear form, and T1 , . . . , Tn :
1.3.20 Let c00 be the vector space of eventually-zero real V → V are linear transformations. Show that the function
= RN .
sequences from Example 1.1.10. Show that c∗00 ∼ g : V × · · · × V → F defined by
[Side note: This example shows that the dual of an infinite-
dimensional vector space can be much larger than the origi- g(v1 , . . . , vn ) = f (T1 (v1 ), . . . , Tn (vn ))
nal vector space. For example, c00 has dimension equal to is a multilinear form.
the cardinality of N, but this exercise shows that c∗00 has
dimension at least as large as the cardinality of R. See also
Exercise 1.3.23.] ∗∗ 1.3.27 An array A is called antisymmetric if swap-
ping any two of its indices swaps the sign of its entries
(i.e., a j1 ,..., jk ,..., j` ,..., j p = −a j1 ,..., j` ,..., jk ,..., j p for all k 6= `).
∗∗ 1.3.21 Let Ex : P p → R be the evaluation map de- Show that for each p ≥ 2 there is, up to scaling, only one
fined by Ex ( f ) = f (x) (see Example 1.3.5) and let E = p × p × · · · × p (p times) antisymmetric array.
{1, x, x2 , . . . , x p } be the standard basis of P p . Find a polyno-
mial gx ∈ P p such that Ex ( f ) = [gx ]E · [ f ]E for all f ∈ P p . [Side note: This array corresponds to the determinant in the
sense of Theorem 1.3.6.]
∗∗1.3.22 Let Ex : P p → R be the evaluation map defined [Hint: Make use of permutations (see Appendix A.1.5) or
by Ex ( f ) = f (x). Show that if c0 , c1 , . . . , c p ∈ R are distinct properties of the determinant.]
then {Ec0 , Ec1 , . . . , Ec p } is a basis of (P p )∗ .
1.4 Orthogonality and Adjoints
Now that we know how to generalize the dot product from Rn to other vector
spaces (via inner products), it is worth revisiting the concept of orthogonality.
Recall that two vectors v, w ∈ Rn are orthogonal when v · w = 0. We define
orthogonality in general inner product spaces completely analogously:
Definition 1.4.1 Suppose V is an inner product space. Two vectors v, w ∈ V are called
Orthogonality orthogonal if hv, wi = 0.
In Rn , we could think of “orthogonal” as a synonym for “perpendicular”,

since two vectors in Rn are orthogonal if and only if the angle between them is
π/2. However, in general inner product spaces this geometric picture makes
much less sense (for example, it does not quite make sense to say that the angle
between two polynomials is π/2). For this reason, it is perhaps better to think
of orthogonal vectors as ones that are “as linearly independent as possible” (see
Figure 1.10 for a geometric justification of this interpretation).
Figure 1.10: Orthogonality can be thought of as a stronger version of linear inde-

pendence: not only do the vectors not point in the same direction, but in fact
they point as far away from each other as possible.
In order to better justify this interpretation of orthogonality, we first need a

notion of orthogonality for sets of vectors rather than just pairs of vectors. To
make this leap, we simply say that a set of vectors B is mutually orthogonal
if every two distinct vectors v 6= w ∈ B are orthogonal. For example, a set of
three vectors in R3 is mutually orthogonal if they all make right angles with
each other (like the coordinate axes).
The following result pins down our claim that mutual orthogonality is a
stronger property than linear independence.
Theorem 1.4.1 Suppose V is an inner product space. If B ⊂ V is a mutually orthogonal

Mutual Orthogonality set of non-zero vectors then B is linearly independent.
Implies Linear
Independence
Proof. We start by supposing that v1 , v2 , . . . , vk ∈ B and c1 , c2 , . . . , ck ∈ F
(where F is the ground field) are such that
c1 v1 + c2 v2 + · · · + ck vk = 0. (1.4.1)
Our goal is to show that c1 = c2 = · · · = ck = 0. To this end, we note that

The fact that hv1 , 0i = 0, but also
hv1 , 0i = 0 is hopefully
intuitive enough. If hv1 , 0i = hv1 , c1 v1 + c2 v2 + · · · + ck vk i (by Equation (1.4.1))
not, it was proved in
Exercise 1.3.9. = c1 hv1 , v1 i + c2 hv1 , v2 i + · · · + ck hv1 , vk i (linearity of the i.p.)
= c1 hv1 , v1 i + 0 + · · · + 0 (B is mutually orthogonal)
= c1 kv1 k2 . (definition of norm)
Since all of the vectors in B are non-zero we know that kv1 k 6= 0, so this implies
c1 = 0.
A similar computation involving hv2 , 0i shows that c2 = 0, and so on up to
hvk , 0i showing that ck = 0, so we conclude c1 = c2 = · · · = ck = 0 and thus B
is linearly independent.
Example 1.4.1 Show that the set B = {1, x, 2x2 − 1} ⊂ P 2 [−1, 1] is mutually orthogonal
Checking Mutual with respect to the inner product
Orthogonality of Z 1
Polynomials f (x)g(x)
h f , gi = √ dx.
Recall that P 2 [−1, 1] −1 1 − x2
is the vector space
of polynomials of Solution:
degree at most 2
acting on the
We explicitly compute all 3 possible inner products between these
interval [−1, 1].
1.4 Orthogonality and Adjoints 81
3 polynomials, which requires some integration techniques that we may

have not used in a while:
Z 1 p 1
x
h1, xi = √ dx = − 1 − x2 = 0 − 0 = 0,
−1 1 − x2 −1
All 3 of these
Z 1 p 1
integrals can be 2x2 − 1
solved by making h1, 2x2 − 1i = √ dx = −x 1 − x2 = 0 − 0 = 0, and
the substitution −1 1 − x2 −1
x = sin(u) and then
Z 1 1
integrating with
2 2x3 − x −(2x2 + 1) p
respect to u. hx, 2x − 1i = √ dx = 1 − x2 = 0 − 0 = 0.
−1 1 − x2 3 −1
Since each pair of these polynomials is orthogonal with respect to this

inner product, the set is mutually orthogonal.
When combined with Theorem 1.4.1, the above example shows that the set
{1, x, 2x2 − 1} is linearly independent in P 2 [−1, 1]. It is worth observing that a
set of vectors may be mutually orthogonal with respect to one inner product but
not another—all that is needed to show linear independence in this way is that
it is mutually orthogonal with respect to at least one inner product. Conversely,
for every linearly independent set there is some inner product with respect to
which it is mutually orthogonal, at least in the finite-dimensional case (see
Exercise 1.4.25).
1.4.1 Orthonormal Bases

One of the most useful things that we could do with linear independence was
introduce bases, which in turn let us give coordinates to vectors in arbitrary
finite-dimensional vector spaces. We now spend some time investigating bases
that are not just linearly independent, but are even mutually orthogonal and
scaled so that all vectors in the basis have the same length. We will see that this
additional structure makes these bases more well-behaved and much easier to
work with than others.
Definition 1.4.2 Suppose V is an inner product space with basis B ⊂ V. We say that B is an
Orthonormal Bases orthonormal basis of V if
a) hv, wi = 0 for all v 6= w ∈ B, and (mutual orthogonality)
b) kvk = 1 for all v ∈ B. (normalization)
Before proceeding with examples, we note that determining whether or not

a set is an orthonormal basis is, rather surprisingly, often easier than determining
whether or not it is a (potentially non-orthonormal) basis. The reason for this is
that checking mutual orthogonality (which just requires computing some inner
products) is typically easier than checking linear independence (which requires
solving a linear system), and the following theorem says that this is all that we
have to check as long as the set we are working with has the “right” size (i.e.,
as many vectors as the dimension of the vector space).
Theorem 1.4.2 Suppose V is a finite-dimensional inner product space and B ⊆ V is a mu-

Determining if a Set is tually orthogonal set consisting of unit vectors. Then B is an orthonormal
an Orthonormal Basis basis of V if and only if |B| = dim(V).
The notation |B| Proof. Recall that Theorem 1.4.1 tells us that if B is mutually orthogonal and
means the number
its members are unit vectors (and thus non-zero) then B is linearly independent.
of vectors in B.
We then make use of Exercise 1.2.27(a), which tells us that a set with dim(V)
vectors is a basis if and only if it is linearly independent.
It is straightforward to check that the standard bases of each of Rn , Cn ,

and Mm,n are in fact orthonormal with respect to the standard inner products
on these spaces (i.e., the dot product on Rn and Cn and the Frobenius inner
product on Mm,n ). On the other hand, we have to be somewhat careful when
working with P p , since the standard basis {1, x, x2 , . . . , x p } is not orthonormal
with respect to any of its standard inner products like
Z 1
h f , gi = f (x)g(x) dx.
0
For example, performing the integration indicated above shows that h1, xi =
(12 )/2 − (02 )/2 = 1/2 6= 0, so the polynomials 1 and x are not orthogonal in
this inner product.
Example 1.4.2 Construct an orthonormal basis of P 2 [−1, 1] with respect to the inner
An Orthonormal Basis product
Z 1
of Polynomials f (x)g(x)
h f , gi = √ dx.
−1 1 − x2
Solution:
We already showed that the set B = {1, x, 2x2 − 1} is mutually orthogo-
nal with respect to this inner product in Example 1.4.1. To turn this set into
an orthonormal basis, we just normalize these polynomials (i.e., divide
them by their norms):
Be careful here: we s
might guess that p Z 1
1 √
k1k = 1, but this is not k1k = h1, 1i = √ dx = π,
true. We have to go −1 1 − x2
through the
computation with s r
Z 1
the indicated inner p x2 π
product. kxk = hx, xi = √ dx = , and
−1 1 − x2 2
s r
q Z 1
2 π
(2x2 − 1)2
k2x − 1k = h2x2 − 1, 2x2 − 1i = √. dx =
These integrals can 1 − x2 −1 2
be evaluated by
substituting x = sin(u). √ √ √ √ √
It follows that the set C = 1/ π, 2x/ π, 2(2x2 − 1)/ π is
a mutually orthogonal set of normalized vectors. Since C consists of
dim(P 2 ) = 3 vectors, we know from Theorem 1.4.2 that it is an orthonor-
mal basis of P 2 [−1, 1] (with respect to this inner product).
Example 1.4.3 Show that the set

The Pauli Matrices
1 1 0 1 0 1 1 0 −i 1 1 0
Form an B= √ ,√ ,√ ,√
Orthonormal Basis 2 0 1 2 1 0 2 i 0 2 0 −1
These are, up to the

√ is an orthonormal basis of M2 (C) with the usual Frobenius inner product.
scalar factor 1/ 2,
the Pauli matrices Solution:
that we saw earlier Recall that the Frobenius inner product works just like the dot product
in Example 1.1.12.
on Cn : hX,Y i is obtained by multiplying X and Y entrywise and adding
up the results. We thus can see that most of these matrices are indeed
X is the entrywise orthogonal, since all of the terms being added up are 0. For example,
complex conjugate
of X. 1 0 1 1 1 0 1
√ ,√ = 0 · 1 + 1 · 0 + 1 · 0 + 0 · (−1)
2 1 0 2 0 −1 2
1
= 0 + 0 + 0 + 0 = 0.
2
For this reason, the only two other inner products that we explicitly
check are the ones where the zeros do not match up in this way:

1 1 0 1 1 0 1
√ , √ = 1 + 0 + 0 + (−1) = 0, and
2 0 1 2 0 −1 2

1 0 1 1 0 −i 1
√ ,√ = 0 + (−i) + i + 0 = 0.
2 1 0 2 i 0 2
Since every pair of these matrices is orthogonal, B is mutually orthogonal.

To see that these matrices are properly normalized, we just note that it
is straightforward to check that hX, Xi = 1 for all four of these matrices
X ∈ B. It thus follows from Theorem 1.4.2 that, since B consists of 4
matrices and dim(M2 (C)) = 4, B is an orthonormal basis of M2 (C).
We already learned in Section 1.3.1 that all finite-dimensional vector spaces

are isomorphic (i.e., “essentially the same”) as Fn . It thus seems natural to ask
the corresponding question about inner products—do all inner products on a
finite-dimensional inner product space V look like the usual dot product on Fn
on some basis? It turns out that the answer to this question is “yes”, and the
bases for which this happens are exactly orthonormal bases.
Theorem 1.4.3 If B is an orthonormal basis of a finite-dimensional inner product space V

All Inner Products then
Look Like the Dot hv, wi = [v]B · [w]B for all v, w ∈ V.
Product
Proof. Suppose B = {u1 , u2 , . . . un }. Since B is a basis of V, we can write v =

c1 u1 + c2 u2 + · · · + cn un and w = d1 u1 + d2 u2 + · · · + dn un . Using properties
of inner products then reveals that

* +
n n
hv, wi = ∑ ci ui , ∑ d j u j
i=1 j=1
n
= ∑ ci d j hui , u j i (sesquilinearity)
i, j=1
n
= ∑ c jd j (B is an orthonormal basis)
j=1
An even more
explicit = (c1 , c2 , . . . , cn ) · (d1 , d2 , . . . , dn ) (definition of dot product)
characterization of = [v]B · [w]B , (definition of coordinate vectors)
inner products on
V = Rn and V = Cn is
presented in as desired.
Exercise 1.4.24.
For example, we showed in Example 1.3.16 that if E is the standard basis
of Mm,n then the Frobenius inner product on matrices satisfies
hA, Bi = [A]E · [B]E for all A, B ∈ Mm,n .
Theorem 1.4.3 says that the same is true if we replace E by any orthonormal
basis of Mm,n . However, for some inner products it is not quite so obvious how
to find an orthonormal basis and thus represent it as a dot product.
Example 1.4.4 Find a basis B of P 1 [0, 1] (with the standard inner product) with the
A Polynomial property that h f , gi = [ f ]B · [g]B for all f , g ∈ P 1 [0, 1].
Inner Product
Solution:
as the Dot
Product First, we recall that the standard inner product on P 1 [0, 1] is
Z 1
h f , gi = f (x)g(x) dx.
0
It might be tempting to think that we should choose the standard basis

B = {1, x}, but this does not work (for example, we noted earlier that
h1, xi = 1/2, but [1]B · [x]B = 0).
The problem with the standard basis is that it is not orthonormal in
We could also
choose h1 to be any this inner product, so Theorem 1.4.3 does not apply to it. To construct an
other non-zero orthonormal basis of P 1 [0, 1] in this inner product, we start by finding any
vector, but h1 (x) = 1 vector (function) h2 ∈ P 1 [0, 1] that is orthogonal to h1 (x) = 1. To do so,
seems like it will
we write h2 (x) = ax + b for some scalars a, b ∈ R and solve hh1 , h2 i = 0:
make the algebra
work out nicely. Z 1 a 1 a

0 = hh1 , h2 i = h1, ax + bi = (ax + b) dx = x2 + bx = + b.
0 2 0 2
We can solve this equation by choosing a = 2 and b = −1 so that h2 (x) =
2x − 1.
Finally, we just need to rescale h1 and h2 so that they have norm 1. We
can compute their norms as follows:

s
p Z 1
Be careful here: the
fact that k1k = 1 is kh1 k = k1k = h1, 1i = 1 dx = 1 and
not quite as obvious 0
s
as it seems, since k1k Z 1
p √
must be computed kh2 k = k2x − 1k = h2x − 1, 2x − 1i = (2x − 1)2 dx = 1/ 3.
via an integral (i.e., 0
the inner product).
√
It follows that B = 1, 3(2x − 1) is an orthonormal basis of P 1 [0, 1],
so Theorem 1.4.3 tells us that h f , gi = [ f ]B · [g]B for all f , g ∈ P 1 [0, 1].
The following corollary shows that we can similarly think of the norm in-
duced by any inner product as “the same” as the usual vector length on Rn or Cn :
Corollary 1.4.4 If B is an orthonormal basis of a finite-dimensional inner product space V

The Norm Induced then
by an Inner Product kvk = [v]B for all v ∈ V.
Looks Like Vector
Length
Be slightly careful Proof of Theorem 1.2.6. Just choose w = v in Theorem 1.4.3. Then
here—kvk refers to
the norm induced p p
by the inner product kvk = hv, vi = [v]B · [v]B = [v]B .

on V, whereas [v]B
refers to the length in In words, Corollary 1.4.4 says that the norm induced by an inner product
Rn or Cn .
just measures how long a vector’s coordinate vector is when it is represented
in an orthonormal basis. To give a bit of geometric intuition to what this
means, consider the norm k · k∗ on R2 induced by the weird inner product of
Example 1.3.18, which has the form
p q
kvk∗ = hv, vi = (v1 + 2v2 )2 + v22 .
We can think of this norm as measuring how long v is when it is represented in

the basis B = {(1, 0), (−2, 1)} instead of in the standard basis (see Figure 1.11).
The reason that this works is simply that B is an orthonormal basis with respect
B is not orthonormal to the weird inner product that induces this norm (we will see how this basis B
with respect to the
was constructed shortly, in Example 1.4.6).
usual inner product
(the dot product) on
R2 though.
√ √
Figure 1.11: The length of v = (1, 2) is kvk = 12 + 22 = 5. On the other hand, kvk∗
measures
p the length of √ v when it is represented
√ √in the basis B = {(1, 0), (−2, 1)}:
kvk∗ = (1 + 4)2 + 22 = 29 and [v]B = 52 + 22 = 29.
We typically think of orthonormal bases are particularly “well-behaved”

bases for which everything works out a bit more simply than it does for general
bases. To give a bit of a sense of what we mean by this, we now present a result
that shows that finding coordinate vectors with respect to orthonormal bases is
trivial. For example, recall that if B = {e1 , e2 , . . . , en } is the standard basis of
Rn then, for each 1 ≤ j ≤ n, the j-th coordinate of a vector v ∈ Rn is simply
e j · v = v j . The following theorem says that coordinates in any orthonormal
basis can similarly be found simply by computing inner products (instead of
solving a linear system, like we have to do to find coordinate vectors with
respect to general bases).
Theorem 1.4.5 If B = {u1 , u2 , . . . , un } is an orthonormal basis of a finite-dimensional

Coordinates inner product space V then
with Respect to
Orthonormal [v]B = hu1 , vi, hu2 , vi, . . . , hun , vi for all v ∈ V.
Bases
Proof. We simply make use of Theorem 1.4.3 to represent the inner product as
[u j ]B = e j simply the dot product of coordinate vectors. In particular, we recall that [u j ]B = e j
because u j = 0u1 +
for all 1 ≤ j ≤ n and then compute
· · · + 1u j + · · · + 0un ,
and sticking the
coefficients of that hu1 , vi, hu2 , vi, . . . , hun , vi = [u1 ]B · [v]B , [u2 ]B · [v]B , . . . , [un ]B · [v]B
linear combination = (e1 · [v]B , e2 · [v]B , . . . , en · [v]B ),
in a vector gives e j .
which equals [v]B since e1 · [v]B is the first entry of [v]B , e2 · [v]B is its second
entry, and so on.
The above theorem can be interpreted as telling us that the inner product
between v ∈ V and a unit vector u ∈ V measures how far v points in the
direction of u (see Figure 1.12).
Figure 1.12: The coordinates of a vector v with respect to an orthonormal basis

are simply the inner products of v with the basis vectors. In other words, the inner
product of v with a unit vector tells us how far v extends in that direction.
Example 1.4.5 Compute [A]B (the coordinate vector of A ∈ M2 (C) with respect to B) if
A Coordinate
Vector with 1 1 0 1 0 1 1 0 −i 1 1 0
B= √ ,√ ,√ ,√
Respect to the 2 0 1 2 1 0 2 i 0 2 0 −1
Pauli Basis " #
and A = 5 2 − 3i .
2 + 3i −3
Solution:
We already showed that B is an orthonormal basis of M2 (C) (with
respect to the Frobenius inner product) in Example 1.4.3, so we can
compute [A]B via Theorem 1.4.5. In particular,

" #!
Recall that the 1 1 0 1 5 2 − 3i 2 √
√ , A = tr √ = √ = 2,
Frobenius inner 2 0 1 2 2 + 3i −3 2
product is " #!
hX,Y i = tr(X ∗Y ).
1 0 1 1 2 + 3i −3 4 √
√ , A = tr √ = √ = 2 2,
2 1 0 2 5 2 − 3i 2
" #!
1 0 −i 1 −2i + 3 3i 6 √
√ , A = tr √ = √ = 3 2,
2 i 0 2 5i 2i + 3 2
" #!
1 1 0 1 5 2 − 3i 8 √
√ , A = tr √ = √ = 4 2.
2 0 −1 2 −2 − 3i 3 2
The fact that [A]B is
real follows from A √
being Hermitian, We thus conclude that [A]B = 2(1, 2, 3, 4). We can verify our work
since B is not just a by simply checking that it is indeed the case that
basis of M2 (C) but " #
also of the real
5 2 − 3i 1 0 0 1 0 −i 1 0
vector space MH 2 of A= = +2 +3 +4 .
2 × 2 Hermitian 2 + 3i −3 0 1 1 0 i 0 0 −1
matrices.
Now that we have demonstrated how to determine whether or not a par-

ticular set is an orthonormal basis, and a little bit of what we can do with
orthonormal bases, we turn to the question of how to construct an orthonormal
basis. While this is reasonably intuitive in familiar inner product spaces like Rn
or Mm,n (in both cases, we can just choose the standard basis), it becomes a
bit more delicate when working in weirder vector spaces or with stranger inner
products like the one on P 2 [−1, 1] from Example 1.4.1:
Z 1
f (x)g(x)
h f , gi = √ dx.
−1 1 − x2
Fortunately, there is indeed a standard method of turning any basis of a finite-
dimensional vector space into an orthonormal one. Before stating the result in
full generality, we illustrate how it works in R2 .
Indeed, suppose that we have a (not necessarily orthonormal) basis B =
{v1 , v2 } of R2 that we want to turn into an orthonormal basis C = {u1 , u2 }.
We start by simply defining u1 = v1 /kv1 k, which will be the first vector in our
orthonormal basis (after all, we want each vector in the basis to be normalized).
To construct the next member of the orthonormal basis, we define
w2 = v2 − (u1 · v2 )u1 , and then u2 = w2 /kw2 k,
where we recall that u1 · v2 measures how far v2 points in the direction u1 .
In words, w2 is the same as v2 , but with the portion of v2 that points in the
direction of u1 removed, leaving behind only the piece of it that is orthogonal
to u1 . The division by its length is just done so that the resulting vector u2 has
length 1 (since we want an orthonormal basis, not just an orthogonal one). This
construction of u1 and u2 from v1 and v2 is illustrated in Figure 1.13.
In higher dimensions, we would then continue in this way, adjusting each
vector in the basis so that it is orthogonal to each of the previous vectors, and
then normalizing it. The following theorem makes this precise and tells us that
the result is indeed always an orthonormal basis, regardless of what vector
space and inner product is being used.
Figure 1.13: An illustration of our method for turning any basis of R2 into an or-
thonormal basis of R2 . The process works by (b) normalizing one of the vectors,
(c) moving the other vector so that they are orthogonal, and then (d) normalizing
the second vector. In higher dimensions, the process continues in the same way
by repositioning the vectors one at a time so that they are orthogonal to the rest,
and then normalizing.
Theorem 1.4.6 Suppose B = {v1 , v2 , . . . , vn } is a linearly independent set in an inner

The Gram–Schmidt product space V. Define
Process
j−1
wj
If j = 1 then the w j = v j − ∑ hui , v j iui and uj = for all j = 1, . . . , n.
summation that i=1 kw j k
defines w j is empty
and thus equals 0, so Then C j = {u1 , u2 , . . . , u j } is an orthonormal basis of span(v1 , v2 , . . . , v j )
w1 = v1 and for each 1 ≤ j ≤ n. In particular, if B is a basis of V then Cn is an orthonor-
u1 = v1 /kv1 k.
mal basis of V.
We emphasize that even though the formulas involved in the Gram–Schmidt

process look ugly at first, each part of the formulas has a straightforward
purpose. We want w j to be orthogonal to ui for each 1 ≤ i < j, so that is why
we subtract off each hui , v j iui . Similarly, we want each u j to be a unit vector,
which is why divide w j by its norm.
Proof of Theorem 1.4.6. We prove this result by induction on j. For the base
j = 1 case, we simply note that u1 is indeed a unit vector and span(u1 ) =
span(v1 ) since u1 and v1 are scalar multiples of each other.
For the inductive step, suppose that for some particular j we know that
{u1 , u2 , . . . , u j } is a mutually orthogonal set of unit vectors and
span(u1 , u2 , . . . , u j ) = span(v1 , v2 , . . . , v j ). (1.4.2)
We know that v j+1 6∈ span(v1 , v2 , . . . , v j ), since B is linearly independent. It

follows that v j+1 6∈ span(u1 , u2 , . . . , u j ) as well, so the definition of u j+1 makes
j
sense (i.e., w j+1 = v j+1 − ∑i=1 hui , v j+1 iui 6= 0, so we are not dividing by 0)
and is a unit vector.
To see that u j+1 is orthogonal to each of u1 , u2 , . . . , u j , suppose that
1 ≤ k ≤ j and compute
* +
j
v j+1 − ∑i=1 hui , v j+1 iui
huk , u j+1 i = uk ,

(definition of u j+1 )
j
v j+1 − ∑i=1 hui , v j+1 iui
j
huk , v j+1 i − ∑i=1 hui , v j+1 ihuk , ui i
=

(expand the inner product)
j
v j+1 − ∑i=1 hui , v j+1 iui
huk , v j+1 i − huk , v j+1 i

=

(k ≤ j, so huk , ui i = 0)
j
v j+1 − ∑i=1 hui , v j+1 iui
= 0. (huk , v j+1 i − huk , v j+1 i = 0)
All that remains is to show that
span(u1 , u2 , . . . , u j+1 ) = span(v1 , v2 , . . . , v j+1 ).
The fact that the By rearranging the definition of u j+1 , we see that v j+1 ∈ span(u1 , u2 , . . . , u j+1 ).
only When we combine this fact with Equation (1.4.2), this implies
( j + 1)-dimensional
subspace of a span(u1 , u2 , . . . , u j+1 ) ⊇ span(v1 , v2 , . . . , v j+1 ).
( j + 1)-dimensional
vector space is that
vector space itself is
The vi ’s are linearly independent, so the span on the right has dimension j + 1.
hopefully intuitive Similarly, the ui ’s are linearly independent (they are mutually orthogonal, so
enough, but it was linear independence follows from Theorem 1.4.1), so the span on the left also
proved explicitly in has dimension j + 1, and thus the two spans must in fact be equal.
Exercise 1.2.31.
Since finite-dimensional inner product spaces (by definition) have a basis
consisting of finitely many vectors, and Theorem 1.4.6 tells us how to convert
any such basis into one that is orthonormal, we now know that every finite-
dimensional inner product space has an orthonormal basis:
Corollary 1.4.7 Every finite-dimensional inner product space has an orthonormal basis.
Existence of
Orthonormal Bases We now illustrate how to use the Gram–Schmidt process to find orthonormal
bases of various inner product spaces.
Example 1.4.6 Use the Gram–Schmidt process to construct an orthonormal basis of

Finding an R2 with respect to the weird inner product that we introduced back in
Orthonormal Basis Example 1.3.18:
with Respect to
a Weird Inner hv, wi = v1 w1 + 2v1 w2 + 2v2 w1 + 5v2 w2 .
Product
Solution:
The starting point of the Gram–Schmidt process is a basis of the given
We could also inner product space, so we start with the standard basis B = {e1 , e2 } of R2 .
choose any other
To turn this into an orthonormal (with respect to the weird inner product
basis of R2 as our
starting point. above) basis, we define
e1 p √
u1 = = e1 , since ke1 k = he1 , e1 i = 1 + 0 + 0 + 0 = 1.
ke1 k
Be careful when Next, we let

computing things
like hu1 , e2 i = he1 , e2 i.
It is tempting to think
w2 = e2 − hu1 , e2 iu1 = (0, 1) − (0 + 2 + 0 + 0)(1, 0) = (−2, 1),
√
that it equals u2 = w2 /kw2 k = (−2, 1), since kw2 k = 4 − 4 − 4 + 5 = 1.
e1 · e2 = 0, but hu1 , e2 i
refers to the weird
inner product, not It follows that C = {u1 , u2 } = (1, 0), (−2, 1) is an orthonormal basis
the dot product. of R2 with respect to this weird inner product (in fact, this is exactly the
orthonormal basis that we saw back in Figure 1.11).
Example 1.4.7 Find an orthonormal basis (with respect to the usual dot product) of the
Finding an plane S ⊂ R3 with equation x − y − 2z = 0.
Orthonormal Basis
Solution:
of a Plane
We start by picking any basis of S. Since S is 2-dimensional, a basis
is made up of any two vectors in S that are not multiples of each other. By
v1 and v2 can be inspection, v1 = (2, 0, 1) and v2 = (3, 1, 1) are vectors that work, so we
found by choosing x
choose B = {v1 , v2 }.
and y arbitrarily and
using the equation To create an orthonormal basis from B, we apply the Gram–Schmidt
x − y − 2z = 0 to solve process—we define
for z.
v1 1
u1 = = √ (2, 0, 1)
kv1 k 5
and
1
w2 = v2 − (u1 · v2 )u1 = (3, 1, 1) − 57 (2, 0, 1) = (1, 5, −2),
5
w2 1
u2 = = √ (1, 5, −2).
kw2 k 30
n o
Just like other bases, It follows that C = {u1 , u2 } = √15 (2, 0, 1), √130 (1, 5, −2) is an orthonor-
orthonormal bases
are very non-unique. mal basis of V, as displayed below:
There are many
B = {v1 , v2 } C = {u1 , u2 }
other orthonormal © ª n o
bases of S. = (2, 0, 1), (3, 1, 1) = √15 (2, 0, 1), √130 (1, 5, −2)
z z
v2
v1
u1
x x
y y
u2
Example 1.4.8 Find an orthonormal basis of P 2 [0, 1] with respect to the inner product
Finding an Z 1
Orthonormal Basis h f , gi = f (x)g(x) dx.
of Polynomials 0
Solution:
Recall that the Once again, we apply the Gram–Schmidt process to the standard basis
standard basis is not
orthonormal in this
B = {1, x, x2 } to create an orthonormal basis C = {h1 , h2 , h3 }. To start, we
inner product since, define h1 (x) = 1/k1k = 1. The next member of the orthonormal basis is
for example, computed via
h1, xi = 1/2.
g2 (x) = x − hh1 , xih1 (x) = x − h1, xi1 = x − 1/2,
s
. Z1 √
Notice that {h1 , h2 } is h2 (x) = g2 (x)/kg2 k = (x − 1/2) (x − 1/2)2 dx = 3(2x − 1).
exactly the 0
orthonormal basis of
P 1 [0, 1] that we The last member of the orthonormal basis C is similarly computed via
constructed back in

Example 1.4.4. We
g3 (x) = x2 − h1 , x2 h1 (x) − h2 , x2 h2 (x)
were doing the

Gram–Schmidt = x2 − 1, x2 1 − 12 x − 1/2, x2 (x − 1/2)
process back there
without realizing it. = x2 − 1/3 − (x − 1/2) = x2 − x + 1/6, and
h3 (x) = g3 (x)/kg3 k
s
. Z1
2
= (x − x + 1/6) (x2 − x + 1/6)2 dx
0
√
= 5(6x2 − 6x + 1).
√ √
It follows that C = 1, 3(2x − 1), 5(6x2 − 6x + 1) is an orthonor-
mal basis of P 2 [0, 1]. While this basis looks a fair bit uglier than the
standard basis {1, x, x2 } algebraically, its members are more symmetric
about the midpoint x = 1/2 and more evenly distributed across the interval
[0, 1] geometrically, as shown below:
C = {h1 , h2 , h3 } =
n √ √ o
© ª
B = 1, x, x2 1, 3(2x − 1), 5(6x2 − 6x + 1)
However, we should y y
not expect to be
able to directly
1
1
“see” whether or not 2
a basis of P 2 [0, 1] (or
3/4
any of its variants) is y = h1 (x)
orthonormal. 1
1/2 x
x2 0 x
1/4 1/4 1/2 3/4 1
-1 y = h3 (x)
x
1/4 1/2 3/4 1 y = h2 (x)
1.4.2 Adjoint Transformations

There is one final operation that is used extensively in introductory linear
algebra that we have not yet generalized to vector spaces beyond Fn , and that is
the (conjugate) transpose of a matrix. We now fill in this gap by introducing the
adjoint of a linear transformation, which can be thought of as finally answering
the question of why the transpose is an important operation (after all, why
would we expect that swapping the rows and columns of a matrix should tell
us anything useful?).
Definition 1.4.3 Suppose that V and W are inner product spaces and T : V → W is a linear
Adjoint transformation. Then a linear transformation T ∗ : W → V is called the
Transformation adjoint of T if
hT (v), wi = hv, T ∗ (w)i for all v ∈ V, w ∈ W.
For example, it is straightforward (but slightly tedious and unenlightening,

so we leave it to Exercise 1.4.17) to show that real matrices A ∈ Mm,n (R)
In fact, this finally satisfy
explains why the
conjugate transpose
(Av) · w = v · (AT w) for all v ∈ Rn , w ∈ Rm ,
is typically the “right” so the transposed matrix AT is the adjoint of A. Similarly, for every complex
version of the
transpose for matrix A ∈ Mm,n (C) we have
complex matrices.
(Av) · w = v · (A∗ w) for all v ∈ Cn , w ∈ Cm ,
The ground field so the conjugate transpose matrix A∗ is the adjoint of A. However, it also makes
must be R or C in sense to talk about the adjoint of linear transformations between more exotic
order for inner vector spaces, as we now demonstrate with the trace (which we recall from
products, and thus
adjoints, to make Example 1.2.7 is the linear transformation tr : Mn (F) → F that adds up the
sense. diagonal entries of a matrix).
Example 1.4.9 Show that the adjoint of the trace tr : Mn (F) → F with respect to the
The Adjoint standard Frobenius inner product is given by
of the Trace
tr∗ (c) = cI for all c ∈ F.
Solution:
In the equation Our goal is to show that hc, tr(A)i = hcI, Ai for all A ∈ Mn (F) and
hc, tr(A)i = hcI, Ai, the
all c ∈ F. Recall that the Frobenius inner product is defined by hA, Bi =
left inner product is
on F (i.e., it is the tr(A∗ B), so this condition is equivalent to
1-dimensional dot
product c tr(A) = tr (cI)∗ A for all A ∈ Mn (F), c ∈ F.
hc, tr(A)i = c tr(A)) and
the right inner This equation holds simply by virtue of linearity of the trace:
product is the
Frobenius inner
product on Mn (F). tr (cI)∗ A = tr cI ∗ A = c tr(IA) = c tr(A),
as desired.
The previous example is somewhat unsatisfying for two reasons. First, it

does not illustrate how to actually find the adjoint of a linear transformation, but
rather it only shows how to verify that two linear transformations are adjoints of
each other. How could have we found the adjoint tr∗ (c) = cI if it were not given
to us? Second, how do we know that there is not another linear transformation
that is also an adjoint of the trace? That is, how do we know that tr∗ (c) = cI is
the adjoint of the trace rather than just an adjoint of it?
The following theorem answers both of these questions by showing that, in
finite dimensions, every linear transformation has exactly one adjoint, and it
can be computed by making use of orthonormal bases of the two vector spaces
V and W.
Theorem 1.4.8 Suppose that V and W are finite-dimensional inner product spaces with
Existence and orthonormal bases B and C, respectively. If T : V → W is a linear trans-
Uniqueness of formation then there exists a unique adjoint transformation T ∗ : W → V,
the Adjoint and its standard matrix satisfies
∗ ∗
T B←C = [T ]C←B .
In infinite dimensions, Proof. To prove uniqueness of T ∗ , suppose that T ∗ exists, let v ∈ V and w ∈ W,
some linear
and compute hT (v), wi in two different ways:
transformations fail
to have an adjoint
(see Remark 1.4.1). hT (v), wi = hv, T ∗ (w)i (definition of T ∗ )

However, if it exists = [v]B · T ∗ (w) B (Theorem 1.4.3)
then it is still unique
(see Exercise 1.4.23). = [v]B · T ∗ B←C [w]C ) (definition of standard matrix)

= [v]∗B T ∗ B←C [w]C . (definition of dot product)
Similarly,
hT (v), wi = [T (v)]C · [w]C (Theorem 1.4.3)

= [T ]C←B [v]B ) · [w]C (definition of standard matrix)
= [v]∗B [T ]C←B
∗
[w]C . (definition of dot product)

It follows that [v]∗B T ∗ B←C [w]C = [v]∗B [T ]C←B
∗ [w]C for all [v]B ∈ Fn and all
m
[w]C ∈ F (where F is the ground field). If we choose v to be the i-th vector in
the basis B and w to be the j-th vector in C, then [v]B = ei and [w]C = e j , so

[v]∗B T ∗ B←C [w]C = eTi T ∗ B←C e j is the (i, j)-entry of T ∗ B←C , and
∗
[v]∗B [T ]C←B [w]C = eTi [T ]C←B
∗
ej is the (i, j)-entry of∗
[T ]C←B .

Since these quantities are equal for all i and j, it follows that T ∗ B←C =
[T ]C←B
∗ . Uniqueness of T ∗ now follows immediately from uniqueness of stan-
dard matrices.
Existence of T ∗ follows from that fact that we can choose T ∗ to be the
linear transformation with standard matrix [T ]C←B
∗ and then follow the above
argument backward to verify that hT (v), wi = hv, T ∗ (w)i for all v ∈ V and
w ∈ W.
It is worth emphasizing that the final claim of Theorem 1.4.8 does not
necessarily hold if B or C are not orthonormal. For example, the standard
matrix of the differentiation map D : P 2 → P 2 with respect to the standard
basis E = {1, x, x2 } is
 
0 1 0
[D]E = 0 0 2 .
0 0 0
We thus might expect that if we equip P 2 with the standard inner product
Z 1
h f , gi = f (x)g(x) dx
0
then we would have

 
∗ 0 0 0
D E = [D]∗E = 1 0 0 .
0 2 0
which would give D∗ (ax2 + bx + c) = cx + 2bx2 . However, it is straightforward

This formula forD∗ is to check that formula for D∗ is incorrect since, for example,
also incorrect even if
we use pretty much Z 1 Z 1
any other natural h1, D(x)i = D(x) dx = 1 dx = 1, but
inner product on P 2 . 0 0
Z 1 Z 1
hD∗ (1), xi = D∗ (1)x dx = x2 dx = 1/3.
0 0
To fix the above problem and find the actual adjoint of D, we must mimic
the above calculation with an orthonormal basis of P 2 rather than the standard
basis E = {1, x, x2 }.
Example 1.4.10 Compute the adjoint of the differentiation map D : P 2 [0, 1] → P 2 [0, 1]
The Adjoint of (with respect to the standard inner product).
the Derivative
Solution:
Fortunately, we already computed an orthonormal basis of P 2 [0, 1]
back in Example 1.4.8, and it is C = {h1 , h2 , h3 }, where
√ √
h1 (x) = 1, h2 (x) = 3(2x − 1), and h3 (x) = 5(6x2 − 6x + 1).
To find the standard matrix of D with respect to this basis, we compute
D(h1 (x)) = 0,
√ √
D(h2 (x)) = 2 3 = 2 3h1 (x), and
√ √ √
D(h3 (x)) = 12 5x − 6 5 = 2 15h2 (x).
Then the standard matrices of D and D∗ are given by

 √   
0 2 3 0 0 0 0
 √  ∗
 √ 
[D]C = 0 0 2 15 and D C
= 2 3 0 0 ,
√
0 0 0 0 2 15 0
√
linear transformation with the property that D∗ (h1 ) = 2 3h2 ,
so D∗ is the√
D∗ (h2 ) = 2 15h3 , and D∗ (h3 ) = 0.
This is a fine and dandy answer already, but it can perhaps be made a
bit more intuitive if we instead describe D∗ in terms of what it does to 1,
x, and x2 , rather than h1 , h2 , and h3 . That is, we compute

√
These linear D∗ (1) = D∗ (h1 (x)) = 2 3h2 (x) = 6(2x − 1),
combinations like
x = 2√1 3 h2 (x) + 21 h1 (x) D∗ (x) = D∗ 2√1 3 h2 (x) + 21 h1 (x)
can be found by √ √
solving linear systems. = 5h3 (x) + 3h2 (x)
= 2(15x2 − 12x + 1), and

D∗ (x2 ) = D∗ 6√1 5 h3 (x) + 2√1 3 h2 (x) + 31 h1 (x)
√
= 5h3 (x) + √23 h2 (x)
= 6x2 − 2x − 1.
I guess D∗ is just kind Putting this all together shows that

of ugly no matter
which basis we D∗ (ax2 + bx + c) = 6(a + 5b)x2 − 2(a + 12b − 6c)x − (a − 2b + 6c).
represent it in.
Example 1.4.11 Show that the adjoint of the transposition map T : Mm,n → Mn,m , with
The Adjoint of respect to the Frobenius inner product, is also the transposition map.
the Transpose
Solution:
Our goal is to show that hAT , Bi = hA, BT i for all A ∈ Mm,n and
B ∈ Mn,m . Recall that the Frobenius inner product is defined by hA, Bi =
tr(A∗ B), so this is equivalent to
Recall that A = (AT )∗
is the entrywise tr AB = tr A∗ BT for all A ∈ Mm,n , B ∈ Mn,m .
complex conjugate
of A.
These two quantities can be shown to be equal by brute-force calcula-
tion of the traces and matrix multiplications in terms of the entries of A and
B, but a more elegant way is to use properties of the trace and transpose:
T
tr AB = tr AB (transpose does not change trace)

= tr BT A∗ (transpose of a product)

= tr A∗ BT . (cyclic commutativity of trace)
The situation presented in Example 1.4.11, where a linear transformation is

its own adjoint, is important enough that we give it a name:
Definition 1.4.4 If V is an inner product space then a linear transformation T : V → V is

Self-Adjoint called self-adjoint if T ∗ = T .
Transformations
For example, a matrix in Mn (R) is symmetric (i.e., A = AT ) if and only
if it is a self-adjoint linear transformation on Rn , and a matrix in Mn (C) is
Hermitian (i.e., A = A∗ ) if and only if it is a self-adjoint linear transformation on
Strictly speaking, the Cn . Slightly more generally, Theorem 1.4.8 tells us that a linear transformation
transposition map on
Mm,n (F) is only
on a finite-dimensional inner product space V is self-adjoint if and only if its
self-adjoint when standard matrix (with respect to some orthonormal basis of V) is symmetric or
m = n, since Hermitian (depending on whether the underlying field is R or C).
otherwise the input The fact that the transposition map on Mn is self-adjoint tells us that
and output vector
spaces are different. its standard matrix (with respect to an orthonormal basis, like the standard
basis) is symmetric. In light of this, it is perhaps worthwhile looking back at
Example 1.2.10, where we explicitly computed its standard matrix in the n = 2

case (which is indeed symmetric).
Remark 1.4.1 The reason that Theorem 1.4.8 specifies that the vector spaces must be
The Adjoint in finite-dimensional is that some linear transformations acting on infinite-
Infinite Dimensions dimensional vector spaces do not have an adjoint. To demonstrate this phe-
nomenon, consider the vector space c00 of all eventually-zero sequences
of real numbers (which we first introduced in Example 1.1.10), together
with inner product
Since all of the
∞
sequences here are (v1 , v2 , v3 , . . .), (w1 , w2 , w3 , . . .) = ∑ vi wi .
eventually zero, all of i=1
the sums considered
here only have Then consider the linear transformation T : c00 → c00 defined by
finitely many
non-zero terms, so
!
∞ ∞ ∞
we do not need to T (v1 , v2 , v3 , . . .) = ∑ vi , ∑ vi , ∑ vi , . . . .
worry about limits or i=1 i=2 i=3
convergence.
A straightforward calculation reveals that, if the adjoint T ∗ : c00 → c00
exists, it must have the form
!
1 2 3
T ∗ (w1 , w2 , w3 , . . .) = ∑ wi , ∑ wi , ∑ wi , . . . .
i=1 i=1 i=1
However, this is not actually a linear transformation on c00 since, for

example, it would give
T ∗ (1, 0, 0, . . .) = (1, 1, 1, . . .),
which is not in c00 since its entries are not eventually 0.
1.4.3 Unitary Matrices

Properties of Recall that invertible matrices are exactly the matrices whose columns form
invertible matrices
a basis of Rn (or Fn more generally). Now that we understand orthonormal
like this one are
summarized in bases and think of them as the “nicest” bases out there, it seems natural to ask
Theorem A.1.1. what additional properties invertible matrices have if their columns form an
orthonormal basis, rather than just any old basis. We now give a name to these
matrices.
Definition 1.4.5 If F = R or F = C then a matrix U ∈ Mn (F) is called a unitary matrix

Unitary Matrix if its columns form an orthonormal basis of Fn (with respect to the usual
dot product).
For example, the identity matrix is unitary since its columns are the standard
basis vectors e1 , e2 , . . . , en , which form the standard basis of Fn , which is
orthonormal. As a slightly less trivial example, consider the matrix

1 1 −1
U=√ ,
2 1 1
√ √
which we can show is unitary simply by noting that (1, 1)/ 2, (1, −1)/ 2
(i.e., the set consisting of the columns of U) is an orthonormal basis of R2 .
For a refresher on We can make geometric sense of unitary matrices if we recall that the
how to think of
columns of a matrix tell us where that matrix sends the standard basis vectors
matrices
geometrically as e1 , e2 , . . . , en . Thus, just like invertible matrices are those that send the unit
linear square grid to a parallelogram grid (without squishing it down to a smaller
transformations, see dimension), unitary matrices are those that send the unit square grid to a
Appendix A.1.2.
(potentially rotated or reflected) unit square grid, as in Figure 1.14.
y y
Pe2
We will show shortly, e2 P
in Examples 1.4.12 −−−−−→
and 1.4.13, that all invertible
Pe1
rotation matrices e1
and all reflection x x
matrices are indeed
unitary. y y
e2 U
−−−−−→ Ue2
unitary
Ue1
e1
x x
Figure 1.14: A non-zero matrix P ∈ M2 is invertible if and only if it sends the unit
square grid to a parallelogram grid (whereas it is non-invertible if and only if it
sends that grid to a line). A matrix U ∈ M2 is unitary if and only if it sends the unit
square grid to a unit square grid that is potentially rotated and/or reflected, but
not skewed.
For this reason, we often think of unitary matrices as the most “rigid” or
“well-behaved” invertible matrices that exist—they preserve not just the dimen-
sion of Fn , but also its shape (but maybe not its orientation). The following
theorem provides several additional characterizations of unitary matrices that
can help us understand them in other ways and perhaps make them a bit more
intuitive.
Theorem 1.4.9 Suppose F = R or F = C and U ∈ Mn (F). The following are equivalent:

Characterization of a) U is unitary,
Unitary Matrices b) U ∗ is unitary,
c) UU ∗ = I,
d) U ∗U = I,
e) (Uv) · (Uw) = v · w for all v, w ∈ Fn , and
In this theorem, k · k f) kUvk = kvk for all v ∈ Fn .
refers to the
standard length in
Fn . We discuss how Before proving this result, it is worthwhile to think about what some of
to generalize unitary its characterizations really mean. Conditions (c) and (d) tell us that unitary
matrices to other
inner products,
matrices are not only invertible, but they are the matrices U for which their
norms, and vector inverse equals their adjoint (i.e., U −1 = U ∗ ). Algebraically, this is extremely
spaces in convenient since it is trivial to compute the adjoint (i.e., conjugate transpose)
Section 1.D.3. of a matrix, so it is thus trivial to compute the inverse of a unitary matrix.
The other properties of Theorem 1.4.9 can also be thought of as stronger
versions of properties of invertible matrices, as summarized in Table 1.2.
Property of invertible P Property of unitary U

P−1 exists U −1 = U ∗
kPvk 6= 0 whenever kvk 6= 0 kUvk = kvk for all v
columns of P are a basis columns of U are an orthonormal basis
vec. space automorphism on Fn inner prod. space automorphism on Fn
Table 1.2: A comparison of the properties of invertible matrices and the corre-
sponding stronger properties of unitary matrices. The final properties (that invertible
matrices are vector space automorphisms while unitary matrices are inner product
space automorphisms) means that invertible matrices preserve linear combina-
tions, whereas unitary matrices preserve linear combinations as well as the dot
product (property (e) of Theorem 1.4.9).
The final two properties of Theorem 1.4.9 provide us with another natural
geometric interpretation of unitary matrices. Condition (f) tells us that unitary
matrices are exactly those that preserve the length of every vector. Similarly,
since the dot product can be used to measure angles between vectors, condi-
tion (e) says that unitary matrices are exactly those that preserve the angle
between every pair of vectors, as in Figure 1.15.
y y
Uv4
v1 Uv3
v3
v2
v4
U
−−−−→
unitary Uv2
x x
Uv1
Figure 1.15: Unitary matrices are those that preserve the lengths of vectors as well
as the angles between them.
To prove this Proof of Theorem 1.4.9. We start by showing that conditions (a)–(d) are equiv-
theorem, we show
alent to each other. The equivalence of conditions (c) and (d) follows from the
that the 6 properties
imply each other as fact that, for square matrices, a one-sided inverse is necessarily a two-sided
follows: inverse.

(a) To see that (a) is equivalent to (d), we write U = u1 | u2 | · · · | un and
(f) (b) then use block matrix multiplication to multiply by U ∗ :
 ∗   
u1 u1 · u1 u1 · u2 · · · u1 · un
(e) (c)  u2 
∗
u2 · u1 u2 · u2 · · · u2 · un 
 
(d) U ∗U =  .  u1 | u2 | · · · | un =   ... .. .. ..  .
 ..  . . . 
u∗n un · u1 un · u2 · · · un · un ,
This product equals I if and only if its diagonal entries equal 1 and its off-
diagonal entries equal 0. In other words, U ∗U = I if and only if ui · ui = 1 for
all i and ui · u j = 0 whenever i 6= j. This says exactly that {u1 , u2 , . . . , un } is a
set of mutually orthogonal normalized vectors. Since it consists of exactly n
vectors, Theorem 1.4.2 tells us that this is equivalent to {u1 , u2 , . . . , un } being

an orthonormal basis of Fn , which is exactly the definition of U being unitary.
The equivalence of (b) and (c) follows by applying the same argument as
in the previous paragraph to U ∗ instead of U, so all that remains is to show
that conditions (d), (e), and (f) are equivalent. We prove these equivalences by
showing the chain of implications (d) =⇒ (f) =⇒ (e) =⇒ (d).
To see that (d) implies (f), suppose U ∗U = I. Then for all v ∈ Fn we have
kUvk2 = (Uv) · (Uv) = (U ∗Uv) · v = v · v = kvk2 ,
as desired.
The implication (f) For the implication (f) =⇒ (e), note that if kUvk2 = kvk2 for all v ∈ Fn
=⇒ (e) is the “tough
then (Uv) · (Uv) = v · v for all v ∈ Fn . If x, y ∈ Fn then this tells us (by choosing
one” of this proof.
v = x + y) that U(x + y) · U(x + y) = (x + y) · (x + y). Expanding this dot
product on both the left and right then gives
that (Ux) · (Uy) + (Ux) · (Ux) + 2Re (Ux) · (Uy) + (Uy) · (Uy) = x · x + 2Re(x · y) + y · y.
(Uy) · (Ux) = (Ux) ·
(Uy) + (Ux) · (Uy) =
By then using the facts that (Ux) · (Ux) = x · x and (Uy) · (Uy) = y · y, we can
2Re (Ux) · (Uy) .
simplify the above equation to the form

Re (Ux) · (Uy) = Re(x · y).
If F = R then this implies (Ux) · (Uy) = x · y for all x, y ∈ Fn , as desired. If

instead F = C then we can repeat the above argument with v = x + iy to see
that

Im (Ux) · (Uy) = Im(x · y),
so in this case we have (Ux) · (Uy) = x · y for all x, y ∈ Fn too, establishing (e).
Finally, to see that (e) =⇒ (d), note that if we rearrange (Uv)·(Uw) = v·w
slightly, we get

(U ∗U − I)v · w = 0 for all v, w ∈ Fn .
2
If we choose w = (U ∗U − I)v then this implies (U ∗U − I)v = 0 for all
v ∈ Fn , so (U ∗U − I)v = 0 for all v ∈ Fn . This in turn implies U ∗U − I = O, so
U ∗U = I, which completes the proof.
Checking whether or not a matrix is unitary is now quite simple, since we
just have to check whether or not U ∗U = I. For example, if we return to the
matrix

The fact that this 1 1 −1
matrix is unitary U=√
makes sense 2 1 1
geometrically if we from earlier, we can now check that it is unitary simply by computing
notice that
it rotates R2
counter-clockwise 1 1 1 1 −1 1 0
U ∗U = = .
by π/4 (45◦ ). 2 −1 1 1 1 0 1
Since U ∗U = I, Theorem 1.4.9 tells us that U is unitary. The following example

generalizes the above calculation.
Example 1.4.12 Recall from introductory linear algebra that the standard matrix of the
Rotation Matrices linear transformation Rθ : R2 → R2 that rotates R2 counter-clockwise by
are Unitary an angle of θ is " #
θ cos(θ ) − sin(θ )
R = .
sin(θ ) cos(θ )

Show that Rθ is unitary.
Solution:
Since rotation matrices do not change the length of vectors, we know
that
θ they must be unitary. To verify this a bit more directly we compute
∗ θ
R R :
" #" #
θ ∗ θ cos(θ ) sin(θ ) cos(θ ) − sin(θ )
R R =
− sin(θ ) cos(θ ) sin(θ ) cos(θ )
" #
cos2 (θ ) + sin2 (θ ) sin(θ ) cos(θ ) − cos(θ ) sin(θ )
=
cos(θ ) sin(θ ) − sin(θ ) cos(θ ) sin2 (θ ) + cos2 (θ )
Recall that
sin2 (θ ) + cos2 (θ ) = 1 1 0
= .
for all θ ∈ R. 0 1
∗
Since Rθ Rθ = I, we conclude that Rθ is unitary.
Example 1.4.13 Recall from introductory linear algebra that the standard matrix of the
Reflection Matrices linear transformation Fu : Rn → Rn that reflects Rn through the line in the
are Unitary direction of the unit vector u ∈ Rn is
[Fu ] = 2uuT − I.
Show that [Fu ] is unitary.

Solution:
Again, reflection matrices do not change the length of vectors, so we
know that they must be unitary. To see this a bit more directly, we compute
[Fu ]∗ [Fu ]:
[Fu ]∗ [Fu ] = (2uuT − I)∗ (2uuT − I) = 4u(uT u)uT − 4uuT + I

= 4uuT − 4uuT + I = I,
where the third equality comes from the fact that u is a unit vector, so
uT u = kuk2 = 1. Since [Fu ]∗ [Fu ] = I, we conclude the [Fu ] is unitary.
Again, the previous two examples provide exactly the intuition that we
should have for unitary matrices—they are the ones that rotate and/or reflect
Fn , but do not stretch, shrink, or otherwise distort it. They can be thought of as
very rigid linear transformations that leave the size and shape of Fn intact, but
possibly change its orientation.
Remark 1.4.2 Many sources refer to real unitary matrices as orthogonal matrices (but
Orthogonal still refer to complex unitary matrices as unitary). However,
Matrices
we dislike this terminology for numerous reasons:

a) The columns of orthogonal matrices (i.e., real unitary matrices) are
mutually orthonormal, not just mutually orthogonal.
b) Two orthogonal matrices (i.e., real unitary matrices) need not be
orthogonal to each other in any particular inner product on Mn (R).
c) There just is no reason to use separate terminology depending on
whether the matrix is real or complex.
We thus do not use the term “orthogonal matrix” again in this book, and
we genuinely hope that it falls out of use.
1.4.4 Projections
As one final application of inner products and orthogonality, we now introduce
projections, which we roughly think of as linear transformations that squish
vectors down into some given subspace. For example, when discussing the
Gram–Schmidt process, we implicitly used the fact that if u ∈ V is a unit vector
then the linear transformation Pu : V → V defined by Pu (v) = hu, viu squishes
v down onto span(u), as in Figure 1.16. Indeed, Pu is a projection onto span(u).
y y
v
(v1 , 0) (v1 , 0)
P
−−→
u
Pu (v)
u vi
x hu, x
Figure 1.16: Given a unit vector u, the linear transformation Pu (v) = hu, viu is a
projection onto the line in the direction of u.
Intuitively, projecting a vector onto a subspace is analogous to casting a

vector’s shadow onto a surface as in Figure 1.17, or looking at a 3D object from
one side (and thus seeing a 2D image of it, like when we look at objects on
computer screens). The simplest projections to work with mathematically are
those that project at a right angle (as in Figures 1.16 and 1.17(b))—these are
called orthogonal projections, whereas if they project at another angle (as in
Figure 1.17(a)) then they are called oblique projections.
The key observation that lets us get our hands on projections mathematically
is that a linear transformation P : V → V projects onto its range (i.e., it leaves
Recall that P2 = P ◦ P. every vector in its range alone) if and only if P2 = P, since P(v) ∈ range(P)
regardless of v ∈ V, so applying P again does not change it (see Figure 1.18(a)).
If we furthermore require that every vector is projected down at a right angle
(that is, it is projected orthogonally) then we need V to be an inner product
One of the greatest
joys that this author space and we want hP(v), v − P(v)i = 0 for all v ∈ V (see Figure 1.18(b)).
gets is from glossing Remarkably, this property is equivalent to P being self-adjoint (i.e., P = P∗ ),
over ugly details and but proving this fact is somewhat tedious, so we defer it to Exercises 1.4.29
making the reader
and 2.1.23.
work them out.
With all of this in mind, we are finally able to define (orthogonal) projec-
tions in general:
When the sun is

directly above us in
the sky, our shadow
is our orthogonal
projection. At any
other time, our
shadow is just an
oblique projection.
Figure 1.17: A projection P can be thought of as a linear transformation that casts

the shadow of a vector onto a subspace from a far-off light source. Here, R3 is
projected onto the xy-plane.
Figure 1.18: A rank-2 projection P : R3 → R3 projects onto a plane (its range). After it
projects once, projecting again has no additional effect, so P2 = P. An orthogonal
projection is one for which, as in (b), P(v) is always orthogonal to v − P(v).
Definition 1.4.6 Suppose that V is a vector space and P : V → V is a linear transformation.

Projections a) If P2 = P then P is called a projection.
b) If V is an inner product space and P2 = P = P∗ then P is called an
orthogonal projection.
We furthermore say that P projects onto range(P).
Example 1.4.14 Determine which of the following matrices are projections. If they are
Determining if a projections, determine whether or not they are orthogonal and describe the
Matrix is a Projection subspace of Rn that they project onto.  

a) P = 1 −1 b) Q = 1 0 0
1 1  
0 1 1/2
0 0 0
 
c) R = 5/6 1/6 −1/3
 
 1/6 5/6 1/3 
−1/3 1/3 1/3
Solutions:
a) This matrix is not a projection, since direct computation shows that

2 0 −2 1 −1
P = 6= = P.
2 0 1 1
b) Again, we just check whether or not Q2 = Q:

    
1 0 0 1 0 0 1 0 0
    
Q2 = 0 1 1/2 0 1 1/2 = 0 1 1/2 = Q,
0 0 0 0 0 0 0 0 0
In fact, Q is exactly
the projection onto so Q is a projection, but Q∗ 6= Q so it is not an orthogonal projec-
the xy-plane that tion. Since its columns span the xy-plane, that is its range (i.e., the
was depicted in
subspace that it projects onto).
Figure 1.17(a).
c) This matrix is a projection, since
  2
5 1 −2
1  
R2 =   1 5 2 
6
−2 2 2
   
30 6 −12 5 1 −2
1   1 
=  6 30 12  =  1 5 2  = R.
36 6
−12 12 12 −2 2 2
R is the projection Furthermore, R is an orthogonal projection since R∗ = R. We

that was depicted in
can compute range(R) = span{(2, 0, −1), (0, 2, 1)} using techniques
Figure 1.18(b).
from introductory linear algebra, which is the plane with equation
x − y + 2z = 0.
We return to oblique Although projections in general have their uses, we are primarily interested
projections in
in orthogonal projections, and will focus on them for the remainder of this
Section 1.B.2.
section. One of the nicest features of orthogonal projections is that they are
uniquely determined by the subspace that they project onto (i.e., there is only
one orthogonal projection for each subspace), and they can be computed in a
straightforward way from any orthonormal basis of that subspace, at least in
finite dimensions.
In order to get more comfortable with constructing and making use of or-
thogonal projections, we start by describing what they look like in the concrete
setting of matrices that project Fn down onto some subspace of it.
Theorem 1.4.10 Suppose F = R or F = C and let S be an m-dimensional subspace of Fn .

Construction of Then there is a unique orthogonal projection P onto S, and it is given by
Orthogonal
Projections P = AA∗ ,
where A ∈ Mn,m (F) is a matrix with any orthonormal basis of S as its

columns.
Recall that Proof. We start by showing that the matrix P = AA∗ is indeed an orthogonal
(AB)∗ = B∗ A∗ .
Plugging in B = A∗
projection onto S. To verify this claim, we write A = u1 | u2 | · · · | um .
gives (AA∗ )∗ = AA∗ . Then notice that P∗ = (AA∗ )∗ = AA∗ = P and
P2 = (AA∗ )(AA∗ ) = A(A∗ A)A∗

 
Our method of u1 · u1 u1 · u2 ··· u1 · um
computing A∗ A here  u2 · u1 u2 · u2 ··· u2 · um 
  ∗
is almost identical to = A . .. .. ..  A
the one from the  .. . . . 
proof of
Theorem 1.4.9. Recall um · u1 um · u2 ··· um · um
that {u1 , . . . , um } = AIm A∗ = AA∗ = P.
is an orthonormal
basis of S.
It follows that P is an orthogonal projection. To see that range(P) = S, we
simply notice that range(P) = range(A∗ A) = range(A) by Theorem A.1.2, and
range(A) is the span of its columns (by that same theorem), which is S (since
those columns were chosen specifically to be an orthonormal basis of S).
Uniqueness of P lets To see that P is unique, suppose that Q is another orthogonal projection
us (in finite
onto S. If {u1 , u2 , . . . , um } is an orthonormal basis of S then we can extend it
dimensions) talk
about the to an orthonormal basis {u1 , u2 , . . . , un } of all of Fn via Exercise 1.4.20. We
orthogonal then claim that (
projection onto a u j , if 1 ≤ j ≤ m
given subspace, Pu j = Qu j = .
rather than just an 0, if m < j ≤ n
orthogonal
projection onto it. To see why this is the case, we notice that Pu j = Qu j = u j for 1 ≤ j ≤ m
because u j ∈ S and P, Q leave everything in S alone. The fact that Qu j = 0 for
j > m follows from the fact that Q2 = Q = Q∗ , so
hQu j , Qu j i = hQ∗ Qu j , u j i = hQ2 u j , u j i = hQu j , u j i.
Since u j is orthogonal to each of u1 , u2 , . . . , um and thus everything in range(Q) =

S, we then have hQu j , u j i = 0, so kQu j k2 = hQu j , Qu j i = 0 too, and thus
Qu j = 0. The proof that Pu j = 0 when j > m is identical.
Since a matrix (linear transformation) is completely determined by how it
acts on a basis of Fn , the fact that Pu j = Qu j for all 1 ≤ j ≤ n implies P = Q,
so all orthogonal projections onto S are the same.
In the special case when S is 1-dimensional (i.e., a line), the above result
simply says that P = uu∗ , where u is a unit vector pointing in the direction
of that line. It follows that Pv = (uu∗ )v = (u · v)u, which recovers the fact
that we noted earlier about functions of this form (well, functions of the form
P(v) = hu, viu) projecting down onto the line in the direction of u.
More generally, if we expand out the product P = AA∗ using block matrix
multiplication, we see that if {u1 , u2 , . . . , um } is an orthonormal basis of S then
 
This is a special case u∗1
of the rank-one sum
decomposition from 
 u∗2 

m
Theorem A.1.3. P = u1 | u2 | · · · | um 

.. =
 ∑ u j u∗j .
. j=1
u∗m
Example 1.4.15 Construct the orthogonal projection P onto the plane S ⊂ R3 with equation
Finding an x − y − 2z = 0.
Orthogonal Projection
Solution:
Onto a Plane
Recall from Example 1.4.7 that one orthonormal basis of S is
Even though there
n o
are lots of C = {u1 , u2 } = √15 (2, 0, 1), √130 (1, 5, −2) .
orthonormal bases
of S, they all It follows from Theorem 1.4.10 that the (unique!) orthogonal projection
produce the same
projection P.
onto S is
   
∗ ∗
2
1  1  1 h i
It is worth comparing P = u1 u1 + u2 u2 = 0 2 0 1 +  5  1 5 −2
5 30
P to the orthogonal 1 −2
projection onto the  
plane x − y + 2z = 0 5 1 2
from 1 
= 1 5 −2 .
Example 1.4.14(c). 6
2 −2 2
One of the most useful features of orthogonal projections is that they do not
just project a vector v anywhere in their range, but rather they always project
down to the closest vector in their range, as illustrated in Figure 1.19. This
observation hopefully makes some intuitive sense, since the shortest path from
us to the ceiling above us is along a line pointing straight up (i.e., the shortest
path is orthogonal to the ceiling), but we make it precise in the following
theorem.
z
S v
The distance kv −
between two k w2
k
kv − P(v)k
w1
vectors v and w is kv
−
kv − wk. w2
w1
y
P(v)
x
Figure 1.19: The fastest way to get from a vector to a nearby plane is to go “straight
down” to it. In other words, the closest vector to v in a subspace S is P(v), where P
is the orthogonal projection of onto S. That is, kv − P(v)k ≤ kv − wk for all w ∈ S.
Theorem 1.4.11 Suppose P is an orthogonal projection onto a subspace S of an inner

Orthogonal product space V. Then
Projections Find
the Closest Point v − P(v) ≤ kv − wk for all v ∈ V, w ∈ S.
in a Subspace
Furthermore, equality holds if and only if w = P(v).
Proof. We first notice that v − P(v) is orthogonal to every vector x ∈ S, since
hv − P(v), xi = hv − P(v), P(x)i = hP∗ (v − P(v)), xi

= hP(v − P(v)), xi = hP(v) − P(v), xi = 0.
If we now let w ∈ S be arbitrary and choose x = P(v) − w (which is in

S since P(v) and w are), we see that v − P(v) is orthogonal to P(v) − w. It
follows that

v − w 2 = (v − P(v)) + (P(v) − w) 2 = v − P(v) 2 + P(v) − w 2 ,
where the final equality follows from the Pythagorean theorem for inner prod-
ucts (Exercise 1.3.12). Since kP(v) − wk2 ≥ 0, it follows immediately that
kv − wk ≥ kv − P(v)k. Furthermore, equality holds if and only if kP(v) − wk =
0, which happens if and only if P(v) = w.
Example 1.4.16 Find the closest vector to v = (3, −2, 2) in the plane S ⊂ R3 defined by
Finding the Closest the equation x − 4y + z = 0.
Vector in a Plane
Solution:
Theorem 1.4.11 tells us that the closest vector to v in S is Pv, where
P is the orthogonal projection onto S. To construct P, we first need an
orthonormal basis of S so that we can use Theorem 1.4.10. To find an
orthonormal basis of S, we proceed as we did in Example 1.4.7—we
apply the Gram–Schmidt process to any set consisting of two linearly
To find these vectors independent vectors in S, like B = {(1, 0, −1), (0, 1, 4)}.
(1, 0, −1) and (0, 1, 4),
choose x and y Applying the Gram–Schmidt process to B gives the following orthonor-
arbitrarily and then mal basis C of S:
solve for z via n o
x − 4y + z = 0. C = √12 (1, 0, −1), 31 (2, 1, 2) .
Theorem 1.4.10 then tells us that the orthogonal projection onto S is

Even though there
  
√ 
are lots of different
17 4 −1 1/ 2 2/3
orthonormal bases 1   
of S, they all P = AA∗ = 4 2 4 , where A =  0 1/3 .
18 √
produce this same −1 4 17 −1/ 2 2/3
orthogonal
projection P.
It follows that the closest vector to v = (3, −2, 2) in S is
    
17 4 −1 3 41
1  1  
Pv = 4 2 4  −2 = 16 ,
18 18
−1 4 17 2 23
which is illustrated on the next page:

Pv = (41, 16, 23)/18

x − 4y + z = 0 x
y v = (3, −2, 2)
Example 1.4.17 Show that the linear system Ax = b is inconsistent, where

Almost Solving a    
Linear System 1 2 3 2
 
A = 4 5 6 and b = −1 . 
7 8 9 0
Recall that Then find the closest thing to a solution—that is, find a vector x ∈ R3 that
“inconsistent” means
minimizes kAx − bk.
“has no solutions”.
Solution:
To see that this linear system is inconsistent (i.e., has no solutions), we
just row reduce the augmented matrix [ A | b ]:
   
1 2 3 2 1 0 −1 0
 4 5 6 −1  row reduce  0 1 2 0  .
−−−−−−→
7 8 9 0 0 0 0 1
The bottom row of this row echelon form tells us that the original linear
system has no solutions, since it corresponds to the equation 0x1 + 0x2 +
0x3 = 1.
The fact that this linear system is inconsistent means exactly that
b∈/ range(A), and to find that “closest thing” to a solution (i.e., to minimize
kAx − bk), we just orthogonally project b onto range(A). We thus start by
This orthonormal constructing an orthonormal basis C of range(A):
basis can be found
by applying the n o
Gram–Schmidt C = √166 (1, 4, 7), √111 (−3, −1, 1) .
process to the first
two columns of A,
which span its range
The orthogonal projection onto range(A) is then
(see Theorem A.1.2).    √ √ 
5 2 −1 1/ 66 −3/ 11
1   √ √ 
We develop a more P = BB∗ = 2 2 2 , where B =  
4/ 66 −1/ 11 ,
6 √ √
direct method of −1 2 5 7/ 66 1/ 11
constructing the
projection onto
range(A) in which tells us (via Theorem 1.4.11) that the vector Ax that minimizes
Exercise 1.4.30.
kAx − bk is
    
5 2 −1 2 4
1  1
Ax = Pb =  2 2 2  −1 =  1  .
6 3
−1 2 5 0 −2
Unfortunately, this is not quite what we want—we have found Ax (i.e.,

the closest vector to b in range(A)), but we want to find x itself. Fortunately,
x can now be found by solving the linear system Ax = (4, 1, −2)/3:
   
1 2 3 4/3 1 0 −1 −2
 4 5 6 1/3  row reduce  0 1 2 5/3  .
−−−−−−→
7 8 9 −2/3 0 0 0 0
It follows that x can be chosen to be any vector of the form x =

(−2, 5/3, 0) + x3 (1, −2, 1), where x3 is a free variable. Choosing x3 = 0
gives x = (−2, 5/3, 0), for example.
The method of Example 1.4.17 for using projections to find the “closest
thing” to a solution of an unsolvable linear system is called linear least squares,
and it is extremely widely-used in statistics. If we want to fit a model to a set of
data points, we typically have far more data points (equations) than parameters
of the model (variables), so our model will typically not exactly match all of the
data. However, with linear least squares we can find the model that comes as
close as possible to matching the data. We return to this method and investigate
it in more depth and with some new machinery in Section 2.C.1.
While Theorem 1.4.10 only applies directly in the finite-dimensional case,
we can extend them somewhat to the case where only the subspace being
projected onto is finite-dimensional, but the source vector space V is potentially
infinite-dimensional. We have to be slightly more careful in this situation, since
it does not make sense to even talk about the standard matrix of the projection
when V is infinite-dimensional:
Theorem 1.4.12 Suppose V is an inner product space and S ⊆ V is an n-dimensional

Another subspace with orthonormal basis {u1 , u2 , . . . , un }. Then the linear trans-
Construction of formation P : V → V defined by
Orthogonal
Projections P(v) = hu1 , viu1 + hu2 , viu2 + · · · + hun , viun for all v∈V
Uniqueness of P in is an orthogonal projection onto S.

this case is a bit
trickier, and requires
an additional
assumption like the Proof. To see that P2 = P, we just use linearity of the second entry of the inner
axiom of choice or V
being
finite-dimensional.
product to see that
n
P P(v) = ∑ hui , P(v)iui (definition of P(v))
i=1
* +
n n
It is worth comparing
this result to = ∑ ui , ∑ hu j , viu j ui (definition of P(v))
Theorem 1.4.5—we i=1 j=1
are just giving n
coordinates to P(v) = ∑ hu j , vihui , u j iui (linearity of the inner product)
with respect to i, j=1
{u1 , u2 , . . . , un }. n
= ∑ hui , viui = P(v). ({u1 , . . . , un } is an orthonormal basis)
i=1
Similarly, to see that P∗ = P we use sesquilinearity of the inner product to

see that
For the * +
second-to-last n n
equality here, recall hw, P(v)i = w, ∑ hui , viui = ∑ hui , vihw, ui i
that inner products i=1 i=1
are conjugate linear * +
n
in their first argument,
so hui , vihw, ui i = = ∑ hui , wiui , v = hP(w), vi

i=1
hw, ui iui , v =

hui , wiui , v .
for all v, w ∈ V. The fact that P∗ = P follows from the definition of adjoint
given in Definition 1.4.3.
While Theorems 1.4.10 and 1.4.12 perhaps look quite different on the
surface, the latter simply reduces to the former when V = Fn . To see this,
simply notice that in this case Theorem 1.4.12 says that
Keep in mind that !
u∗j v is a scalar, so n n n n
∗ ∗ ∗
there is no problem P(v) = ∑ hu j , viu j = ∑ u j v u j = ∑ u j u j v = ∑ u j u j v,
commuting it past u j j=1 j=1 j=1 j=1
in the final equality
here. which agrees with the formula from Theorem 1.4.10.
Example 1.4.18 Find the degree-2 polynomial f with the property that the integral
Finding the Closest Z 1
Polynomial to a ex − f (x)2 dx
Function −1
is as small as possible.
Solution:
The important fact to identify here is that we are being asked to mini-
mize kex − f (x)k2 (which is equivalent to minimizing kex − f (x)k) as f
ranges over the subspace P 2 [−1, 1] of C[−1, 1]. By Theorem 1.4.11, our
goal is thus to construct P(ex ), where P is the orthogonal projection from
C[−1, 1] onto P 2 [−1, 1], which is guaranteed to exist by Theorem 1.4.12
This basis can be since P 2 [−1, 1] is finite-dimensional. To construct P, we need an orthonor-
found by applying
the Gram–Schmidt mal basis of P 2 [−1, 1], and one such basis is
process to the
n √ √ o
standard basis
C = p1 (x), p2 (x), p3 (x) = √12 , √32 x, √58 (3x2 − 1) .
{1, x, x2 } much like
we did in
Example 1.4.8. Next, we compute the inner product of ex with each of these basis
polynomials:

Z 1
p1 (x), ex = √1
2 −1
ex dx = √12 e − 1e ≈ 1.6620,

√ Z 1 √
p2 (x), ex = √3
2 −1
xex
dx = e
6
≈ 0.9011, and

√ Z 1 √ 2
5(e
p3 (x), ex = √5
8 −1
(3x2 − 1)ex dx = √ −7)
2e
≈ 0.2263.
Theorem 1.4.12 then tells us that

P(ex ) = p1 (x), ex p1 (x) + p2 (x), ex p2 (x) + p3 (x), ex p3 (x)
2
= 21 e − 1e + 3e x + 5(e4e−7) (3x2 − 1)
≈ 0.5367x2 + 1.1036x + 0.9963.
It is worth noting that the solution to the previous example is rather close
For a refresher on to the degree-2 Taylor polynomial for ex , which is x2 /2 + x + 1. The reason
Taylor polynomials,
that these polynomials are close to each other, but not exactly the same as each
see Appendix A.2.2.
other, is that the Taylor polynomial is the polynomial that best approximates
ex at x = 0, whereas the one that we constructed in the previous example
is the polynomial that best approximates ex on the whole interval [−1, 1]
(see Figure 1.20). In fact, if we similarly use orthogonal projections to find
polynomials that approximate ex on the interval [−c, c] for some scalar c > 0,
those polynomials get closer and closer to the Taylor polynomial as c goes to 0
(see Exercise 1.4.32).
Figure 1.20: Orthogonal projections of ex and its Taylor polynomials each approx-
imate it well, but orthogonal projections provide a better approximation over
an entire interval ([−1, 1] in this case) while Taylor polynomials provide a better
approximation at a specific point (x = 0 in this case).
Remark 1.4.3 When projecting onto infinite-dimensional subspaces, we must be much

Projecting Onto more careful, as neither of Theorem 1.4.10 nor 1.4.12 apply in this sit-
Infinite-Dimensional uation. In fact, there are infinite-dimensional subspaces that cannot be
Subspaces projected onto by any orthogonal projection.
For example, if there existed an orthogonal projection from C[−1, 1]
onto P[−1, 1], then Theorem 1.4.11 (which does hold even in the infinite-
dimensional case) would tell us that
x
e − P(ex ) ≤ ex − f (x)
for all polynomials f ∈ P[−1, 1]. However, this is impossible since we

can find polynomials f (x) that make the norm on the right as close to 0 as
we like. For example, if we define
p
xn
Tp (x) = ∑
n=1 n!
A proof that this limit to be the degree-p Taylor polynomial for ex centered at 0, then
equals 0 is outside of
the scope of this 2 Z 1 x
book, but is typically lim ex − Tp (x) = lim e − Tp (x)2 dx = 0.
covered in p→∞ p→∞ −1
differential calculus
courses. The idea is Since kex − P(ex )k ≤ kex − Tp (x)k for all p, this implies kex − P(ex )k = 0,
simply that as the so ex = P(ex ). However, this is impossible since ex ∈ / P[−1, 1], so the
degree of the Taylor
polynomial orthogonal projection P does not actually exist.
increases, it
approximates ex We close this section with one final simple result that says that orthogonal
better and better.
projections never increase the norm of a vector. While this hopefully seems
somewhat intuitive, it is important to keep in mind that it does not hold for
oblique projections. For example, if the sun is straight overhead then shadows
(i.e., projections) are shorter than the objects that cast them, but if the sun is
low in the sky then our shadow may be longer than we are tall.
Theorem 1.4.13 Suppose V is an inner product space and let P : V → V be an orthogonal

Orthogonal Projections projection onto some subspace of V. Then
Never Increase
the Norm kP(v)k ≤ kvk for all v ∈ V.
Furthermore, equality holds if and only if P(v) = v (i.e., v ∈ range(P)).
Proof. We just move things around in the inner product and use the Cauchy–
Schwarz inequality:
Keep in mind that P
is an orthogonal
projection, so P∗ = P
kP(v)k2 = hP(v), P(v)i = h(P∗ ◦ P)(v), vi
(1.4.3)
and P2 = P here. = hP2 (v), vi = hP(v), vi ≤ kP(v)kkvk.
If kP(v)k = 0 then the desired inequality is trivial, and if kP(v)k 6= 0 then we

can divide Inequality (1.4.3) through by it to see that kP(v)k ≤ kvk, as desired.
To verify the claim about the equality condition, suppose v 6= 0 (otherwise
the statement is trivial) and note that equality holds in the Cauchy–Schwarz
inequality step of Inequality (1.4.3) if and only if P(v) is a multiple of v (i.e., v
is an eigenvector of P). Since Exercise 1.4.31 tells us that all eigenvalues of P

are 0 or 1, we conclude that P(v)=v (if P(v)=0 then kP(v)k=0 6= kvk).
We show a bit later, in Exercise 2.3.25, that the above result completely
characterizes orthogonal projections. That is, if a projection P : V → V is such
that kP(v)k ≤ kvk for all v ∈ V then P must be an orthogonal projection.
1.4.1 Determine which of the following sets are orthonor-

(b) If A, B ∈ Mn (C) are Hermitian matrices then so is
mal bases of their span in Rn (with respect to the dot prod-
A + B.
uct).
∗(c) If A, B ∈ Mn (R) are symmetric matrices then so is
∗(a) B = {(0, 1, 1)}
√ √ AB.
(b) B = {(1, 1)/√2, (1, −1)/ 2}√ (d) There exists a set of 6 mutually orthogonal non-zero
∗(c) B = {(1, 1)/ 2, √(1, 0), (1, 2)/ 5} √ vectors in R4 .
(d) B = {(1, 0, 2)/ 5, (0, 1, 0), (2, 0,√
−1)/ 5} ∗(e) If U,V ∈ Mn are unitary matrices, then so is U +V .
∗(e) B = {(1, 1, 1, 1)/2,√
(0, 1, −1, 0)/ 2, (f) If U ∈ Mn is a unitary matrix, then U −1 exists and
(1, 0, 0, −1)/
√ 2} √ is also unitary.
(f) B = {(1, 2, 2, 1)/ √10, (1, 0, 0, −1)/ 2,
√ ∗(g) The identity transformation is an orthogonal projec-
(0, 1, −1, 0)/ 2, (2, 1, −1, −2)/ 10} tion.
(h) If P : V → V is a projection then so is IV − P.
1.4.2 Find an orthonormal basis (with respect to the stan-
dard inner product on the indicated inner product space) for 1.4.6 Let D : P 2 [−1, 1] → P 2 [−1, 1] be the differentiation
the span of the given set of vectors: map, and recall that P 2 [−1, 1] is the vector space of poly-
(a) {(0, 3, 4)} ⊆ R3 nomials with degree at most 2 with respect to the standard
∗(b) {(3, 4), (7, 32)} ⊆ R2 . inner product
Z 1
(c) {(1, 0, 1), (2, 1, 3)} ⊆ R3 h f , gi = f (x)g(x) dx.
∗(d) {sin(x), cos(x)} ⊆ C[−π, π]. −1
(e) {1, x, x2 } ⊆ P[0, 1]. Compute D∗ (2x + 1).
∗(f) The following three matrices in M2 :
" # " # " #
1 1 1 0 0 2 1.4.7 Find a vector x ∈ R3 that minimizes kAx−bk, where
, , and .    
1 1 0 1 1 −1 1 1 1 1
   
A = 1 −1 0 and b = 1 .
1.4.3 For each of the following linear transformations T , 2 0 1 1
find the adjoint T ∗ :
∗(a) T : R2 → R3 , defined by T (v) = (v1 , v1 + v2 , v2 ). ∗∗1.4.8 Suppose U,V ∈ Mn are unitary matrices. Show
(b) T : Rn → R, defined by T (v) = v1 . that UV is also unitary.
∗(c) T : R2 → R2 , defined by T (v) = (v1 + v2 , v1 − v2 ),
√
where the inner product on R2 is hv, wi = v1 w1 − 1.4.9 Show that if U ∈ Mn is unitary then kUkF = n
v1 w2 − v2 w1 + 2v2 w2 . (recall that kUkF is the Frobenius norm of U).
(d) T : M2 → M2 , defined by
" #! " #
a b d −b ∗1.4.10 Show that the eigenvalues of unitary matrices lie
T = . on the unit circle in the complex plane. That is, show that
c d −c a
if U ∈ Mn is a unitary matrix, and λ is an eigenvalue of U,
then |λ | = 1.
1.4.4 Construct the orthogonal projection onto the indi-
cated subspace of Rn .
∗∗1.4.11 Show that if U ∈ Mn is a unitary matrix then
∗(a) The x-axis in R2 .
| det(U)| = 1.
(b) The plane in R3 with equation
 x + 2y + 3z
 = 0.
∗(c) The range of the matrix A = 1 0 1 .
  ∗∗1.4.12 Show that if U ∈ Mn is unitary and upper trian-
1 1 2
gular then it must be diagonal and its diagonal entries must
0 1 1
have magnitude 1.
true and which are false. 1.4.13 Let {u1 , u2 , . . . , un } be any orthonormal basis of
∗(a) If B and C are orthonormal bases of finite- Fn (where F = R or F = C). Show that
n
dimensional inner product spaces V and W, respec-
tively, and T : V → W is a linear transformation, ∑ u j u∗j = I.
j=1
then
[T ∗ ]B←C = [T ]C←B
∗
. 1.4.14 Let A ∈ Mn have mutually orthogonal (but not
necessarily normalized) columns.
(a) Show that A∗ A is a diagonal matrix. ∗∗1.4.22 Suppose V and W are finite-dimensional inner
(b) Give an example to show that AA∗ might not be a product spaces and T : V → W is a linear transformation
diagonal matrix. with adjoint T ∗ . Show that rank(T ∗ ) = rank(T ).
∗1.4.15 Let ω = e2πi/n (which is an n-th root of unity). ∗∗1.4.23 Show that every linear transformation T : V →
Show that the Fourier matrix F ∈ Mn (C) defined by W has at most one adjoint map, even when V and W are
  infinite-dimensional. [Hint: Use Exercise 1.4.27.]
1 1 1 ··· 1
 
1 ω ω2 ··· ω n−1 
  ∗∗1.4.24 Suppose F = R or F = C and consider a function
1 1 ω2 ω4 ··· ω 2n−2 
F=√  
.
 h·, ·i : Fn × Fn → F.
n . . . . 
. . . . . .  (a) Show that h·, ·i is an inner product if and only if there
. . . . . 
exists an invertible matrix P ∈ Mn (F) such that
1 ω n−1 ω 2n−2 ··· ω (n−1)(n−1)
hv, wi = v∗ (P∗ P)w for all v, w, ∈ Fn .
is unitary. [Hint: Change a basis in Theorem 1.4.3.]
[Side note: F is, up to scaling, a Vandermonde matrix.] (b) Find a matrix P associated with the weird inner prod-
n−1
uct from Example 1.3.18. That is, for that inner prod-
[Hint: Try to convince yourself that ∑ ω k = 0.] uct find a matrix P ∈ M2 (R) such that
k=0 hv, wi = v∗ (P∗ P)w for all v, w, ∈ R2 .
1.4.16 Suppose that B and C are bases of a finite- (c) Explain why h·, ·i is not an inner product if the matrix
dimensional inner product space V. P from part (a) is not invertible.
(a) Show that if B and C are each orthonormal then
[v]B · [w]B = [v]C · [w]C . ∗∗1.4.25 Suppose V is a finite-dimensional vector space
(b) Provide an example that shows that if B and C are and B ⊂ V is linearly independent. Show that there is an
not both orthonormal bases, then it might be the case inner product on V with respect to which B is orthonormal.
that [v]B · [w]B 6= [v]C · [w]C .
[Hint: Use the method of Exercise 1.4.24 to construct an
inner product.]
∗∗1.4.17 Suppose A ∈ Mm,n (F) and B ∈ Mn,m (F).
(a) Suppose F = R. Show that (Ax) · y = x · (By) for all 1.4.26 Find an inner product on R2 with respect to which
x ∈ Rn and y ∈ Rm if and only if B = AT . the set {(1, 0), (1, 1)} is an orthonormal basis.
(b) Suppose F = C. Show that (Ax) · y = x · (By) for all
x ∈ Cn and y ∈ Cm if and only if B = A∗ .
∗∗1.4.27 Suppose V and W are inner product spaces and
[Side note: In other words, the adjoint of A is either AT or T : V → W is a linear transformation. Show that
A∗ , depending on the ground field.]
hT (v), wi = 0 for all v ∈ V and w ∈ W.
if and only if T = O.
∗∗1.4.18 Suppose F = R or F = C, E is the standard basis
of Fn , and B is any basis of Fn . Show that a change-of-basis
matrix PE←B is unitary if and only if B is an orthonormal ∗∗1.4.28 Suppose V is an inner product space and T :
basis of Fn . V → V is a linear transformation.
(a) Suppose the ground field is C. Show that
∗∗1.4.19 Suppose F = R or F = C and B,C ∈ Mm,n (F). hT (v), vi = 0 for all v ∈ V
Show that the following are equivalent:
if and only if T = O.
a) B∗ B = C∗C, [Hint: Mimic part of the proof of Theorem 1.4.9.]
b) (Bv) · (Bw) = (Cv) · (Cw) for all v, w ∈ Fn , and (b) Suppose the ground field is R. Show that
c) kBvk = kCvk for all v ∈ Fn .
hT (v), vi = 0 for all v ∈ V
[Hint: If C = I then we get some of the characterizations of
if and only if T ∗ = −T .
unitary matrices from Theorem 1.4.9. Mimic that proof.]
∗∗1.4.29 In this exercise, we show that if V is a finite-

∗∗1.4.20 Suppose B is a mutually orthogonal set of unit
dimensional inner product space over C and P : V → V is
vectors in a finite-dimensional inner product space V. Show
a projection then P is orthogonal (i.e., hP(v), v − P(v)i = 0
that there is an orthonormal basis C of V with B ⊆ C ⊆ V.
for all v ∈ V) if and only if it is self-adjoint (i.e., P = P∗ ).
[Side note: The analogous result for non-orthonormal bases (a) Show that if P is self-adjoint then it is orthogonal.
was established in Exercise 1.2.26.] (b) Use Exercise 1.4.28 to show that if P is orthogonal
then it is self-adjoint.
1.4.21 Suppose V and W are inner product spaces and [Side note: The result of this exercise is still true over R, but
T : V → W is a linear transformation with adjoint T ∗ . Show it is more difficult to show—see Exercise 2.1.23.]
that (T ∗ )∗ = T .
∗∗ 1.4.30 Suppose A ∈ Mm,n has linearly independent

columns.
(a) Show that A∗ A is invertible. 1.4.33 Much like we can use polynomials to approx-
(b) Show that P = A(A∗ A)−1 A∗ is the orthogonal projec- imate functions via orthogonal projections, we can also
tion onto range(A). use trigonometric functions to approximate them. Doing so
[Side note: This exercise generalizes Theorem 1.4.10 to the gives us something called the function’s Fourier series.
case when the columns of A are just a basis of its range, but
(a) Show that, for each n ≥ 1, the set
not necessarily an orthonormal one.]
Bn = {1, sin(x), sin(2x), . . . , sin(nx),
∗∗1.4.31 Show that if P ∈ Mn is a projection then there cos(x), cos(2x), . . . , cos(nx)}
exists an invertible matrix Q ∈ Mn such that
" # is mutually orthogonal in the usual inner product on
I O −1 C[−π, π].
P=Q r Q , (b) Rescale the members of Bn so that they have norm
O O
equal to 1.
where r = rank(P). In other words, every projection is diag- (c) Orthogonally project the function f (x) = x onto
onalizable and has all eigenvalues equal to 0 or 1. span(Bn ).
[Hint: What eigen-properties do the vectors in the range and [Side note: These are called the Fourier approxi-
null space of P have?] mations of f , and letting n → ∞ gives its Fourier
series.]
[Side note: We prove a stronger decomposition for orthogo- (d) Use computer software to plot the function f (x) = x
nal projections in Exercise 2.1.24.] from part (c), as well as its projection onto span(B5 ),
on the interval [−π, π].
∗∗1.4.32 Let 0 < c ∈ R be a scalar.
(a) Suppose Pc : C[−c, c] → P 2 [−c, c] is an orthogo-
nal projection. Compute Pc (ex ). [Hint: We worked
through the c = 1 case in Example 1.4.18.]
(b) Compute the polynomial limc→0+ Pc (ex ) and notice
that it equals the degree-2 Taylor polynomial of ex at
x = 0. Provide a (not necessarily rigorous) explana-
tion for why we would expect these two polynomials
to coincide.
1.5 Summary and Review
In this chapter, we investigated how the various properties of vectors in Rn

and matrices in Mm,n (R) from introductory linear algebra can be generalized
to many other settings by working with vector spaces instead of Rn , and
linear transformations between them instead of Mm,n (R). In particular, any
property or theorem about Rn that only depends on vector addition and scalar
multiplication (i.e., linear combinations) carries over straightforwardly with
essentially no changes to abstract vector spaces and the linear transformations
acting on them. Examples of such properties include:
These concepts all • Subspaces, linear (in)dependence, and spanning sets.
work over any field F.
• Coordinate vectors, change-of-basis matrices, and standard matrices.
• Invertibility, rank, range, nullity, and null space of linear transformations.
• Eigenvalues, eigenvectors, and diagonalization.
However, the dot product (and properties based on it, like the length of a
vector) cannot be described solely in terms of scalar multiplication and vector
addition, so we introduced inner products as their abstract generalization. With
inner products in hand, we were able to extend the following ideas from Rn to
more general inner product spaces:
These concepts • The length of a vector (the norm induced by the inner product).
require us to be
working over one of • Orthogonality.
the fields R or C. • The transpose of a matrix (the adjoint of a linear transformation).
• Orthogonal projections.
1.5 Summary and Review 115
Having inner products to work with also let us introduce orthonormal bases
We could have also and unitary matrices, which can be thought of as the “best-behaved” bases and
used inner products
invertible matrices that exist, respectively.
to define angles in
general vector In the finite-dimensional case, none of the aforementioned topics change
spaces. much when going from Rn to abstract vector spaces, since every such vector
space is isomorphic to Rn (or Fn , if the vector space is over a field F). In partic-
ular, to check some purely linear-algebraic concept like linear independence
or invertibility, we can simply convert abstract vectors and linear transforma-
tions into vectors in Fn and matrices in Mm,n (F), respectively, and check the
corresponding property there. To check some property that also depends on
an inner product like orthogonality or the length of a vector, we can similarly
convert everything into vectors in Fn and matrices in Mm,n (F) as long as we
are careful to represent the vectors and matrices in orthonormal bases.
For this reason, not much is lost in (finite-dimensional) linear algebra if we
explicitly work with Fn instead of abstract vector spaces, and with matrices
instead of linear transformations. We will often switch back and forth between
these two perspectives, depending on which is more convenient for the topic
at hand. For example, we spend most of Chapter 2 working specifically with
matrices, though we will occasionally make a remark about what our results
say for linear transformations.

true and which are false.
∗1.5.4 Let P2p denote the vector space of 2-variable poly-
∗(a) If B,C ⊆ V satisfy span(B) = span(C) then B = C. nomials of degree at most p. For example,
(b) If B,C ⊆ V satisfy B ⊆ C then span(B) ⊆ span(C).
∗(c) If B and C are bases of finite-dimensional inner prod- x2 y3 − 2x4 y ∈ P25 and 3x3 y − x7 ∈ P27 .
uct spaces V and W, respectively, and T : V → W Construct a basis of P2p and compute dim(P2p ).
is an invertible linear transformation, then
−1 −1 [Side note: We generalize this exercise to Pnp , the vector
T B←C
= [T ]C←B .
space of n-variable polynomials of degree at most p, in
(d) If B and C are bases of finite-dimensional inner prod- Exercise 3.B.4.]
uct spaces V and W, respectively, and T : V → W
is a linear transformation, then
∗ ∗ 1.5.5 Explain why the function
T B←C = [T ]C←B .
Z b
∗(e) If V is a vector space over C then every linear trans- h f , gi = f (x)g(x) dx
formation T : V → V has an eigenvalue. a
(f) If U,V ∈ Mn are unitary matrices, then so is UV . is an inner product on each of P[a, b], P (and their
subspaces like P p [a, b] and P p ), and C[a, b] (see Exam-
∗∗1.5.2 Let S1 and S2 be finite-dimensional subspaces of ple 1.3.17), but not on C.
a vector space V. Recall from Exercise 1.1.19 that sum of
S1 and S2 is defined by ∗1.5.6 Suppose A, B ∈ Mm,n and consider the linear trans-
def
S1 + S2 = {v + w : v ∈ S1 , w ∈ S2 }, formation TA,B : Mn → Mm defined by
which is also a subspace of V.
TA,B (X) = AXB∗ .
(a) Show that
Compute T ∗ (with respect to the standard Frobenius inner
dim(S1 +S2 ) = dim(S1 )+dim(S2 )−dim(S1 ∩S2 ).
product).
(b) Show that dim(S1 ∩ S2 ) ≥ dim(S1 ) + dim(S2 ) −
dim(V).
1.5.7 How many diagonal unitary matrices are there in
Mn (R)? Your answer should be a function of n.
1.5.3 Show that if f is a degree-3 polynomial then
{ f , f 0 , f 00 , f 000 } is a basis of P 3 (where f 0 is the first deriva-
tive of f , and so on).
1.A Extra Topic: More About the Trace
Recall that the trace of a square matrix A ∈ Mn is the sum of its diagonal
entries:
tr(A) = a1,1 + a2,2 + · · · + an,n .
While we are already familiar with some nice features of the trace (such as the
Recall that a fact that it is similarity-invariant, and the fact that tr(A) also equals the sum of
function f is
the eigenvalues of A), it still seems somewhat arbitrary—why does adding up
similarity-invariant if
f (A) = f (PAP−1 ) for the diagonal entries of a matrix tell us anything interesting?
all invertible P. In this section, we explore the trace in a bit more depth in an effort to
explain where it “comes from”, in the same sense that the determinant can
be thought of as the answer to the question of how to measure how much
a linear transformation expands or contracts space. In brief, the trace is the
“most natural” or “most useful” linear form on Mn —it can be thought of as the
additive counterpart of the determinant, which in some sense is similarly the
“most natural” multiplicative function on Mn .
1.A.1 Algebraic Characterizations of the Trace

Algebraically, what makes the trace interesting is the fact that tr(AB) = tr(BA)
for all A ∈ Mm,n and B ∈ Mn,m , since
m m n
tr(AB) = ∑ [AB]i,i = ∑ ∑ ai, j b j,i
i=1 i=1 j=1
n m n
= ∑ ∑ b j,i ai, j = ∑ [BA] j, j = tr(BA).
j=1 i=1 j=1
Even though matrix multiplication itself is not commutative, this property lets
us treat it as if it were commutative in some situations. To illustrate what we
mean by this, we note that the following example can be solved very quickly
with the trace, but is quite difficult to solve otherwise.
Example 1.A.1 Show that there do not exist matrices A, B ∈ Mn such that AB − BA = I.
The Matrix AB − BA
Solution:
To see why such matrices cannot exist, simply take the trace of both
The matrix AB − BA is sides of the equation:
sometimes called
the commutator of A tr(AB − BA) = tr(AB) − tr(BA) = tr(AB) − tr(AB) = 0,
and B, and is
denoted by
def but tr(I) = n. Since n 6= 0, no such matrices A and B can exist.
[A, B] = AB − BA.
Remark 1.A.1 The previous example can actually be extended into a theorem. Using the
A Characterization of exact same logic as in that example, we can see that the matrix equation
Trace-Zero Matrices AB − BA = C can only ever have a solution when tr(C) = 0, since
tr(C) = tr(AB − BA) = tr(AB) − tr(BA) = tr(AB) − tr(AB) = 0.
This fact is proved in Remarkably, the converse of this observation is also true—for any matrix
[AM57].
C with tr(C) = 0, there exist matrices A and B of the same size such that
1.A Extra Topic: More About the Trace 117
AB − BA = C. Proving this fact is quite technical and outside of the scope

of this book, but we prove a slightly weaker result in Exercise 1.A.6.
One of the most remarkable facts about the trace is that, not only does it
satisfy this commutativity property, but it is essentially the only linear form
that does so. In particular, the following theorem says that the only linear forms
for which f (AB) = f (BA) are the trace and its scalar multiples:
Theorem 1.A.1 Let f : Mn → F be a linear form with the following properties:

Commutativity a) f (AB) = f (BA) for all A, B ∈ Mn , and
Defines the Trace b) f (I) = n.
Then f (A) = tr(A) for all A ∈ Mn .
Proof. Every matrix A ∈ Mn (F) can be written as a linear combination of the

standard basis matrices:
Recall that Ei, j is the n
matrix with a 1 as its A= ∑ ai, j Ei, j .
(i, j)-entry and i, j=1
all other entries Since f is linear, it follows that
equal to 0.
!
n n
f (A) = f ∑ ai, j Ei, j = ∑ ai, j f (Ei, j ).
i, j=1 i, j=1
Our goal is thus to show that f (Ei, j ) = 0 whenever i 6= j and f (Ei, j ) = 1

whenever i = j, since that would imply
n
f (A) = ∑ ai,i = tr(A).
i=1
To this end, we first recall that since f is linear it must be the case that
f (O) = 0. Next, we notice that Ei, j E j, j = Ei, j but E j, j Ei, j = O whenever i 6= j,
so property (a) of f given in the statement of the theorem implies

0 = f (O) = f E j, j Ei, j = f Ei, j E j, j = f (Ei, j ) whenever i 6= j,
which is one of the two facts that we wanted to show.
To similarly prove the other fact (i.e., f (E j, j ) = 1 for all 1 ≤ j ≤ n), we
notice that E1, j E j,1 = E1,1 , but E j,1 E1, j = E j, j for all 1 ≤ j ≤ n, so

f (E j, j ) = f E j,1 E1, j = f E1, j E j,1 = f (E1,1 ) for all 1 ≤ j ≤ n.
However, since I = E1,1 + E2,2 + · · · + En,n , it then follows that
n = f (I) = f (E1,1 + E2,2 + · · · + En,n )
= f (E1,1 ) + f (E2,2 ) + · · · + f (En,n ) = n f (E1,1 ),
so f (E1,1 ) = 1, and similarly f (E j, j ) = 1 for all 1 ≤ j ≤ n, which completes
the proof.
There are also a few other ways of thinking of the trace as the unique
function with certain properties. For example, there are numerous functions of
matrices that are similarity-invariant (e.g., the rank, trace, and determinant),
but the following corollary says that the trace is (up to scaling) the only one
that is linear.
Corollary 1.A.2 Suppose F = R or F = C, and f : Mn (F) → F is a linear form with the

Similarity- following properties:
Invariance Defines a) f (A) = f (PAP−1 ) for all A, P ∈ Mn (F) with P invertible, and
the Trace b) f (I) = n.
Then f (A) = tr(A) for all A ∈ Mn (F).
The idea behind this corollary is that if A or B is invertible, then AB is

similar to BA (see Exercise 1.A.5). It follows that if f is similarity-invariant
then f (AB) = f (BA) as long as at least one of A or B is invertible, which almost
tells us, via Theorem 1.A.1, that f must be the trace (up to scaling).
However, making the jump from f (AB) = f (BA) whenever at least one of
A or B invertible, to f (AB) = f (BA) for all A and B, is somewhat delicate. We
thus return to this corollary in Section 2.D.3 (in particular, see Theorem 2.D.4),
where we prove it via matrix analysis techniques. The idea is that, since every
matrix is arbitrarily close to an invertible matrix, if this property holds for all
invertible matrices then continuity of f tells us that it must in fact hold for
non-invertible matrices too.
As one final way of characterizing the trace, notice that if P ∈ Mn is a
(not necessarily orthogonal) projection (i.e., P2 = P), then tr(P) = rank(P)
(see Exercise 1.A.4). In other words, tr(P) is the dimension of the subspace
projected onto by P. Our final result of this subsection says that this property
characterizes the trace—it is the only linear form with this property. In a sense,
this means that the trace can be thought of as a linearized version of the rank or
dimension-counting function.
Theorem 1.A.3 Let f : Mn → F be a linear form. The following are equivalent:

Rank of a) f (P) = rank(P) for all projections P ∈ Mn .
Projections Defines b) f (A) = tr(A) for all A ∈ Mn .
the Trace
Proof. To see that (a) =⇒ (b), we proceed much like in the proof of The-
orem 1.A.1—our goal is to show that f (E j, j ) = 1 and f (Ei, j ) = 0 for all
1 ≤ i 6= j ≤ n.
In fact, this shows The reason that f (E j, j ) = 1 is simply that E j, j is a rank-1 projection for
that if f is a linear
each 1 ≤ j ≤ n. To see that f (Ei, j ) = 0 when i 6= j, notice that E j, j + Ei, j
function for which
f (P) = 1 for all rank-1 is also a rank-1 (oblique) projection, so f (E j, j + Ei, j ) = 1. However, since
projections P ∈ Mn f (E j, j ) = 1, linearity of f tells us that f (Ei, j ) = 0. It follows that for every
then f = tr. matrix A ∈ Mn (F) we have
!
n n n
f (A) = f ∑ ai, j Ei, j = ∑ ai, j f (Ei, j ) = ∑ a j, j = tr(A),
i, j=1 i, j=1 j=1
as desired.
The fact that (b) =⇒ (a) is left to Exercise 1.A.4.
If we have an inner product to work with (and are thus working over the
field F = R or F = C), we can ask whether or not the above theorem can be
strengthened to consider only orthogonal projections (i.e., projections P for
which P∗ = P). It turns out that if F = C then this works—if f (P) = rank(P) for
all orthogonal projections P ∈ Mn (C) then f (A) = tr(A) for all A ∈ Mn (C).
However, this is not true if F = R (see Exercise 1.A.7).
1.A Extra Topic: More About the Trace 119
1.A.2 Geometric Interpretation of the Trace

Since the trace of a matrix is similarity-invariant, it should have some sort of
geometric interpretation that depends only on the underlying linear transforma-
We return to tion (after all, similarity-invariance means exactly that it does not depend on
similarity-invariance
the basis that we use to represent that linear transformation). This is a bit more
and its geometric
interpretation in difficult to see than it was for the rank or determinant, but one such geometric
Section 2.4. interpretation is the best linear approximation of the determinant.
The above statement can be made more precise with derivatives. In particu-
lar, the following theorem says that if we start at the identity transformation
(i.e., the linear transformation that does nothing) and move slightly in the di-
rection of A, then space is expanded by an amount proportional to tr(A). More
specifically, the directional derivative of the determinant in the direction of A is
its trace:
Theorem 1.A.4 Suppose F = R or F = C, let A ∈ Mn (F), and define a function fA : F → F

Directional by fA (x) = det(I + xA). Then fA0 (0) = tr(A).
Derivative of
the Determinant Before proving this theorem, it is perhaps useful to try to picture it. If
A ∈ Mn has columns a1 , a2 , . . . , an then the linear transformation I + xA adds
xa1 , xa2 , . . . , xan to the side vectors of the unit square (or cube, or hypercube...).
The majority of the change in the determinant of I + xA thus comes from how
much xa1 points in the direction of e1 (i.e., xa1,1 ) plus the amount that xa2 points
in the direction of e2 (i.e., xa2,2 ), and so on (see Figure 1.21), which equals
xtr(A). In other words, tr(A) provides the first-order (or linear) approximation
for how det(I + xA) changes when x is small.
Figure 1.21: The determinant det(I +xA) can be split up into four pieces, as indicated
here at the bottom-right. The blue region has area approximately equal to 1,
the purple region has area proportional to x2 , and the orange region has area
xa1,1 + xa2,2 = xtr(A). When x is close to 0, this orange region is much larger than the
purple region and thus determines the growth rate of det(I + xA).
Proof of Theorem 1.A.4. Recall that the characteristic polynomial pA of A has

the form
pA (λ ) = det(A − λ I) = (−1)n λ n + (−1)n−1 tr(A)λ n−1 + · · · + det(A).
If we let λ = −1/x and then multiply this characteristic polynomial through

by xn , we see that
fA (x) = det(I + xA) = xn det(A + (1/x)I)

= xn pA (−1/x)

= xn (1/x)n + tr(A)(1/x)n−1 + · · · + det(A)
= 1 + xtr(A) + · · · + det(A)xn .
Recall that the It follows that

derivative of xn is fA0 (x) = tr(A) + · · · + det(A)nxn−1 ,
nxn−1 .
where each of the terms hidden in the “· · · ” has at least one power of x in it. We
thus conclude that fA0 (0) = tr(A), as claimed.
1.A.1 Determine which of the following statements are

(c) Show that W = Z. [Hint: Find sufficiently many
linearly independent matrices in W to show that its
(a) If A, B ∈ Mn then tr(A + B) = tr(A) + tr(B). dimension coincides with that of Z.]
∗(b) If A, B ∈ Mn then tr(AB) = tr(A)tr(B).
(c) If A ∈ Mn is invertible, then tr(A−1 ) = 1/tr(A).
∗(d) If A ∈ Mn then tr(A) = tr(AT ). ∗∗1.A.7 Suppose F = R or F = C and let f : Mn (F) → F
(e) If A ∈ Mn has k-dimensional range then tr(A) = k. be a linear form with the property that f (P) = rank(P) for
all orthogonal projections P ∈ Mn (F).
(a) Show that if F = C then f (A) = tr(A) for all A ∈
∗∗1.A.2 Suppose A, B ∈ Mn are such that A is symmet-
Mn (C).
ric (i.e, A = AT ) and B is skew-symmetric (i.e., B = −BT ).
[Hint: Take inspiration from the proof of Theo-
Show that tr(AB) = 0.
rem 1.A.3, but use some rank-2 projections too.]
(b) Provide an example to show that if F = R then it
1.A.3 Suppose A ∈ Mm,n (C). Show that tr(A∗ A) ≥ 0. is not necessarily the case that f (A) = tr(A) for all
[Side note: Matrices of the form A∗ A are called positive A ∈ Mn (R).
semidefinite, and we investigate them thoroughly in Sec-
tion 2.2.] ∗∗1.A.8 Suppose F = R or F = C and let A, B ∈ Mn (F)
be such that A is invertible.
∗∗1.A.4 Show that if P ∈ Mn then tr(P) = rank(P). (a) Show that if fA,B (x) = det(A + xB) then fA,B
0 (0) =

[Side note: This is the implication (b) =⇒ (a) of Theo- −1
det(A)tr A B .
rem 1.A.3.] (b) Suppose that the entries of A depend in a differen-
tiable way on a parameter t ∈ F (so we denote it by
A(t) from now on). Explain why
∗∗1.A.5 Suppose A, B ∈ Mn .
(a) Show that if at least one of A or B is invertible then d 0
det A(t) = fA(t),dA/dt (0),
AB and BA are similar. dt
(b) Provide an example to show that if A and B are not where dA/dt refers to the matrix that is obtained by
invertible then AB and BA may not be similar. taking the derivative of each entry of A with respect
to t.
∗∗1.A.6 Consider the two sets [Hint: Taylor’s theorem from Appendix A.2 might
help.]
Z = C ∈ Mn : tr(C) = 0 and
(c) Show that
W = span AB − BA : A, B ∈ Mn .
d dA
In this exercise, we show that W = Z. [Side note: Refer det A(t) = det A(t) tr A(t)−1 .
dt dt
back to Remark 1.A.1 for some context.]
(a) Show that W is a subspace of Z, which is a subspace [Side note: This is called Jacobi’s formula, and
of Mn . we generalize it to non-invertible matrices in The-
(b) Compute dim(Z). orem 2.D.6.]
1.B Extra Topic: Direct Sum, Orthogonal Complement 121
1.B Extra Topic: Direct Sum, Orthogonal Complement
It is often useful to break apart a large vector space into multiple subspaces
that do not intersect each other (except at the zero vector, where intersection
is unavoidable). For example, it is somewhat natural to think of R2 as being
made up of two copies of R, since every vector (x, y) ∈ R2 can be written in
the form (x, 0) + (0, y), and the subspaces {(x, 0) : x ∈ R} and {(0, y) : y ∈ R}
are each isomorphic to R in a natural way.
Similarly, we can think of R3 as being made up of either three copies of R,
or a copy of R2 and a copy of R, as illustrated in Figure 1.22. The direct sum
provides a way of making this idea precise, and we explore it thoroughly in
this section.
Figure 1.22: The direct sum lets us break down R3 (and other vector spaces) into
smaller subspaces that do not intersect each other except at the origin.
1.B.1 The Internal Direct Sum

We now pin down exactly what we mean by the afore-mentioned idea of
splitting up a vector space into two non-intersecting subspaces.
Definition 1.B.1 Let V be a vector space with subspaces S1 and S2 . We say that V is the
The Internal (internal) direct sum of S1 and S2 , denoted by V = S1 ⊕ S2 , if
Direct Sum a) span(S1 ∪ S2 ) = V, and
b) S1 ∩ S2 = {0}.
For now, we just refer The two defining properties of the direct sum mimic very closely the two
to this as a direct
defining properties of bases (Definition 1.1.6). Just like bases must span the
sum (without caring
about the “internal” entire vector space, so too must subspaces in a direct sum, and just like bases
part of its name). We must be “small enough” that they are linearly independent, subspaces in a direct
will distinguish sum must be “small enough” that they only contain the zero vector in common.
between this and
another type of It is also worth noting that the defining property (a) of the direct sum
direct sum later. is equivalent to saying that every vector v ∈ V can be written in the form
v = v1 + v2 for some v1 ∈ S1 and v2 ∈ S2 . The reason for this is simply that
in any linear combination of vectors from S1 and S2 , we can group the terms
In other words, the from S1 into v1 and the terms from S2 into v2 . That is, if we write
direct sum is a
special case of the
(not necessarily v = c1 x1 + c2 x2 + · · · + ck xk + d1 y1 + d2 y2 + · · · + dm ym , (1.B.1)
| {z } | {z }
direct) sum
call this v1 call this v2
V = S1 + S2 from
Exercise 1.1.19.
where x1 , . . . , xk ∈ S1 , y1 , . . . , ym ∈ S2 , and c1 , . . . , ck , d1 , . . . , dm ∈ F, then we
can just define v1 and v2 to be the parenthesized terms indicated above.
Example 1.B.1 Determine whether or not R3 = S1 ⊕ S2 , where S1 and S2 are the given
Checking Whether or subspaces.
Not Subspaces a) S1 is the x-axis and S2 is the y-axis.
Make a Direct Sum b) S1 is xy-plane and S2 is the yz-plane.
c) S1 is the line through the origin in the direction of the vector (0, 1, 1)
and S2 is the xy-plane.
Solutions:
a) To determine whether or not V = S1 ⊕ S2 , we must check the two
In part (a) of this defining properties of the direct sum from Definition 1.B.1. Indeed,
example, S1 and S2
property (a) does not hold since span(S1 ∪ S2 ) is just the xy-plane,
are “too small”.
not all of R3 , so R3 6= S1 ⊕ S2 :
z
S2
y
x S1
b) Again, we check the two defining properties of the direct sum.

Property (a) holds since every vector in R3 can be written as a linear
In part (b) of this combination of vectors in the xy-plane and the yz-plane. However,
example, S1 and S2
property (b) does not hold since S1 ∩ S2 is the y-axis (not just {0}),
are “too big”.
z
S2
x S1
so R3
6= S1 ⊕ S2 .
c) To see that span(S1 ∪ S2 ) = R3 we must show that we can write
every vector (x, y, z) ∈ R3 as a linear combination of vectors from
S1 and S2 . One way to do this is to notice that
(x, y, z) = (0, z, z) + (x, y − z, 0).

Since (0, z, z) ∈ S1 and (x, y − z, 0) ∈ S2 , we have
(x, y, z) ∈ span(S1 ∪ S2 ),
so span(S1 ∪ S2 ) = R3 .
To see that S1 ∩S2 = {0}, suppose (x, y, z) ∈ S1 ∩S2 . Since (x, y, z) ∈
S1 , we know that x = 0 and y = z. Since (x, y, z) ∈ S2 , we know that
z = 0. It follows that (x, y, z) = (0, 0, 0) = 0, so S1 ∩ S2 = {0}, and
z
S1
S2
Notice that the line
S1 is not orthogonal
to the plane S2 . The y
intuition here is the
same as that of
linear thus R3 = S1 ⊕ S2 . x
independence, not
of orthogonality.
The direct sum extends straightforwardly to three or more subspaces as well.
In general, we say that V is the (internal) direct sum of subspaces S1 , S2 , . . . , Sk
if span(S1 ∪ S2 ∪ · · · ∪ Sk ) = V and
The notation with the [
big “∪” union symbol Si ∩ span S j = {0} for all 1 ≤ i ≤ k. (1.B.2)
here is analogous to j6=i
big-Σ notation for
sums. Indeed, Equation (1.B.2) looks somewhat complicated on the surface, but it
just says that each subspace Si (1 ≤ i ≤ k) has no non-zero intersection with
(the span of) the rest of them. In this case we write either
k
M
V = S1 ⊕ S 2 ⊕ · · · ⊕ S k or V= S j.
j=1
For example, if S1 , S2 , and S3 are the x-, y-, and z-axes in R3 , then R3 =
S1 ⊕ S2 ⊕ S3 . More generally, Rn can be written as the direct sum of its n
coordinate axes. Even more generally, given any basis {v1 , v2 , . . . , vk } of a
finite-dimensional vector space V, it is the case that
This claim is proved
in Exercise 1.B.4.
V = span(v1 ) ⊕ span(v2 ) ⊕ · · · ⊕ span(vk ).
We thus think of the direct sum as a higher-dimensional generalization of bases:

while bases break vector spaces down into 1-dimensional subspaces (i.e., the
lines in the direction of the basis vectors) that only intersect at the zero vector,
direct sums allow for subspaces of any dimension.
Given this close connection between bases and direct sums, it should not
be surprising that many of our theorems concerning bases from Section 1.2
generalize in a straightforward way to direct sums. Our first such result is
analogous to the fact that there is a unique way to write every vector in a vector
space as a linear combination of basis vectors (Theorem 1.1.4).
Theorem 1.B.1 Suppose that V is a vector space with subspaces S1 , S2 ⊆ V satisfying

Uniqueness of Sums V = S1 ⊕ S2 . For every v ∈ V, there is exactly one way to write v in the
in Direct Sums form
v = v1 + v2 with v1 ∈ S1 and v2 ∈ S2 .
Proof. We already noted that v can be written in this form back in Equa-
tion (1.B.1). To see uniqueness, suppose that there exist v1 , w1 ∈ S1 and
v2 , w2 ∈ S2 such that
v = v1 + v2 = w1 + w2 .
Recall that S1 is a Subtracting v2 + w1 from the above equation shows that v1 − w1 = w2 − v2 .
subspace, so if
Since v1 −w1 ∈ S1 and w2 −v2 ∈ S2 , and S1 ∩S2 = {0}, this implies v1 −w1 =
v1 , w1 ∈ S1 then
v1 − w1 ∈ S1 too (and w2 − v2 = 0, so v1 = w1 and v2 = w2 , as desired.
similarly for S2 ).
Similarly, we now show that if we combine bases of the subspaces S1 and
S2 then we get a basis of V = S1 ⊕ S2 . This hopefully makes some intuitive
sense—every vector in V can be represented uniquely as a sum of vectors in S1
and S2 , and every vector in those subspaces can be represented uniquely as a
linear combination of their basis vectors.
Theorem 1.B.2 Suppose that V is a vector space and S1 , S2 ⊆ V are subspaces with bases
Bases of Direct Sums B and C, respectively. Then
a) span(B ∪C) = V if and only if span(S1 ∪ S2 ) = V, and
Here, B ∪C refers to b) B ∪C is linearly independent if and only if S1 ∩ S2 = {0}.
the union of B and C
as a multiset, so that In particular, B ∪C is a basis of V if and only if V = S1 ⊕ S2 .
if there is a common
vector in each of B
and C then we Proof. For part (a), we note that if span(B ∪ C) = V then it must be the case
immediately regard
B ∪C as linearly
that span(S1 ∪ S2 ) = V as well, since span(B ∪C) ⊂ span(S1 ∪ S2 ). In the other
dependent. direction, if span(S1 ∪ S2 ) = V then we can write every v ∈ V in the form v = v1 +
v2 , where v1 ∈ S1 and v2 ∈ S2 . Since B and C are bases of S1 and S2 , respectively,
we can write v1 and v2 as linear combinations of vectors from those sets:

v = v1 + v2 = c1 x1 + c2 x2 + · · · + ck xk + d1 y1 + d2 y2 + · · · + dm ym ,
where x1 , x2 , . . ., xk ∈ S1 , y1 , y2 , . . ., ym ∈ S2 , and c1 , c2 , . . ., ck , d1 , d2 , . . .,
dm ∈ F. We have thus written v as a linear combination of vectors from B ∪C,
so span(B ∪C) = V.
For part (b), suppose that S1 ∩ S2 = {0} and consider some linear combi-
nation of vectors from B ∪C that equals the zero vector:

c1 x1 + c2 x2 + · · · + ck xk + d1 y1 + d2 y2 + · · · + dm ym = 0, (1.B.3)
| {z } | {z }
As was the case with call this v1 call this v2
proofs about bases,
these proofs are where the vectors and scalars come from the same spaces as they did in part (a).
largely uninspiring Our goal is to show that c1 = c2 = · · · = ck = 0 and d1 = d2 = · · · = dm = 0,
definition-chasing which implies linear independence of B ∪C.
affairs. The theorems
themselves should To this end, notice that Equation (1.B.3) says that 0 = v1 + v2 , where
be somewhat v1 ∈ S1 and v2 ∈ S2 . Since 0 = 0 + 0 is another way of writing 0 as a sum of
intuitive though. something from S1 and something from S2 , Theorem 1.B.1 tells us that v1 = 0
and v2 = 0. It follows that
c1 x1 + c2 x2 + · · · + ck xk = 0 and d1 y1 + d2 y2 + · · · + dm ym = 0,
so linear independence of B implies c1 = c2 = · · · = ck = 0 and linear indepen-

dence of C similarly implies d1 = d2 = · · · = dm = 0. We thus conclude that
B ∪C is linearly independent.
All that remains is to show that B ∪C being linearly independent implies
S1 ∩ S2 = {0}. This proof has gone on long enough already, so we leave this
final implication to Exercise 1.B.5.
For example, the above theorem implies that if we take any basis of a vector
space V and partition that basis in any way, then the spans of those partitions
form a direct sum decomposition of V. For example, if B = {v1 , v2 , v3 , v4 } is a
basis of a (4-dimensional) vector space V then
V = span(v1 ) ⊕ span(v2 ) ⊕ span(v3 ) ⊕ span(v4 )
= span(v1 ∪ v2 ) ⊕ span(v3 ) ⊕ span(v4 )
= span(v1 ) ⊕ span(v2 ∪ v4 ) ⊕ span(v3 )
= span(v1 ∪ v3 ∪ v4 ) ⊕ span(v2 ),
as well as many other possibilities.
Example 1.B.2 Let P E and P O be the subspaces of P consisting of the even and odd
Even and Odd polynomials, respectively:
Polynomials as a
Direct Sum P E = { f ∈ P : f (−x) = f (x)} and P O = { f ∈ P : f (−x) = − f (x)}.
Show that P = P E ⊕ P O .
We introduced PE
Solution:
and P O in
Example 1.1.21. We could directly show that span(P E ∪ P O ) = P and P E ∩ P O = {0},
but perhaps an easier way is consider how the bases of these vector spaces
relate to each other. Recall from Example 1.1.21 that B = {1, x2 , x4 , . . .} is
a basis of P E and C = {x, x3 , x5 , . . .} is a basis of P O . Since
B ∪C = {1, x, x2 , x3 , . . .}
is a basis of P, we conclude from Theorem 1.B.2 that P = P E ⊕ P O .
Another direct consequence of Theorem 1.B.2 is the (hopefully intuitive)

fact that we can only write a vector space as a direct sum of subspaces if the
dimensions of those subspaces sum up to the dimension of the original space.
Note that this agrees with our intuition from Example 1.B.1 that subspaces in a
direct sum must not be too big, but also must not be too small.
Corollary 1.B.3 Suppose that V is a finite-dimensional vector space with subspaces

Dimension of S1 , S2 ⊆ V satisfying V = S1 ⊕ S2 . Then
(Internal) Direct
Sums dim(V) = dim(S1 ) + dim(S2 ).
Proof. We recall from Definition 1.2.2 that the dimension of a finite-dimensional

vector space is the number of vectors in any of its bases. If B and C are bases
Recall that |B| of S1 and S2 , respectively, then the fact that S1 ∩ S2 = {0} implies B ∩C = {},
means the number
since 0 cannot be a member of a basis. It follows that |B ∪C| = |B| + |C|, so
of vectors in B.
dim(V) = |B ∪C| = |B| + |C| = dim(S1 ) + dim(S2 ).
For example, we can now see straight away that the subspaces S1 and S2
from Examples 1.B.1(a) and (b) do not form a direct sum decomposition of R3
since in part (a) we have dim(S1 ) + dim(S2 ) = 1 + 1 = 2 6= 3, and in part (b)
we have dim(S1 ) + dim(S2 ) = 2 + 2 = 4 6= 3. It is perhaps worthwhile to work
through a somewhat more exotic example in a vector space other than Rn .
Example 1.B.3 Let MSn and MsS

n be the subspaces of Mn consisting of the symmetric
The Cartesian and skew-symmetric matrices, respectively. Show that
Decomposition
of Matrices Mn = MSn ⊕ MsS
n .
Also compute the dimensions of each of these vector spaces.

Recall that A ∈ Mn is
symmetric if AT = A Solution:
and it is
To see that property (b) of Definition 1.B.1 holds, suppose that A ∈ Mn
skew-symmetric if
AT = −A. is both symmetric and skew-symmetric. Then
A = AT = −A,
This example only
works if the ground
from which it follows that A = O, so MSn ∩ MsS
n = {O}.
field does not have For property (a), we have to show that every matrix A ∈ Mn can be
1 + 1 = 0 (like in Z2 , written in the form A = B +C, where BT = B and CT = −C. We can check
for example). Fields
with this property are via direct computation that
said to have
“characteristic 2”, 1 1
A= A + AT + A − AT ,
and they often 2 2
| {z } | {z }
require special
symmetric skew-symmetric
attention.
so we can choose B = (A + AT )/2 and C = (A − AT )/2 (and it is straight-

forward to check that BT = B and CT = −C). We thus conclude that
Mn = MSn ⊕ MsS n , as desired.
We computed the dimensions of these vector spaces in Exercise 1.2.2:
n(n + 1) n(n − 1)
dim(MSn ) = , dim(MsS
n )= , and dim(Mn ) = n2 .
2 2
Note that these quantities are in agreement with Corollary 1.B.3, since
n(n + 1) n(n − 1)
dim(MSn ) + dim(MsS
n )= + = n2 = dim(Mn ).
2 2
Remark 1.B.1 While the decomposition of Example 1.B.3 works fine for complex matri-
The Complex ces (as well as matrices over almost any other field), a slightly different
Cartesian decomposition that makes use of the conjugate transpose is typically used
Decomposition in that setting instead. Indeed, we can write every matrix A ∈ Mn (C) as a
sum of a Hermitian and a skew-Hermitian matrix via
1 1
A= A + A∗ + A − A∗ . (1.B.4)
2 2
| {z } | {z }
Hermitian skew-Hermitian
It is thus tempting to write Mn (C) = MH sH H

n ⊕ Mn (where now Mn
and MsH n denote the sets of Hermitian and skew-Hermitian matrices,

respectively). However, this is only true if we are thinking of Mn (C) as
a 2n2 -dimensional vector space over R (not as an n2 -dimensional vector
space over C like usual). The reason for this is simply that MH n and
MsHn are not even subspaces of Mn (C) over the field C (refer back to
The matrix (A + A∗ )/2 Example 1.1.5—the point is that if A is Hermitian or skew-Hermitian then
is sometimes called
so is cA when c ∈ R, but not necessarily when c ∈ C).
the real part or
Hermitian part of A It is also worth noting that, in light of Theorem 1.B.1, Equation (1.B.4)
and denoted by provides the unique way of writing a matrix as a sum of a Hermitian and
Re(A), and similarly
(A − A∗ )/(2i) is called
skew-Hermitian matrix. This is sometimes called the Cartesian decom-
its imaginary part or position of A (as is the closely-related decomposition of Example 1.B.3),
skew-Hermitian part and it is analogous to the fact that every complex number can be written
and denoted by as a sum of a real number and imaginary number:
Im(A). Then
Im
A = Re(A) + iIm(A).
2 a + ib
b
Re
-1 1 2 a 3
-1
We return to this idea
of “matrix versions”
of certain subsets of Indeed, Hermitian matrices are often thought of as the “matrix version”
complex numbers in of real numbers, and skew-Hermitian matrices as the “matrix version” of
Figure 2.6. imaginary numbers.
Just like we can use dimensionality arguments to make it easier to de-

termine whether or not a set is a basis of a vector space (via Theorem 1.2.1
or Exercise 1.2.27), we can also use dimensionality of subspaces to help us
determine whether or not they form a direct sum decomposition of a vector
space. In particular, if subspaces have the right size, we only need to check one
of the two properties from Definition 1.B.1 that define the direct sum, not both.
Theorem 1.B.4 Suppose V is a finite-dimensional vector space with subspaces S1 , S2 ⊆ V.

Using Dimension to a) If dim(V) 6= dim(S1 ) + dim(S2 ) then V 6= S1 ⊕ S2 .
Check a Direct b) If dim(V) = dim(S1 ) + dim(S2 ) then the following are equivalent:
Sum Decomposition i) span(S1 ∪ S2 ) = V,
ii) S1 ∩ S2 = {0}, and
iii) V = S1 ⊕ S2 .
Proof. Part (a) of the theorem follows immediately from Corollary 1.B.3, so
we focus our attention on part (b). Also, condition (iii) immediately implies
conditions (i) and (ii), since those two conditions define exactly what V =
S1 ⊕ S2 means. We thus just need to show that condition (i) implies (iii) and
that (ii) also implies (iii).
To see that condition (i) implies condition (iii), we first note that if B and C
are bases of S1 and S2 , respectively, then
|B ∪C| ≤ |B| + |C| = dim(S1 ) + dim(S2 ) = dim(V).
However, equality must actually be attained since Theorem 1.B.2 tells us that
(since (i) holds) span(B ∪C) = V. It then follows from Exercise 1.2.27 (since
B ∪ C spans V and has dim(V) vectors) that B ∪ C is a basis of V, so using
Theorem 1.B.2 again tells us that V = S1 ⊕ S2 .
The proof that condition (ii) implies condition (iii) is similar, and left as
Exercise 1.B.6.
1.B.2 The Orthogonal Complement

Just like the direct sum can be thought of as a “subspace version” of bases,
there is also a “subspace version” of orthogonal bases. As is usually the case
when dealing with orthogonality, we need slightly more structure here than a
vector space itself provides—we need an inner product as well.
Definition 1.B.2 Suppose V is an inner product space and B ⊆ V is a set of vectors. The
Orthogonal orthogonal complement of B, denoted by B⊥ , is the subspace of V con-
Complement sisting of the vectors that are orthogonal to everything in B:
def
B⊥ is read as “B B⊥ = v ∈ V : hv, wi = 0 for all w ∈ B .
perp”, where “perp”
is short for
perpendicular. For example, two lines in R2 that are perpendicular to each other are orthog-
onal complements of each other, as are a plane and a line in R3 that intersect
at right angles (see Figure 1.23). The idea is that orthogonal complements
break an inner product space down into subspaces that only intersect at the zero
vector, much like (internal) direct sums do, but with the added restriction that
those subspaces must be orthogonal to each other.
Figure 1.23: Orthogonal complements in R2 and R3 .
It is straightforward to show that B⊥ is always a subspace of V (even if B

is not), so we leave the proof of this fact to Exercise 1.B.10. Furthermore, if
S is a subspace of V then we will see shortly that (S ⊥ )⊥ = S, at least in the
finite-dimensional case, so orthogonal complement subspaces come in pairs.
For now, we look at some examples.
Example 1.B.4 Describe the orthogonal complements of the following subsets of Rn .

Orthogonal a) The line in R2 through the origin and the vector (2, 1).
Complements in b) The vector (1, −1, 2) ∈ R3 .
Euclidean Space
Solutions:
a) We want to determine which vectors are orthogonal to (2, 1). Well,
Actually, we want to if v = (v1 , v2 ) then we can rearrange the equation
find all vectors v
orthogonal to every (v1 , v2 ) · (2, 1) = 2v1 + v2 = 0
multiple of (2, 1).
However, v is
orthogonal to a to the form v2 = −2v1 , or v = v1 (1, −2). It follows that the orthog-
non-zero multiple of onal complement is the set of scalar multiples of (1, −2):
(2, 1) if and only if it is y
orthogonal to (2, 1)
itself.
(2, 1)
x
(1, −2)
b) We want to determine which vectors are orthogonal to (1, −1, 2):

the vectors v = (v1 , v2 , v3 ) with v1 − v2 + 2v3 = 0. This is a (quite
degenerate) linear system with solutions of the form
v = (v1 , v2 , v3 ) = v2 (1, 1, 0) + v3 (−2, 0, 1),
where v2 and v3 are free variables. It follows that the orthogonal

complement is the plane in R3 with basis {(1, 1, 0), (−2, 0, 1)}:
z
(1, −1, 2) (−2, 0, 1)
We replaced the
basis vector (1, 1, 0)
with (2, 2, 0) just to
make this picture a x
bit prettier. The
author is very y
superficial. (2, 2, 0)
Example 1.B.5 Describe the orthogonal complement of the set in Mn consisting of just
Orthogonal the identity matrix: B = {I}.
Complements of
Solution:
Matrices
Recall from Example 1.3.16 that the standard (Frobenius) inner prod-
uct on Mn is hX,Y i = tr(X ∗Y ), so X ∈ B⊥ if and only if hX, Ii = tr(X ∗ ) =
0. This is equivalent to simply requiring that tr(X) = 0.
Example 1.B.6 Describe the orthogonal complement of the set B = {x, x3 } ⊆ P 3 [−1, 1].
Orthogonal
Solution:
Complements of
Our goal is to find all polynomials f (x) = ax3 + bx2 + cx + d with the
Polynomials
property that
Z 1 Z 1
h f (x), xi = x f (x) dx = 0 and h f (x), x3 i = x3 f (x) dx = 0.
−1 −1
Straightforward calculation shows that

Z 1 2a 2c
h f (x), xi = ax4 + bx3 + cx2 + dx dx = + and
−1 5 3
Z 1
3
2a 2c
h f (x), x i = ax6 + bx5 + cx4 + dx3 dx = + .
−1 7 5
Similarly, it is the case
that the orthogonal
complement of the Setting both of these quantities equal to 0 and then solving for a and c
set of even gives a = c = 0, so f ∈ B⊥ if and only if f (x) = bx2 + d.
polynomials is the set
of odd polynomials
(and vice-versa) in We now start pinning down the various details of orthogonal complements
P[−1, 1] (see that we have alluded to—that they behave like an orthogonal version of direct
Exercise 1.B.17). sums, that they behave like a “subspace version” of orthonormal bases, and that
orthogonal complements come in pairs. The following theorem does most of the
heavy lifting in this direction, and it is completely analogous to Theorem 1.B.2
for (internal) direct sums.
Theorem 1.B.5 Suppose V is a finite-dimensional inner product space and S ⊆ V is a

Orthonormal Bases of subspace. If B is an orthonormal basis of S and C is an orthonormal basis
Orthogonal of S ⊥ then B ∪C is an orthonormal basis of V.
Complements
Proof. First note that B and C are disjoint since the only vector that S and
S ⊥ have in common is 0, since that is the only vector orthogonal to itself.
With that in mind, write B = {u1 , u2 , . . . , um } and C = {v1 , v2 , . . . , vn }, so
that B ∪ C = {u1 , u2 , . . . , um , v1 , v2 , . . . , vn }. To see that B ∪ C is a mutually
orthogonal set, notice that
hui , u j i = 0 for all 1 ≤ i 6= j ≤ m, since B is an orthonormal basis,

hvi , v j i = 0 for all 1 ≤ i 6= j ≤ n, since C is an orthonormal basis, and
hui , v j i = 0 for all 1 ≤ i ≤ m, 1 ≤ j ≤ n, since ui ∈ S and v j ∈ S ⊥ .
We thus just need to show that span(B ∪C) = V. To this end, recall from
Exercise 1.4.20 that we can extend B ∪C to an orthonormal basis of V: we can
find k ≥ 0 unit vectors w1 , w2 , . . . wk such that the set
{u1 , u2 , . . . , um , v1 , v2 , . . . , vn , w1 , w2 , . . . wk }
is an orthonormal basis of V. However, since this is an orthonormal basis, we

know that hwi , u j i = 0 for all 1 ≤ i ≤ k and 1 ≤ j ≤ m, so in fact wi ∈ S ⊥ for
all i. This implies that {v1 , v2 , . . . , vn , w1 , w2 , . . . , wk } is a mutually orthogonal
subset of S ⊥ consisting of n + k vectors. However, since dim(S ⊥ ) = |C| =
n we conclude that the only possibility is that k = 0, so B ∪ C itself is a
basis of V.
The above theorem has a few immediate (but very useful) corollaries. For
example, we can now show that orthogonal complements really are a stronger
version of direct sums:
Theorem 1.B.6 Suppose V is a finite-dimensional inner product space with subspaces

Orthogonal S1 , S2 ⊆ V. The following are equivalent:
Complements are a) S2 = S1⊥ .
Direct Sums b) V = S1 ⊕ S2 and hv, wi = 0 for all v ∈ S1 and w ∈ S2 .
In particular, this Proof. If property (a) holds (i.e., S2 = S1⊥ ) then the fact that hv, wi = 0 for all
theorem tells us that
v ∈ S1 and w ∈ S2 is clear, so we just need to show that V = S1 ⊕ S2 . Well,
if V is
finite-dimensional Theorem 1.B.5 tells us that if B and C are orthonormal bases of S1 and S2 ,
then for every respectively, then B ∪C is an orthonormal basis of V. Theorem 1.B.2 then tells
subspace S ⊆ V we us that V = S1 ⊕ S2 .
have V = S ⊕ S ⊥ .
In the other direction, property (b) immediately implies S2 ⊆ S1⊥ , so we
just need to show the opposite inclusion. To that end, suppose w ∈ S1⊥ (i.e.,
hv, wi = 0 for all v ∈ S1 ). Then, since w ∈ V = S1 ⊕ S2 , we can write w =
w1 + w2 for some w1 ∈ S1 and w2 ∈ S2 . It follows that
0 = hv, wi = hv, w1 i + hv, w2 i = hv, w1 i
for all v ∈ S1 , where the final equality follows from the fact that w2 ∈ S2 and
thus hv, w2 i = 0 by property (b). Choosing v = w1 then gives hw1 , w1 i = 0, so
w1 = 0, so w = w2 ∈ S2 , as desired.
Example 1.B.7 Suppose F = R or F = C and let MSn and MsS

n be the subspaces of Mn (F)
The Orthogonal consisting of the symmetric and skew-symmetric matrices, respectively.
⊥
Complement of Show that MSn = MsS n .
Symmetric Matrices
Solution:
Recall from Example 1.B.3 that Mn = MSn ⊕ MsS n . Furthermore, we
can use basic properties of the trace to see that if A ∈ MSn and B ∈ MsS n
Compare this then
orthogonality
property with
Exercise 1.A.2. hA, Bi = tr(A∗ B) = −tr(ABT ) = −tr (BA∗ )T
= −tr(BA∗ ) = −tr(A∗ B) = −hA, Bi,
⊥
so hA, Bi = 0. It then follows from Theorem 1.B.6 that MSn = MsS
n .
Theorem 1.B.6 tells us that for every subspace S of V, we have V = S ⊕ S ⊥ ,

so everything that we already know about direct sums also applies to orthogonal
Most of these complements. For example:
connections
between the • If V is a finite-dimensional inner product space with subspace S ⊆ V
orthogonal and v ∈ V, then there exist unique vectors v1 ∈ S and v2 ∈ S ⊥ such that
complement and v = v1 + v2 (via Theorem 1.B.1).
direct sum break
down in infinite • If V is a finite-dimensional inner product space with subspace S ⊆ V,
dimensions—see then dim(S) + dim(S ⊥ ) = dim(V) (via Corollary 1.B.3).
Exercise 1.B.18.
These results also tell us that if S is a subspace of a finite-dimensional inner
product space then (S ⊥ )⊥ = S. Slightly more generally, we have the following
fact:
! If B is any subset (not necessarily a subspace) of a finite-

dimensional inner product space then (B⊥ )⊥ = span(B).
This fact is proved in Exercise 1.B.14 and illustrated in Figure 1.24. Note in
particular that after taking the orthogonal complement of any set B once, further
orthogonal complements just bounce back and forth between B⊥ and span(B).
span(B) B⊥
⊥
B
⊥
⊥
..
.
Figure 1.24: The orthogonal complement of B is B⊥ , and the orthogonal com-
plement of B⊥ is span(B). After that point, taking the orthogonal complement of
span(B) results in B⊥ again, and vice-versa.
Example 1.B.8 Suppose V is a vector space and P : V → V is a projection (i.e., P2 = P).

(Orthogonal) a) Show that V = range(P) ⊕ null(P).
Projections and b) Show that if P is an orthogonal projection (i.e., P∗ = P) then
Direct Sums range(P)⊥ = null(P).
If P is an orthogonal
projection then V Solutions:
must actually be an a) The fact that range(P) ∩ null(P) = {0} is follows from noting that
inner product space, if v ∈ range(P) then P(v) = v, and if v ∈ null(P) then P(v) = 0.
not just a vector Comparing these two equations shows that v = 0.
space.
To see that V = span(range(P) ∪ null(P)) (and thus V = range(P) ⊕
null(P)), notice that every vector v ∈ V can be written in the form
v = P(v) + (v − P(v)). The first vector P(v) is in range(P) and the
second vector v − P(v) satisfies
P(v − P(v)) = P(v) − P2 (v) = P(v) − P(v) = 0,
so it is in null(P). We have thus written v as a sum of vectors from

range(P) and null(P), so we are done.
b) We just observe that w ∈ range(P)⊥ is equivalent to several other
conditions:
w ∈ range(P)⊥ ⇐⇒ hP(v), wi = 0 for all v∈V

∗
⇐⇒ hv, P (w)i = 0 for all v ∈ V
⇐⇒ hv, P(w)i = 0 for all v∈V
⇐⇒ P(w) = 0
⇐⇒ w ∈ null(P).
It follows that range(P)⊥ = null(P), as desired.
The above example tells us that every projection breaks space down into two
pieces—its range and null space, respectively—one of which is projects onto

and one of which it project along. Furthermore, the projection is orthogonal if
and only if these two component subspaces are orthogonal to each other (see
Figure 1.25).
Figure 1.25: An (a) oblique projection P1 and an (b) orthogonal projection P2

projecting along their null spaces onto their ranges.
In fact, just like orthogonal projections are completely determined by

their range (refer back to Theorem 1.4.10), oblique projections are uniquely
determined by their range and null space (see Exercise 1.B.7). For this reason,
just like we often talk about the orthogonal projection P onto a particular
subspace (range(P)), we similarly talk about the (not necessarily orthogonal)
projection P onto one subspace (range(P)) along another one (null(P)).
Orthogonality of the Fundamental Subspaces
The range and null space of a linear transformation acting on a finite-
dimensional inner product space, as well as the range and null space of its
adjoint, are sometimes collectively referred to as its fundamental subspaces.
We saw above that the direct sum and orthogonal complement play an important
role when dealing with the fundamental subspaces of projections. Somewhat
surprisingly, they actually play an important role in the fundamental subspaces
of every linear transformation.
For example, by using standard techniques from introductory linear algebra,
we can see that the matrix
 
1 0 1 0 −1
 
1 1 0 0 1
A=  (1.B.5)
−1 0 −1 1 4
2 1 1 −1 −3
has the following sets as bases of its four fundamental subspaces:

(
Here, A∗ = AT since A 4 range(A) : (1, 1, −1, 2), (0, 1, 0, 1), (0, 0, 1, −1)
subspaces of R
is real. null(A∗ ) : (0, 1, −1, −1)
(
5 range(A∗ ) : (1, 0, 1, 0, −1), (0, 1, −1, 0, 2), (0, 0, 0, 1, 3)
subspaces of R
null(A) : (−1, 1, 1, 0, 0), (1, −2, 0, −3, 1) .
There is a lot of structure that is suggested by these bases—the dimensions

of range(A) and null(A∗ ) add up to the dimension of the output space R4 that
they live in, and similarly the dimensions of range(A∗ ) and null(A) add up
to the dimension of the input space R5 that they live in. Furthermore, it is
straightforward to check that the vector in this basis of null(A∗ ) is orthogonal
to each of the basis vectors for range(A), and the vectors in this basis for null(A)
are orthogonal to each of the basis vectors for range(A∗ ). All of these facts
can be explained by observing that the fundamental subspaces of any linear
transformation are in fact orthogonal complements of each other:
Theorem 1.B.7 Suppose V and W are finite-dimensional inner product spaces and T :
Orthogonality of the V → W is a linear transformation. Then
Fundamental a) range(T )⊥ = null(T ∗ ), and
Subspaces b) null(T )⊥ = range(T ∗ ).
Proof. The proof of this theorem is surprisingly straightforward. For part (a),
To help remember we just argue as we did in Example 1.B.8(b)—we observe that w ∈ range(T )⊥
this theorem, note is equivalent to several other conditions:
that each equation
in it contains exactly
one T , one T ∗ , one w ∈ range(T )⊥ ⇐⇒ hT (v), wi = 0 for all v∈V
∗
range, one null ⇐⇒ hv, T (w)i = 0 for all v ∈ V
space, and one
orthogonal ⇐⇒ T ∗ (w) = 0
complement. ⇐⇒ w ∈ null(T ∗ ).
It follows that range(T )⊥ = null(T ∗ ), as desired. Part (b) of the theorem can
now be proved by making use of part (a), and is left to Exercise 1.B.15.
The way to think of this theorem is as saying that, for every linear transfor-
mation T : V → W, we can decompose the input space V into an orthogonal
direct sum V = range(T ∗ ) ⊕ null(T ) such that T acts like an invertible map
on one space (range(T ∗ )) and acts like the zero map on the other (null(T )).
Similarly, we can decompose the output space W into an orthogonal direct sum
W = range(T ) ⊕ null(T ∗ ) such that T maps all of V onto one space (range(T ))
Okay, T maps things and maps nothing into the other (null(T ∗ )). These relationships between the
to 0 ∈ null(T ∗ ), but
four fundamental subspaces are illustrated in Figure 1.26.
that’s it! The rest of
null(T ∗ ) is untouched. For example, the matrix A from Equation (1.B.5) acts as a rank 3 linear
transformation that sends R5 to R4 . This means that there is a 3-dimensional
subspace range(A∗ ) ⊆ R5 on which A just “shuffles things around” to an-
other 3-dimensional subspace range(A) ⊆ R4 . The orthogonal complement of
range(A∗ ) is null(A), which accounts for the other 2 dimensions of R5 that are
We return to the “squashed away”.
fundamental
subspaces in If we recall from Exercise 1.4.22 that every linear transformation T acting
Section 2.3.1. on a finite-dimensional inner product space has rank(T ) = rank(T ∗ ), we im-
mediately get the following corollary that tells us how large the fundamental
subspaces are compared to each other.
Corollary 1.B.8 Suppose V and W are finite-dimensional inner product spaces and T :
Dimensions of the V → W is a linear transformation. Then
Fundamental a) rank(T ) + nullity(T ) = dim(V), and
Subspaces b) rank(T ) + nullity(T ∗ ) = dim(W).
V W
ra
ng
e(
T)
We make the idea
)
∗
T
that T acts “like” an
e(
ng
invertible linear
ra
transformation on
range(T ∗ ) precise in T (vr )
Exercise 1.B.16.
T
vr
0 T (vn ) = 0
T)
ll(
nu
T
vn
)
∗
T
ll(
nu
Figure 1.26: Given a linear transformation T : V → W, range(T ∗ ) and null(T ) are
orthogonal complements in V, while range(T ) and null(T ∗ ) are orthogonal comple-
ments in W. These particular orthogonal decompositions of V and W are useful
because T acts like the zero map on null(T ) (i.e., T (vn ) = 0 for each vn ∈ null(T ))
and like an invertible linear transformation on range(T ∗ ) (i.e., for each w ∈ range(T )
there exists a unique vr ∈ range(T ∗ ) such that T (vr ) = w).
1.B.3 The External Direct Sum

The internal direct sum that we saw in Section 1.B.1 works by first defining a
vector space and then “breaking it apart” into two subspaces. We can also flip
the direct sum around so as to start off with two vector spaces and then create
their “external” direct sum, which is a larger vector space that more or less
contains the original two vector spaces as subspaces. The following definition
makes this idea precise.
Definition 1.B.3 Let V and W be vector spaces over the same field F. Then the external
The External direct sum of V and W, denoted by V ⊕ W, is the vector space with
Direct Sum vectors and operations defined as follows:
Vectors: ordered pairs (v, w), where v ∈ V and w ∈ W.
In other words, Vector addition: (v1 , w1 ) + (v2 , w2 ) = (v1 + v2 , w1 + w2 ) for all v1 , v2 ∈
V ⊕ W is the
V and w1 , w2 ∈ W.
Cartesian product of
V and W, together Scalar mult.: c(v, w) = (cv, cw) for all c ∈ F, v ∈ V, and w ∈ W.
with the entry-wise
addition and scalar
multiplication It is hopefully believable that the external direct sum V ⊕ W really is a
operations. vector space, so we leave the proof of that claim to Exercise 1.B.19. For now,
we look at the canonical example that motivates the external direct sum.
Example 1.B.9 Show that F ⊕ F2 (where “⊕” here means the external direct sum) is iso-
The Direct Sum morphic to F3 .
of Fn
Solution:
By definition, F ⊕ F2 consists of vectors of the form (x, (y, z)), where
x ∈ F and (y, z) ∈ F2 (together with the “obvious” vector addition and
scalar multiplication operations). It is straightforward to check that erasing
the inner set of parentheses is an isomorphism (i.e., the linear map T :
F ⊕ F2 → F3 defined by T (x, (y, z)) = (x, y, z) is an isomorphism), so
F ⊕ F2 ∼= F3 .
More generally, it should not be surprising that Fm ⊕ Fn ∼= Fm+n in a natural

way. Furthermore, this is exactly the point of the external direct sum—it lets us
build up large vector spaces out of small ones in much the same way that we
think of Fn as made up of many copies of F.
This is basically the same intuition that we had for the internal direct
sum, but in that case we started with Fm+n and broke it down into subspaces
We will show that the that “looked like” (i.e., were isomorphic to) Fm and Fn . This difference in
internal and external
perspective (i.e., whether we start with the large container vector space or
direct sums are
isomorphic shortly. with the smaller component vector spaces) is the only appreciable difference
between the internal and external direct sums, which is why we use the same
notation for each of them.
Theorem 1.B.9 Suppose V and W are vector spaces with bases B and C, respectively, and
Bases of the define the following subsets of the external direct sum V ⊕ W:
External Direct
Sum B0 = (v, 0) : v ∈ B and C0 = (0, w) : w ∈ C .
Then B0 ∪C0 is a basis of V ⊕ W.
Proof. To see that span(B0 ∪C0 ) = V ⊕ W, suppose (x, y) ∈ V ⊕ W (i.e., x ∈ V

and y ∈ W). Since B and C are bases of V and W, respectively, we can find v1 ,
v2 , . . ., vk ∈ B, w1 , w2 , . . ., wm ∈ C, and scalars c1 , c2 , . . ., ck and d1 , d2 , . . .,
dm such that
k m
x = ∑ ci vi and y= ∑ d jw j.
i=1 j=1
It follows that
! !
k m
(x, y) = (x, 0) + (0, y) = ∑ ci vi , 0 + 0, ∑ d j w j
i=1 j=1
k m
= ∑ ci (vi , 0) + ∑ d j (0, w j ),
i=1 j=1
which is a linear combination of vectors from B0 ∪C0 .

The proof of linear independence is similar, so we leave it to
Exercise 1.B.20.
In the finite-dimensional case, the above result immediately implies the
following corollary, which helps clarify why we refer to the external direct sum
as a “sum” in the first place.
Corollary 1.B.10 Suppose V and W are finite-dimensional vector spaces with external direct
Dimension of sum V ⊕ W. Then
(External) Direct
Sums dim(V ⊕ W) = dim(V) + dim(W).
Compare this Proof. Just observe that the sets B0 and C0 from Theorem 1.B.9 have empty
corollary (and its
intersection (keep in mind that they cannot even have (0, 0) in common, since
proof) to
Corollary 1.B.3. B and C are bases so 0 ∈
/ B,C), so
dim(V ⊕ W) = |B0 ∪C0 | = |B0 | + |C0 | = |B| + |C| = dim(V) + dim(W),
as claimed.
We close this section by noting that, in practice, people often just talk about
the direct sum, without specifying whether they mean the internal or external
one (much like we use the same notation for each of them). The reason for this
is two-fold:
• First, it is always clear from context whether a direct sum is internal or
external. If the components in the direct sum were first defined and then
a larger vector space was constructed via their direct sum, it is external.
On the other hand, if a single vector space was first defined and then
it was broken down into two component subspaces, the direct sum is
internal.
• Second, the internal and external direct sums are isomorphic in a natural
way. If V and W are vector spaces with external direct sum V ⊕ W, then
we cannot quite say that V ⊕ W is the internal direct sum of V and W,
since they are not even subspaces of V ⊕ W. However, V ⊕ W is the
internal direct sum of its subspaces

V 0 = (v, 0) : v ∈ V and W 0 = (0, w) : w ∈ W ,
which are pretty clearly isomorphic to V and W, respectively (see Exer-

cise 1.B.21).
1.B.1 Find a basis for the orthogonal complement of each

of the following sets in the indicated inner product space. 1.B.2 Compute a basis of each of the four fundamental
subspaces of the following matrices and verify that they
∗(a) {(3, 2)} ⊂ R2
satisfy the orthogonality relations of Theorem 1.B.7.
(b) {(3, 2), (1, 2)} ⊂ R2 " # " #
∗(c) {(0, 0, 0)} ⊂ R3 ∗(a) 1 2 (b) 2 1 3 1
(d) {(1, 1, 1), (2, 1, 0)} ⊂ R3 3 6 4 −1 2 3
∗(e) {(1, 2, 3), (1, 1, 1), (3, 2, 1)} ⊂ R3    
(f) {(1, 1, 1, 1), (1, 2, 3, 1)} ⊂ R4
∗(c) 1 2 3 (d) 1 3 4 1
   
∗(g) MS2 ⊂ M2 (the set of symmetric 2 × 2 matrices) 4 5 7 2 1 2 2
(h) P 1 [−1, 1] ⊂ P 3 [−1, 1] 7 8 9 −1 2 2 −1
1.B.3 Determine which of the following statements are (a) Show that if hv, w1 i = hv, w2 i = 0 for all v ∈ S,
true and which are false. w1 ∈ W1 and w2 ∈ W2 then W1 = W2 .
(b) Provide an example to show that W1 may not equal
(a) If S1 , S2 ⊆ V are subspaces such that V = S1 ⊕ S2 W2 if we do not have the orthogonality requirement
then V = S2 ⊕ S1 too. of part (a).
∗(b) If V is an inner product space then V ⊥ = {0}.
(c) If A ∈ Mm,n , v ∈ range(A), and w ∈ null(A) then
v · w = 0. 1.B.13 Show that if V is a finite-dimensional inner product
⊥
∗(d) If A ∈ Mm,n , v ∈ range(A), and w ∈ null(A∗ ) then space and B ⊆ V then B⊥ = span(B) .
v · w = 0.
(e) If a vector space V has subspaces S1 , S2 , . . . , Sk satis- ∗∗1.B.14 Show that if V is a finite-dimensional inner prod-
fying span(S1 ∪S2 ∪· · ·∪Sk ) = V and Si ∩S j = {0} uct space and B ⊆ V then (B⊥ )⊥ = span(B).
for all i 6= j then V = S1 ⊕ · · · ⊕ Sk .
∗(f) The set [Hint: Make use of Exercises 1.B.12 and 1.B.13.]
{(e1 , e1 ), (e1 , e2 ), (e1 , e3 ), (e2 , e1 ), (e2 , e2 ), (e2 , e3 )}
is a basis of the external direct sum R2 ⊕ R3 . ∗∗1.B.15 Prove part (b) of Theorem 1.B.7. That is, show
(g) The external direct sum P 2 ⊕ P 3 has dimension 6. that if V and W are finite-dimensional inner product spaces
and T : V → W is a linear transformation then null(T )⊥ =
range(T ∗ ).
∗∗1.B.4 Suppose V is a finite-dimensional vector space
with basis {v1 , v2 , . . . , vk }. Show that
V = span(v1 ) ⊕ span(v2 ) ⊕ · · · ⊕ span(vk ). ∗∗ 1.B.16 Suppose V and W are finite-dimensional in-
ner product spaces and T : V → W is a linear transforma-
tion. Show that the linear transformation S : range(T ∗ ) →
∗∗ 1.B.5 Suppose V is a vector space with subspaces
range(T ) defined by S(v) = T (v) is invertible.
S1 , S2 ⊆ V that have bases B and C, respectively. Show
that if B ∪ C (as a multiset) is linearly independent then [Side note: S is called the restriction of T to range(T ∗ ).]
S1 ∩ S2 = {0}.
[Side note: This completes the proof of Theorem 1.B.2.] ∗∗ 1.B.17 Let P E [−1, 1] and P O [−1, 1] denote the sub-
spaces of even and odd polynomials, respectively, in
P[−1, 1]. Show that (P E [−1, 1])⊥ = P O [−1, 1].
∗∗1.B.6 Complete the proof of Theorem 1.B.4 by show-
ing that condition (b)(ii) implies condition (b)(iii). That is,
show that if V is a finite-dimensional vector space with sub- ∗∗1.B.18 Let C[0, 1] be the inner product space of continu-
spaces S1 , S2 ⊆ V such that dim(V) = dim(S1 ) + dim(S2 ) ous functions on the real interval [0, 1] and let S ⊂ C[0, 1]
and S1 ∩ S2 = {0}, then V = S1 ⊕ S2 . be the subspace

S = f ∈ C[0, 1] : f (0) = 0 .
∗∗1.B.7 Suppose V is a finite-dimensional vector space
(a) Show that S ⊥ = {0}.
with subspaces S1 , S2 ⊆ V. Show that there is at most one
[Hint: If f ∈ S ⊥ , consider the function g ∈ S defined
projection P : V → V with range(P) = S1 and null(P) = S2 .
by g(x) = x f (x).]
[Side note: As long as S1 ⊕ S2 = V, there is actually exactly (b) Show that (S ⊥ )⊥ 6= S.
one such projection, by Exercise 1.B.8.]
[Side note: This result does not contradict Theorem 1.B.6
or Exercise 1.B.14 since C[0, 1] is not finite-dimensional.]
∗∗1.B.8 Suppose A, B ∈ Mm,n have linearly independent
columns and range(A) ⊕ null(B∗ ) = Fm (where F = R or
∗∗1.B.19 Show that if V and W are vector spaces over the
F = C).
same field then their external direct sum V ⊕ W is a vector
(a) Show that B∗ A is invertible. space.
(b) Show that P = A(B∗ A)−1 B∗ is the projection onto
range(A) along null(B∗ ).
∗∗1.B.20 Complete the proof of Theorem 1.B.9 by show-
[Side note: Compare this exercise with Exercise 1.4.30, ing that if B and C are bases of vector spaces V and W,
which covered orthogonal projections.] respectively, then the set B0 ∪ C0 (where B0 and C0 are as
defined in the statement of that theorem) is linearly indepen-
1.B.9 Let S be a subspace of a finite-dimensional inner dent.
product space V. Show that if P is the orthogonal projection
onto S then I − P is the orthogonal projection onto S ⊥ . ∗∗1.B.21 In this exercise, we pin down the details that
show that the internal and external direct sums are isomor-
∗∗1.B.10 Let B be a set of vectors in an inner product phic. Let V and W be vector spaces over the same field.
space V. Show that B⊥ is a subspace of V. (a) Show that the sets
V 0 = {(v, 0) : v ∈ V} and W 0 = {(0, w) : w ∈ W}
1.B.11 Suppose that B and C are sets of vectors in an inner
product space V such that B ⊆ C. Show that C⊥ ⊆ B⊥ . are subspaces of the external direct sum V ⊕ W.
(b) Show that V 0 ∼
= V and W 0 ∼= W.
(c) Show that V ⊕ W = V 0 ⊕ W 0 , where the direct sum
1.B.12 Suppose V is a finite-dimensional inner prod- on the left is external and the one on the right is
uct space and S, W1 , W2 ⊆ V are subspaces for which internal.
V = S ⊕ W1 = S ⊕ W2 .
1.C Extra Topic: The QR Decomposition 139
1.C Extra Topic: The QR Decomposition
Many linear algebraic algorithms and procedures can be rephrased as a certain

way of decomposing a matrix into a product of simpler matrices. For example,
Gaussian elimination is the standard method that is used to solve systems of
linear equations, and it is essentially equivalent to a matrix decomposition that
the reader may be familiar with: the LU decomposition, which says that most
matrices A ∈ Mm,n can be written in the form
A = LU,
where L ∈ Mm is lower triangular and U ∈ Mm,n is upper triangular. The rough

idea is that L encodes the “forward elimination” portion of Gaussian elimination
Some matrices do that gets A into row echelon form, U encodes the “backward substitution” step
not have an LU
that solves for the variables from row echelon form (for details, see [Joh20,
decomposition, but
can be written in the Section 2.D], for example).
form A = PLU, where In this section, we explore a matrix decomposition that is essentially equiv-
P ∈ Mm is a
permutation matrix
alent to the Gram–Schmidt process (Theorem 1.4.6) in a very similar sense.
that encodes the Since the Gram–Schmidt process tells us how to convert any basis of Rn or
swap row operations Cn into an orthonormal basis of the same space, the corresponding matrix
used in Gaussian decomposition analogously provides us with a way of turning any invertible
elimination.
matrix (i.e., a matrix with columns that form a basis of Rn or Cn ) into a unitary
matrix (i.e., a matrix with columns that form an orthonormal basis of Rn or
Cn ).
1.C.1 Statement and Examples

We now state the main theorem of this section, which says that every matrix
can be written as a product of a unitary matrix and an upper triangular matrix.
As indicated earlier, our proof of this fact, as well as our method of actually
computing this matrix decomposition, both come directly from the Gram–
Schmidt process.
Theorem 1.C.1 Suppose F = R or F = C, and A ∈ Mm,n (F). There exists a unitary ma-
QR Decomposition trix U ∈ Mm (F) and an upper triangular matrix T ∈ Mm,n (F) with non-
negative real entries on its diagonal such that
A = UT.
We call such a decomposition of A a QR decomposition.
Before proving this theorem, we clarify that an upper triangular matrix T is

one for which ti, j = 0 whenever i > j, even if T is not square. For example,
     
1 2 3 1 2 1 2 3 4
   
0 4 5 , 0 3 , and 0 5 6 7
0 0 6 0 0 0 0 8 9
are all examples of upper triangular matrices.

In particular, this first Proof of Theorem 1.C.1. We start by proving this result in the special case
argument covers the when m ≥ n and A has linearly independent columns. We partition A as a block
case when A is
square and matrix according to its columns, which we denote by v1 , v2 , . . ., vn ∈ Fm :
invertible (see
Theorem A.1.1). A = v1 | v2 | · · · | vn .
To see that A has a QR decomposition, we recall that the Gram–Schmidt pro-

cess (Theorem 1.4.6) tells us that there is a set of vectors {u1 , u2 , . . . , un } with
the property that, for each 1 ≤ j ≤ n, {u1 , u2 , . . . , u j } is an orthonormal basis
of span(v1 , v2 , . . . , v j ). Specifically, we can write v j as a linear combination
v j = t1, j u1 + t2, j u2 + · · · + t j, j u j ,
where
This formula for ti, j
(
j−1 ui · v j if i < j, and

follows from t j, j = v j − ∑ (ui · v j )ui and ti, j =
rearranging the i=1
0 if i > j.
formula in
Theorem 1.4.6 so as
to solve for v j (and
We then extend {u1, u2 , . . . , un } to an orthonormal
basis {u1 , u2 , . . . , um }
choosing the inner of Fm and define U = u1 | u2 | · · · | um , noting that orthonormality of
product to be the its columns implies that U is unitary. We also define T ∈ Mm,n to be the
dot product). It also upper triangular matrix with ti, j as its (i, j)-entry for all 1 ≤ i ≤ m and 1 ≤
follows from
Theorem 1.4.5, which j ≤ n (noting that its diagonal entries t j, j are clearly real and non-negative, as
tells us that required). Block matrix multiplication then shows that the j-th column of UT
t j, j = u j · v j too. is
 
t1, j
 .. 
 . 
 
t j, j 
UT e j = u1 | u2 | · · · | um  
 0  = t1, j u1 + t2, j u2 + · · · + t j, j u j = v j ,
 
 .. 
 . 
0
which is the j-th column of A, for all 1 ≤ j ≤ n. It follows that UT = A, which

completes the proof in the case when m ≥ n and A has linearly independent
columns.
If n > m then it is not possible for
the columns
of A to be linearly inde-
pendent. However, if we write A = B | C with B ∈ Mm having linearly
independent columns then the previous argument shows that B has a QR de-
In other words, B is composition B = UT . We can then write
invertible.

A = U T | U ∗C ,
The name “QR”
decomposition is which is a QR decomposition of A.
mostly just a
historical We defer the proof of the case when the leftmost min{m, n} columns of A
artifact—when it do not form a linearly independent set to Section 2.D.3 (see Theorem 2.D.5 in
was first introduced, particular). The rough idea is that we can approximate a QR decomposition of
the unitary matrix U
was called Q all the any matrix A as well as we like via QR decompositions of nearby matrices that
upper triangular have their leftmost min{m, n} columns being linearly independent.
matrix T was called
R (for “right Before computing some explicit QR decompositions of matrices, we make
triangular”). some brief observations about it:
• The proof above shows that if A ∈ Mn is invertible then the diagonal

entries of T are in fact strictly positive, not just non-negative (since
j−1
v j 6= ∑i=1 (ui · v j )ui ). However, if A is not invertible then some diagonal
entries of T will in fact equal 0.
• If A ∈ Mn is invertible then its QR decomposition is unique (see Exer-
cise 1.C.5). However, uniqueness fails for non-invertible matrices.
• If A ∈ Mm,n (R) then we can choose the matrices U and T in the QR
decomposition to be real as well.
 
Example 1.C.1 Compute the QR decomposition of the matrix A = 1 3 3 .
Computing a QR 2 2 −2
Decomposition −2 2 1
Solution:
As suggested by the proof of Theorem 1.C.1, we can find the QR
decomposition of A by applying the Gram–Schmidt process to the columns
v1 , v2 , and v3 of A to recursively construct mutually orthogonal vectors
j−1
w j = v j − ∑i=1 (ui · v j )ui and their normalizations u j = w j /kw j k for
j = 1, 2, 3:
j wj uj u j · v1 u j · v2 u j · v3
The 1 (1, 2, −2) (1, 2, −2)/3 3 1 −1
“lower-triangular” 2 (8, 4, 8)/3 (2, 1, 2)/3 – 4 2
inner products like 3 (2, −2, −1) (2, −2, −1)/3 – – 3
u2 · v1 exist, but are
irrelevant for the
Gram–Schmidt It follows that A has QR decomposition A = UT , where
process and QR  
decomposition. Also, 1 2 2
1
the “diagonal” inner U = u1 | u2 | u3 = 2 1 −2 and
products come for 3
free since −2 2 −1
u j · v j = kw j k, and we    
already computed
u1 · v1 u1 · v2 u1 · v3 3 1 −1
this norm when T = 0 u2 · v2 u2 · v3  = 0 4 2  .
computing 0 0 u3 · v3 0 0 3
u j = w j /kw j k.
 
Example 1.C.2 Compute a QR decomposition of the matrix A = 3 0 1 2 .
Computing a  
−2 −1 −3 2
Rectangular QR
−6 −2 −2 5
Decomposition
Solution:
Since A has more columns than rows, its columns cannot possibly
form a linearly independent set. We thus just apply the Gram–Schmidt
We showed in the
process to its leftmost 3 columns v1 , v2 , and v3 , while just computing dot
proof of products with its 4th column v4 for later use:
Theorem 1.C.1 that
the 4th column of T j wj uj u j · v1 u j · v2 u j · v3 u j · v4
is U ∗ v4 , whose entries
are exactly the dot 1 (3, −2, −6) (3, −2, −6)/7 7 2 3 −4
products in the final 2 (−6, −3, −2)/7 (−6, −3, −2)/7 – 1 1 −4
column here: u j · v4
for j = 1, 2, 3. 3 (4, −12, 6)/7 (2, −6, 3)/7 – – 2 1
It follows that A has QR decomposition A = UT , where

 
1  3 −6 2 
U = u1 | u2 | u3 = −2 −3 −6 and
7
−6 −2 3
   
u1 · v1 u1 · v2 u1 · v3 u1 · v4 7 2 3 −4
T = 0 u2 · v2 u2 · v3 u2 · v4  = 0 1 1 −4 .
0 0 u3 · v3 u3 · v4 0 0 2 1
 
Example 1.C.3 Compute a QR decomposition of the matrix A = −1 −3 1 .
 
Computing a −2 2 −2
Tall/Thin QR  
 2 −2 0 
Decomposition
4 0 −2
Solution:
Since A has more rows than columns, we can start by applying the
Gram–Schmidt process to its columns, but this will only get us the leftmost
3 columns of the unitary matrix U:
j wj uj u j · v1 u j · v2 u j · v3
1 (−1, −2, 2, 4) (−1, −2, 2, 4)/5 5 −1 −1
2 (−16, 8, −8, 4)/5 (−4, 2, −2, 1)/5 – 4 −2
3 (−4, −8, −2, −4)/5 (−2, −4, −1, −2)/5 – – 2
We showed that To find its 4th column, we just extend its first 3 columns {u1 , u2 , u3 } to an
every mutually
orthogonal set of
orthonormal basis of R4 . Up to sign, the unique unit vector u4 that works
unit vectors can be as the 4th column of U is u4 = (2, −1, −4, 2)/5, so it follows that A has
extended to an QR decomposition A = UT , where
orthonormal basis in
Exercise 1.4.20. To do
 
so, just add vector
−1 −4 −2 2
not in the span of 1 −2 2 −4 −1

U = u1 | u2 | u3 | u4 =   and
the current set, 5  2 −2 −1 −4
apply
Gram–Schmidt, and 4 1 −2 2
repeat.    
u1 · v1 u1 · v2 u1 · v3 5 −1 −1
 0 u2 · v2 u2 · v3   
= 0 4 −2
T =  0   .
0 u3 · v3 0 0 2
0 0 0 0 0 0
Remark 1.C.1 The method of computing the QR decomposition that we presented here,
Computing QR based on the Gram–Schmidt process, is typically not actually used in
Decompositions practice. The reason for this is that the Gram–Schmidt process is numer-
ically unstable. If a set of vectors is “close” to linearly dependent then
changing those vectors even slightly can drastically change the resulting
orthonormal basis, and thus small errors in the entries of A can lead to a
wildly incorrect QR decomposition.
Numerically stable methods for computing the QR decomposition (and
numerical methods for linear algebraic tasks in general) are outside of the
scope of this book, so the interested reader is directed to a book like [TB97]
for their treatment.
1.C.2 Consequences and Applications

One of the primary uses of the QR decomposition is as a method for solving
systems of linear equations more quickly than we otherwise could. To see how
this works, suppose we have already computed a QR decomposition A = UT
of the coefficient matrix of the linear system Ax = b. Then UT x = b, which is
a linear system that we can solve via the following two-step procedure:
• First, solve the linear system Uy = b for the vector y by setting y = U ∗ b.
• Next, solve the linear system T x = y for the vector x. This linear system
is straightforward to solve via backward elimination due to the triangular
shape of T .
Once we have obtained the vector x via this procedure, it is the case that
Ax = UT x = U(T x) = Uy = b,
so x is indeed a solution of the original linear system, as desired.
Example 1.C.4 Use the QR decomposition to find all solutions of the linear system
Solving Linear     
Systems via a QR w
3 0 1 2   1
Decomposition  x  
−2 −1 −3 2   = 0 .
y
−6 −2 −2 5 z 4
Solution:
We constructed the following QR decomposition A = UT of the coef-
ficient matrix A in Example 1.C.2:
   
3 −6 2 7 2 3 −4
1 
U = −2 −3 −6 and T = 0 1 1 −4 .
7
−6 −2 3 0 0 2 1
If b = (1, 0, 4) then setting y = U ∗ b gives

    
3 −2 −6 1 −3
1 
y = −6 −3 −2 0 = −2 .
7
2 −6 3 4 2
See Appendix A.1.1 if Next, we solve the upper triangular system T x = y:

on linear systems.    
7 2 3 −4 −3 R1 /7 1 2/7 3/7 −4/7 −3/7
 0 1 1 −4 −2  R3 /2  0 1 1 −4 −2 
0 0 2 1 2 −−−→ 0 0 1 1/2 1
 
3 1 2/7 0 −11/14 −6/7
R1 − 7 R3
R2 −R3  0 1 0 −9/2 −3 
−−−−−→ 0 0 1 1/2 1
 
1 0 0 1/2 0
2
R1 − 7 R2  0 1 0 −9/2 −3  .
−−−−−→
0 0 1 1/2 1
It follows that z is a free variable, and w, x, and y are leading variables with
w = −z/2, x = −3 + 9z/2, and y = 1 − z/2. It follows that the solutions
of this linear system (as well as the original linear system) are the vectors
of the form (w, x, y, z) = (0, −3, 1, 0) + z(−1, 9, −1, 2)/2.
Remark 1.C.2 While solving a linear system via the QR decomposition is simpler than
Multiple Methods for solving it directly via Gaussian elimination, actually computing the QR
Solving Repeated decomposition in the first place takes just as long as solving the linear
Linear Systems system directly. For this reason, the QR decomposition is typically only
used in this context to solve multiple linear systems, each of which has the
same coefficient matrix (which we can pre-compute a QR decomposition
of) but different right-hand-side vectors.
There are two other standard methods for solving repeated linear
systems of the form Ax j = b j ( j = 1, 2, 3, . . .) that it is worth comparing
to the QR decomposition:
• We could pre-compute A−1 and then set x j = A−1 b j for each j. This
method has the advantage of being conceptually simple, but it is
If A ∈ Mn then all slower and less numerically stable than the QR decomposition.
three of these
methods take O(n3 ) • We could pre-compute an LU decomposition A = LU, where L
operations to do the is lower triangular and U is upper triangular, and then solve the
pre-computation pair of triangular linear systems Ly j = b j and Ux j = y j for each j.
and then O(n2 )
operations to solve a
This method is roughly twice as quick as the QR decomposition,
linear system, but its numerical stability lies somewhere between that of the QR
compared to the decomposition and the method based on A−1 described above.
O(n3 ) operations
needed to solve a Again, justification of the above claims is outside of the scope of this
linear system directly book, so we direct the interested reader to a book on numerical linear
via Gaussian algebra like [TB97].
elimination.
Once we have the QR decomposition of a (square) matrix, we can also use

it to quickly compute the absolute value of its determinant, since determinants
of unitary and triangular matrices are both easy to deal with.
Theorem 1.C.2 If A ∈ Mn has QR decomposition A = UT with U ∈ Mn unitary and

Determinant via T ∈ Mn upper triangular, then
QR Decomposition
| det(A)| = t1,1 · t2,2 · · ·tn,n .
Proof of Theorem 1.C.1. We just string together three facts about the determi-
nant that we already know:
We review these • The determinant is multiplicative, so | det(A)| = | det(U)|| det(T )|,
properties of
determinants in • U is unitary, so Exercise 1.4.11 tells us that | det(U)| = 1, and
Appendix A.1.5. • T is upper triangular, so | det(T )| = |t1,1 · t2,2 · · ·tn,n | = t1,1 · t2,2 · · ·tn,n ,
with the final equality following from the fact that the diagonal entries of
T are non-negative.
For example, we saw in Example 1.C.1 that the matrix

 
1 3 3

A= 2 2 −2
−2 2 1
has QR decomposition A = UT with

   
1 2 2 3 1 −1
1
U= 2 1 −2 and T = 0 4 2 .
3
−2 2 −1 0 0 3
It follows that | det(A)| is the product of the diagonal entries of T : | det(A)| =

In fact, det(A) = 36. 3 · 4 · 3 = 36.
1.C.1 Compute the QR Decomposition of each of the

following matrices. 1.C.2 Solve each of the following linear systems Ax = b
" # " # by making use of the provided QR decomposition A = UT :
∗(a) 3 4 (b) 15 −17 −1 " # " # " #
1 1 1 1 2 1
4 2 8 −6 4 ∗(a) b = ,U=√ ,T=
    2 2 −1 1 0 3
∗(c) 6 3 (d) 0 0 −1 " # " # " #
    3 1 3 −4 2 3 1
3 2 4 −1 −2 (b) b = ,U= ,T=
−1 5 4 3 0 1 −1
−6 −2 −3 −3 −1    
   
∗(e) 2 1 1 4 (f) −11 4 −2 −1 1 −3 6 2
    1 

1 1 0 1

 10 −5 −3 0 ∗(c) b = 2 , U = −2 −3 6
  7
4 2 1 −2 −2 −2 −4 −2 0 6 2 3
 
2 2 2 2 3 −1 2
 
T = 0 2 1
0 0 1
   
1 −15 30 10
  1  
(d) b =  1  , U =  18 −1 30 
35
−2 26 18 −15
 
2 1 0 1
 
T = 0 1 3 −2
0 0 2 2
1.C.3 Determine which of the following statements are 1.C.6 Suppose that A ∈ M2 is the (non-invertible) matrix
true and which are false. with QR decomposition A = UT , where
" # " #
∗(a) Every matrix A ∈ Mm,n (R) has a QR decomposition 1 3 4 0 0
A = UT with U and T real. U= and T = .
5 4 −3 0 1
(b) If A ∈ Mn has QR decomposition A = UT then
det(A) = t1,1 · t2,2 · · ·tn,n . Find another QR decomposition of A.
∗(c) If A ∈ Mm,n has QR decomposition A = UT then
range(A) = range(T ). [Side note: Contrast this with the fact from Exercise 1.C.5
(d) If A ∈ Mn is invertible and has QR decomposition that QR decompositions of invertible matrices are unique.]
A = UT then, for each 1 ≤ j ≤ n, the span of the left-
most j columns of A equals the span of the leftmost ∗1.C.7 In this exercise, we investigate when the QR de-
j columns of U. composition of a rectangular matrix A ∈ Mm,n is unique.
(a) Show that if n ≥ m and the left m × m block of A is
1.C.4 Suppose that A ∈ Mn is an upper triangular matrix. invertible then its QR decomposition is unique.
(a) Show that A is invertible if and only if all of its diag- (b) Provide an example that shows that if n < m then the
onal entries are non-zero. QR decomposition of A may not be unique, even if
(b) Show that if A is invertible then A−1 is also upper its top n × n block is invertible.
triangular. [Hint: First show that if b has its last k
entries equal to 0 (for some k) then the solution x to 1.C.8 Suppose A ∈ Mm,n . In this exercise, we demonstrate
Ax = b also has its last k entries equal to 0.] the existence of several variants of the QR decomposition.
(c) Show that if A is invertible then the diagonal entries
(a) Show that there exists a unitary matrix U and a lower
of A−1 are the reciprocals of the diagonal entries of
triangular matrix S such that A = US.
A, in the same order.
[Hint: What happens to a matrix’s QR decomposition
if we swap its rows and/or columns?]
∗∗1.C.5 Show that if A ∈ Mn is invertible then its QR (b) Show that there exists a unitary matrix U and an
decomposition is unique. upper triangular matrix T such that A = TU.
[Hint: Exercise 1.4.12 might help.] (c) Show that there exists a unitary matrix U and a lower
triangular matrix S such that A = SU.
1.D Extra Topic: Norms and Isometries
We now investigate how we can measure the length of vectors in arbitrary

vector spaces. We already know how to do this in Rn —the length of a vector
v ∈ Rn is q
√
kvk = v · v = v21 + v22 + · · · + v2n .
Slightly more generally, we saw in Section 1.3.4 that we can use the norm
induced by the inner product to measure length in any inner product space:
p
kvk = hv, vi.
However, it is sometimes preferable to use a measure of size that does not

rely on us first defining an underlying inner product. We refer to such functions
as norms, and we simply define them to be the functions that satisfy the usual
properties that the length on Rn or the norm induced by an inner product satisfy.
Definition 1.D.1 Suppose that F = R or F = C and that V is a vector space over F. Then
Norm of a Vector a norm on V is a function k · k : V → R such that the following three
properties hold for all c ∈ F and all v, w ∈ V:
a) kcvk = |c|kvk (absolute homogeneity)
b) kv + wk ≤ kvk + kwk (triangle inequality)
c) kvk ≥ 0, with equality if and only if v = 0 (positive definiteness)
1.D Extra Topic: Norms and Isometries 147
The motivation for why each of the above defining properties should hold
for any reasonable measure of “length” or “size” is hopefully somewhat clear—
(a) if we multiply a vector by a scalar then its length scaled by the same amount,
(b) the shortest path between two points is the straight line connecting them,
and (c) we do not want lengths to be negative.
Every norm induced by an inner product is indeed a norm, as was estab-
lished by Theorems 1.3.7 (which established properties (a) and (c)) and 1.3.9
(which established property (b)). However, there are also many different norms
out there that are not induced by any inner product, both on Rn and on other
vector spaces. We now present several examples.
Example 1.D.1 The 1-norm on Cn is the function defined by

The 1-Norm
kvk1 = |v1 | + |v2 | + · · · + |vn | for all v ∈ Cn .
def
(or “Taxicab” Norm)
Show that the 1-norm is indeed a norm.

Solution:
We have to check the three defining properties of norms
(Definition 1.D.1). If v, w ∈ Cn and c ∈ C then:
a) kcvk1 = |cv1 | + · · · + |cvn | = |c|(|v1 | + · · · + |vn |) = |c|kvk1 .
b) First, we note that |v + w| ≤ |v| + |w| for all v, w ∈ C (this statement
is equivalent to the usual triangle inequality on R2 ). We then have
kv + wk1 = |v1 + w1 | + · · · + |vn + wn |

≤ (|v1 | + |w1 |) + · · · + (|vn | + |wn |)
= (|v1 | + · · · + |vn |) + (|w1 | + · · · + |wn |)
= kvk1 + kwk1 .
c) The fact that kvk1 = |v1 |+· · ·+|vn | ≥ 0 is clear. To see that kvk1 = 0
if and only if v = 0, we similarly just notice that |v1 | + · · · + |vn | = 0
if and only if v1 = · · · = vn = 0 (indeed, if any v j were non-zero
then |v j | > 0, so |v1 | + · · · + |vn | > 0 too).
It is worth observing that the 1-norm measures the amount of distance

that must be traveled to go from a vector’s tail to its head when moving in
the direction of the standard basis vectors. For this reason, it is sometimes
called the taxicab norm: we imagine a square grid on R2 as the streets
along which a taxi can travel to get from the tail to the head of a vector v,
and this norm measures how far the taxi must drive, as illustrated below
for the vector v = (4, 3):
y
√
kvk = 42 + 32 = 5
, 3)
(4
v=
x
kvk1 = 4 + 3 = 7
In analogy with the 1-norm from the above example, we refer to the usual
vector length on Cn , defined by
q
def
kvk2 = |v1 |2 + |v2 |2 + · · · + |vn |2 ,
as the 2-norm in this section, in reference to the exponent that appears in the
terms being summed. We sometimes denote it by k · k2 instead of k · k to avoid
confusion with the notation used for norms in general.
Example 1.D.2 The ∞-norm (or max norm) on Cn is the function defined by
The ∞-Norm
for all v ∈ Cn .
def
(or “Max” Norm) kvk∞ = max |v j |
1≤ j≤n
Show that the ∞-norm is indeed a norm.

Solution:
Again, we have to check the three properties that define norms. If
v, w ∈ Cn and c ∈ C then:

a) kcvk∞ = max1≤ j≤n |cv j | = |c| max1≤ j≤n |v j | = |c|kvk∞ .
b) By again making use of the fact that |v + w| ≤ |v| + |w| for all
v, w ∈ C, we see that

kv + wk∞ = max |v j + w j |
1≤ j≤n
The second
inequality comes ≤ max |v j | + |w j |
from two 1≤ j≤n
maximizations
≤ max |v j | + max |w j |
allowing more 1≤ j≤n 1≤ j≤n
freedom (and thus a
higher maximum) = kvk∞ + kwk∞ .
than one
maximization. c) The fact that kvk∞ ≥ 0 is clear. Similarly, kvk∞ = 0 if and only if
Equality holds if both the largest entry of v equals zero, if and only if |v j | = 0 for all j, if
maximums are
attained at the and only if v = 0.
same subscript j.
One useful way of visualizing norms on Rn is to draw their unit ball—the
set of vectors v ∈ Rn satisfying kvk ≤ 1. For the 2-norm, the unit ball is exactly
the circle (or sphere, or hypersphere...) with radius 1 centered at the origin,
together with its interior. For the ∞-norm, the unit ball is the set of vectors
with the property that |v j | ≤ 1 for all j, which is exactly the square (or cube, or
hypercube...) that circumscribes the unit circle. Similarly, the unit ball of the
1-norm is the diamond inscribed within that unit circle. These unit balls are
illustrated in R2 in Figure 1.27.
Even though there are lots of different norms that we can construct on any
vector space, in the finite-dimensional case it turns out that they cannot be “too”
different. What we mean by this is that there is at most a multiplicative constant
by which any two norms differ—a property called equivalence of norms.
Theorem 1.D.1 Let k · ka and k · kb be norms on a finite-dimensional vector space V. There

Equivalence of Norms exist real scalars c,C > 0 such that
It does not matter
which of k · ka or k · kb ckvka ≤ kvkb ≤ Ckvka for all v ∈ V.
is in the middle in this
theorem. The given
inequality is The proof of the above theorem is rather technical and involved, so we de-
1
equivalent to
1
fer it to Appendix B.1. However, for certain choices of norms it is straightforward
kvkb ≤ kvka ≤ kvkb .
C c
x
k · k1
k · k2
k · k∞
Figure 1.27: The unit balls for the 1-, 2-, and ∞-norms on R2 .
to explicitly find constants c and C that work. For example, we can directly
check that for any vector v ∈ Cn we have

kvk∞ = max |v j | ≤ |v1 | + |v2 | + · · · + |vn | = kvk1 ,
1≤ j≤n
where the inequality

holds because there is some particular index i with |vi | =
max1≤ j≤n |v j | , and the sum on the right contains this |vi | plus other non-
negative terms.
To establish a bound between kvk∞ and kvk1 that goes in the other direction,
we similarly compute
The middle
inequality just says n n n
that each |vi | is no kvk1 = ∑ |vi | ≤ ∑ max |v j | = ∑ kvk∞ = nkvk∞ .
larger than the i=1 i=1 1≤ j≤n i=1
largest |vi |.
so kvk1 ≤ nkvk∞ (i.e., in the notation of Theorem 1.D.1, if we have k·ka = k·k∞
and k · kb = k · k1 , then we can choose c = 1 and C = n).
Geometrically, the fact that all norms on a finite-dimensional vector space
are equivalent just means that their unit balls can be stretched or shrunk to
contain each other, and the scalars c and C from Theorem 1.D.1 tell us what
factor they must be stretched by to do so. For example, the equivalence of the
1- and ∞-norms is illustrated in Figure 1.28.
However, it is worth observing that Theorem 1.D.1 does not hold in infinite-
dimensional vector spaces. That is, in infinite-dimensional vector spaces, we
can construct norms k · ka and k · kb with the property that the ratio kvkb /kvka
can be made as large as we like by suitably choosing v (see Exercise 1.D.26),
so there does not exist a constant C for which kvkb ≤ Ckvka . We call such
norms inequivalent.
1.D.1 The p-Norms

We now investigate a family of norms that generalize the 1-, 2-, and ∞-norms
on Cn in a fairly straightforward way:
Be careful—the unit
ball of the norm
2k · k∞ is half as large
as that of k · k∞ , not
twice as big. After
all, 2k · k∞ ≤ 1 if and
only if k · k∞ ≤ 1/2.
Figure 1.28: An illustration in R2 of the fact that (a) kvk∞ ≤ kvk1 ≤ 2kvk∞ and
(b) 12 kvk1 ≤ kvk∞ ≤ kvk1 . In particular, one norm is lower bounded by another
one if and only if the unit ball of the former norm is contained within the unit ball
of the latter.
Definition 1.D.2 If p ≥ 1 is a real number, then the p-norm on Cn is defined by

The p-Norm !1/p
of a Vector n
p
for all v ∈ Cn .
def
kvk p = ∑ |v j |
j=1
It should be reasonably clear that if p = 1 then the p-norm is exactly the

“1-norm” from Example 1.D.1 (which explains why we referred to it as the
1-norm in the first place), and if p = 2 then it is the standard vector length.
Furthermore, it is also the case that

lim kvk p = max |v j | for all v ∈ Cn ,
p→∞ 1≤ j≤n
which is exactly the ∞-norm of v (and thus explains why we called it the
∞-norm in the first place). To informally see why this limit holds, we just
notice that increasing the exponent p places more and more importance on the
largest component of v compared to the others (this is proved more precisely in
Exercise 1.D.8).
To verify that the p-norm is indeed a norm, we have to check the three
properties of norms from Definition 1.D.1. Properties (a) and (c) (absolute
It is commonly the homogeneity and positive definiteness) are straightforward enough:
case that the
triangle inequality is a) If v ∈ Cn and c ∈ C then
the difficult property
of a norm to verify, kcvk p = |cv1 | p +· · ·+|cvn | p )1/p = |c| |v1 | p +· · ·+|vn | p )1/p = |c|kvk p .
while the other two
properties (absolute c) The fact that kvk p ≥ 0 is clear. To see that kvk p = 0 if and only if v = 0,
homogeneity and
positive definiteness) we just notice that kvk p = 0 if and only if |v1 | p + · · · + |vn | p = 0, if and
are much simpler. only if v1 = · · · = vn = 0, if and only if v = 0.
Proving that the triangle inequality holds for the p-norm is much more
involved, so we state it separately as a theorem. This inequality is impor-
tant enough and useful enough that it is given its own name—it is called
Minkowski’s inequality.
Theorem 1.D.2 If 1 ≤ p ≤ ∞ then kv + wk p ≤ kvk p + kwk p for all v, w ∈ Cn .

Minkowski’s
Inequality
To see that f is Proof. First, consider the function f : [0, ∞) → R defined by f (x) = x p . Stan-
convex, we notice dard calculus techniques show that f is convex (sometimes called “concave up”
that its second in introductory calculus courses) whenever x ≥ 0—any line connecting two
derivative is
f 00 (x) = p(p − 1)x p−2 , points on the graph of f lies above its graph (see Figure 1.29).
which is
non-negative as y
long as p ≥ 1 and
x ≥ 0 (see f (x) = x p
Theorem A.5.2). 5
4
x2p
3
(1 − t)x1p + tx2p ≥
¢p 2
(1 − t)x1 + tx2
1
For an introduction
to convex functions,
x1p x
1 2
see Appendix A.5.2. x1 x2
(1 − t)x1 + tx2
Figure 1.29: The function f (x) = x p (pictured here with p = 2.3) is convex on the
interval [0, ∞), so any line segment between two points on its graph lies above the
graph itself.
Algebraically, this means that

p
(1 − t)x1 + tx2 ≤ (1 − t)x1p + tx2p for all x1 , x2 ≥ 0, 0 ≤ t ≤ 1. (1.D.1)
Now suppose v, w ∈ Cn are non-zero vectors (if either one of them is the zero
vector, Minkowski’s inequality is trivial). Then we can write v = kvk p x and
w = kwk p y, where x and y are scaled so that kxk p = kyk p = 1. Then for any
Writing vectors as 0 ≤ t ≤ 1, we have
scaled unit vectors is
a common n
technique when (1 − t)x + ty p = ∑ |(1 − t)x j + ty j | p (definition of k · k p )
p
proving inequalities j=1
involving norms. n p
≤ ∑ (1 − t)|x j | + t|y j | (triangle inequality on C)
j=1
n
≤ ∑ (1 − t)|x j | p + t|y j | p (Equation (1.D.1))
j=1
= (1 − t)kxk pp + tkyk pp (definition of k · k p )

= 1. (kxk p = kyk p = 1)
In particular, if we choose t = kwk p /(kvk p + kwk p ) then 1 − t =
kvk p /(kvk p + kwk p ), and the quantity above simplifies to

kvk p x + kwk p y p kv + wk pp
p p
k(1 − t)x + tyk p = p
= .
(kvk p + kwk p ) (kvk p + kwk p ) p
Since this quantity is no larger than 1, multiplying through by (kvk p + kwk p ) p

tells us that
p
kv + wk pp ≤ kvk p + kwk p .
Taking the p-th root of both sides of this inequality gives us exactly Minkowski’s
inequality and thus completes the proof.
Equivalence and Hölder’s Inequality
Now that we know that p-norms are indeed norms, it is instructive to try to draw
their unit balls, much like we did in Figure 1.27 for the 1-, 2-, and ∞-norms.
The following theorem shows that the p-norm can only decrease as p increases,
which will help us draw the unit balls shortly.
Theorem 1.D.3 If 1 ≤ p ≤ q ≤ ∞ then kvkq ≤ kvk p for all v ∈ Cn .

Monotonicity of
p-Norms
Proof. If v = 0 then this inequality is trivial, so suppose v 6= 0 and consider
the vector w = v/kvkq . Then kwkq = 1, so |w j | ≤ 1 for all j. It follows that
Again, notice that |w j | p ≥ |w j |q for all j as well, so
proving this theorem
was made easier by !1/p !1/p
first rescaling the n n
p q q/p
vector to have kwk p = ∑ |w j | ≥ ∑ |w j | = kwkq = 1.
length 1. j=1 j=1
Since w = v/kvkq , this implies that kwk p = kvk p /kvkq ≥ 1, so kvk p ≥ kvkq ,
as desired.
The argument above covers the case when both p and q are finite. The case
when q = ∞ is proved in Exercise 1.D.7.
By recalling that larger norms have smaller unit balls, we can interpret
Theorem 1.D.3 as saying that the unit ball of the p-norm is contained within
The reason for the the unit ball of the q-norm whenever 1 ≤ p ≤ q ≤ ∞. When we combine this
unit ball inclusion is
with the fact that we already know what the unit balls look like when p = 1, 2,
that kvk p ≤ 1 implies
kvkq ≤ kvk p = 1 too. or ∞ (refer back to Figure 1.27), we get a pretty good idea of what they look
Generally (even like for all values of p (see Figure 1.30). In particular, the sides of the unit ball
beyond just p-norms), gradually “bulge out” as p increases from 1 to ∞.
larger norms have
smaller unit balls.
p=1 p = 1.2 p = 1.5 p=2
p=2 p=3 p=6 p=∞
Figure 1.30: The unit ball of the p-norm on R2 for several values of 1 ≤ p ≤ ∞.
The inequalities of Theorem 1.D.3 can be thought of as one half of the

equivalence (refer back to Theorem 1.D.1) between the norms k · k p and k · kq .
The other inequality in this norm equivalence is trickier to pin down, so we
start by solving this problem just for the 1-, 2-, and ∞-norms.
√
Theorem 1.D.4 If v ∈ Cn then kvk1 ≤ nkvk2 ≤ nkvk∞ .
1-, 2-, and ∞-Norm
Inequalities
Proof of Theorem 1.C.1. For the left inequality, we start by defining two vec-
tors x, y ∈ Rn :
x = (1, 1, . . . , 1) and y = (|v1 |, |v2 |, . . . , |vn |).
By the Cauchy–Schwarz inequality, we know that |x · y| ≤ kxk2 kyk2 . However,

plugging everything into the relevant definitions then shows that
n
|x · y| = ∑ 1 · |v j | = kvk1 ,
j=1
and s s
n √ n
kxk2 = ∑ 12 = n and kyk2 = ∑ |v j |2 = kvk2 .
j=1 j=1
√
It follows that kvk1 = |x · y| ≤ kxk2 kyk2 = nkvk2 , as claimed.
To prove the second inequality, we notice that
Again, the inequality s s s
here follows simply n n 2 n √
because each |vi | is
no larger than the
kvk2 = ∑ |vi |2 ≤ ∑ max |v j |
1≤ j≤n
= ∑ kvk2∞ = nkvk∞ .
i=1 i=1 i=1
largest |vi |.
Before we can prove an analogous result for arbitrary p- and q-norms, we
first need one more technical helper theorem. Just like Minkowski’s inequality
(Theorem 1.D.2) generalizes the triangle inequality from the 2-norm to arbitrary
p-norms, the following theorem generalizes the Cauchy–Schwarz inequality
from the 2-norm to arbitrary p-norms.
Theorem 1.D.5 Let 1 ≤ p, q ≤ ∞ be such that 1/p + 1/q = 1. Then

Hölder’s Inequality
|v · w| ≤ kvk p kwkq for all v, w ∈ Cn .
Before proving this theorem, it is worth focusing a bit on the somewhat

strange relationship between p and q that it requires. First, notice that if p =
q = 2 then 1/p + 1/q = 1, so this is how we get the Cauchy–Schwarz inequality
as a special case of Hölder’s inequality. On the other hand, if one of p or q
is smaller than 2 then the other one must be correspondingly larger than 2 in
order for 1/p + 1/q = 1 to hold—as p decreases from 2 to 1, q increases from
2 to ∞ (see Figure 1.31).
We could explicitly solve for q in terms of p and get q = p/(p − 1), but
doing so makes the symmetry between p and q less clear (e.g., it is also the
case that p = q/(q − 1)), so we do not usually do so. It is also worth noting
that if p = ∞ or q = ∞ then we take the convention that 1/∞ = 0 in this setting,
so that if p = ∞ then q = 1 and if q = ∞ then p = 1.
C.–S. ineq.
p=q=2
(p, q) = (3/2, 3)
We call a pair of (p, q) = (5/4, 5)
numbers p and q (p, q) = (10/9, 10)
conjugate if
1/p + 1/q = 1. We
also call the pairs 1 2 3 4 5 6 7 8 9 10
(p, q) = (1, ∞) and
(p, q) = (∞, 1) Figure 1.31: Some pairs of numbers (p, q) for which 1/p + 1/q = 1 and thus Hölder’s
conjugate. inequality (Theorem 1.D.5) applies to them.
Inequality (1.D.2) is Proof of Theorem 1.D.5. Before proving the desired inequality involving vec-
called Young’s
inequality, and it tors, we first prove the following inequality involving real numbers x, y ≥ 0:
depends on the fact
that 1/p + 1/q = 1. x p yq
xy ≤ + . (1.D.2)
p q
To see why this inequality holds, first notice that it is trivial if x = 0 or
y = 0, so we can assume from now on that x, y > 0. Consider the function
f : (0, ∞) → R defined by f (x) = ln(x). Standard calculus techniques show
ln(x) refers to the that f is concave (sometimes called “concave down” in introductory calculus
natural logarithm courses) whenever x > 0—any line connecting two points on the graph of f
(i.e., the base-e
logarithm) of x.
lies below its graph (see Figure 1.32).
y
2
ln(x2 ) f (x) = ln(x)
To see that f is ¢
concave, we notice ln (1 − t)x1 + tx2 ≥
that its second
derivative is
(1 − t) ln(x1 ) + t ln(x2 )
x
f 00 (x) = −1/x2 , which x1 1 2 3 4 x2 5
is negative as long
as x > 0 (see -1 (1 − t)x1 + tx2
Theorem A.5.2). ln(x1 )
-2
Figure 1.32: The function f (x) = ln(x) is concave on the interval (0, ∞), so any line
segment between two points on its graph lies below the graph itself.
Algebraically, this tells us that

ln (1 − t)x1 + tx2 ≥ (1 − t) ln(x1 ) + t ln(x2 ) for all x1 , x2 > 0, 0 ≤ t ≤ 1.
In particular, if we choose x1 = x p , x2 = yq , and t = 1/q then 1 −t = 1 − 1/q =
1/p and the above inequality becomes
Recall that p
ln(x p ) = p ln(x), so x yq ln(x p ) ln(yq )
ln + ≥ + = ln(x) + ln(y) = ln(xy).
ln(x p )/p = ln(x). p q p q
Exponentiating both sides of this inequality (i.e., raising e to the power of the
left- and right-hand-sides) then gives us exactly Inequality (1.D.2), which we
were trying to prove.
To now prove Hölder’s inequality, we first observe that it suffices to prove
the case when kvk p = kwkq = 1, since multiplying either v or w by a scalar
has no effect on whether or not the inequality v · w ≤ kvk p kwkq holds (both
sides of it just get multiplied by that same scalar). To see that |v · w| ≤ 1 in this
case, and thus complete the proof, we compute

n n

|v · w| = ∑ v j w j ≤ ∑ |v j w j | (triangle inequality on C)
j=1 j=1
n n
|v j | p |w j |q
≤ ∑ +∑ (Inequality (1.D.2))
j=1 p j=1 q
q
kvk pp kwkq
= + (definition of p−, q-norm)
p q
1 1
= + (since kvk p = kwkq = 1)
p q
= 1.
As one final note, we have only explicitly proved this theorem in the case when
both p and q are finite. The case when p = ∞ or q = ∞ is actually much simpler,
and thus left to Exercise 1.D.9.
With Hölder’s inequality now at our disposal, we can finally find the multi-
plicative constants that can be used to bound different p-norms in terms of each
other (i.e., we can quantify the equivalence of the p-norms that was guaranteed
by Theorem 1.D.1).
Theorem 1.D.6 If 1 ≤ p ≤ q ≤ ∞ then

Equivalence of
p-Norms kvkq ≤ kvk p ≤ n1/p−1/q kvkq for all v ∈ Cn .
Proof. We already proved the left inequality in Theorem 1.D.3, so we just

need to prove the right inequality. To this end, consider the vectors x =
(|v1 | p , |v2 | p , . . . , |vn | p ) and y = (1, 1, . . . , 1) ∈ Cn and define p̃ = q/p and q̃ =
q/(q − p). It is straightforward to check that 1/ p̃ + 1/q̃ = 1, so Hölder’s in-
equality then tells us that
We introduce p̃ and n
q̃ and apply Hölder’s kvk pp =
inequality to them ∑ |vi | p (definition of kvk p )
j=1
instead of p and q
since it might not = |x · y| ≤ kxk p̃ kykq̃ (Hölder’s inequality)
even be the case !1/ p̃ !1/q̃
that 1/p + 1/q = 1. n n
p p̃
= ∑ |v j | ∑ 1q̃ (definition of kxk p̃ and kykq̃ )
j=1 j=1
! p/q
n
= ∑ |v j |q
n(q−p)/q (p p̃ = p(q/p) = q, etc.)
j=1

= n(q−p)/q kvkqp . (definition of kvkq )
Taking the p-th root of both sides of this inequality then shows us that
kvk p ≤ n(q−p)/(pq) kvkq = n1/p−1/q kvkq ,
as desired.
If p and q are some combination of 1, 2, or ∞ then the above result recovers
earlier inequalities like those of Theorem 1.D.4. For example, if (p, q) = (1, 2)
then this theorem says that
√
kvk2 ≤ kvk1 ≤ n1−1/2 kvk2 = nkvk2 ,
which we already knew. Furthermore, we note that the constants (c = 1 and

C = n1/p−1/q ) given in Theorem 1.D.6 are the best possible—for each inequal-
ity, there are particular choices of vectors v ∈ Cn that attain equality (see
Exercise 1.D.12).
C[a, b] denotes the The p-Norms for Functions

vector space of Before we move onto investigating other properties of norms, it is worth noting
continuous functions that the p-norms can also be defined for continuous functions in a manner
on the interval [a, b].
completely analogous to how we defined them on Cn above.
Definition 1.D.3 If p ≥ 1 is a real number, then the p-norm on C[a, b] is defined by

The p-Norm Z 1/p
of a Function b
| f (x)| p dx
def
k f kp = for all f ∈ C[a, b].
a
Similarly, for the p = ∞ case we simply define k f k p to be the maximum

value that f attains on the interval [a, b]:
The Extreme Value
def
Theorem guarantees k f k∞ = max | f (x)| .
that this maximum is a≤x≤b
attained, since f is
continuous. Defining norms on functions in this way perhaps seems a bit more natural if
we notice that we can think of a vector v ∈ Cn as a function from {1, 2, . . . , n}
to Cn (in particular, this function sends j to v j ). Definition 1.D.3 can then be
thought of as simply extending the p-norms from functions on discrete sets of
numbers like {1, 2, . . . , n} to functions on continuous intervals like [a, b].
Example 1.D.3 Compute the p-norms of the function f (x) = x + 1 in C[0, 1].
Computing the p-Norm
Solution:
of a Function
For finite values of p, we just directly compute the value of the desired
integral:
Z 1
1/p
We do not have to p
take the absolute
kx + 1k p = (x + 1) dx
0
value of f since 
f (x) ≥ 0 on the
1 1/p 1/p
(x + 1) p+1 2 p+1 − 1
interval [0, 1].  
= = .
p+1 p+1
x=0
For the p = ∞ case, we just notice that the maximum value of f (x) on
the interval [0, 1] is
kx + 1k∞ = max {x + 1} = 1 + 1 = 2.
0≤x≤1
It is worth noting that lim kx + 1k p = 2 as well, as we might hope.

x→∞
One property of Furthermore, most of the previous theorems that we saw concerning p-
p-norms that does
norms carry through with minimal changes to this new setting too, simply by
not carryover is the
rightmost inequality replacing vectors by functions and sums by integrals. To illustrate this fact, we
in Theorem 1.D.6. now prove Minkowski’s inequality for the p-norm of a function. However, we
leave numerous other properties of the p-norm of a function to the exercises.
Theorem 1.D.7 If 1 ≤ p ≤ ∞ then k f + gk p ≤ k f k p + kgk p for all f , g ∈ C[a, b].

Minkowski’s Inequality
for Functions
Proof. Just like in our original proof of Minkowski’s inequality, notice that the
function that sends x to x p is convex, so
p
(1 − t)x1 + tx2 ≤ (1 − t)x1p + tx2p for all x1 , x2 ≥ 0, 0 ≤ t ≤ 1.
This proof is almost Now suppose that f , g ∈ C[a, b] are non-zero functions (if either one of
identical to the
them is the zero function, Minkowski’s inequality is trivial). Then we can write
proof of
Theorem 1.D.2, but f = k f k p fˆ and g = kgk p ĝ, where fˆ and ĝ are unit vectors in the p-norm (i.e.,
with sums replaced k fˆk p = kĝk p = 1). Then for every 0 ≤ t ≤ 1, we have
by integrals, vectors
replaced by Z b
functions, and (1 − t) fˆ + t ĝ p = (1 − t) fˆ(x) + t ĝ(x) p dx (definition of k · k p )
p
coordinates of a
Z b p
vectors (like x j )
replaced by ≤ (1 − t)| fˆ(x)| + t|ĝ(x)| dx (triangle ineq. on C)
a
function values (like Z b
fˆ(x)).
≤ (1 − t)| fˆ(x)| p + t|ĝ(x)| p dx (since x p is convex)
a
= (1 − t)k fˆk pp + tkĝk pp (definition of k · k p )
= 1. (k fˆk p = kĝk p = 1)
We apologize for In particular, if we choose t = kgk p /(k f k p + kgk p ) then 1 − t =
having norms within
k f k p /(k f k p + kgk p ), and the quantity above simplifies to
norms here. Ugh.
p
p k f k p fˆ + kgk p ĝ p k f + gk pp
(1 − t) fˆ + t ĝ = = .
p (k f k p + kgk p ) p (k f k p + kgk p ) p
p
Since this quantity is no larger than 1, multiplying through by k f k p + kgk p
tells us that
p
k f + gk pp ≤ k f k p + kgk p .
Taking the p-th root of both sides of this inequality gives us exactly Minkowski’s
inequality.
We prove Hölder’s Minkowski’s inequality establishes the triangle inequality of the p-norm
inequality for
of a function, just like it did for the p-norm of vectors in Cn . The other two
functions in
Exercise 1.D.14. defining properties of norms (absolute homogeneity and positive definiteness)
are much easier to prove, and are left as Exercise 1.D.13.
1.D.2 From Norms Back to Inner Products

We have now seen that there are many different norms out there, but the norms
induced by inner products serve a particularly important role and are a bit easier
to work with (for example, they satisfy the Cauchy–Schwarz inequality). It
thus seems natural to ask how we can determine whether or not a given norm is
induced by an inner product, and if so how we can recover that inner product.
For example, although the 1-norm on R2 is not induced by the standard inner
product (i.e., the dot product), it does not seem obvious whether or not it is
induced by some more exotic inner product.
To answer this question for the 1-norm, suppose for a moment that there
were an inner product h·, ·i on R2 that induces the 1-norm : kvk1 = hv, vi for
all v ∈ R2 . By using the fact that ke1 k1 = ke2 k1 = 1 and ke1 + e2 k1 = 2, we

see that
4 = ke1 + e2 k21 = he1 + e2 , e1 + e2 i

The upcoming = ke1 k21 + 2Re(he1 , e2 i) + ke2 k21 = 2 + 2Re(he1 , e2 i).
theorem tells us how
to turn a norm back
By rearranging and simplifying, we thus see that Re(he1 , e2 i) = 1, so |he1 , e2 i| ≥
into an inner
product, so one way 1. However, the Cauchy–Schwarz inequality tells us that |he1 , e2 i| ≤ 1 and in
of interpreting it is as fact |he1 , e2 i| < 1 since e1 and e2 are not scalar multiples of each other. We have
a converse of thus arrived at a contradiction that shows that no such inner product exists—the
Definition 1.3.7,
which gave us a
1-norm is not induced by any inner product.
method of turning The above argument was very specific to the 1-norm, however, and it is
an inner product into not immediately clear whether or not a similar argument can be applied to
a norm.
the ∞-norm, 7-norm, or other even more exotic norms. Before introducing a
theorem that solves this problem, we introduce one additional minor piece
of terminology and notation. We say that a vector space V, together with a
particular norm, is a normed vector space, and we denote the norm by k · kV
rather than just k · k in order to avoid any potential confusion with other norms.
Theorem 1.D.8 Let V be a normed vector space. Then k · kV is induced by an inner product
Jordan–von Neumann if and only if
Theorem
2kvk2V + 2kwk2V = kv + wk2V + kv − wk2V for all v, w ∈ V.
Before proving this theorem, we note that the equation 2kvk2V + 2kwk2V =
kv + wk2V + kv − wk2V is sometimes called the parallelogram law, since it
relates the lengths of the sides of a parallelogram to the lengths of its diagonals,
The parallelogram as illustrated in Figure 1.33. In particular, it says that the sum of squares of
law is a
the side lengths of a parallelogram always equals the sum of squares of the
generalization of the
Pythagorean diagonal lengths of that parallelogram, as long as the way that we measure
theorem, which is “length” comes from an inner product.
what we get if we
apply the y y
parallelogram law to
v+w
rectangles. v
w
w
v−w
v
x x
kvk2V + kwk2V + kvk2V + kwk2V = kv + wk2V + kv − wk2V
Figure 1.33: The Jordan–von Neumann theorem says that a norm is induced by
an inner product if and only if the sum of squares of norms of diagonals of a
parallelogram equals the sum of squares of norms of its sides (i.e., if and only if the
We originally proved parallelogram law holds for that norm).
the ‘’only if”
direction back in
Exercise 1.3.13. Proof of Theorem 1.D.8. To see the “only if” claim, we note that if k · kV is
induced by an inner product h·, ·i then some algebra shows that
kv + wk2V + kv − wk2V = hv + w, v + wi + hv − w, v − wi

= hv, vi + hv, wi + hw, vi + hw, wi

+ hv, vi − hv, wi − hw, vi + hw, wi
= 2hv, vi + 2hw, wi
= 2kvk2V + 2kwk2V .
The “if” direction of the proof is much more involved, and we only prove
the case when the underlying field is F = R (we leave the F = C case to
Exercise 1.D.16). To this end, suppose that the parallelogram law holds and
define the following function h·, ·i : V × V → R (which we will show is an inner
This formula for an product):
inner product in 1
terms of its induced hv, wi = kv + wk2V − kv − wk2V .
norm is sometimes 4
called the To see that this function really is an inner product, we have to check the
polarization identity.
We first encountered
three defining properties of inner products from Definition 1.3.6. Property (a)
this identity back in is straightforward, since
Exercise 1.3.14.
1 1
hv, wi = kv + wk2V − kv − wk2V = kw + vk2V − kw − vk2V = hw, vi.
4 4
Similarly, property (c) follows fairly quickly from the definition, since
1 1
hv, vi = kv + vk2V − kv − vk2V = k2vk2V = kvk2V ,
4 4
which is non-negative and equals 0 if and only if v = 0. In fact, this furthermore
Properties (a) shows that the norm induced by this inner product is indeed kvkV .
and (c) do not
actually depend on
All that remains is to prove property (b) of inner products holds (i.e.,
the parallelogram hv, w + cxi = hv, wi + chv, xi for all v, w, x ∈ V and all c ∈ R). This task is
law holding—that is significantly more involved than proving properties (a) and (c) was, so we split
only required for it up into four steps:
property (b).
i) First, we show that hv, w + xi = hv, wi + hv, xi for all v, w, x ∈ V. To this
end we use the parallelogram law (i.e., the hypothesis of this theorem) to
see that
2kv + xk2V + 2kwk2V = kv + w + xk2V + kv − w + xk2V .
Rearranging slightly gives
kv + w + xk2V = 2kv + xk2V + 2kwk2V − kv − w + xk2V .
By repeating this argument with the roles of w and x swapped, we

similarly see that
kv + w + xk2V = 2kv + wk2V + 2kxk2V − kv + w − xk2V .
Averaging these two equations gives
kv + w + xk2V = kwk2V + kxk2V + kv + wk2V + kv + xk2V

(1.D.3)
− 12 kv − w + xk2V − 21 kv + w − xk2V .
Inner products By repeating this entire argument with w and x replaced by −w and −x,
provide more
respectively, we similarly see that
structure than just a
norm does, so this
theorem can be kv − w − xk2V = kwk2V + kxk2V + kv − wk2V + kv − xk2V
(1.D.4)
thought of as telling
us exactly when that
− 12 kv + w − xk2V − 21 kv − w + xk2V .
extra structure is
present. Finally, we can use Equations (1.D.3) and (1.D.4) to get what we want:
1
hv, w + xi = kv + w + xk2V − kv − w − xk2V
4
1
= kv + wk2V + kv + xk2V − kv − wk2V − kv − xk2V
4
1 1
= kv + wk2V − kv − wk2V + kv + xk2V − kv − xk2V
4 4
= hv, wi + hv, xi,
as desired. Furthermore, by replacing x by cx above, we see that hv, w +

cxi = hv, wi + hv, cxi for all c ∈ R. Thus, all that remains is to show that
hv, cxi = chv, xi for all c ∈ R, which is what the remaining three steps
of this proof are devoted to demonstrating.
ii) We now show that hv, cxi = chv, xi for all integers c. If c > 0 is an integer,
then this fact follows from (i) and induction (for example, (i) tells us that
hv, 2xi = hv, x + xi = hv, xi + hv, xi = 2hv, xi).
If c = 0 then this fact is trivial, since it just says that
1 1
hv, 0xi = kv + 0xk2V − kv − 0xk2V = kvk2V − kvk2V = 0 = 0hv, xi.
4 4
If c = −1, then
1
hv, −xi = kv − xk2V − kv + xk2V
4
−1
= kv + xk2V − kv − xk2V = −hv, xi.
4
Finally, if c < −1 is an integer, then the result follows by combining the
fact that it holds when c = −1 and when c is a positive integer.
We denote the set of iii) We now show that hv, cxi = chv, xi for all rational c. If c is rational then
rational numbers
we can write it as c = p/q, where p and q are integers and q 6= 0. Then
by Q.
qhv, cxi = qhv, p(x/q)i = qphv, x/qi = phv, qx/qi = phv, xi,
where we used (ii) and the fact that p and q are integers in the second
and third equalities. Dividing the above equation through by q then gives
hv, cxi = (p/q)hv, xi = chv, xi, as desired.
Yes, this proof is still iv) Finally, to see that hv, cxi = chv, xi for all c ∈ R, we use the fact that, for
going on.
each fixed v, w ∈ V, the function f : R → R defined by
1 1
f (c) = hv, cxi = kv + cwk2V − kv − cwk2V
c 4c
is continuous (this follows from the fact that all norms are continuous
when restricted to a finite-dimensional subspace like span(v, w)—see
Section 2.D and Appendix B.1 for more discussion of this fact), and com-
positions and sums/differences of continuous functions are continuous).
We expand on this Well, we just showed in (iii) that f (c) = hv, wi for all c ∈ Q. When
type of argument
combined with continuity of f , this means that f (c) = hv, wi for all
considerably in
Section 2.D. c ∈ R. This type of fact is typically covered in analysis courses and texts,
but hopefully it is intuitive enough even if you have not taken such a
course—the rational numbers are dense in R, so if there were a real
number c with f (c) 6= hv, wi then the graph of f would have to “jump”
at c (see Figure 1.34) and thus f would not be continuous.
f (c1 ) 6= hv, wi
f (c) = hv, wi
for all c ∈ Q
c
/Q
c1 ∈
Figure 1.34: If a continuous function f (c) is constant on the rationals, then it must
be constant on all of R—otherwise, it would have a discontinuity.
We have thus shown that h·, ·i is indeed an inner product, and that it induces
k · kV , so the proof is complete.
Example 1.D.4 Determine which of the p-norms (1 ≤ p ≤ ∞) on Cn are induced by an

Which p-Norms inner product.
are Induced by an
Solution:
Inner Product?
We already know that the 2-norm is induced by the standard inner
If we are being super product (the dot product). To see that none of the other p-norms are
technical, this induced by an inner product, we can check that the parallelogram law does
argument only works not hold when v = e1 and w = e2 :
when n ≥ 2. The
p-norms are all the
same (and thus are
2ke1 k2p + 2ke2 k2p = 2 + 2 = 4 and
(
all induced by an
2 2 1+1 = 2 if p = ∞,
inner product) when ke1 + e2 k p + ke1 − e2 k p =
n = 1. 22/p + 22/p = 21+2/p otherwise.
The parallelogram law thus does not hold whenever 4 6= 21+2/p (i.e.,
whenever p 6= 2). It follows from Theorem 1.D.8 that the p-norms are not
induced by any inner product when p 6= 2.
In fact, the only norms that we have investigated so far that are induced
by an inner product are the “obvious” ones: the 2-norm on Fn and C[a, b], the
Frobenius norm on Mm,n (F), and the norms that we can construct from the
inner products described by Corollary 1.4.4.
1.D.3 Isometries
Recall from Section 1.4.3 that if F = R or F = C then a unitary matrix U ∈
Mn (F) is one that preserves the usual 2-norm of all vectors: kUvk2 = kvk2
for all v ∈ Fn . Unitary matrices are extraordinary for the fact that they have
so many different, but equivalent, characterizations (see Theorem 1.4.9). In
this section, we generalize this idea and look at what we can say about linear
transformations that preserve any particular norm, including ones that are not
induced by any inner product.
Definition 1.D.4 Let V and W be normed vector spaces and let T : V → W be a linear
Isometries transformation. We say that T is an isometry if
kT (v)kW = kvkV for all v ∈ V.
In other words, a For example, the linear transformation T : R2 → R3 (where we use the
unitary matrix is an
isometry of the
usual 2-norm on R2 and R3 ) defined by T (v1 , v2 ) = (v1 , v2 , 0) is an isometry
2-norm on Fn . since
q q
kT (v)k2 = v21 + v22 + 02 = v21 + v22 = kvk2 for all v ∈ V.
Whether or not a linear transformation is an isometry depends heavily on the

norms that are being used on V and W, so they need to be specified unless they
are clear from context.

Example 1.D.5 Consider the matrix U = 1 1 −1 that acts on R2 .
√
An Isometry in the 2 1 1
2-Norm But Not
a) Show that U is an isometry of the 2-norm.
the 1-Norm
b) Show that U is not an isometry of the 1-norm.
Solutions:
a) It is straightforward to check that U is unitary and thus an isometry
of the 2-norm:

1 1 1 1 −1 1 2 0
U ∗U = = = I2 .
2 −1 1 1 1 2 0 2
b) We simply have to find any vector v ∈ R2 with the property that

kUvk1 6= kvk1 . Well, v = e1 works, since
√
kUe1 k1 = √12 (1, 1) 1 = 2 but ke1 k1 = 1.
Recall that a rotation Geometrically, part (a) makes sense because U acts as a rotation
matrix (by an angle
counter-clockwise by π/4:
θ counter-clockwise)
is a matrix of the " #
form cos(π/4) − sin(π/4)
" # U= ,
cos(θ ) − sin(θ ) sin(π/4) cos(π/4)
.
sin(θ ) cos(θ )
and rotating vectors does not change their 2-norm (see Example 1.4.12).
However, rotating vectors can indeed change their 1-norm, as demonstrated
in part (b):
y
√
Ue1 = (1, 1)/ 2 kUe1 k2 = 1
√
kUe1 k1 = √1 + √12 = 2
2
U
1/ 2
√
π/4
x
√ e1 = (1, 0) ke1 k2 = 1
1/ 2
ke1 k1 = 1
In situations where we have an inner product to work with (like in part (a)
of the previous example), we can characterize isometries in much the same way
that we characterized unitary matrices back in Section 1.4.3. In fact, the proof
of the upcoming theorem is so similar to that of the equivalence of conditions
(d)–(f) in Theorem 1.4.9 that we defer it to Exercise 1.D.22.
Theorem 1.D.9 Let V and W be inner product spaces and let T : V → W be a linear
Characterization transformation. Then the following are equivalent:
of Isometries a) T is an isometry (with respect to the norms induced by the inner
products on V and W),
b) T ∗ ◦ T = IV , and
c) hT (x), T (y)iW = hx, yiV for all x, y ∈ V.
If we take the standard matrix of both sides of part (b) of this theorem with
respect to an orthonormal basis B of V, then we see that T is an isometry if and
only if
The first equality here [T ]∗B [T ]B = [T ∗ ]B [T ]B = [T ∗ ◦ T ]B = [IV ]B = I.
follows from
Theorem 1.4.8 and In the special case when V = W, this observation can be phrased as follows:
relies on B being an
orthonormal basis.
! If B is an orthonormal basis of an inner product space V, then
T : V → V is an isometry of the norm induced by the inner
product on V if and only if [T ]B is unitary.
On the other hand, if a norm is not induced by an inner product then it can
be quite a bit more difficult to get our hands on what its isometries are. The
following theorem provides the answer for the ∞-norm:
Theorem 1.D.10 Suppose F = R or F = C, and P ∈ Mn (F). Then

Isometries of the
∞-Norm kPvk∞ = kvk∞ for all v ∈ Fn
if and only if every row and column of P has a single non-zero entry, and
Real matrices of this
that entry has magnitude (i.e., absolute value) equal to 1.
type are sometimes
called signed Before proving this theorem, it is worth illustrating exactly what it means.
permutation
matrices.
When F = R, it says that isometries of the ∞-norm only have entries 1, −1, and
0, and each row and column has exactly one non-zero entry. For example,
 
0 −1 0
P= 0 0 1
−1 0 0
is an isometry of the ∞-norm on R3 , which is straightforward to verify directly:

kPvk∞ = k(−v2 , v3 , −v1 )k∞ = max | − v2 |, |v3 |, | − v1 |

= max |v1 |, |v2 |, |v3 | = kvk∞ for all v ∈ R3 .
When F = C, the idea is similar, except instead of just allowing non-zero

Complex matrices of entries of ±1, the non-zero entries can lie anywhere on the unit circle in the
this type are
complex plane.
sometimes called
complex
permutation
matrices. Proof of Theorem 1.D.10. The “if” direction is not too difficult to prove. We
just note that any P with the specified form just permutes the entries of v
and possibly multiplies them by
a number
with absolute value 1, and such an
operation leaves kvk∞ = max j |v j | unchanged.
For the “only if” direction, we only prove the F = R case and leave the
F = C case to Exercise 1.D.25. Suppose P = p1 | p2 | · · · | pn is an isometry
of the ∞-norm. Then Pe j = p j for all j, so

max |pi, j | = kp j k∞ = kPe j k∞ = ke j k∞ = 1 for all 1 ≤ j ≤ n.
1≤i≤n
In particular, this implies that every column of P has at least one entry with
absolute value 1. Similarly, P(e j ± ek ) = p j ± pk for all j, k, so
It turns out that the
1-norm has the exact max |pi, j ± pi,k | = kp j ± pk k∞ = kP(e j ± ek )k∞
1≤i≤n
same isometries (see
Exercise 1.D.24). In = ke j ± ek k∞ = 1 for all 1 ≤ j 6= k ≤ n.
fact, all p-norms
other than the It follows that if |pi, j | = 1 then pi,k = 0 for all k 6= j (i.e., the rest of the i-th
2-norm have these
same isometries row of P equals zero). Since each column p j contains an entry with |pi, j | = 1,
[CL92, LS94], but it follows that every row and column contains exactly one non-zero entry—the
proving this is one with absolute value 1.
beyond the scope
of this book. In particular, it is worth noting that the above theorem says that all isome-
tries of the ∞-norm are also isometries of the 2-norm, since P∗ P = I for any
matrix of the form described by Theorem 1.D.10.
1.D.1 Determine which of the following functions k · k are

(f) V = P[0, 1],
norms on the specified vector space V.
kpk = max |p(x)| + max |p0 (x)| ,
∗(a) V= R2 , kvk = |v1 | + 3|v2 |. 0≤x≤1 0≤x≤1
(b) V= R2 , kvk = 2|v1 | − |v2 |. where p0 is the derivative of p.
1/3
∗(c) V= R2 , kvk = v31 + v21 v2 + v1 v22 + v32 .
(d) V = R2 , kvk = |v31 + 3v21 v2 + 3v1 v22 + v32 | + 1.D.2 Determine whether or not each of the following
1/3
|v2 |3 . norms k · k are induced by an inner product on the indicated
∗(e) V = P (the vector space of polynomials), kpk is the vector space V.
degree of p (i.e., the highest power of x appearing in q
p(x)). (a) V = R2 , kvk = v21 + 2v1 v2 + 3v22 .
1/3
∗(b) V = C2 , kvk = |v1 |3 + 3|v2 |3 .

(c) V = C[0, 1], k f k = max | f (x)| . (a) Show that if p, q > 1 are such that 1/p + 1/q then
0≤x≤1
|v·w| = kvk p kwkq if and only if either w = 0 or there
exists a scalar 0 ≤ c ∈ R such that |v j | p = c|w j |q for
1.D.3 Determine which of the following statements are
all 1 ≤ j ≤ n.
(b) Show that if p = ∞ and q = 1 then |v·w| = kvk p kwkq
∗(a) If k · k is a norm on a vector space V and 0 < c ∈ R if and only if v = c(w1 /|w1 |, w2 /|w2 |, . . . , wn /|wn |)
then ck · k is also a norm on V. for some 0 ≤ c ∈ R (and if w j = 0 for some j then
(b) Every norm k · k on Mn (R) arisesp from some inner v j can be chosen so that |v j | ≤ 1 arbitrarily).
product on Mn (R) via kAk = hA, Ai.
∗(c) If S, T : V → W are isometries then so is S + T .
(d) If S, T : V → V are isometries of the same norm then ∗∗ 1.D.12 In this exercise, we show that the bounds of
so is S ◦ T . Theorem 1.D.6 are tight (i.e., the constants in it are as good
∗(e) If T : V → W is an isometry then T −1 exists and is as possible). Suppose 1 ≤ p ≤ q ≤ ∞.
also an isometry. (a) Find a vector v ∈ Cn such that kvk p = kvkq .
(b) Find a vector v ∈ Cn such that
1.D.4 Which of the three defining properties of norms fails 1 1
−
if we define the p-norm on Fn as usual in Definition 1.D.2, kvk p = n p q kvkq .
but with 0 < p < 1?
∗∗1.D.13 Show that the p-norm of a function (see Defini-
1.D.5 Show that norms are never linear transformations. tion 1.D.3) is indeed a norm.
That is, show that if V is a non-zero vector space and
k · k : V → R is a norm, then k · k is not a linear transforma- ∗∗1.D.14 Let 1 ≤ p, q ≤ ∞ be such that 1/p + 1/q = 1.
tion. Show that
Z b
f (x)g(x) dx ≤ k f k p kgkq for all f , g ∈ C[a, b].
∗∗ 1.D.6 Suppose V is a normed vector space and let a
v, w ∈ V. Show that kv − wk ≥ kvk − kwk. [Side note: This
[Side note: This is Hölder’s inequality for functions.]
is called the reverse triangle inequality.]
1.D.15 Show that the Jordan–von Neumann theorem (The-

∗∗ 1.D.7 Suppose v ∈ Cn and 1 ≤ p ≤ ∞. Show that
orem 1.D.8) still holds if we replace the parallelogram law
kvk∞ ≤ kvk p .
by the inequality
∗∗1.D.8 In this exercise, we show that 2kvk2V + 2kwk2V ≤ kv + wk2V + kv − wk2V
lim kvk p = kvk∞ for all v ∈ Cn . for all v, w ∈ V. [Hint: Consider vectors of the form v =
p→∞ x + y and w = x − y.]
(a) Briefly explain why lim p→∞ |v| p = 0 whenever v ∈ C
satisfies |v| < 1. ∗∗1.D.16 Prove the “if” direction of the Jordan–von Neu-
(b) Show that if v ∈ Cn is such that kvk∞ = 1 then mann theorem when the field is F = C. In particular, show
lim kvk p = 1. that if the parallelogram law holds on a complex normed
p→∞ vector space V then the function defined by
[Hint: Be slightly careful—there might be multiple
values of j for which |v j | = 1.] 1 3 1
(c) Use absolute homogeneity of the p-norms and the
hv, wi = ∑ ik kv + ik wk2V
4 k=0
∞-norm to show that
is an inner product on V.
lim kvk p = kvk∞ for all v ∈ Cn . [Hint: Show that hv, iwi = ihv, wi and apply the theorem
p→∞
from the real case to the real and imaginary parts of hv, wi.]
∗∗1.D.9 Prove the p = 1, q = ∞ case of Hölder’s inequality.
That is, show that 1.D.17 Suppose A ∈ Mm,n (C), and 1 ≤ p ≤ ∞.
|v · w| ≤ kvk1 kwk∞ for all v, w ∈ Cn . (a) Show that the following quantity k · k p is a norm on
Mm,n (C):
∗∗1.D.10 In this exercise, we determine when equality
kAvk p
holds in Minkowski’s inequality (Theorem 1.D.2) on Cn . def
kAk p = maxn : v 6= 0 .
v∈C kvk p
(a) Show that if p > 1 or p = ∞ then kv + wk p =
kvk p + kwk p if and only if either w = 0 or v = cw [Side note: This is called the induced p-norm of A.
for some 0 ≤ c ∈ R. In the p = 2 case it is also called the operator norm
(b) Show that if p = 1 then kv + wk p = kvk p + kwk p if of A, and we explore it in Section 2.3.3.]
and only if there exist non-negative scalars {c j } ⊆ R (b) Show that kAvk p ≤ kAk p kvk p for all v ∈ Cn .
such that, for each 1 ≤ j ≤ n, either w j = 0 or (c) Show that kA∗ kq = kAk p , where 1 ≤ q ≤ ∞ is such
v j = c jw j. that 1/p + 1/q = 1.
[Hint: Make use of Exercise 1.D.11.]
(d) Show that this norm is not induced by an inner prod-
1.D.11 In this exercise, we determine when equality holds uct, regardless of the value of p (as long as m, n ≥ 2).
in Hölder’s inequality (Theorem 1.D.5) on Cn .
1.D.18 Suppose A ∈ Mm,n (C), and consider the induced ∗∗1.D.24 Show that P ∈ Mn (C) is an isometry of the 1-
p-norms introduced in Exercise 1.D.17. norm if and only if every row and column of P has a single
(a) Show that the induced 1-norm of A is its maximal non-zero entry, and that entry has absolute value 1.
column sum: [Hint: Try mimicking the proof of Theorem 1.D.10. It might
( ) be helpful to use the result of Exercise 1.D.10(b).]
m
kAk1 = max ∑ |ai, j | .
1≤ j≤n i=1
∗∗1.D.25 In the proof of Theorem 1.D.10, we only proved
(b) Show that the induced ∞-norm of A is its maximal the “only if” direction in the case when F = R. Explain what
row sum: changes/additions need to be made to the proof to handle
( ) the F = C case.
n
kAk∞ = max ∑ |ai, j | .
1≤i≤m j=1
∗∗1.D.26 Show that the 1-norm and the ∞-norm on P[0, 1]
1.D.19 Find all isometries of the norm on R2 defined by are inequivalent. That is, show that there do not exist real
kvk = max{|v1 |, 2|v2 |}. scalars c,C > 0 such that
ckpk1 ≤ kpk∞ ≤ Ckpk1 for all p ∈ P[0, 1].
∗1.D.20 Suppose that V and W are finite-dimensional [Hint: Consider the polynomials of the form pn (x) = xn for
normed vector spaces. Show that if T : V → W is an isome- some fixed n ≥ 1.]
try then dim(V) ≤ dim(W).
∗∗ 1.D.27 Suppose that k · ka , k · kb , and k · kc are any

1.D.21 Suppose that V is a normed vector space and three norms on a vector space V for which k · ka and k · kb
T : V → V is an isometry. are equivalent, and so are k · kb and k · kc . Show that k · ka
(a) Show that if V is finite-dimensional then T −1 exists and k · kc are equivalent as well (in other words, show that
and is an isometry. equivalence of norms is transitive).
(b) Provide an example that demonstrates that if V is
infinite-dimensional then T might not be invertible.
∗∗1.D.28 Suppose F = R or F = C and let V be a finite-
dimensional vector space over F with basis B. Show that a
∗∗1.D.22 Prove Theorem 1.D.9. function k · kV : V → F is a norm if and only if there exists
a norm k · kFn : Fn → F such that

1.D.23 Suppose U ∈ Mm,n (C). Show that U is an isom- kvkV = [v]B Fn for all v ∈ V.
etry of the 2-norm if and only if its columns are mutually
orthogonal and all have length 1. In words, this means that every norm on a finite-dimensional
vector space looks like a norm on Fn (compare with Corol-
[Side note: If m = n then this recovers the usual definition lary 1.4.4).
of unitary matrices.]
2. Matrix Decompositions
We think basis-free, we write basis-free, but when

the chips are down we close the office door and
compute with matrices like fury.
Irving Kaplansky
We also explored the Many of the most useful linear algebraic tools are those that involve breaking
QR decomposition matrices down into the product of two or more simpler matrices (or equivalently,
in Section 1.C1.C, breaking linear transformations down into the composition of two or more
which says that simpler linear transformations). The standard example of this technique from
every matrix can be
written as a product introductory linear algebra is diagonalization, which says that it is often the
of a unitary matrix case that we can write a matrix A ∈ Mn in the form A = PDP−1 , where P is
and an upper invertible and D is diagonal.
triangular matrix.
More explicitly, some variant of the following theorem is typically proved
in introductory linear algebra courses:
Theorem 2.0.1 Suppose A ∈ Mn (F). The following are equivalent:

Characterization of a) A is diagonalizable over F.
Diagonalizability
b) There exists a basis of Fn consisting of eigenvectors of A.
Furthermore, in any diagonalization A = PDP−1 , the eigenvalues of A
are the diagonal entries of D and the corresponding eigenvectors are the
By “diagonalizable
over F”, we mean
columns of P in the same order.
that we can choose
D, P ∈ Mn (F). One of the standard examples of why diagonalization is so useful is that we can
use it to compute large powers of matrices quickly. In particular, if A = PDP−1
is a diagonalization of A ∈ Mn , then for every integer k ≥ 1 we have
P−1 P = I P−1 P = I ···
z }| { z }| { z }| {
Ak = PD P−1 P D P−1 P D P−1 · · · PDP−1 = PDk P−1 ,
| {z }
k times
Another brief review and Dk is trivial to compute (for diagonal matrices, matrix multiplication is
of diagonalization
the same as entrywise multiplication). It follows that after diagonalizing a
is provided in
Appendix A.1.7. matrix, we can compute any power of it via just two matrix multiplications—
pre-multiplication of Dk by P, and post-multiplication of Dk by P−1 . In a sense,
we have off-loaded the difficulty of computing matrix powers into the difficulty
of diagonalizing a matrix.

https://doi.org/10.1007/978-3-030-52815-7_2
168 Chapter 2. Matrix Decompositions
If we think of the matrix A as a linear transformation acting on Fn then

diagonalizing A is equivalent to finding a basis B of Fn in which the standard
matrix of that linear transformation is diagonal. That is, if we define T : Fn → Fn
Here we use the by T (v) = Av then Theorem 1.2.8 tells us that [T ]B = PB←E APE←B , where E
fact that [T ]E = A.
For a refresher on is the standard basis of Fn . Since PB←E = PE←B −1
(by Theorem 1.2.3), a slight
−1
change-of-basis rearrangement shows that A = PE←B [T ]B PE←B , so A is diagonalizable if and
matrices and only if there is a basis B in which [T ]B is diagonal (see Figure 2.1).
standard matrices,
refer back to
Section 1.2.2. y y
Av1
A
v2 v1 Av2
x x
" #
2 1
Figure 2.1: The matrix A = “looks diagonal” (i.e., just stretches space, but
1 2
does not skew or rotate it) when viewed in the basis B = {v1 , v2 } = {(1, 1), (−1, 1)}
We say that two matrices A, B ∈ Mn (F) are similar if there exists an invert-
With this terminology ible matrix P ∈ Mn (F) such that A = PBP−1 (and we call such a transformation
in hand, a matrix is
of A into B a similarity transformation). The argument that we just provided
diagonalizable if
and only if it is similar shows that two matrices are similar if and only if they represent the same linear
to a diagonal matrix. transformation on Fn , just represented in different bases. The main goal of this
chapter is to better understand similarity transformations (and thus change of
bases). More specifically, we investigate the following questions:
1) What if A is not diagonalizable—how “close” to diagonal can we make
D = P−1 AP when P is invertible? That is, how close to diagonal can we
make A via a similarity transformation?
2) How simple can we make A if we multiply by something other than P
and P−1 on the left and right? For example, how simple can we make A
via a unitary similarity transformation—a similarity transformation
in which the invertible matrix P is unitary? Since a change-of-basis
matrix PE←B is unitary if and only if B is an orthonormal basis of Fn
(see Exercise 1.4.18), this is equivalent to asking how simple we can
make the standard matrix of a linear transformation if we represent it in
an orthonormal basis.
2.1 The Schur and Spectral Decompositions
We start this chapter by probing question (2) above—how simple can we can
make a matrix via a unitary similarity? That is, given a matrix A ∈ Mn (C),
how simple can U ∗ AU be if U is a unitary matrix? Note that we need the field
here to be R or C so that it even makes sense to talk about unitary matrices,
and for now we restrict our attention to C since the answer in the real case will
follow as a corollary of the answer in the complex case.
2.1 The Schur and Spectral Decompositions 169
2.1.1 Schur Triangularization
Recall that an upper We know that we cannot hope in general to get a diagonal matrix via unitary
triangular matrix is similarity, since not every matrix is even diagonalizable via any similarity.
one with zeros in However, the following theorem, which is the main workhorse for the rest
every position below
the main diagonal. of this section, says that we can get partway there and always get an upper
triangular matrix.
Theorem 2.1.1 Suppose A ∈ Mn (C). There exists a unitary matrix U ∈ Mn (C) and an
Schur upper triangular matrix T ∈ Mn (C) such that
Triangularization
A = UTU ∗ .
Proof. We prove the result by induction on n (the size of A). For the base case,
we simply notice that the result is trivial if n = 1, since every 1 × 1 matrix is
upper triangular.
For the inductive step, suppose that every (n − 1) × (n − 1) matrix can be
upper triangularized (i.e., can be written in the form described by the theorem).
This step is why Since A is complex, the fundamental theorem of algebra (Theorem A.3.1) tells
we need to work
us that its characteristic polynomial pA has a root, so A has an eigenvalue. If
over C, not R. Not all
polynomials factor we denote one of its corresponding eigenvectors by v ∈ Cn , then we recall that
over R, so not all real eigenvectors are non-zero by definition, so we can scale it so that kvk = 1.
matrices have real
eigenvalues.
We can then extend v to an orthonormal basis of Cn via Exercise 1.4.20.
In other words, we can find a unitary matrix V ∈ Mn (C) with v as its first
column:
To extend v to an
orthonormal basis, V = v | V2 ,
pick any vector
where V2 ∈ Mn,n−1 (C) satisfies V2∗ v = 0 (since V is unitary so v is orthogonal
not in its span, apply
Gram–Schmidt, and to every column of V2 ). Direct computation then shows that
repeat. ∗
∗
∗ v
V AV = v | V2 A v | V2 = Av | AV2
The final step in this V2∗
computation follows ∗
because Av = λ v, v Av v∗ AV2 λ v∗ AV2
= = .
v∗ v = kvk2 = 1, and λV2∗ v V2∗ AV2 0 V2∗ AV2
V2∗ v = 0.
We now apply the inductive hypothesis—since V2∗ AV2 is an (n − 1) × (n − 1)
matrix, there exists a unitary matrix U2 ∈ Mn−1 (C) and an upper triangular
T2 ∈ Mn−1 (C) such that V2∗ AV2 = U2 T2U2∗ . It follows that

λ v∗ AV2 1 0T λ v∗ AV2 1 0T
V ∗ AV = = .
0 U2 T2U2∗ 0 U2 0 T2 0 U2∗
By multiplying on the left by V and on the right by V ∗ , we see that A = UTU ∗ ,
where
Recall from
Exercise 1.4.8 1 0T λ v∗ AV2
U =V and T = .
that the product 0 U2 0 T2
of unitary matrices
is unitary. Since U is unitary and T is upper triangular, this completes the inductive step
and the proof.
In a Schur triangularization A = UTU ∗ , the diagonal entries of T are
necessarily the same as the eigenvalues of A. To see why this is the case, recall
that (a) A and T must have the same eigenvalues since they are similar, and
(b) the eigenvalues of a triangular matrix are its diagonal entries. However,
the other pieces of Schur triangularization are highly non-unique—the unitary
matrix U and the entries in the strictly upper triangular portion of T can vary
wildly.
Example 2.1.1 Compute a Schur triangularization of the following matrices:

Computing Schur " #  
Triangularizations a) A = 1 2 b) B = 1 2 2
5 4 2 1 2
3 −3 4
Solutions:
a) We construct the triangularization by mimicking the proof of Theo-
rem 2.1.1, so we start by constructing a unitary matrix whose first
column is an eigenvector of A. It is straightforward to show that its
We computed eigenvalues are −1 and 6, with corresponding eigenvectors (1, −1)
these eigenvalues
and (2, 5), respectively.
and eigenvectors in
Example A.1.1. See If we (arbitrarily)√choose the eigenvalue/eigenvector pair λ = −1
Appendix A.1.6 if you and v = (1, −1)/ 2, then we can extend v to the orthonormal basis
need a refresher on
eigenvalues and n o
eigenvectors. √1 (1, −1), √1 (1, 1)
2 2
of C2 simply by inspection. Placing these vectors as columns into a

unitary matrix then gives us a Schur triangularization A = UTU ∗ as
follows:
" #
1 1 1 −1 −3
U=√ and T = U ∗ AU = .
2 −1 1 0 6
b) Again, we mimic the proof of Theorem 2.1.1 and thus start by con-
structing a unitary matrix whose first column is an eigenvector of
B. Its eigenvalues are −1, 3, and 4, with corresponding eigenvec-
If we chose a tors (−4, 1, 3), (1, 1, 0), and (2, 2, 1), respectively. We (arbitrarily)
different eigen-
choose the pair λ = 4, v = (2, 2, 1)/3, and then we compute an or-
value/eigenvector
pair, we would still thonormal basis of C3 that contains v (notice that we normalized v
get a valid Schur so that this is possible).
triangularization √
of B, but it would look
To this end, we note that the unit vector w = (1, −1, 0)/ 2 is clearly
completely different orthogonal to v, so we just need to find one more unit vector or-
than the one we thogonal to both v and w. We can either solve a linear system,or
compute here. use the Gram–Schmidt
√ process, or use the cross product to find
x = (−1, −1, 4)/ 18, which does the job. It follows that the set
n o
{v, w, x} = 13 (2, 2, 1), √12 (1, −1, 0), √118 (−1, −1, 4)
is an orthonormal basis of C3 containing v, and if we stick these

vectors as columns into a unitary matrix V1 then we have
 √  
√ √ √ 
2/3 1/ 2 −1/ 18 4 2 2 2
 √ √    
V1 = 
2/3 −1/ 2 −1/ 18 , V1
∗
BV1 = 0 −1 0 .
√
1/3 0 4/ 18 0 4 3
For larger matrices, We now iterate—we repeat this entire procedure with the bottom-
we have to iterate
right 2 × 2 block of V1∗ BV1 . Well, the eigenvalues of
even more. For
example, to
compute a Schur −1 0
triangularization of a 4 3
4 × 4 matrix, we must
compute an
eigenvector of a are −1 and 3, with corresponding eigenvectors
√ (1, −1) and
√ (0, 1),
4 × 4 matrix, then a respectively. The orthonormal basis (1, −1)/ 2, (1, 1)/ 2 con-
3 × 3 matrix, and tains one of its eigenvectors, so we define V2 to be the unitary matrix
then a 2 × 2 matrix.
with these vectors as its columns. Then B has Schur triangularization
B = UTU ∗ via

1 0T
U = V1
In these examples,
0 V2
 √ √ 
the eigenvectors 
were simple 2/3 1/ 2 −1/ 18 1 0 0
 √ √  √ √
enough to extend to 
= 2/3 −1/ 2 
−1/ 18 0 1/ √2 1/√2 
orthonormal bases √ 0 −1/ 2 1/ 2
by inspection. In 1/3 0 4/ 18
more complicated  
situations, use the 2 2 1
1
Gram–Schmidt = 2 −1 −2
process. 3
1 −2 2
and
   
2 2 1 1 2 2 2 2 1
1
T = U ∗ BU = 2 −1 −2 2 1 2 2 −1 −2
9
1 −2 2 3 −3 4 1 −2 2
 
4 −1 3
= 0 −1 −4 .
0 0 3
As demonstrated by the previous example, computing a Schur triangular-

ization by hand is quite tedious. If A is n × n then constructing a Schur triangu-
larization requires us to find an eigenvalue and eigenvector of an n × n matrix,
extend it to an orthonormal basis, finding an eigenvector of an (n − 1) × (n − 1)
matrix, extend it it to an orthonormal basis, and so on down to a 2 × 2 matrix.
While there are somewhat faster numerical algorithms that can be used
to compute Schur triangularizations in practice, we do not present them here.
Rather, we are interested in Schur triangularizations for their theoretical prop-
erties and the fact that they will help us establish other, more practical, matrix
decompositions. In other words, we frequently make use of the fact that Schur
triangularizations exist, but we rarely need to actually compute them.
Remark 2.1.1 Schur triangularizations are one of the few things that really do require
Schur us to work with complex matrices, as opposed to real matrices (or even
Triangularizations matrices over some more exotic field). The reason for this is that the first
are Complex step in finding a Schur triangularization is finding an eigenvector, and
some real matrices do not have any real eigenvectors. For example, the
matrix
1 −2
A=
1 −1
has no real eigenvalues and thus no real Schur triangularization (since
the diagonal entries of its triangularization T necessarily have the same
eigenvalues as A). However, it does have a complex Schur triangularization:
A = UTU ∗ , where
"√ # "√ #
1 2(1 + i) 1 + i 1 i 2 3−i
U=√ √ and T = √ √ .
6 2 −2 2 0 −i 2
While it is not quite as powerful as diagonalization, the beauty of Schur

triangularization is that it applies to every square matrix, which makes it an
Recall that the trace extremely useful tool when trying to prove theorems about (potentially non-
of a matrix (denoted
diagonalizable) matrices. For example, we can use it to provide a short and
by tr(·)) is the sum of
its diagonal entries. simple proof of the following relationship between the determinant, trace, and
eigenvalues of a matrix (the reader may have already seen this theorem for diag-
onalizable matrices, or may have already proved it in general by painstakingly
analyzing coefficients of the characteristic polynomial).
Theorem 2.1.2 Let A ∈ Mn (C) have eigenvalues λ1 , λ2 , . . . , λn (listed according to alge-

Determinant and braic multiplicity). Then
Trace in Terms
of Eigenvalues det(A) = λ1 λ2 · · · λn and tr(A) = λ1 + λ2 + · · · + λn .
Proof. Both formulas are proved by using Schur triangularization to write A =

UTU ∗ with U unitary and T upper triangular, and recalling that the diagonal
entries of T are its eigenvalues, which (since T and A are similar) are also the
eigenvalues of A. In particular,
det(A) = det(UTU ∗ ) (Schur triangularization)

= det(U) det(T ) det(U ∗ ) (multiplicativity of determinant)
For a refresher
∗
on properties of the = det(UU ) det(T ) (multiplicativity of determinant)
determinant like the
= det(T ) (det(UU ∗ ) = det(I) = 1)
ones we use here,
see Appendix A.1.5. = λ1 λ2 · · · λn . (determinant of a triangular matrix)
The corresponding statement about the trace is proved analogously:
tr(A) = tr(UTU ∗ ) (Schur triangularization)

∗
= tr(U UT ) (cyclic commutativity of trace)
= tr(T ) (U ∗U = I)
= λ1 + λ2 + · · · + λn . (definition of trace)
As one more immediate application of Schur triangularization, we prove an
important result called the Cayley–Hamilton theorem, which says that every
matrix satisfies its own characteristic polynomial.
Theorem 2.1.3 Suppose A ∈ Mn (C) has characteristic polynomial pA (λ ) = det(A − λ I).

Cayley–Hamilton Then pA (A) = O.
Before proving this result, we clarify exactly what we mean by it. Since
the characteristic polynomial pA really is a polynomial, there are coefficients
In fact, an = (−1)n . c0 , c1 , . . . , cn ∈ C such that
pA (λ ) = an λ n + · · · + a2 λ 2 + a1 λ + a0 .
What we mean by pA (A) is that we plug A into this polynomial in the naïve
way—we just replace each power of λ by the corresponding power of A:
Recall that, for
every matrix pA (A) = an An + · · · + a2 A2 + a1 A + a0 I.
A ∈ Mn , we have

A0 = I. 1 2
For example, if A = then
3 4
" #!
1−λ 2
pA (λ ) = det(A − λ I) = det = (1 − λ )(4 − λ ) − 6
3 4−λ
= λ 2 − 5λ − 2,
so pA (A) = A2 − 5A − 2I. The Cayley–Hamilton theorem says that this equals

Be careful with the the zero matrix, which we can verify directly:
constant term: −2
" #
becomes −2I when
2 7 10 1 2 1 0 0 0
plugging a matrix A − 5A − 2I = −5 −2 = .
into the polynomial. 15 22 3 4 0 1 0 0
We now prove this theorem in full generality.
Proof of Theorem 2.1.3. Because we are working over C, the fundamental

theorem of algebra (Theorem A.3.1) says that the characteristic polynomial of
A has a root and can thus be factored:
pA (λ ) = (λ1 − λ )(λ2 − λ ) · · · (λn − λ ), so
pA (A) = (λ1 I − A)(λ2 I − A) · · · (λn I − A).
To simplify things, we use Schur triangularization to write A = UTU ∗ ,
where U ∈ Mn (C) is unitary and T ∈ Mn (C) is upper triangular. Then by
factoring somewhat cleverly, we see that
pA (A) = (λ1 I −UTU ∗ ) · · · (λn I −UTU ∗ ) (plug in A = UTU ∗ )
= (λ1UU ∗ −UTU ∗ ) · · · (λnUU ∗ −UTU ∗ ) (since I = UU ∗ )
This is a fairly
standard technique = U(λ1 I − T )U ∗ · · · U(λn I − T )U ∗ (factor out U and U ∗ )
that Schur
triangularization = U(λ1 I − T ) · · · (λn I − T )U ∗ (since U ∗U = I)
gives us—it lets us just = U pA (T )U ∗ .
prove certain
statements for It follows that we just need to prove that pA (T ) = O.
triangular matrices
without losing any To this end, we assume for simplicity that the diagonal entries of T are in
generality. the same order that we have been using for the eigenvalues of A (i.e., the first
diagonal entry of T is λ1 , the second is λ2 , and so on). Then for all 1 ≤ j ≤ n,
λ j I − T is an upper triangular matrix with diagonal entries λ1 − λ j , λ2 − λ j , . . .,
λn − λ j . That is,
 
We use asterisks (∗) 0 ∗ ··· ∗
in the place of 0 λ − λ · · · ∗ 
 1 2 
matrix entries whose 
λ1 I − T =  . ,
. . .
.. 
values we do not  .. .. .. 
care about.
0 0 · · · λ1 − λn
and similarly for λ2 I − T , . . ., λn I − T . We now claim that the leftmost k

columns of (λ1 I − T )(λ2 I − T ) · · · (λk I − T ) consist entirely of zeros whenever
1 ≤ k ≤ n, and thus pA (T ) = (λ1 I − T )(λ2 I − T ) · · · (λn I − T ) = O. Proving this
claim is a straightforward (albeit rather ugly) matrix multiplication exercise,
which we solve via induction.
Another proof of The base case k = 1 is straightforward, as we wrote down the matrix λ1 I −T
the Cayley–Hamilton
above, and its leftmost column is indeed the zero vector. For the inductive step,
theorem is provided
in Section 2.D.4. define Tk = (λ1 I − T )(λ2 I − T ) · · · (λk I − T ) and suppose that the leftmost k
columns of Tk consist entirely of zeros. We know that each Tk is upper triangular
(since the product of upper triangular matrices is again upper triangular), so
some rather unpleasant block matrix multiplication shows that

Tk+1 = Tk λk+1 I − T
  
Recall that Ok is the Ok ∗ ∗ ∗ ∗
k × k zero matrix. O   
(without a subscript)  ∗ ∗ ··· ∗  0 ∗ ··· ∗ 
 0 ∗ ··· ∗  0 λk+1 − λk+2 ··· ∗ 
of whatever size =  
makes sense in that 
 O .. .. .. ..

 O .. .. .. ..


portion of the block  . . . .  . . . . 
matrix ((n − k) × k
in this case).
0 0 ··· ∗ 0 0 ··· λk+1 − λn
 
Ok 0 ∗
 0 ∗ ··· ∗ 
 
 
= 0 ∗ ··· ∗ ,
 O 
 .. .. .. .. 
 . . . . 
0 0 ··· ∗
which has its leftmost k + 1 columns equal to the zero vector. This completes
the inductive step and the proof.
More generally, if For example, because the characteristic polynomial of a 2 × 2 matrix A is
A ∈ Mn (C) then the
constant coefficient
pA (λ ) = λ 2 − tr(A)λ + det(A), in this case the Cayley–Hamilton theorem says
of pA is det(A) and its that every 2 × 2 matrix satisfies the equation A2 = tr(A)A − det(A)I, which can
coefficient of λ n−1 is be verified directly by giving names to the 4 entries of A and computing all of
(−1)n−1 tr(A). the indicated quantities:
" #
a b 1 0
tr(A)A − det(A)I = (a + d) − (ad − bc)
c d 0 1
" #
a2 + bc ab + bd
= = A2 .
ac + cd bc + d 2
One useful feature of the Cayley–Hamilton theorem is that if A ∈ Mn (C)

then it lets us write every power of A as a linear combination of I, A, A2 , . . . , An−1 .
In other words, it tells us that the powers of A are all contained within some
(at most) n-dimensional subspace of the n2 -dimensional vector space Mn (C).
Example 2.1.2 Suppose

Matrix Powers via 1 2
A= .
Cayley–Hamilton 3 4
Use the Cayley–Hamilton theorem to come up with a formula for A4 as a
linear combination of A and I.
Solution:
Another way to solve As noted earlier, the characteristic polynomial of A is pA (λ ) = λ 2 −
this would be to 5λ − 2, so A2 − 5A − 2I = O. Rearranging somewhat gives A2 = 5A + 2I.
square both sides of
the equation To get higher powers of A as linear combinations of A and I, just multiply
A2 = 5A + 2I, this equation through by A:
which gives
A4 = 25A2 + 20A + 4I. A3 = 5A2 + 2A = 5(5A + 2I) + 2A = 25A + 10I + 2A = 27A + 10I, and
Substituting in
A2 = 5A + 2I then A4 = 27A2 + 10A = 27(5A + 2I) + 10A = 135A + 54I + 10A
gives the answer.
= 145A + 54I.
Example 2.1.3 Use the Cayley–Hamilton theorem to find the inverse of the matrix
Matrix Inverses via
Cayley–Hamilton 1 2
A= .
3 4
Solution:
Alternatively, first As before, we know from the Cayley–Hamilton theorem that A2 −
check that det(A) 6= 0
so that we know A−1
5A − 2I = O. If we solve this equation for I, we get I = (A2 − 5A)/2.
exists, and then find Factoring this equation gives I = A(A − 5I)/2, from which it follows that
it by multiplying both A is invertible, with inverse
sides of A2 − 5A −
2I = O by A−1 . " #!
−1 1 1 2 5 0 1 −4 2
A = (A − 5I)/2 = − = .
2 3 4 0 5 2 3 −1
Example 2.1.4 Suppose

 
Large Matrix 0 2 0
Powers via A= 1 1 −1 .
Cayley–Hamilton −1 1 1
Use the Cayley–Hamilton theorem to compute A314 .
Solution:
The characteristic polynomial of A is
pA (λ ) = det(A − λ I) = (−λ )(1 − λ )2 + 2 + 0 − λ − 2(1 − λ ) − 0

= −λ 3 + 2λ 2 .
The Cayley–Hamilton theorem then says that A3 = 2A2 . Multiplying that

equation by A repeatedly gives us A4 = 2A3 = 4A2 , A5 = 2A4 = 4A3 = 8A2 ,
and in general, An = 2n−2 A2 . It follows that
   
2 2 −2 1 1 −1
A314 = 2312 A2 = 2312 2 2 −2 = 2313 1 1 −1 .
0 0 0 0 0 0
It is maybe worth noting that the final example above is somewhat contrived,
since it only works out so cleanly due to the given matrix having a very simple
characteristic polynomial. For more complicated matrices, large powers are
typically still best computed via diagonalization.
2.1.2 Normal Matrices and the Complex Spectral Decomposition

The reason that Schur triangularization is such a useful theoretical tool is
that it applies to every matrix in Mn (C) (unlike diagonalization, for example,
which has the annoying restriction of only applying to matrices with a basis
of eigenvectors). However, upper triangular matrices can still be somewhat
challenging to work with, as evidenced by how technical the proof of the
Cayley–Hamilton theorem (Theorem 2.1.3) was. We thus now start looking at
when Schur triangularization actually results in a diagonal matrix, rather than
just an upper triangular one.
It turns out that the answer is related to a new class of matrices that we have
not yet encountered, so we start with a definition:
Definition 2.1.1 A matrix A ∈ Mn (C) is called normal if A∗ A = AA∗ .

Normal Matrix
Many of the important families of matrices that we are already familiar with
are normal. For example, every Hermitian matrix is normal, since if A∗ = A
Recall that then A∗ A = A2 = AA∗ , and a similar argument shows that skew-Hermitian
A ∈ Mn (C) is
matrices are normal as well. Similarly, every unitary matrix U is normal, since
skew-Hermitian
if A∗ = −A. U ∗U = I = UU ∗ in this case, but there are also normal matrices that are neither
Hermitian, nor skew-Hermitian, nor unitary.
Example 2.1.5 Show that the matrix  

A Normal Matrix 1 1 0

A= 0 1 1
1 0 1
is normal but not Hermitian, skew-Hermitian, or unitary.
Solution:
To see that A is normal, we directly compute
    
1 0 1 1 1 0 2 1 1
A∗ A = 1 1 0 0 1 1 = 1 2 1 = AA∗ .
0 1 1 1 0 1 1 1 2
The fact that A is not unitary follows from the fact that A∗ A (as computed
above) does not equal the identity matrix, and the fact that A is neither
Hermitian nor skew-Hermitian can be seen just by inspecting the entries
of the matrix.
Our primary interest in normal matrices comes from the following theorem,
which says that normal matrices are exactly those that can be diagonalized
by a unitary matrix. Equivalently, they are exactly the matrices whose Schur
triangularizations are actually diagonal.
Theorem 2.1.4 Suppose A ∈ Mn (C). Then there exists a unitary matrix U ∈ Mn (C) and
Spectral diagonal matrix D ∈ Mn (C) such that
Decomposition
(Complex Version) A = UDU ∗
if and only if A is normal (i.e., A∗ A = AA∗ ).

Proof. To see the “only if” direction, we just compute
A∗ A = (UDU ∗ )∗ (UDU ∗ ) = UD∗U ∗UDU ∗ = UD∗ DU ∗ , and

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
AA = (UDU )(UDU ) = UDU UD U = UDD U .
Since D and D∗ are diagonal, they commute, so UD∗ DU ∗ = UDD∗U ∗ . We

thus conclude that A∗ A = AA∗ , so A is normal.
For the “if” direction, use Schur triangularization to write A = UTU ∗ ,
where U is unitary and T is upper triangular. Since A is normal, we know that
A∗ A = AA∗ . It follows that
UT T ∗U ∗ = (UTU ∗ )(UTU ∗ )∗ = AA∗

= A∗ A = (UTU ∗ )∗ (UTU ∗ ) = UT ∗ TU ∗ .
Multiplying on the left by U ∗ and on the right by U then shows that T ∗ T = T T ∗

(i.e., T is also normal). Our goal now is to show that, since T is both upper
triangular and normal, it must in fact be diagonal.
To this end, we compute the diagonal entries of T ∗ T and T T ∗ , starting with
Recall that the [T T ]1,1 = [T T ∗ ]1,1 . It is perhaps useful to write out the matrices T and T ∗ to
∗
notation [T ∗ T ]i, j
highlight what this equality tells us:
means the (i, j)-entry
of the matrix T ∗ T .  
  
 1,1 t 0 · · · 0 t1,1 t1,2 · · · t1,n 
£ ∗ ¤ t1,2 t2,2 · · · 0  0 t2,2 · · · t2,n 
  
T T 1,1 =  . .. .. ..  .. .. .. .. 
 .. . . .  . . . . 
 
t1,n t2,n · · · tn,n 0 0 · · · tn,n
1,1
= |t1,1 | ,
2
and
 
  
 t1,1 t1,2 · · · t1,n t1,1 0 ··· 0 
 0 t2,2 · · · t2,n  ··· 
£ ∗¤  t1,2 t2,2 0 
T T 1,1 =  . .. .. ..  .. .. .. .. 
 .. . . .  . . . . 
 
0 0 · · · tn,n t1,n t2,n · · · tn,n
1,1
= |t1,1 | + |t1,2 | + · · · + |t1,n | .
2 2 2
We thus see that [T ∗ T ]1,1 = [T T ∗ ]1,1 implies |t1,1 |2 = |t1,1 |2 + |t1,2 |2 + · · · +

|t1,n |2 , and since each term in that sum is non-negative it follows that |t1,2 |2 =
· · · = |t1,n |2 = 0. In other words, the only non-zero entry in the first row of T is
its (1, 1)-entry.
We now repeat this argument with [T ∗ T ]2,2 = [T T ∗ ]2,2 . Again, direct com-
putation shows that
 
  
More of the
entries of T and T ∗
 t1,1 0 · · · 0 t1,1 0 · · · 0 
£ ∗ ¤  0 t2,2 · · · 0  0 t2,2 · · · t2,n 
  
equal zero this time T T 2,2 =  . .. .. ..  .. .. .. .. 
because we now  .. . . .  . . . . 
know that the  
off-diagonal entries
0 t2,n · · · tn,n 0 0 · · · tn,n
2,2
of the first row of T
are all zero. = |t2,2 | ,
2
and
 
  
 t1,1 0 ··· 0 t1,1 0 ··· 0 
£ ∗¤  0 t2,2 · · · t2,n 0 t2,2 ··· 0 
  
T T 2,2 =  . .. .. ..  .. .. .. .. 
 .. . . .  . . . . 
 
0 0 · · · tn,n 0 t2,n · · · tn,n
2,2
= |t2,2 | + |t2,3 | + · · · + |t2,n | ,
2 2 2
which implies |t2,2 |2 = |t2,2 |2 + |t2,3 |2 + · · · + |t2,n |2 . Since each term in this sum
is non-negative, it follows that |t2,3 |2 = · · · = |t2,n |2 = 0, so the only non-zero
entry in the second row of T is its (2, 2)-entry.
By repeating this argument for [T ∗ T ]k,k = [T T ∗ ]k,k for each 3 ≤ k ≤ n, we
similarly see that all of the off-diagonal entries in T equal 0, so T is diagonal.
We can thus simply choose D = T , and the proof is complete.
The spectral decomposition is one of the most important theorems in all of
linear algebra, and it can be interpreted in at least three different ways, so it is
worth spending some time thinking about how to think about this theorem.
Interpretation 1. A matrix is normal if and only if its Schur triangularizations are actually
diagonalizations. To be completely clear, the following statements about
a matrix A ∈ Mn (C) are all equivalent:
• A is normal.
• At least one Schur triangularization of A is diagonal.
• Every Schur triangularization of A is diagonal.
The equivalence of the final two points above might seem somewhat
strange given how non-unique Schur triangularizations are, but most of
that non-uniqueness comes from the strictly upper triangular portion of
the triangular matrix T in A = UTU ∗ . Setting all of the strictly upper
triangular entries of T to 0 gets rid of most of this non-uniqueness.
Interpretation 2. A matrix is normal if and only if it is diagonalizable (in the usual sense
of Theorem 2.0.1) via a unitary matrix. In particular, recall the columns
The “usual
diagonalization of the matrix P in a diagonalization A = PDP−1 necessarily form a basis
procedure” is (of Cn ) of eigenvectors of A. It follows that we can compute spectral
illustrated in decompositions simply via the usual diagonalization procedure, but
Appendix A.1.7.
making sure to choose the eigenvectors to be an orthonormal basis of Cn .
This method is much quicker and easier to use than the method suggested
by Interpretation 1, since Schur triangularizations are awful to compute.

Example 2.1.6 Find a spectral decomposition of the matrix A = 1 2 .
A Small Spectral 2 1
Decomposition
Solution:
We start by finding its eigenvalues:
" #!
1−λ 2
det(A − λ I) = det = (1 − λ )2 − 4 = λ 2 − 2λ − 3.
2 1−λ
Setting this polynomial equal to 0 and solving (either via factoring or the
quadratic equation) gives λ = −1 and λ = 3 as the eigenvalues of A. To

find the corresponding eigenvectors, we solve the equations (A − λ I)v = 0
for each of λ = −1 and λ = 3.
λ = −1: The system of equations (A + I)v = 0 can be solved as follows:

2 2 0 R2 −R1 2 2 0 1 R1 1 1 0
2 .
2 2 0 −−−−→ 0 0 0 −−→ 0 0 0
It follows that v2 = −v1 , so any vector of the form (v1 , −v1 )

(with v1 6= 0) is an eigenvector to which this eigenvalue cor-
We could have also
√ responds. However, we want to choose it to have length 1,
chosen v1 = −1/ 2 since we want the eigenvectors √ to form an orthonormal basis
(or even something
√
like v1 = i/ 2 if we
of Cn . We
√ thus choose v 1 = 1/ 2 so that the eigenvector is
are feeling funky). (1, −1)/ 2.
λ = 3: The system of equations (A − 3I)v = 0 can be solved as follows:

This is just the usual −2 2 0 R2 +R1 −2 2 0
diagonalization 2 −2 0 −−−−→ 0 0 0
procedure, with
some minor care
−1
R 1 −1 0
2 1 .
being taken to −− −→ 0 0 0
choose eigenvectors
√
to have length 1 (so We conclude that v2 = v1 , so we choose (1, 1)/ 2 to be our
that the matrix we
place them in will be (unit length) eigenvector.
unitary).
To construct a spectral decomposition of A, we then place the eigen-
values as the diagonal entries in a diagonal matrix D, and we place the
eigenvectors to which they correspond as columns (in the same order) in a
matrix U:

−1 0 1 1 1
D= and U = √ .
0 3 2 −1 1
It is then straightforward to verify that U is indeed unitary, and A = UDU ∗ ,

so we have found a spectral decomposition of A.
Interpretation 3. A matrix is normal if and only if it represents a linear transformation

that stretches (but does not rotate or skew) some orthonormal basis of
Cn (i.e., some unit square grid in Cn , as in Figure 2.1). The reason for
this is that if A = UDU ∗ is a spectral decomposition then the (mutually
orthogonal) columns u1 , u2 , . . . , un of U are eigenvectors of A:
To see that U ∗ u j = e j ,
recall that Ue j = u j Au j = UDU ∗ u j = UDe j = U(d j e j ) = d j u j for all 1 ≤ j ≤ n.
and multiply on the
left by U ∗ . Equivalently, normal matrices are the linear transformations that look
diagonal in some orthonormal basis B of Cn (whereas diagonalizable
matrices are the linear transformations that look diagonal in some not-
necessarily-orthonormal basis).
In Example 2.1.6, we chose the eigenvectors so as to make the diagonalizing
matrix unitary, but the fact that they were orthogonal to each other came for
free as a consequence of A being normal. This always happens (for normal
matrices) with eigenvectors corresponding to different eigenvalues—a fact that
we now state formally as a corollary of the spectral decomposition.
Corollary 2.1.5 Suppose A ∈ Mn (C) is normal. If v, w ∈ Cn are eigenvectors of A corre-

Normal Matrices sponding to different eigenvalues then v · w = 0.
have Orthogonal
Eigenspaces
Proof. The idea is simply that in a spectral decomposition A = UDU ∗ , the
columns of U can be partitioned so as to span the eigenspaces of A, and since
those columns are mutually orthogonal, so are all vectors in the eigenspaces.
More precisely, suppose Av = λ v and Aw = µw for some λ 6= µ (i.e., these
are the two different eigenvalues of A described by the theorem). By the spectral
decomposition (Theorem 2.1.4), we can write A = UDU ∗ , where U is a unitary
matrix and D is diagonal. Furthermore, we can permute the diagonal entries of
D (and the columns of U correspondingly) so as to group all of the occurrences
Here, k and ` are the of λ and µ on the diagonal of D together:
multiplicities of the  
eigenvalues λ and µ
of A (i.e., they are
λ Ik O O
 
the dimensions of D =  O µI` O  ,
the corresponding O O D2
eigenspaces).
where D2 is diagonal and does not contain either λ or µ as any of its di-
agonal entries. If we write U = u1 | u2 | · · · | un then {u1 , . . . , uk } and
{uk+1 , . . . , uk+` } are orthonormal bases of the eigenspaces of A corresponding
to the eigenvalues λ and µ, respectively. We can thus write v and w as linear
Be slightly combinations of these vectors:
careful—the spectral
decomposition only v = c1 u1 + · · · + ck uk and w = d1 uk+1 + · · · + d` uk+` .
says that there exists
an orthonormal basis We then simply compute
of Cn consisting of ! !
eigenvectors (i.e., k ` k `
the columns of U), v·w = ∑ ci ui · ∑ d j uk+ j = ∑ ∑ ci d j (ui · uk+ j ) = 0,
not that every basis i=1 j=1 i=1 j=1
of eigenvectors is
orthonormal. since U is unitary so its columns are mutually orthogonal.
The above corollary tells us that, when constructing a spectral decomposi-
tion of a normal matrix, eigenvectors corresponding to different eigenvalues
will be orthogonal “for free”, so we just need to make sure to scale them to
For diagonalizable have length 1. On the other hand, we have to be slightly careful when construct-
matrices, algebraic ing multiple eigenvectors corresponding to the same eigenvalue—we have to
and geometric
multiplicities coincide, choose them to be orthogonal to each other. This is a non-issue when the eigen-
so here we just call values of the matrix are distinct (and thus its eigenspaces are 1-dimensional),
them both but it requires some attention when some eigenvalues have multiplicity greater
“multiplicity”. than 1.
 
Example 2.1.7 Find a spectral decomposition of the matrix A = 1 1 −1 −1 .
 
A Larger Spectral 1 1 −1 −1
Decomposition  
1 1 1 1
1 1 1 1
Solution:
No, this It is straightforward to check that A∗ A = AA∗ , so this matrix is normal
characteristic and thus has a spectral decomposition. Its characteristic polynomial is
polynomial is not
λ 2 (λ 2 − 4λ + 8), so its eigenvalues are λ = 0, with algebraic multiplicity
fun to come up with.
Just compute 2, and λ = 2 ± 2i, each with algebraic multiplicity 1.
det(A − λ I) like usual For the eigenvalues λ = 2 ± 2i, we find corresponding normalized
though.
√
eigenvectors to be v = (1 ± i, 1 ± i, 1 ∓ i, 1 ∓ i)/(2 2), respectively. What
Recall that happens when λ = 0 is a bit more interesting, so we now explicitly demon-
eigenvalues and
strate the computation of those corresponding eigenvectors by solving the
eigenvectors of real
matrices come in equation (A − λ I)v = 0 when λ = 0:
complex conjugate
pairs. λ = 0: The system of equations (A−0I)v = 0 can be solved as follows:
   
1 1 −1 −1 0 R2 −R1 1 1 −1 −1 0
 1 1 0   0 0 0 0 0 
 −1 −1  R3 −R1  
 1 1 1 1 0  R4 −R1  0 0 2 2 0 
−−−−→
1 1 1 1 0 0 0 2 2 0
 
These row operations 1 1 1 0 0 0
2 R3
are performed  0 0 1 1
R1 +R3 0 
sequentially (i.e.,  .
one after another),
 0 0 0 0
R4 −2R3 0 
R2 ↔R3
not simultaneously. −−−−−→ 0 0 0 0 0
The first row of the RREF above tells us that the eigenvec-
tors v = (v1 , v2 , v3 , v4 ) satisfy v1 = −v2 , and its second row
tells us that v3 = −v4 , so the eigenvectors of A correspond-
ing to the eigenvalue λ = 0 are exactly those of the form
v = v2 (−1, 1, 0, 0) + v4 (0, 0, −1, 1).
We want to choose two vectors of this form (since λ = 0 has
multiplicity 2), but we have to be slightly careful to choose
them so that not only are they normalized, but they are also
orthogonal to√each other. In this case,√there is an obvious choice:
(−1, 1, 0, 0)/ 2 and (0, 0, −1, 1)/ 2 are orthogonal, so we
We could choose them.
have also √ chosen
(1, −1, 0, 0)/ √2 and To construct a spectral decomposition of A, we then place the eigen-
(0, 0, 1, −1)/ 2, or values as the diagonal entries in a diagonal matrix D, and we place the
(1, −1, 1, −1)/2 and
(1, −1, −1, 1)/2, but normalized and orthogonal eigenvectors to which they correspond as
√
not (1, −1, 0, 0)/ 2 columns (in the same order) in a matrix U:
and (1, −1, 1, −1)/2,    
for example. 0 0 0 0 −2 0 1 + i 1 − i
  1  
0 0 0 0  2 0 1 + i 1 − i
D= , U = √  .
0 0 2 + 2i 0  2 2  0 −2 1 − i 1 + i
0 0 0 2 − 2i 0 2 1−i 1+i
I’m glad that we did To double-check our work, we could verify that U is indeed unitary, and
not do this for a 5 × 5
A = UDU ∗ , so we have indeed found a spectral decomposition of A.
matrix. Maybe we’ll
save that for the
exercises? In the previous example, we were able to find an orthonormal basis of the
2-dimensional eigenspace just by inspection. However, it might not always be
so easy—if we cannot just “see” an orthonormal basis of an eigenspace, then
we can construct one by applying the Gram–Schmidt process (Theorem 1.4.6)
to any basis of the space.
Now that we know that every normal matrix is diagonalizable, it’s worth
taking a moment to remind ourselves of the relationships between the various
families of matrices that we have introduced so far. See Figure 2.2 for a visual
representation of these relationships and a reminder of which matrix families
contain each other.
Mn (C)
diagonalizable
The sizes of the sets normal

shown here are
slightly misleading.
For example, the set unitary
of diagonalizable
matrices is “dense”
in Mn (C) (see
ske
n w-
Section 2.D) and
mitia He
rm
er itia
thus quite large, H O n
whereas the set of
normal matrices is I
tiny.
Figure 2.2: A visualization of the containments of several important families of

matrices within each other. For example, every unitary matrix is normal, every
normal matrix is diagonalizable, and the only matrix that is both Hermitian and
skew-Hermitian is the zero matrix O.
2.1.3 The Real Spectral Decomposition

Since the spectral decomposition applies to all (square) complex matrices, it
automatically applies to all real matrices in the sense that, if A ∈ Mn (R) is
normal, then we can find a unitary U ∈ Mn (C) and a diagonal D ∈ Mn (C)
such that A = UDU ∗ . However, U and D might have complex entries, even if
A is real. For example, the eigenvalues of

0 1
A=
−1 0
√
We saw another are ±i, with corresponding eigenvectors (1, ±i)/ 2. It follows that a diagonal
example of a
matrix D and unitary matrix U providing a spectral decomposition of A are
real matrix with a
complex spectral
i 0 1 1 1
decomposition in D= and U = √ ,
Example 2.1.7. 0 −i 2 i −i
and furthermore there is no spectral decomposition of A that makes use of a
real D and real U.
This observation raises a natural question: which real matrices have a
real spectral decomposition (i.e., a spectral decomposition in which both the
diagonal matrix D and unitary matrix U are real)? Certainly any such matrix
Keep in mind that must be symmetric since if A = UDU T for some diagonal D ∈ Mn (R) and
if U is real then
unitary U ∈ Mn (R) then
U T = U ∗ , so UDU T
really is a spectral
decomposition. AT = (UDU T )T = (U T )T DT U T = UDU T = A.
Remarkably, the converse of the above observation is also true—every real

symmetric matrix has a spectral decomposition consisting of real matrices:
Theorem 2.1.6 Suppose A ∈ Mn (R). Then there exists a unitary matrix U ∈ Mn (R) and
Spectral diagonal matrix D ∈ Mn (R) such that
Decomposition
(Real Version) A = UDU T
if and only if A is symmetric (i.e., A = AT ).
Proof. We already proved the “only if” direction of the theorem, so we jump
straight to proving the “if” direction. To this end, recall that the complex
spectral decomposition (Theorem 2.1.4) says that we can find a complex unitary
matrix U ∈ Mn (C) and diagonal matrix D ∈ Mn (C) such that A = UDU ∗ ,
and the columns of U are eigenvectors of A corresponding to the eigenvalues
along the diagonal of D.
To see that D must in fact be real, we observe that since A is real and
symmetric, we have
In fact, this argument
also shows that
every (potentially λ kvk2 = λ v∗ v = v∗ Av = v∗ A∗ v = (v∗ Av)∗ = (λ v∗ v)∗ = λ kvk2 ,
complex) Hermitian
matrix has only real which implies λ = λ (since every eigenvector v is, by definition, non-zero), so
eigenvalues. λ is real.
To see that we can similarly choose U to be real, we must construct an
orthonormal basis of Rn consisting of eigenvectors of A. To do so, we first
recall from Corollary 2.1.5 that eigenvectors v and w of A corresponding to
different eigenvalues λ 6= µ, respectively, are necessarily orthogonal to each
other, since real symmetric matrices are normal.
It thus suffices to show that we can find a real orthonormal basis of each
eigenspace of A, and since we can use the Gram–Schmidt process (Theo-
rem 1.4.6) to find an orthonormal basis of any subspace, it suffices to just show
that each eigenspace of A is the span of a set of real vectors. The key observa-
Recall that v is the tion that demonstrates this fact is that if a vector v ∈ Cn is in the eigenspace
complex conjugate
of v (i.e., if v = x + iy of A corresponding to an eigenvalue λ then so is v, since taking the complex
for some x, y ∈ Rn conjugate of the equation Av = λ v gives
then v = x − iy). Also,
Re(v) = (v + v)/2 and Av = Av = λ v = λ v.
Im(v) = (v − v)/(2i)
are called the real
In particular, this implies that if {v1 , v2 , . . . , vk } is any basis of the eigenspace,
and imaginary parts
of v. Refer to then the set {Re(v1 ), Im(v1 ), Re(v2 ), Im(v2 ), . . . , Re(vk ), Im(vk )} has the same
Appendix A.3 for span. Since each vector in this set is real, the proof is now complete.
a refresher on
complex numbers. Another way of phrasing the real spectral decomposition is as saying that
if V is a real inner product space then a linear transformation T : V → V is
self-adjoint if and only if T looks diagonal in some orthonormal basis of V.
Geometrically, this means that T is self-adjoint if and only if it looks like a
rotation and/or reflection (i.e., a unitary matrix U), composed with a diagonal
scaling, composed with a rotation and/or reflection back (i.e., the inverse unitary
The real spectral matrix U ∗ ).
decomposition can
also be proved This gives us the exact same geometric interpretation of the real spectral
“directly”, without decomposition that we had for its complex counterpart—symmetric matrices
making use of its are exactly those that stretch (but do not rotate or skew) some orthonormal
complex version.
See Exercise 2.1.20.
basis. However, the important distinction here from the case of normal matrices
is that the eigenvalues and eigenvalues of symmetric matrices are real, so we
can really visualize this all happening in Rn rather than in Cn , as in Figure 2.3.
y y
√ Au2
u2 = (1, 1)/ 2 A = UDU T Au1 = 3u2
= −u1
x x
√
u1 = (1, −1)/ 2
Recall that [v]B refers y y

to the coordinate
vector of v in
the basis B. That is, [Au2 ]B = De2
[v]B = U ∗ v and U is
the change-of-basis D
= 3e2
[u2 ]B = e2
matrix from [Au1 ]B = De1
the orthonormal
eigenbasis B to the [u1 ]B = e1 = −e1
standard basis x x
E = {e1 , e2 }: U = PE←B .
Figure 2.3: The matrix A from √

Example 2.1.6
√ looks diagonal if we view it in the
orthonormal basis B = (1, 1)/ 2, (1, −1)/ 2 . There exists an orthonormal basis
with this property precisely because A is symmetric.
In order to actually compute a real spectral decomposition of a symmetric

matrix, we just do what we always do—we find the eigenvalues and eigenvec-
tors of the matrix and use them to diagonalize it, taking the minor care necessary
to ensure that the eigenvectors are chosen to be real, mutually orthogonal, and
of length 1.
 
Example 2.1.8 Find a real spectral decomposition of the matrix A = 1 2 2 .
A Real Spectral 2 1 2
Decomposition 2 2 1
Solution:
Since this matrix is symmetric, we know that it has a real spectral
decomposition. To compute it, we first find its eigenvalues:
 
1−λ 2 2
 
det(A − λ I) = det  2 1−λ 2 
2 2 1−λ
= (1 − λ )3 + 8 + 8 − 4(1 − λ ) − 4(1 − λ ) − 4(1 − λ )
= (1 + λ )2 (5 − λ ).
Setting the above characteristic polynomial equal to 0 gives λ = −1 (with

multiplicity 2) or λ = 5 (with multiplicity 1). To find the corresponding
eigenvectors, we solve the linear systems (A − λ I)v = 0 for each of λ = 5
and λ = −1.
λ = 5: The system of equations (A − 5I)v = 0 can be solved as follows:

  1  
−4 2 2 0 R2 + 2 R1 −4 2 2 0
 2 −4 2 0  R3 + 1 R1  0 −3 3 0 
2
2 2 −4 0 −−−−−→ 0 3 −3 0
 
−4 2 2 0
R3 +R2  0 −3 3 0 .
−−−−→
0 0 0 0
From here we can use back substitution to see that v1 = v2 =

v3 , so any vector of the form (v1 , v1 , v1 ) (with v1 6= 0) is an
eigenvector to which this eigenvalue corresponds. However, √ we
want to choose it to have length √1, so we choose v1 = 1/ 3, so
We could have also that the eigenvector is (1, 1, 1)/ 3.
chosen v1 = −1 so
that the eigenvector λ = −1: The system of equations (A + I)v = 0 can be solved as follows:
would √be
(−1, −1, −1)/ 3.
   
2 2 2 0 R2 −R1 2 2 2 0
 2 2 2 0  R3 −R1  0 0 0 0  .
2 2 2 0 −−−−→ 0 0 0 0
We conclude that v1 = −v2 − v3 , so the eigenvectors to which

this eigenvalue corresponds have the form v2 (−1, 1, 0) +
v3 (−1, 0, 1). Since this eigenspace has dimension 2, we have to
choose two eigenvectors, and we furthermore have to be careful
to choose them to form an orthonormal basis of their span (i.e.,
they must be orthogonal to each other, and we must scale them
each to have length 1).
Alternatively, One of the easiest pairs of orthogonal eigenvectors to “eyeball” is
we could construct
{(−2, 1, 1), (0, 1, −1)} (which correspond to choosing
an orthonormal basis
of this eigenspace (v2 , v3 ) = (1, 1) and (v2 , v3 ) = (1, −1), respectively). After
by applying normalizing these eigenvectors, we get
Gram–Schmidt to n o
{(−1, 1, 0), (−1, 0, 1)}.
√1 (−2, 1, 1), √1 (0, 1, −1)
6 2
as the orthonormal basis of this eigenspace.

To construct a spectral decomposition of A, we then place the eigen-
values as the diagonal entries in a diagonal matrix D, and we place the
eigenvectors to which they correspond as columns (in the same order) in a
Note that we pulled matrix U:
a common
√ factor of
1/ 6 outside of U,   √ 
which makes its 5 0 0 2 −2 0
columns look slightly   1 √ √ 
different from what
D = 0 −1 0  and U = √   2 1 3.
6 √ √
we saw when 0 0 −1 2 1 − 3
computing the
eigenvectors.
To double-check our work, we could verify that U is indeed unitary, and
A = UDU T , so we have indeed found a real spectral decomposition of A.
Before moving on, it is worth having a brief look at Table 2.1, which sum-
marizes the relationship between the real and complex spectral decompositions,
and what they say about normal, Hermitian, and symmetric matrices.
Matrices that are Matrix A Decomp. A = UDU∗ Eigenvalues, eigenvectors

complex symmetric
(i.e., A ∈ Mn (C) and Normal D ∈ Mn (C) Eigenvalues: complex
AT = A) are not A∗ A = AA∗ U ∈ Mn (C) Eigenvectors: complex
necessarily normal
and thus may not Hermitian D ∈ Mn (R) Eigenvalues: real
have a spectral A∗ = A U ∈ Mn (C) Eigenvectors: complex
decomposition.
However, they Real symmetric D ∈ Mn (R) Eigenvalues: real
have a related
AT = A U ∈ Mn (R) Eigenvectors: real
decomposition that
is presented in
Exercise 2.3.26. Table 2.1: A summary of which parts of the spectral decomposition are real and
complex for different types of matrices.
2.1.1 For each of the following matrices, say whether it is 2.1.4 Find a spectral decomposition of each of the follow-
(i) unitary, (ii) Hermitian, (iii) skew-Hermitian, (iv) normal. ing normal matrices.
" # " #
It may have multiple properties or even none of the listed ∗ (a) 3 2 (b) 1 1
properties.
" # " # 2 3 −1 1
" #  
∗ (a) 2 2 (b) 1 2 ∗ (c) 0 −i (d) 1 0 1
−2 2 3 4  
" # " # i 0 0 1 0
 
∗ (c) 1 1 2i (d) 1 0 ∗ (e) 1 1 0 1 0 1
√  
5 2i 1 0 1   (f) 2i 0 0
" # " # 0 1 1
 
∗ (e) 0 0 (f) 1 1+i 1 0 1  0 1 + i −1 + i
0 0 1+i 1 0 −1 + i 1 + i
" # " #
∗ (g) 0 −i (h) i 1
i 0 −1 2i 2.1.5 Determine which of the following statements are
(a) If A = UTU ∗ is a Schur triangularization of A then
2.1.2 Determine which of the following matrices are nor-
the eigenvalues of A are along the diagonal of T .
mal.
" # " # ∗ (b) If A, B ∈ Mn (C) are normal then so is A + B.
∗ (a) 2 −1 (b) 1 1 1 (c) If A, B ∈ Mn (C) are normal then so is AB.
∗ (d) The set of normal matrices is a subspace of Mn (C).
1 3 1 1 1
" #   (e) If A, B ∈ Mn (C) are similar and A is normal, then B
∗ (c) 1 1 (d) 1 2 3 is normal too.
 
−1 1 3 1 2 ∗ (f) If A ∈ Mn (R) is normal then there exists a uni-
  tary matrix U ∈ Mn (R) and a diagonal matrix D ∈
∗ (e) 1 2 −3i 2 3 1
  Mn (R) such that A = UDU T .
  (f) 2 + 3i 0 0
2 2 2  (g) If A = UTU ∗ is a Schur triangularization of A then
 
3i 2 4  0 7i 0 A2 = UT 2U ∗ is a Schur triangularization of A2 .
 
∗ (g) 1 2 0 0 0 18 ∗ (h) If all of the eigenvalues of A ∈ Mn (C) are real, it
√ √ 
  (h) must be Hermitian.
3 4 5 2 2 i
  (i) If A ∈ M3 (C) has eigenvalues 1, 1, and 0 (counted
0 6 7  1 1 1

√
 via algebraic multiplicity), then A3 = 2A2 − A.
2 i 1
2.1.6 Suppose A ∈ Mn (C). Show that there exists a uni-
2.1.3 Compute a Schur triangularization of the following tary matrix U ∈ Mn (C) and a lower triangular matrix
matrices. L ∈ Mn (C) such that
" # " #
∗ (a) 6 −3 (b) 7 −5 A = ULU ∗ .
2 1 −1 3
∗2.1.7 Suppose A ∈ M2 (C). Use the Cayley–Hamilton
theorem to find an explicit formula for A−1 in terms of the
entries of A (assuming that A is invertible).
 √  (c) Show that A is unitary if and only if its eigenvalues

2.1.8 Compute A2718 if A = 3 2 −3 .
√ lie on the unit circle in the complex plane.
 2 √ 
 0 2
 (d) Explain why the “if” direction of each of the above
√ statements fails if A is not normal.
−3 − 2 3
[Hint: The Cayley–Hamilton theorem might help.]
2.1.18 Show that if A ∈ Mn (C) is Hermitian, then eiA is
  unitary.
∗2.1.9 Suppose A = 4 0 0 .
  [Side note: Recall that eiA is the matrix obtained via expo-
3 1 −3 nentiating the diagonal part of iA’s diagonalization (which
3 −3 1 in this case is a spectral decomposition).]
(a) Use the Cayley–Hamilton theorem to write A−1 as a
linear combination of I, A, and A2 . ∗∗2.1.19 Suppose A ∈ Mn .
(b) Write A−1 as a linear combination of A2 , A3 , and A4 . (a) Show that if A is normal then the number of non-zero
eigenvalues of A (counting algebraic multiplicity)
2.1.10 Suppose A ∈ Mn has characteristic polynomial equals rank(A).
pA (λ ) = det(A − λ I). Explain the problem with the follow- (b) Provide an example to show that the result of part (a)
ing one-line “proof” of the Cayley–Hamilton theorem: is not necessarily true if A is not normal.
pA (A) = det(A − AI) = det(O) = 0.

∗∗2.1.20 In this exercise, we show how to prove the real
spectral decomposition (Theorem 2.1.6) “directly”, with-
2.1.11 A matrix A ∈ Mn (C) is called nilpotent if there out relying on the complex spectral decomposition (Theo-
exists a positive integer k such that Ak = O. rem 2.1.4). Suppose A ∈ Mn (R) is symmetric.
(a) Show that A is nilpotent if and only if all of its eigen- (a) Show that all of the eigenvalues of A are real.
values equal 0. (b) Show that A has a real eigenvector.
(b) Show that if A is nilpotent then Ak = O for some (c) Complete the proof of the real spectral decomposi-
k ≤ n. tion by mimicking the proof of Schur triangulariza-
tion (Theorem 2.1.1).
∗∗ 2.1.12 Suppose that A ∈ Mn (C) has eigenvalues
λ1 , λ2 , . . . , λn (listed according to algebraic multiplicity). 2.1.21 Suppose A ∈ Mn (C) is skew-Hermitian (i.e.,
Show that A is normal if and only if A∗ = −A).
s
n (a) Show that I + A is invertible.
kAkF = ∑ |λ j |2 , (b) Show that UA = (I + A)−1 (I − A) is unitary.
j=1
[Side note: UA is called the Cayley transform of A.]
where kAkF is the Frobenius norm of A. (c) Show that if A is real then det(UA ) = 1.
∗∗2.1.13 Show that A ∈ Mn (C) is normal if and only if 2.1.22 Suppose U ∈ Mn (C) is a skew-symmetric unitary
kAvk = kA∗ vk for all v ∈ Cn . matrix (i.e., U T = −U and U ∗U = I).
[Hint: Make use of Exercise 1.4.19.] (a) Show that n must be even (i.e., show that no such
matrix exists when n is odd).
(b) Show that the eigenvalues of U are ±i, each with
∗∗2.1.14 Show that A ∈ Mn (C) is normal if and only if multiplicity equal to n/2.
A∗ ∈ span I, A, A2 , A3 , . . . . (c) Show that there is a unitary matrix V ∈ Mn (C) such
[Hint: Apply the spectral decomposition to A and think that
about interpolating polynomials.] " #
∗ 0 1
U = V BV , where Y = and
−1 0
2.1.15 Suppose A, B ∈ Mn (C) are unitarily similar (i.e.,  
there is a unitary U ∈ Mn (C) such that B = UAU ∗ ). Show Y O ··· O
that if A is normal then so is B.  
O Y · · · O
 
B=. .. ..  .
. .. 
2.1.16 Suppose T ∈ Mn (C) is upper triangular. Show that . . . .
T is diagonal if and only if it is normal. O O ··· Y
[Hint: We actually showed this fact somewhere in this [Hint: Find a complex spectral decomposition of Y .]
section—just explain where.]
[Side note: This exercise generalizes Exercise 1.4.12.] ∗∗ 2.1.23 In this exercise, we finally show that if V
is a finite-dimensional inner product space over R and
∗∗2.1.17 Suppose A ∈ Mn (C) is normal. P : V → V is a projection then P is orthogonal (i.e.,
hP(v), v − P(v)i = 0 for all v ∈ V) if and only if it is self-
(a) Show that A is Hermitian if and only if its eigenvalues adjoint (i.e., P = P∗ ). Recall that we proved this when the
are real. ground field is C in Exercise 1.4.29.
(b) Show that A is skew-Hermitian if and only if its
eigenvalues are imaginary. (a) Show that if P is self-adjoint then it is orthogonal.
(b) Use Exercise 1.4.28 to show that if P is orthogonal 2.1.27 Suppose A, B ∈ Mn (C) commute (i.e., AB = BA).
then the linear transformation T = P−P∗ ◦P satisfies Show that there is a common unitary matrix U ∈ Mn (C)
T ∗ = −T . that triangularizes them both:
(c) Use part (b) to show that if P is orthogonal then it
is self-adjoint. [Hint: Represent T in an orthonormal A = UT1U ∗ and B = UT2U ∗
basis, take its trace, and use Exercise 2.1.12.]
for some upper triangular T1 , T2 ∈ Mn (C).
2.1.24 Show that if P ∈ Mn (C) is an orthogonal projec- [Hint: Use Exercise 2.1.26 and mimic the proof of Schur
tion (i.e., P = P∗ = P2 ) then there exists a unitary matrix triangularization.]
U ∈ Mn (C) such that
" # 2.1.28 Suppose A, B ∈ Mn (C) are normal. Show that A
Ir O ∗
P=U U , and B commute (i.e., AB = BA) if and only if there is a
O O common unitary matrix U ∈ Mn (C) that diagonalizes them
where r = rank(P). both:
A = UD1U ∗ and B = UD2U ∗
2.1.25 A circulant matrix is a matrix C ∈ Mn (C) of the for some diagonal D1 , D2 ∈ Mn (C).
form [Hint: Leech off of Exercise 2.1.27.]
 
c0 c1 c2 · · · cn−1 [Side note: This result is still true if we replace “normal” by
cn−1 c0 c1 · · · cn−2  “real symmetric” and C by R throughout the exercise.]
 
cn−2 cn−1 c0 · · · cn−3 
C=  ,

 .. .. .. .. ..  2.1.29 Suppose A, B ∈ Mn (C) are diagonalizable. Show
 . . . . . 
that A and B commute (i.e., AB = BA) if and only if there is
c1 c2 c3 · · · c0
a common invertible matrix P ∈ Mn (C) that diagonalizes
where c0 , c1 , . . . , cn−1 are scalars. Show that C can be diago- them both:
nalized by the Fourier matrix F from Exercise 1.4.15. That
is, show that F ∗CF is diagonal. A = PD1 P−1 and B = PD2 P−1
[Hint: It suffices to show that the columns of F are eigen- for some diagonal D1 , D2 ∈ Mn (C).
vectors of C.]
[Hint: When does a diagonal matrix D1 commute with an-
other matrix? Try proving the claim when the eigenvalues
∗2.1.26 Suppose A, B ∈ Mn (C) commute (i.e., AB = BA). of A are distinct first, since that case is much easier. For the
Show that there is a vector v ∈ Cn that is an eigenvector of general case, induction might help you sidestep some of the
each of A and B. ugly details.]
[Hint: This is hard. Maybe just prove it in the case when A
has distinct eigenvalues, which is much easier. The general
case can be proved using techniques like those used in the
proof of Schur triangularization.]
2.2 Positive Semidefiniteness
We have now seen that normal matrices play a particularly important role in
linear algebra, especially when decomposing matrices. There is one particularly
important sub-family of normal matrices that we now focus our attention on
that play perhaps an even more important role.
Definition 2.2.1 Suppose A ∈ Mn (F) is self-adjoint. Then A is called

Positive a) positive semidefinite (PSD) if v∗ Av ≥ 0 for all v ∈ Fn , and
(Semi)Definite b) positive definite (PD) if v∗ Av > 0 for all v 6= 0.
Matrices
Since A is self-adjoint, Positive (semi)definiteness is somewhat difficult to eyeball from the entries
F = R or F = C
of a matrix, and we should emphasize that it does not mean that the entries of
throughout this
section. Adjoints only the matrix need to be positive. For example, if
make sense in inner
product spaces. 1 −1 1 2
A= and B = , (2.2.1)
−1 1 2 1
2.2 Positive Semidefiniteness 189
then A is positive semidefinite since

1 −1 v1
A 1 × 1 matrix v∗ Av = v1 v2
−1 1 v2
(i.e., a scalar) is PSD
if and only if it is = |v1 |2 − v1 v2 − v2 v1 + |v2 |2 = |v1 − v2 |2 ≥ 0
non-negative. We
often think of PSD
matrices as the for all v ∈ C2 . On the other hand, B is not positive semidefinite since if v =
“matrix version” of (1, −1) then v∗ Bv = −2.
the non-negative
real numbers.
While the defining property of positive definiteness seems quite strange at
first, it actually has a very natural geometric interpretation. If we recall that
v∗ Av = v · (Av) and that the angle θ between v and Av can be computed in
terms of the dot product via

v · (Av)
θ = arccos ,
kvkkAvk
For positive then we see that the positive definiteness property that v∗ Av > 0 for all v simply
semidefinite
matrices, the angle says that the angle between v and Av is always acute (i.e., 0 ≤ θ < π/2). We
between v and Av is can thus think of positive (semi)definite matrices as the Hermitian matrices
always acute or that do not rotate vectors “too much”. In particular, Av is always in the same
perpendicular (i.e.,
half-space as v, as depicted in Figure 2.4.
0 ≤ θ ≤ π/2).
y y
v v
e2 Ae2
For another A Av
geometric
interpretation
x x
e1
of positive Ae1
(semi)definiteness,
see Section 2.A.1.
Figure 2.4: As a linear transformation, a positive definite matrix is one that keeps
vectors pointing “mostly” in the same direction. In particular, the angle θ between
v and Av never exceeds π/2, so Av is always in the same half-space as v. Positive
semidefiniteness allows for v and Av to be perpendicular (i.e., orthogonal), so Av
can be on the boundary of the half-space defined by v.
2.2.1 Characterizing Positive (Semi)Definite Matrices

The definition of positive semidefinite matrices that was provided in Defini-
tion 2.2.1 perhaps looks a bit odd at first glance, and showing that a matrix is
(or is not) positive semidefinite seems quite cumbersome so far. The following
theorem characterizes these matrices in several other equivalent ways, some
of which are a bit more illuminating and easier to work with. If we prefer, we
can think of any one of these other characterizations of PSD matrices as their
definition.
Theorem 2.2.1 Suppose A ∈ Mn (F) is self-adjoint. The following are equivalent:

Characterization a) A is positive semidefinite,
of Positive
b) All of the eigenvalues of A are non-negative,
Semidefinite
Matrices c) There is a matrix B ∈ Mn (F) such that A = B∗ B, and
Recall that, since d) There is a diagonal matrix D ∈ Mn (R) with non-negative diagonal
A is Hermitian,
its eigenvalues are entries and a unitary matrix U ∈ Mn (F) such that A = UDU ∗ .
all real.
Proof. We prove the theorem by showing that (a) =⇒ (b) =⇒ (d) =⇒ (c)
=⇒ (a).
To see that (a) =⇒ (b), let v be an eigenvector of A with corresponding
eigenvalue λ . Then Av = λ v, and multiplying this equation on the left by v∗
shows that v∗ Av = λ v∗ v = λ kvk2 . Since A is positive semidefinite, we know
that v∗ Av ≥ 0, so it follows that λ ≥ 0 too.
To see that (b) =⇒ (d), we just apply the spectral decomposition theorem
(either the complex Theorem 2.1.4 or the real Theorem 2.1.6, as appropriate)
to A.
√
To see why (d) =⇒ (c), let D be the diagonal matrix that is obtained by
taking
√ the (non-negative) square
√ root√of the diagonal
√ ∗entries
√ of D, and define
B = DU ∗ . Then B∗ B = ( DU ∗ )∗ ( DU ∗ ) = U D DU ∗ = UDU ∗ = A.
Another Finally, to see that (c) =⇒ (a), we let v ∈ Fn be any vector and we note
characterization
of positive
that
semidefiniteness
is provided in v∗ Av = v∗ B∗ Bv = (Bv)∗ (Bv) = kBvk2 ≥ 0.
Exercise 2.2.25.
It follows that A is positive semidefinite, so the proof is complete.
Example 2.2.1 Explicitly show that all four properties of Theorem 2.2.1 hold for the
Demonstrating matrix
Positive 1 −1
A= .
Semidefiniteness −1 1
Solution:
We already showed that v∗ Av = |v1 − v2 |2 ≥ 0 for all v ∈ C2 at the start
of this section, which shows that A is PSD. We now verify that properties
(b)–(d) of Theorem 2.2.1 are all satisfied as well.
In practice, For property (b), we can explicitly compute the eigenvalues of A:
checking
non-negativity " #!
of its eigenvalues 1−λ −1
is the simplest of
det = (1 − λ )2 − 1 = λ 2 − 2λ = λ (λ − 2) = 0,
−1 1 − λ
these methods of
checking positive
semidefiniteness. so the eigenvalues of A are 0 and 2, which are indeed non-negative.
For property (d), we want to find a unitary matrix U such that A =
UDU ∗ , where D has 2 and 0 (the eigenvalues of A) along its diagonal.
We know from the spectral decomposition that we can construct U by
placing the normalized eigenvectors of A into U as columns. Eigenvectors
corresponding to the eigenvalues 2 and 0 are v = (1, −1) and v = (1, 1),
We check respectively, so
property (d) before
(c) so that we can
1 1 1 2 0
mimic the proof of U=√ and D= .
Theorem 2.2.1. 2 −1 1 0 0
It is then straightforward to verify that A = UDU ∗ , as desired.

For property (c), we let
"√ #
√ ∗ 1 2 0 1 −1 1 −1
B = DU = √ = .
2 0 0 1 1 0 0
Direct computation verifies that it is indeed the case that A = B∗ B.
The characterization of positive semidefinite matrices provided by Theo-

rem 2.2.1 can be tweaked slightly into a characterization of positive definite
matrices by just making the relevant matrices invertible in each statement. In
particular, we get the following theorem, which just has one or two words
changed in each statement (we have italicized the words that changed from the
previous theorem to make them easier to compare).
Theorem 2.2.2 Suppose A ∈ Mn (F) is self-adjoint. The following are equivalent:

Characterization of a) A is positive definite,
Positive Definite
b) All of the eigenvalues of A are strictly positive,
Matrices
c) There is an invertible matrix B ∈ Mn (F) such that A = B∗ B, and
d) There is a diagonal matrix D ∈ Mn (R) with strictly positive diago-
nal entries and a unitary matrix U ∈ Mn (F) such that A = UDU ∗ .
The proof of the above theorem is almost identical to that of Theorem 2.2.1,
so we leave it to Exercise 2.2.23. Instead, we note that there are two additional
characterizations of positive definite matrices that we would like to present, but
we first need some “helper” theorems that make it a bit easier to work with positive
(semi)definite matrices. The first of these results tells us how we can manipulate
positive semidefinite matrices without destroying positive semidefiniteness.
Theorem 2.2.3 Suppose A, B ∈ Mn are positive (semi)definite, P ∈ Mn,m is any matrix,

Modifying Positive and c > 0 is a real scalar. Then
(Semi)Definite a) A + B is positive (semi)definite,
Matrices
b) cA is positive (semi)definite,
c) AT is positive (semi)definite, and
d) P∗ AP is positive semidefinite. Furthermore, if A is positive definite
then P∗ AP is positive definite if and only if rank(P) = m.
We return to this Proof. These properties all follow fairly quickly from the definition of positive
problem of asking
semidefiniteness, so we leave the proof of (a), (b), and (c) to Exercise 2.2.24.
what operations
transform PSD To show that property (d) holds, observe that for all v ∈ Fn we have
matrices into PSD
matrices in 0 ≤ (Pv)∗ A(Pv) = v∗ (P∗ AP)v,
Section 3.A.
so P∗ AP is positive semidefinite as well. If A is positive definite then positive
definiteness of P∗ AP is equivalent to the requirement that Pv 6= 0 whenever
v 6= 0, which is equivalent in turn to null(P) = {0} (i.e., nullity(P) = 0). By the

rank-nullity theorem (see Theorem A.1.2), this is equivalent to rank(P) = m,
As a bit of motivation for our next “helper” result, notice that the diagonal
entries of a positive semidefinite matrix A must be non-negative (or strictly
positive, if A is positive definite), since
Recall that e j is the
j-th standard basis 0 ≤ e∗j Ae j = a j, j for all 1 ≤ j ≤ n.
vector.
The following theorem provides a natural generalization of this fact to block
matrices.
Theorem 2.2.4 Suppose that the self-adjoint block matrix

Diagonal Blocks of  
PSD Block Matrices A1,1 A1,2 · · · A1,n
 ∗ 
A1,2 A2,2 · · · A2,n 
 
A= . .. .. .. 
 . . 
 . . . 
∗ ∗
A1,n A2,n · · · An,n
The diagonal blocks is positive (semi)definite. Then the diagonal blocks A1,1 , A2,2 , . . ., An,n
here must be square
must be positive (semi)definite.
for this block matrix
to make sense (e.g.,
A1,1 is square since it
has A1,2 to its right Proof. We use property (d) of Theorem 2.2.3. In particular, consider the block
and A∗1,2 below it). matrices      
I O O
O I  O
     
P1 =  ..  , P2 =  ..  , . . . , Pn =  ..  ,
. . .
O O I
where the sizes of the O and I blocks are such that the matrix multiplication APj
makes sense for all 1 ≤ j ≤ n. It is then straightforward to verify that Pj∗ APj =
A j, j for all 1 ≤ j ≤ n, so A j, j must be positive semidefinite by Theorem 2.2.3(d).
Furthermore, each Pj has rank equal to its number of columns, so A j, j is positive
definite whenever A is.
Example 2.2.2 Show that the following matrices are not positive semidefinite.
   
Showing Matrices a) 3 2 1 b) 3 −1 1 −1
are Not Positive 2 4 2   
Semidefinite −1 1 2 1 
 
1 2 −1 1 2 1 2
−1 1 2 3
Techniques like these Solutions:

ones for showing a) This matrix is not positive semidefinite since its third diagonal
that a matrix is (not)
PSD are useful since
entry is −1, and positive semidefinite matrices cannot have negative
we often do not diagonal entries.
want to compute
eigenvalues of 4 × 4 b) This matrix is not positive semidefinite since its central 2 × 2 diago-
(or larger) matrices.
nal block is
1 2
,
2 1
which is exactly the matrix B from Equation (2.2.1) that we showed
is not positive semidefinite earlier.
We are now in a position to introduce the two additional characterizations

of positive definite matrices that we are actually interested in, and that we
introduced the above “helper” theorems for. It is worth noting that both of
these upcoming characterizations really are specific to positive definite ma-
trices. While they can be extended to positive semidefinite matrices, they are
significantly easier to use and interpret in the definite (i.e., invertible) case.
The first of these results says that positive definite matrices exactly charac-
terize inner products on Fn :
Theorem 2.2.5 A function h·, ·i : Fn × Fn → F is an inner product if and only if there exists
Positive Definite a positive definite matrix A ∈ Mn (F) such that
Matrices Make
Inner Products hv, wi = v∗ Aw for all v, w ∈ Fn .
If you need a Proof. We start by showing that if h·, ·i is an inner product then such a positive
refresher on inner
definite matrix A must exist. To this end, recall from Theorem 1.4.3 that there
products, refer back
to Section 1.4. exists a basis B of Fn such that
hv, wi = [v]B · [w]B for all v, w ∈ Fn .
Well, let PB←E be the change-of-basis matrix from the standard basis E to B.
Then [v]B = PB←E v and [w]B = PB←E w, so
hv, wi = [v]B · [w]B = (PB←E v) · (PB←E w) = v∗ (PB←E
∗
PB←E )w
for all v, w ∈ Fn . Since change-of-basis matrices are invertible, it follows from
Theorem 2.2.2(c) that A = PB←E ∗ PB←E is positive definite, which is what we
wanted.
In the other direction, we must show that every function h·, ·i of the form
hv, wi = v∗ Aw is necessarily an inner product when A is positive definite. We
thus have to show that all three defining properties of inner products from
Definition 1.3.6 hold.
For property (a), we note that A = A∗ , so
Here we used the
fact that if c ∈ C is hw, vi = w∗ Av = (w∗ Av)∗ = v∗ A∗ w = v∗ Aw = hv, wi for all v, w ∈ Fn .
a scalar, then c = c∗ .
If F = R then For property (b), we check that
these complex
conjugations just hv, w + cxi = v∗ A(w + cx) = v∗ Aw + c(v∗ Ax) = hv, wi + chv, xi
vanish.
for all v, w, x ∈ Fn and all c ∈ F.
Notice that Finally, for property (c) we note that A is positive definite, so
we called inner
products “positive hv, vi = v∗ Av ≥ 0 for all v ∈ Fn ,
definite” way
back when we first with equality if and only if v = 0. It follows that h·, ·i is indeed an inner product,
introduced them
in Definition 1.3.6.
Example 2.2.3 Show that the function h·, ·i : R2 × R2 → R defined by

A Weird Inner
Product (Again) hv, wi = v1 w1 + 2v1 w2 + 2v2 w1 + 5v2 w2 for all v, w ∈ R2
is an inner product on R2 .
Solution:
We already showed this function is an inner product back in Exam-
ple 1.3.18 in a rather brute-force manner. Now that we understand inner
products better, we can be much more slick—we just notice that we can
To construct this rewrite this function in the form
matrix A, just notice
that multiplying " #
out vT Aw gives T 1 2
∑i, j ai, j vi w j , so we just
hv, wi = v Aw, where A = .
2 5
let ai, j be the
coefficient of vi w j .
It is straightforward
√ to check that A is positive definite (its eigenvalues are
3 ± 2 2 > 0, for example), so Theorem 2.2.5 tells us that this function is
an inner product.
There is one final characterization of positive definite matrices that we now

present. This theorem is useful because it gives us another way of checking
positive definiteness that is sometimes easier than computing eigenvalues.
Theorem 2.2.6 Suppose A ∈ Mn is self-adjoint. Then A is positive definite if and only if,
Sylvester’s for all 1 ≤ k ≤ n, the determinant of the top-left k × k block of A is strictly
Criterion positive.
Before proving this theorem, it is perhaps useful to present an example that

demonstrates exactly what it says and how to use it.
Example 2.2.4 Use Sylvester’s criterion to show that the following matrix is positive
Applying definite:  
Sylvester’s 2 −1 i
Criterion A = −1 2 1 .
−i 1 2
Solution:
We have to check that the top-left 1 × 1, 2 × 2, and 3 × 3 blocks of A
For 2 × 2 matrices, all have positive determinants:
recall that
" #!
det
a b 2 =2>0
det
c d
2 −1
= ad − bc. det = 4 − 1 = 3 > 0, and
−1 2
For larger matrices,  
we can use 2 −1 i
Theorem A.1.4 (or det −1 2 1 = 8 + i − i − 2 − 2 − 2 = 2 > 0.
many other
methods) to
−i 1 2
compute
determinants. It follows from Sylvester’s criterion that A is positive definite.
Proof of Sylvester’s criterion. For the “only if” direction of the proof, recall
from Theorem 2.2.4 that if A is positive definite then so is the top-left k × k
block of A, which we will call Ak for the remainder of this proof. Since Ak is
positive definite, its eigenvalues are positive, so det(Ak ) (i.e., the product of
those eigenvalues) is positive too, as desired.
The “if” direction is somewhat trickier to pin down, and we prove it by
induction on n (the size of A). For the base case, if n = 1 then it is clear that
det(A) > 0 implies that A is positive definite since the determinant of a scalar
just equals that scalar itself.
For the inductive step, assume that the theorem holds for (n − 1) × (n − 1)
matrices. To see that it must then hold for n × n matrices, notice that if A ∈
Mn (F) is as in the statement of the theorem and det(Ak ) > 0 for all 1 ≤ k ≤ n,
We can choose v
and x to be then det(A) > 0 and (by the inductive hypothesis) An−1 is positive definite.
orthogonal by Let λi and λ j be any two eigenvalues of A with corresponding orthogonal
the spectral eigenvectors v and x, respectively. Then define x = wn v − vn w and notice that
decomposition.
x 6= 0 (since {v, w} is linearly independent), but xn = wn vn − vn wn = 0. Since
xn = 0 and An−1 is positive definite, it follows that
We have to be
careful if vn = wn = 0,
since then x = 0. In
0 < x∗ Ax
this case, we instead = (wn v − vn w)∗ A(wn v − vn w)
define x = v to fix
up the proof. = |wn |2 v∗ Av − wn vn w∗ Av − vn wn v∗ Aw + |vn |w∗ Aw
= λi |wn |2 v∗ v − λi wn vn w∗ v − λ j vn wn v∗ w + λ j |vn |2 w∗ w
Recall that v and w
= λi |wn |2 kvk2 − 0 − 0 + λ j |vn |2 kwk2 .
are orthogonal, so
v∗ w = v · w = 0. It is thus not possible that both λi ≤ 0 and λ j ≤ 0. Since λi and λ j were
arbitrary eigenvalues of A, it follows that A must have at most one non-positive
eigenvalue. However, if it had exactly one non-positive eigenvalue then it would
be the case that det(A) = λ1 λ2 · · · λn ≤ 0, which we know is not the case. It
follows that all of A’s eigenvalues are strictly positive, so A is positive definite
by Theorem 2.2.2, which completes the proof.
It might be tempting to think that Theorem 2.2.6 can be extended to positive

semidefinite matrices by just requiring that the determinant of each square top-
left block of A is non-negative, rather than strictly positive. However, this does
not work, as demonstrated by the matrix

0 0
A= .
0 −1
For this matrix, the top-left block clearly has determinant 0, and straightfor-
ward computation shows that det(A) = 0 as well. However, A is not positive
semidefinite, since it has −1 as an eigenvalue.
The following theorem shows that there is indeed some hope though, and
Sylvester’s criterion does apply to 2 × 2 matrices as long as we add in the
requirement that the bottom-right entry is non-negative as well.
Theorem 2.2.7 Suppose A ∈ M2 is self-adjoint.

Positive a) A is positive semidefinite if and only if a1,1 , a2,2 , det(A) ≥ 0, and
(Semi)Definiteness
b) A is positive definite if and only if a1,1 > 0 and det(A) > 0.
for 2 × 2 Matrices
Proof. Claim (b) is exactly Sylvester’s criterion (Theorem 2.2.6) for 2 × 2

matrices, so we only need to prove (a).
For the “only if” direction, the fact that a1,1 , a2,2 ≥ 0 whenever A is positive
semidefinite follows simply from the fact that PSD matrices have non-negative
diagonal entries (Theorem 2.2.4). The fact that det(A) ≥ 0 follows from the
fact that A has non-negative eigenvalues, and det(A) is the product of them.
For the “if” direction, recall that the characteristic polynomial of A is
pA (λ ) = det(A − λ I) = λ 2 − tr(A)λ + det(A).
It follows from the quadratic formula that the eigenvalues of A are
q
1
λ= tr(A) ± tr(A)2 − 4 det(A) .
2
This proof shows that To see that these eigenvalues are non-negative (and thus A is positive semidef-
we can replace 2 2
the inequalities
inite), we
p just observe that if det(A) ≥ 0 then tr(A) − 4 det(A) < tr(A) , so
2
tr(A) − tr(A) − 4 det(A) ≥ 0.
a1,1 , a2,2 ≥ 0 in the
statement of this
theorem with the For example, if we apply this theorem to the matrices
single inequality
tr(A) ≥ 0. 1 −1 1 2
A= and B =
−1 1 2 1
from Equation (2.2.1), we see immediately that A is positive semidefinite (but
not positive definite) because det(A) = 1 − 1 = 0 ≥ 0, but B is not positive
semidefinite because det(B) = 1 − 4 = −3 < 0.
Remark 2.2.1 To extend Sylvester’s criterion (Theorem 2.2.6) to positive semidefinite

Sylvester’s Criterion matrices, we first need a bit of extra terminology. A principal minor of a
for Positive square matrix A ∈ Mn is the determinant of a submatrix of A that is ob-
SemiDefinite tained by deleting some (or none) of its rows as well as the corresponding
Matrices columns. For example, the principal minors of a 2 × 2 Hermitian matrix
" #
a b
A=
b d
are
Notice that these " #!
three principal a b
minors are exactly det a = a, det d = d, and det = ad − |b|2 ,
the quantities that b d
determined positive
semidefiniteness in and the principal minors of a 3 × 3 Hermitian matrix
Theorem 2.2.7(a).
 
a b c
 
B = b d e 
c e f
are a, d, f , det(B) itself, as well as

For example,
" #!
a c a b 2 a c
c f det = ad − |b| , det = a f − |c|2 , and
b d c f
is obtained by " #!
deleting the second d e
row and column of B. det = d f − |e|2 .
e f
The correct generalization of Sylvester’s criterion to positive semidef-

inite matrices is that a matrix is positive semidefinite if and only if all
of its principal minors are non-negative (whereas in the positive defi-

nite version of Sylvester’s criterion we only needed to check the principal
minors coming from the top-left corners of the matrix).
2.2.2 Diagonal Dominance and Gershgorin Discs

Intuitively, Theorem 2.2.7 tells us that a 2 × 2 matrix is positive (semi)definite
exactly when its diagonal entries are sufficiently large compared to its off-
diagonal entries. After all, if we expand out the inequality det(A) ≥ 0 explicitly
in terms of the entries of A, we see that it is equivalent to
a1,1 a2,2 ≥ a1,2 a2,1 .
This same intuition is well-founded even for larger matrices. However, to

clarify exactly what we mean, we first need the following result that helps
us bound the eigenvalues of a matrix based on simple information about its
entries.
Theorem 2.2.8 Suppose A ∈ Mn and define the following objects for each 1 ≤ i ≤ n:
Gershgorin • ri = ∑ |ai, j | (the sum of off-diagonal entries of the i-th row of A),
Disc Theorem j6=i
• D(ai,i , ri ) is the closed disc in the complex plane centered at ai,i

A closed disc is just a with radius ri (we call this disc the i-th Gershgorin disc of A).
filled circle.
Then every eigenvalue of A is located in at least one of the D(ai,i , ri ).
This theorem can be thought of as an approximation theorem. For a diagonal

matrix we have ri = 0 for all 1 ≤ i ≤ n, so its Gershgorin discs have radius 0
and thus its eigenvalues are exactly its diagonal entries, which we knew already
(in fact, the same is true of triangular matrices). However, if we were to change
the off-diagonal entries of that matrix then the radii of its Gershgorin discs
would increase, allowing its eigenvalues to wiggle around a bit.
Before proving this result, we illustrate it with an example.
Example 2.2.5 Draw the Gershgorin discs for the following matrix, and show that its
Gershgorin Discs eigenvalues are contained in these discs:

−1 2
A= .
−i 1 + i
Solution:
The radius of the The Gershgorin discs are D(−1, 2) and D(1 + i, 1). Direct calculation
second disc is 1
shows that the eigenvalues of A are 1 and −1 + i, which are indeed con-
because | − i| = 1.
tained within these discs:
In this subsection,
Im
we focus on the
F = C case since
Gershgorin discs live D(−1, 2) D(1 + i, 1)
naturally in the
complex plane.
These same results 2 = −1 + i
apply if F = R simply
because real Re
matrices are 1 =1
complex.
Proof of Theorem 2.2.8. Let λ be an eigenvalue of A corresponding to an

eigenvector v. Since v 6= 0, we can scale it so that its largest entry is vi = 1 and
In the notation of all other entries are no larger than 1 (i.e., |v j | ≤ 1 for all j 6= i). By looking at
Section 1.D, we
the i-th entry of the vector Av = λ v, we see that
scale v so that
kvk∞ = 1.
∑ ai, j v j = λ vi = λ . (since vi = 1)
j
In fact, this proof Now notice that the sum on the left can be split up into the form
shows that each
eigenvalue lies in the
Gershgorin disc
∑ ai, j v j = ∑ ai, j v j + ai,i . (again, since vi = 1)
j j6=i
corresponding to the
largest entry of its By combining and slightly rearranging the two equations above, we get
corresponding
eigenvectors. λ − ai,i = ∑ ai, j v j .
j6=i
Finally, taking the absolute value of both sides of this equation then shows that
We used the fact ¯ ¯ since |v j | ≤ 1 for all j 6= i
that |wz| = |w||z| for ¯ ¯
¯ ¯
all w, z ∈ C here. | − ai,i | = ¯ ai, j v j ¯ ≤ |ai, j ||v j | ≤ |ai, j | = ri ,
j6=i j6=i j6=i
triangle inequality
which means exactly that λ ∈ D(ai,i , ri ).
In order to improve the bounds provided by the Gershgorin disc theorem,

recall that A and AT have the same eigenvalues, which immediately implies
that the eigenvalues of A also lie within the discs corresponding to the columns
of A (instead of its rows). More specifically, if we define ci = ∑ j6=i |a j,i | (the
sum of the off-diagonal entries of the i-th column of A) then it is also true that
each eigenvalue of A is contained in at least one of the discs D(ai,i , ci ).
Example 2.2.6 Draw the Gershgorin discs based on the columns of the matrix from
Gershgorin Discs Example 2.2.5, and show that its eigenvalues are contained in those discs.
via Columns
Solution:
The Gershgorin discs based on the columns of this matrix are D(−1, 1)
and D(1 + i, 2), do indeed contain the eigenvalues 1 and −1 + i:
Im
column discs
row discs
2 = −1 + i
Re
1 =1
Remark 2.2.2 Be careful when using the Gershgorin disc theorem. Since every complex
Counting matrix has exactly n eigenvalues (counting multiplicities) and n Gershgorin
Eigenvalues in discs, it is tempting to conclude that each disc must contain an eigenvalue,
Gershgorin Discs but this is not necessarily true. For example, consider the matrix

1 2
A= ,
−1 −1
which has Gershgorin discs D(1, 2) and D(−1, 1). However, its eigenval-
ues are ±i, both of which are contained in D(1, 2) and neither of which
are contained in D(−1, 1):
Im
D(1, 2)
D(−1, 1)
1 =i
Re
2 = −i
Proofs of the However, in the case when the Gershgorin discs do not overlap, we
statements in this
can indeed conclude that each disc must contain exactly one eigenvalue.
remark are above
our pay grade, so Slightly more generally, if we partition the Gershgorin discs into disjoint
we leave them to sets, then each set must contain exactly as many eigenvalues as Gershgorin
more specialized discs. For example, the matrix
books like (HJ12)
 
4 −1 −1
 
B = 1 −1 1 
1 0 5
has Gershgorin discs D(4, 2), D(−1, 2), and D(5, 1). Since D(4, 2) and
D(5, 1) overlap, but D(−1, 2) is disjoint from them, we know that one of
B’s eigenvalues must be contained in D(−1, 2), and the other two must be
contained in D(4, 2) ∪ D(5, 1). Indeed, the eigenvalues of B are approxi-
mately λ1 ≈ −0.8345 and λ2,3 ≈ 4.4172 ± 0.9274i:
Im
D(−1, 2) D(4, 2)
2 ≈ 4.4172 + 0.9274i
1 ≈ −0.8345 D(5, 1)
Re
3 ≈ 4.4172 − 0.9274i
In fact, we could have even further bounded λ1 to the real interval

[−2, 0] by using the Gershgorin discs based on B’s columns (one of these
discs is D(−1, 1)) and recalling that the eigenvalues of real matrices come
in complex conjugate pairs.
Our primary purpose for introducing Gershgorin discs is that they will help
us show that some matrices are positive (semi)definite shortly. First, we need
to introduce one additional family of matrices:
Definition 2.2.2 A matrix A ∈ Mn is called

Diagonally
a) diagonally dominant if |ai,i | ≥ ∑ |ai, j | for all 1 ≤ i ≤ n, and
Dominant j6=i
Matrices
b) strictly diagonally dominant if |ai,i | > ∑ |ai, j | for all 1 ≤ i ≤ n.
j6=i
To illustrate the relationship between diagonally dominant matrices, Gersh-

gorin discs, and positive semidefiniteness, consider the diagonally dominant
Hermitian matrix
 
2 0 i
 
A =  0 7 1 . (2.2.2)
−i 1 5
This matrix has Gershgorin discs D(2, 1), D(7, 1), and D(5, 2). Furthermore,
since A is Hermitian we know that its eigenvalues are real and are thus contained
in the real interval [1, 8] (see Figure 2.5). In particular, this implies that its
eigenvalues are strictly positive, so A is positive definite. This same type of
argument works in general, and leads immediately to the following theorem:
Theorem 2.2.9 Suppose A ∈ Mn is self-adjoint with non-negative diagonal entries.

Diagonal a) If A is diagonally dominant then it is positive semidefinite.
Dominance
b) If A is strictly diagonally dominant then it is positive definite.
Implies PSD
Proof. By the Gershgorin disc theorem, we know that the eigenvalues of A

are contained in its Gershgorin discs in the complex plane. Since the diag-
Im
The eigenvalues D(5, 2)

2
of this matrix
are actually
D(2, 1) D(7, 1)
1
approximately
1.6806, 4.8767, and Re
7.4427. 1 2 3 4 5 6 7 8
-1
-2
Figure 2.5: The Gershgorin discs of the Hermitian matrix from Equation (2.2.2) all lie
in the right half of the complex plane, so it is positive definite.
onal entries of A are real and non-negative, the centers of these Gershgorin
This is a one-way discs are located on the right half of the real axis. Furthermore, since A is diago-
theorem: diagonally
nally dominant, the radii of these discs are smaller than the coordinates of their
dominant matrices
are PSD, but PSD centers, so they do not cross the imaginary axis. It follows that every eigenvalue
matrices may not be of A is non-negative (or strictly positive if A is strictly diagonally dominant),
diagonally dominant so A is positive semidefinite by Theorem 2.2.1 (or positive definite by Theo-
(see the matrix from
Example 2.2.3, for
rem 2.2.2).
example).
2.2.3 Unitary Freedom of PSD Decompositions

We saw in Theorem 2.2.1 that we can write every positive semidefinite matrix
A ∈ Mn in the form A = B∗ B, where B ∈ Mn . However, this matrix B is not
unique, since if U ∈ Mn is any unitary matrix and we define C = UB then
C∗C = (UB)∗ (UB) = B∗U ∗UB = B∗ B = A as well. The following theorem
says (among other things) that all such decompositions of A are related to each
other in a similar way:
Theorem 2.2.10 Suppose B,C ∈ Mm,n (F). The following are equivalent:
Unitary Freedom a) There exists a unitary matrix U ∈ Mm (F) such that C = UB,
of PSD b) B∗ B = C∗C,
Decompositions c) (Bv) · (Bw) = (Cv) · (Cw) for all v, w ∈ Fn , and
d) kBvk = kCvk for all v ∈ Fn .
If C = I then Proof. We showed above directly above the statement of the theorem that (a)
this theorem gives
implies (b), and we demonstrated the equivalence of conditions (b), (c), and (d)
many of the
characterizations of in Exercise 1.4.19. We thus only need to show that the conditions (b), (c), and
unitary matrices that (d) together imply condition (a).
we saw back in
Theorem 1.4.9.
To this end, note that since B∗ B is positive semidefinite, it has a set of
eigenvectors {v1 , . . . , vn } (with corresponding eigenvalues λ1 , . . . , λn , respec-
tively) that form an orthonormal basis of Fn . By Exercise 2.1.19, we know that
r = rank(B∗ B) of these eigenvalues are non-zero, which we arrange so that
λ1 , . . . , λr are the non-zero ones. We now prove some simple facts about these
eigenvalues and eigenvectors:
i) r ≤ m, since rank(XY ) ≤ min{rank(X), rank(Y )} in general, so r =
rank(B∗ B) ≤ rank(B) ≤ min{m, n}.
ii) Bv1 , . . . , Bvr are non-zero and Bvr+1 = · · · = Bvn = 0. These facts follow
from noticing that, for each 1 ≤ j ≤ n, we have
Recall that v j is an
eigenvector, and
eigenvectors are (by kBv j k2 = (Bv j ) · (Bv j ) = v∗j (B∗ Bv j ) = v∗j (λ j v j ) = λ j kv j k2 ,
definition) non-zero.
which equals zero if and only if λ j = 0.
iii) The set {Bv1 , . . . , Bvr } is mutually orthogonal, since if i 6= j then
(Bvi ) · (Bv j ) = v∗i (B∗ Bv j ) = v∗i (λ j v j ) = λ j v∗i v j = 0.
It follows that if we define vectors wi = Bvi /kBvi k for 1 ≤ i ≤ r then

the set {w1 , . . . , wr } is a mutually orthogonal set of unit vectors. We then
To extend know from Exercise 1.4.20 that we can extend this set to an orthonormal basis
{w1 , . . . , wr } to an
{w1 , . . . , wm } of Fm . By repeating this same argument with C in place of B,
orthonormal basis,
add any vector z not we similarly can construct an orthonormal basis {x1 , . . . , xm } of Fm with the
in its span, then apply property that xi = Cvi /kCvi k for 1 ≤ i ≤ r.
the Gram–Schmidt
process, and repeat.
Before we reach the home stretch of the proof, we need to establish one
more property of this set of vectors:
iv) B∗ w j = C∗ x j = 0 whenever j ≥ r + 1. To see why this property holds,
recall that w j is orthogonal to each of Bv1 , . . . , Bvn . Since {v1 , . . . , vn } is
a basis of Fn , it follows that w j is orthogonal to everything in range(B).
In the notation That is, (B∗ w j ) · v = w j · (Bv) = 0 for all v ∈ Fn , so B∗ w j = 0. The fact
of Section 1.B.2,
that C∗ x j = 0 is proved similarly.
w j ∈ range(B)⊥ and
thus w j ∈ null(B∗ ) With all of these preliminary properties out of the way, we now define the
(compare with unitary matrices
Theorem 1.B.7).

U1 = w1 | · · · | wm and U2 = x1 | · · · | xm .
It then follows that

B∗U1 = B∗ w1 | · · · | B∗ wm
∗
B Bv1 B∗ Bv
r
= ··· 0 ··· 0 (property (iv) above)
kBv1 k kBvr k
∗
C Cv1 C∗Cv
r
= ··· 0 ··· 0 (properties (b) and (d))
kCv1 k kCvr k

= C∗ x1 | · · · | C∗ xm (property (iv) again)
= C∗U2 .
By multiplying the equation B∗U1 = C∗U2 on the right by U2∗ and then taking
the conjugate transpose of both sides, we see that C = (U2U1∗ )B. Since U2U1∗
is unitary, we are done.
Recall from Theorem 1.4.9 that unitary matrices preserve the norm (induced
Now is a good time by the usual dot product) and angles between vectors. The equivalence of
to have a look back
conditions (a) and (b) in the above theorem can be thought of as the converse—
at Figure 1.15.
if two sets of vectors have the same norm and pairwise angles as each other,
then there must be a unitary matrix that transforms one set into the other. That
is, if {v1 , . . . , vn } ⊆ Fm and {w1 , . . . , wn } ⊆ Fm are such that kv j k = kw j k
and vi · v j = wi · w j for all i, j then there exists a unitary matrix U ∈ Mm (F)
such
that w j = Uv j for all j. After all, if B = v1 | v2 | · · · | vn and C =
w1 | w2 | · · · | wn then B∗ B = C∗C if and only if these norms and dot
products agree.
Theorem 2.2.10 also raises the question of how simple we can make the
matrix B in a positive semidefinite decomposition A = B∗ B. The following
theorem provides one possible answer: we can choose B so that it is also
positive semidefinite.
Theorem 2.2.11 Suppose A ∈ Mn (F) is positive semidefinite. There exists a unique positive
Principal Square semidefinite matrix P ∈ Mn (F), called the principal square root of A,
Root of a Matrix such that A = P2 .
Proof. To see that such a matrix P exists, we use the standard method of
applying functions to diagonalizable matrices. Specifically, we use the spec-
tral decomposition to write A = UDU ∗ , where U ∈ Mn (F) is unitary and
D ∈ Mn (R) is diagonal with non-negative real numbers
√ (the eigenvalues
√ of A)
as its diagonal entries. If we then define P = U DU ∗ , where D is the diag-
onal matrix that contains the non-negative square roots of the entries of D, then
√ √ √ 2
P2 = (U DU ∗ )(U DU ∗ ) = U D U ∗ = UDU ∗ = A,
as desired.
To see that P is unique, suppose that Q ∈ Mn (F) is another positive semidef-
inite matrix for which Q2 = A. We can use the spectral decomposition to write
Q = V FV ∗ , where V ∈ Mn (F) is unitary and F ∈ Mn (R) is diagonal with non-
negative real numbers as its diagonal entries. Since V F 2V ∗ = Q2 = A = UDU ∗ ,
we conclude that the eigenvalues of Q2 equal those of A, and thus the diagonal
entries of F 2 equal those of D.
Since we are free in the spectral decomposition to order the eigenvalues
along the diagonals of F√and D however we like, we can assume without loss
Be careful—it is of generality that F = D. It then follows from the fact that P2 = A = Q2
tempting to try to
that V DV ∗ = UDU ∗ , so√ W D = DW√ , where W = U ∗V . Our goal now is to
show that V = U,
√ implies V DV = U DU (i.e., Q = P), which is equivalent
show√that this ∗ ∗
but this is not true
in general. For to W D = DW .
example, if D = I √
then U and V can
To this end, suppose that P has k distinct eigenvalues (i.e., D has k distinct
be anything. diagonal entries), which we denote by λ√ 1 , λ2 , . . ., λk , and denote the multiplic-
ity of each λ j by m j . We can then write D in block diagonal form as follows,
Again, the spectral where we have grouped repeated eigenvalues (if any) so as to be next to each
decomposition other (i.e., in the same block):
lets us order the  
√D
diagonal entries of
λ1 Im1 O ··· O
(and thus of D)  
however we like. √  O λ2 Im2 · · · O 
 
D= . .. .. ..  .
 . . 
 . . . 
O O · · · λk Imk
If we partition W as a block matrix via blocks of the same size and shape,
i.e.,  
W1,1 W1,2 · · · W1,k
W W2,2 · · · W2,k 
 2,1 
W =
 .. .. ..

..  ,
 . . . . 
Wk,1 Wk,2 · · · Wk,k
then block matrix multiplication shows that the equation W D = DW is equiv-
alent to λi2Wi, j = λ j2Wi, j for all 1 ≤ i 6= j ≤ k. Since λi2 6= λ j2 when i 6= j, this
implies that Wi, j = O when i 6= j, so W is block diagonal. It then follows that

 
λ1W1,1 O ··· O
 O λ2W2,2 ··· O 
√   √
W D=
 .. .. ..

..  = DW
 . . . . 
O O ··· λkWk,k
as well, which completes the proof.

√
The principal square root P of a matrix A is typically denoted by P = A,
and it is directly analogous to the principal square root of a non-negative real
number (indeed, for 1 × 1 matrices they are the exact same thing). This provides
yet another reason why we typically think of positive semidefinite matrices as
the “matrix version” of non-negative real numbers.
" #
Example 2.2.7 Compute the principal square root of the matrix A = 8 6 .
Computing a 6 17
Principal Square
Solution:
Root
We just compute a spectral decomposition of B and take the principal
square root of the diagonal entries of its diagonal part. The eigenvalues of A
are 5 and 20, and they have corresponding eigenvectors (2, −1) and (1, 2),
respectively. It follows that A has spectral decomposition A = UDU ∗ ,
where " #
1 2 1 5 0
U=√ and D = .
There are also three 5 −1 2 0 20
other square roots
of A (obtained The principal square root of A is thus
by taking some
negative square "√ # " #
√ √ ∗ 1 2 1 5 0 2 −1 1 6 2
roots in D), but the A = U DU = √ =√ .
principal square root 5 −1 2 0 20 1 2 5 2 9
is the only one of
them that is PSD.
By combining our previous two theorems, we get a new matrix decompo-
sition, which answers the question of how simple we can make a matrix by
multiplying it on the left by a unitary matrix—we can always make it positive
semidefinite.
Theorem 2.2.12 Suppose A ∈ Mn (F). There exists a unitary matrix U ∈ Mn (F) and a
Polar positive semidefinite matrix P ∈ Mn (F) such that
Decomposition
A = UP.
This is sometimes
called the right √
polar decomposition Furthermore, P is unique and is given by P = A∗ A, and U is unique if A
of A. A left polar is invertible.
decomposition is
A = QV , where Q is
PSD and V is Proof. Since A∗ A √
is positive semidefinite, we know from Theorem 2.2.11 that
unitary. These two
decompositions
we can define P = A∗ A so that A∗ A = P2 = P∗ P and P is positive semidefinite.
coincide (i.e., Q = P We then know from Theorem 2.2.10 that there exists a unitary matrix U ∈
and V = U) if and Mn (F) such that A = UP.
only if A is normal.
To see uniqueness of P, suppose A = U1 P1 = U2 P2 , where U1 ,U2 ∈ Mn (F)
are unitary and P1 , P2 ∈ Mn (F) are positive semidefinite. Then
A∗ A = (U1 P1 )∗ (U1 P1 ) = P1U1∗U1 P1 = P12 and

∗ ∗
A A = (U2 P2 ) (U2 P2 ) = P2U2∗U2 P2 = P22 .
Since P12 = P22 is positive semidefinite, is has a unique principal square root, so
P1 = P2 .
If A is invertible then uniqueness of U follows from the fact that we can
rearrange the decomposition A = UP to the form U ∗ = PA−1 , so U = (PA−1 )∗ .
Similarly, the
√
matrices
√ A∗ A
and AA∗ might be
√
The matrix A∗ A in the polar decomposition can be thought of as the
called the left and √
right absolute values “matrix version” of the absolute value of a complex √ number |z| = zz. In fact,
of A, respectively, this matrix is sometimes even denoted by |A| = A∗ A. In a sense, this matrix
and they are equal if provides a “regularization” of A that preserves many of its properties (such as
and only if A is
normal. rank and Frobenius norm—see Exercise 2.2.22), but also adds the desirable
positive semidefiniteness property.

Example 2.2.8 Compute the polar decomposition of the matrix A = 2 −1 .
Computing a Polar 2 4
Decomposition Solution:
In order to find the polar decomposition
√ A = UP, our first priority is
to compute A∗ A and then set P = A∗ A:
" #
∗ 2 2 2 −1 8 6
A A= = ,
−1 4 2 4 6 17
which we computed the principal square root of in Example 2.2.7:

" #
√ 1 6 2
P= A A= √∗ .
5 2 9
Next, as suggested by the proof of Theorem 2.2.12, we can find U by

setting U = (PA−1 )∗ , since this rearranges to exactly A = UP:
The 10 in the " # !∗
denominator here 1 6 2 4 1 1 2 −1
comes from A−1 . U= √ =√ .
10 5 2 9 −2 2 5 1 2
The polar decomposition is directly analogous to the fact that every complex
number z ∈ C can be written in the form z = reiθ , where r = |z| is non-negative
and eiθ lies on the unit circle in the complex plane. Indeed, we have already
been thinking about positive semidefinite matrices as analogous to non-negative
real numbers, and it similarly makes sense to think of unitary matrices as anal-
ogous to numbers on the complex unit circle (indeed, this is exactly what
Have a look at they are in the 1 × 1 case). After all, multiplication by eiθ rotates numbers in
Appendix A.3.3 if
the complex plane but does not change their absolute value, just like unitary
on the polar form of matrices rotate and/or reflect vectors but do not change their length.
a complex number. In fact, just like we think of PSD matrices as analogous to non-negative
real numbers and unitary matrices as analogous to numbers on the complex
unit circle, there are many other sets of matrices that it is useful to think of as
analogous to subsets of the complex plane. For example, we think of the sets
of Hermitian and skew-Hermitian matrices as analogous to the sets of real and

imaginary numbers, respectively (see Figure 2.6). There are several ways in
which these analogies can be justified:
imaginary numbers ∼ skew-Hermitian matrices
Thinking about
sets of matrices unit circle ∼ unitary matrices
geometrically (even
though it is a very
high-dimensional 0∼O 1∼I real numbers ∼
space that we Hermitian matrices
real numbers ≥ 0 ∼
cannot really picture
properly) is a very positive semidefinite matrices
useful technique for
building intuition.
Figure 2.6: Several sets of normal matrices visualized on the complex plane as the
sets of complex numbers to which they are analogous.
• Each of these sets of matrices have eigenvalues living in the corre-

sponding set of complex numbers. For example, the eigenvalues of
skew-Hermitian matrices are imaginary.
• Furthermore, the converse of the above point is true as long as we assume
that the matrix is normal (see Exercise 2.1.17). For example, a normal
matrix with all eigenvalues imaginary must be skew-Hermitian.
• We can write every matrix in the form A = UP, where U is unitary and P
is PSD, just like we can write every complex number in the form z = reiθ ,
where eiθ is on the unit circle and r is non-negative and real.
• We can write every matrix in the form A = H + K, where H is Hermitian
and K is skew-Hermitian (see Remark 1.B.1), just like we can write
every complex number in the form z = h + k, where h is real and k is
imaginary.
• The only positive semidefinite unitary matrix is I (see Exercise 2.2.7),
just like the only positive real number on the unit circle is 1.
• The only matrix that is both Hermitian and skew-Hermitian is O, just
like the only complex number that is both real and imaginary is 0.
Remark 2.2.3 Given a positive semidefinite matrix A ∈ Mn , there are actually multiple
Unitary different ways to simplify the matrix B in the PSD decomposition A = B∗ B,
Multiplication and which one is best to use depends on what we want to do with it.
and PSD We showed in Theorem 2.2.11 that we can choose B to be positive
Decompositions
semidefinite, and this led naturally to the polar decomposition of a matrix.
However, there is another matrix decomposition (the Cholesky decompo-
sition of Section 2.B.2) that says we can instead make B upper triangular
with non-negative real numbers on the diagonal. This similarly leads to
the QR decomposition of Section 1.C. The relationships between these
four matrix decompositions are summarized here:
Decomposition of B Decomposition of A = B∗ B
polar decomposition: B = UP principal square root: A = P2
P is positive semidefinite
QR decomposition: B = UT Cholesky decomposition: A = T ∗ T
T is upper triangular with non-negative real numbers on the diagonal
2.2.1 Which of the following matrices are positive semidef-

inite? Positive definite? 2.2.4 Compute a polar decomposition of each of the fol-
" # " # lowing matrices.
∗ (a) 1 0 (b) 4 −2i " # " #
∗ (a) 1 −1 (b) 1 0
0 1 2i 1
" #   −1 1 3 2
∗ (c) 1 0 0 (d) 0 0 1    
  (c) −1 0 0 ∗ (d) 0 −1 0
0 1 0 0 1 0    
  0 1+i 0 −1 1 1
∗ (e) 1 0 i 1 0 0
 
  (f) 1 1 1 0 0 2i 2 3 1
0 1 0
 
i 0 1 1 2 1
  2.2.5 Determine which of the following statements are
∗ (g) 3 1 1 1 1 2
    true and which are false.
1 3 1 (h) 1 1 1 0
  ∗ (a) If every entry of a matrix is real and non-negative
1 1 3 1 2 1 1
  then it is positive semidefinite.
1 1 2 i
(b) The zero matrix O ∈ Mn is positive semidefinite.
0 1 −i 2 ∗ (c) The identity matrix I ∈ Mn is positive definite.
(d) If A, B ∈ Mn are positive semidefinite matrices, then
2.2.2 Compute the Gershgorin (row) discs of each of the so is A + B.
following matrices. ∗ (e) If A, B ∈ Mn are positive semidefinite matrices, then
" # " # so is AB.
∗ (a) 1 2 (b) 2 −2 (f) If A ∈ Mn is a positive semidefinite matrix, then so
3 −1 −i 3 is A2 .
    ∗ (g) Each Gershgorin disc of a matrix contains at least
∗ (c) 1 0 0 (d) 0 0 1
    one of its eigenvalues.
0 2 0 0 3 0 (h) The identity matrix is its own principal square root.
0 0 3 2 0 0 ∗ (i) The only matrix A ∈ Mn such that A2 = I is A = I.
   
∗ (e) 1 2 i (f) −1 1 1 0 (j) Every matrix A ∈ Mn has a unique polar decompo-
    sition.
−1 3 2 0 1 1 i
 
i −2 1 0 1 0 0
1 0 −1 i 2.2.6 For each of the following matrices, determine which
values of x ∈ R result in the matrix being positive semidefi-
nite.
2.2.3 Compute the principal square root of each of the " # " #
following positive semidefinite matrices. ∗ (a) 1 0 (b) 1 x
" # " # 0 x x 9
∗ (a) 1 −1 (b) 2 3    
∗ (c) 1 x 0 (d) 1 x 0
−1 1 3 5    
    x 1 x x 1 x
∗ (c) 1 0 0 (d) 2 1 1
    0 x 0 0 x 1
0 2 0 1 2 1
0 0 3 1 1 2
    ∗∗2.2.7 Show that the only matrix that is both positive
∗ (e) 1 1 0 (f) 2 2 0 0
    semidefinite and unitary is the identity matrix.
1 2 1 2 2 0 0 
 √ 
0 1 1 0 0 1 3
  2.2.8 Show that if a matrix is positive definite then it is
√
0 0 3 4 invertible, and its inverse is also positive definite.
2.2.9 Suppose A ∈ Mn is self-adjoint. Show that there

exists a scalar c ∈ R such that A + cI is positive definite.
2.2.10 Let J ∈ Mn be the n × n matrix with all entries ∗∗2.2.19 Suppose that A, B,C ∈ Mn are positive semidef-
equal to 1. Show that J is positive semidefinite. inite.
(a) Show that tr(A) ≥ 0.
∗∗2.2.11 Show that if A ∈ Mn is positive semidefinite and (b) Show that tr(AB) ≥ 0. [Hint: Decompose A and B.]
a j, j = 0 for some 1 ≤ j ≤ n then the entire j-th row and j-th (c) Provide an example to show that it is not necessarily
column of A must consist of zeros. the case that tr(ABC) ≥ 0. [Hint: Finding an example
by hand might be tricky. If you have trouble, try writ-
[Hint: What is v∗ Av if v has just two non-zero entries?]
ing a computer program that searches for an example
by generating A, B, and C randomly.]
2.2.12 Show that every (not necessarily Hermitian) strictly
diagonally dominant matrix is invertible.
∗∗2.2.20 Suppose that A ∈ Mn is self-adjoint.
(a) Show that A is positive semidefinite if and only if
∗∗2.2.13 Show that the block diagonal matrix
tr(AB) ≥ 0 for all positive semidefinite B ∈ Mn .
 
A1 O · · · O [Side note: This provides a converse to the statement
  of Exercise 2.2.19(b).]
 O A2 · · · O 
  (b) Show that A positive definite if and only if tr(AB) > 0
A= . .. .. 
. ..  for all positive semidefinite O 6= B ∈ Mn .
. . . .
O O · · · An
∗∗2.2.21 Suppose that A ∈ Mn is self-adjoint.
is positive (semi)definite if and only if each of A1 , A2 , . . . , An
(a) Show that if there exists a scalar c ∈ R such that
are positive (semi)definite.
tr(AB) ≥ c for all positive semidefinite B ∈ Mn then
A is positive semidefinite and c ≤ 0.
∗∗2.2.14 Show that A ∈ Mn is normal if and only if there (b) Show that if there exists a scalar c ∈ R such that
exists a unitary matrix U ∈ Mn such that A∗ = UA. tr(AB) > c for all positive definite B ∈ Mn then A is
positive semidefinite and c ≤ 0.
2.2.15 A matrix A ∈ Mn (R) with non-negative entries √
is called row stochastic if its rows each add up to 1 (i.e., ∗∗2.2.22 Let |A| = A∗ A be the absolute value of the
ai,1 + ai,2 + · · · + ai,n = 1 for each 1 ≤ i ≤ n). Show that matrix A ∈ Mn (F) that was discussed after Theorem 2.2.12.
each eigenvalue λ of a row stochastic matrix has |λ | ≤ 1. (a) Show that rank(|A|) = rank(A).

(b) Show that |A| F = kAkF .

∗∗2.2.16 Suppose F = R or F = C, and A ∈ Mn (F). (c) Show that |A|v = kAvk for all v ∈ Fn .
(a) Show that A is self-adjoint if and only if there exist
positive semidefinite matrices P, N ∈ Mn (F) such ∗∗2.2.23 Prove Theorem 2.2.2.
that A = P − N. [Hint: Apply the spectral decompo-
[Hint: Mimic the proof of Theorem 2.2.1 and just make
sition to A.]
minor changes where necessary.]
(b) Show that if F = C then A can be written as a linear
combination of 4 or fewer positive semidefinite ma-
trices, even if it is not Hermitian. [Hint: Have a look ∗∗2.2.24 Recall Theorem 2.2.3, which described some
at Remark 1.B.1.] ways in which we can combine PSD matrices to create new
(c) Explain why the result of part (b) does not hold if PSD matrices.
F = R.
(a) Prove part (a) of the theorem.
(b) Prove part (b) of the theorem.
2.2.17 Suppose A ∈ Mn is self-adjoint with p strictly pos- (c) Prove part (c) of the theorem.
itive eigenvalues. Show that the largest integer r with the
property that P∗ AP is positive definite for some P ∈ Mn,r
∗∗2.2.25 Suppose A ∈ Mn (F) is self-adjoint.
is r = p.
(a) Show that A is positive semidefinite if and only if
√ there exists a set of vectors {v1 , v2 , . . . , vn } ⊂ Fn
2.2.18 Suppose B ∈ Mn . Show that tr B∗ B ≥ |tr(B)|. such that
[Side note: In other words, out of all possible PSD decom-
positions of a PSD matrix, its principal square root is the ai, j = vi · v j for all 1 ≤ i, j ≤ n.
one with the largest trace.] [Side note: A is sometimes called the Gram matrix
of {v1 , v2 , . . . , vn }.]
(b) Show that A is positive definite if and only if the set
of vectors from part (a) is linearly independent.
2.2.26 Suppose that A, B ∈ Mn are positive definite. ∗∗2.2.28 In this exercise, we show that if A ∈ Mn (C) is

(a) Show that all eigenvalues of AB are real and positive. written in terms of its columns as A = a1 | a2 | · · · | an ,
√ −1 then
[Hint: Multiply
√ AB on the left by A and on the det(A) ≤ ka1 kka2 k · · · kan k.
right by A.]
(b) Part (a) does not imply that AB is positive definite. [Side note: This is called Hadamard’s inequality.]
Why not? (a) Explain why it suffices to prove this inequality in
the case when ka j k = 1 for all 1 ≤ j ≤ n. Make this
2.2.27 Let A, B ∈ Mn be positive definite matrices. assumption throughout the rest of this question.
(b) Show that det(A∗ A) ≤ 1. [Hint: Use the AM–GM
(a) Show that det(I + B) ≥ 1 + det(B). inequality (Theorem A.5.3).]
(b) Show that det(A + B) ≥ det(A) + det(B). [Hint: (c) Conclude that | det(A)| ≤ 1 as well, thus completing
√ −1 √ −1
det(A + B) = det(A) det(I + A B A ).] the proof.
[Side note: The stronger inequality (d) Explain under which conditions equality is attained
in Hadamard’s inequality.
(det(A + B))1/n ≥ (det(A))1/n + (det(B))1/n
is also true for positive definite matrices (and is called
Minkowski’s determinant theorem), but proving it is
quite difficult.]
2.3 The Singular Value Decomposition
We are finally in a position to present what is possibly the most important

theorem in all of linear algebra: the singular value decomposition (SVD). On
its surface, it can be thought of as a generalization of the spectral decomposition
Some of these that applies to all matrices, rather than just normal matrices. However, we will
applications of the
also see that we can use this decomposition to do several unexpected things
SVD are deferred to
Section 2.C. like compute the size of a matrix (in a way that typically makes more sense
than the Frobenius norm), provide a new geometric interpretation of linear
transformations, solve optimization problems, and construct an “almost inverse”
for matrices that do not have an inverse.
Furthermore, many of the results from introductory linear algebra as well
as earlier in this book can be seen as trivial consequences of the singular value
decomposition. The spectral decomposition (Theorems 2.1.2 and 2.1.6), polar
decomposition (Theorem 2.2.12), orthogonality of the fundamental subspaces
(Theorem 1.B.7), and many more results can all be re-derived in a line or two
from the SVD. In a sense, if we know the singular value decomposition then
we know linear algebra.
Theorem 2.3.1 Suppose F = R or F = C, and A ∈ Mm,n (F). There exist unitary matrices
Singular Value U ∈ Mm (F) and V ∈ Mn (F), and a diagonal matrix Σ ∈ Mm,n (R) with
Decomposition non-negative entries, such that
(SVD)
A = UΣV ∗ .
Note that Σ might Furthermore, in any such decomposition,

not be square. When
we say that it is a • the diagonal entries of Σ (called the singular values of A) are the
“diagonal matrix”, non-negative square roots of the eigenvalues of A∗ A (or equivalently,
we just mean that its of AA∗ ),
(i, j)-entry is 0
whenever i 6= j. • the columns of U (called the left singular vectors of A) are eigen-
vectors of AA∗ , and
• the columns of V (called the right singular vectors of A) are eigen-
vectors of A∗ A.
Before proving the singular value decomposition, it’s worth comparing the
ways in which it is “better” and “worse” than the other matrix decompositions
that we already know:
If m 6= n then A∗ A and • Better: It applies to every single matrix (even rectangular ones). Every
AA∗ have different
other matrix decomposition we have seen so far had at least some re-
sizes, but they still
have essentially the strictions (e.g., diagonalization only applies to matrices with a basis of
same eigenvalues— eigenvectors, the spectral decomposition only applies to normal matrices,
whichever one is and Schur triangularization only applies to square matrices).
larger just has some
extra 0 eigenvalues. • Better: The matrix Σ in the middle of the SVD is diagonal (and even
The same is actually has real non-negative entries). Schur triangularization only results in an
true of AB and BA for upper triangular middle piece, and even diagonalization and the spectral
any A and B (see
Exercise 2.B.11).
decomposition do not guarantee an entrywise non-negative matrix.
• Worse: It requires two unitary matrices U and V , whereas all of our
previous decompositions only required one unitary matrix or invertible
matrix.
Proof of the singular value decomposition. To see that singular value decom-
positions exists, suppose that m ≥ n and construct a spectral decomposition
of the positive semidefinite matrix A∗ A = V DV ∗ (if m < n then we instead
use the matrix AA∗ , but otherwise the proof is almost identical). Since A∗ A is
positive semidefinite, its eigenvalues (i.e., the diagonal entries
p of D) are real
and non-negative, so we can define Σ ∈ Mm,n by [Σ] j, j = d j, j for all j, and
In other words, Σ is
the principal square [Σ]i, j = 0 if i 6= j.
root of D, but with It follows that
extra zero rows so ∗
that it has the same A∗ A = V DV ∗ = V Σ∗ ΣV ∗ = ΣV ∗ ΣV ∗ ,
size as A.
so the equivalence of conditions (a) and (b) in Theorem 2.2.10 tells us that
there exists a unitary matrix U ∈ Mm (F) such that A = U(ΣV ∗ ), which is a
singular value decomposition of A.
To check the “furthermore” claims we just note that if A = UΣV ∗ with U
and V unitary and Σ diagonal then it must be the case that
We must write Σ∗ Σ
here (instead of Σ2 )
A∗ A = (UΣV ∗ )∗ (UΣV ∗ ) = V (Σ∗ Σ)V ∗ ,
because Σ might not which is a diagonalization of A∗ A. Since the only way to diagonalize a matrix is
be square.
via its eigenvalues and eigenvectors (refer back to Theorem 2.0.1, for example),
it follows that the columns of V are eigenvectors of A∗ A and the diagonal
entries of Σ∗ Σ (i.e., the squares of the diagonal entries of Σ) are the eigenvalues
of A∗ A. The statements about eigenvalues and eigenvectors of AA∗ are proved
in a similar manner.
We typically denote singular values (i.e., the diagonal entries of Σ) by
For a refresher on
these facts about σ1 , σ2 , . . . , σmin{m,n} , and we order them so that σ1 ≥ σ2 ≥ . . . ≥ σmin{m,n} . Note
the rank of a matrix, that exactly rank(A) of A’s singular values are non-zero, since rank(UΣV ∗ ) =
see Exercise 2.1.19 rank(Σ), and the rank of a diagonal matrix is the number of non-zero diagonal
and Theorem A.1.2,
entries. We often denote the rank of A simply by r, so we have σ1 ≥ · · · ≥ σr > 0
for example.
and σr+1 = · · · = σmin{m,n} = 0.

Example 2.3.1 Compute the singular values of the matrix A = 3 2 .
Computing −2 0
Singular Solution:
Values The singular values are the square roots of the eigenvalues of A∗ A.
2.3 The Singular Value Decomposition 211
Direct calculation shows that

" #
13 6
A∗ A = ,
6 4
which has characteristic polynomial

" #!
∗ 13 − λ 6
pA∗ A (λ ) = det(A A − λ I) = det
6 4−λ
= (13 − λ )(4 − λ ) − 36 = λ 2 − 17λ + 16 = (λ − 1)(λ − 16).
√ of A A are λ1 =
√ 16 and λ2 = 1, so
We thus conclude that the eigenvalues ∗
the singular values of A are σ1 = 16 = 4 and σ2 = 1 = 1.
To compute a matrix’s singular value decomposition (not just its singular

values), we could construct spectral decompositions of each of A∗ A (for the
right singular vectors) and AA∗ (for the left singular vectors). However, there is
a simpler way of obtaining the left singular vectors that lets us avoid computing
a second spectral decomposition. In particular, if we have already computed
a spectral decomposition A∗ A = V (Σ∗ Σ)V ∗ , where V = v1 | v2 | · · · | vn
and Σ has non-zero
diagonal entries σ1 , σ2 , . . ., σr , then we can compute the
remaining U = u1 | u2 | · · · | um piece of A’s singular value decomposition
by noting that A = UΣV ∗ implies

Av1 | Av2 | · · · | Avn = AV = UΣ = σ1 u1 | · · · | σr ur | 0 | · · · | 0 .
The final m − r It follows that the first r = rank(A) columns of U can be obtained via
columns of U do not
really matter—they 1
just need to “fill out” uj = Av j for all 1 ≤ j ≤ r, (2.3.1)
σj
U to make it unitary
(i.e., they must have and the remaining m − r columns can be obtained by extending {ui }ri=1 to
length 1 and be
mutually orthogonal).
an orthonormal basis of Fm (via the Gram–Schmidt process, for example).
 
Example 2.3.2 Compute a singular value decomposition of the matrix A = 1 2 3 .
Computing a −1 0 1
Singular Value 3 2 1
Decomposition Solution:
As discussed earlier, our first step is to find the “V ” and “Σ” pieces of
A’s singular value decomposition, which we do by constructing a spectral
decomposition of A∗ A and taking the square roots of its eigenvalues. Well,
direct calculation shows that
This is a good time  
to remind yourself 11 8 5
of how to calculate  
eigenvalues and A∗ A =  8 8 8  ,
eigenvectors, in 5 8 11
case you have
gotten rusty. We do
not go through all of
which has characteristic polynomial
the details here.
pA∗ A (λ ) = det(A∗ A − λ I) = −λ 3 + 30λ 2 − 144λ = −λ (λ − 6)(λ − 24).
It follows that the eigenvalues of A∗ A are λ1 = 24, λ2 = 6, and λ3 = 0,

√ √ √
In particular, A has so the singular values of A are σ1 = 24 = 2 6, σ2 = 6, and σ3 = 0.
rank 2 since 2 of its
The normalized eigenvectors corresponding to these eigenvalues are
singular values are
non-zero.      
1 −1 −1
1 1 1
v1 = √ 1 , v2 = √  0  , and v3 = √  2  ,
3 2 6
1 1 −1
respectively. We then place these eigenvectors into the matrix V as columns

(in the same order as the corresponding eigenvalues/singular values) and
obtain
 √  √ √ 
We just pulled a 2 6 0 0 2 − 3 −1
√
1/ 6 factor out of V  √  1  √

Σ= 0 6 0 and V = √  2 0 2 .
to avoid having to 6 √ √
write fractional 0 0 0 2 3 −1
entries—simplifying
in another way (e.g.,
explicitly writing its
Since 2 of the singular values are non-zero, we know that rank(A) = 2.
top-left
√ entry as We thus compute the first 2 columns of U via
1/ 3) is fine too.
     
1 2 3 1 1
1 1 1 1
u1 = Av1 = √ −1 0 1  √ 1 = √ 0 ,
σ1 2 6 3 2
3 2 1 1 1
     
1 2 3 −1 1
1 1 1 1
u2 = Av2 = √ −1 0 1  √  0  = √  1  .
σ2 6 2 3
3 2 1 1 −1
The third column of U can be found by extending {u1 , u2 } to an orthonor-

mal basis of R3 , which just means that we need to find a unit vector
orthogonal to each of u1 and u2 . We could do this by solving a linear
Alternatively, system, using the Gram–Schmidt process, or via the cross product.
√
u3 = (−1, 2, 1)/ 6 √ Any of
would be fine too.
these methods quickly lead us to the vector u3 = (1, −2, −1)/ 6, so the
singular value decomposition A = UΣV ∗ is completed by choosing
√ √ 
Again, we just pulled 3 2 1
√
a 1/ 6 factor out of 1  √ 
U=√   0 2 −2 .
U to avoid fractions. 6 √ √
3 − 2 −1
Before delving into what makes the singular value decomposition so useful,
it is worth noting that if A ∈ Mm,n (F) has singular value decomposition A =
UΣV ∗ then AT and A∗ have singular value decompositions
We use V to mean
the entrywise AT = V ΣT U T and A∗ = V Σ∗U ∗ .
complex conjugate
of V . That is, In particular, Σ, ΣT , and Σ∗ all have the same diagonal entries, so A, AT , and
V = (V ∗ )T .
A∗ all have the same singular values.
2.3.1 Geometric Interpretation and the Fundamental Subspaces

By recalling that unitary matrices correspond exactly to the linear transforma-
tions that rotate and/or reflect vectors in Fn , but do not change their length,
we see that the singular value decomposition has a very simple geometric
interpretation. Specifically, it says that every matrix A = UΣV ∗ ∈ Mm,n (F)

acts as a linear transformation from Fn to Fm in the following way:
• First, it rotates and/or reflects Fn (i.e., V ∗ acts on Fn ).
• Then it stretches and/or shrinks Fn along the standard axes (i.e., the
diagonal matrix Σ acts on Fn ) and then embeds it in Fm (i.e., either adds
m − n extra dimensions if m > n or ignores n − m of the dimensions if
m < n).
• Finally, it rotates and/or reflects Fm (i.e., U acts on Fm ).
This geometric interpretation is illustrated in the m = n = 2 case in Fig-
ure 2.7. In particular, it is worth keeping track not only of how the linear
In fact, the product
of a matrix’s singular transformation changes a unit square grid on R2 into a parallelogram grid, but
values equals the also how it transforms the unit circle into an ellipse. Furthermore, the two radii
absolute value of its of the ellipse are exactly the singular values σ1 and σ2 of the matrix, so we
determinant (see
see that singular values provide another way of measuring how much a linear
Exercise 2.3.7).
transformation expands space (much like eigenvalues and the determinant).
y y
e2
V ∗ e2
V∗
x V∗ x
e1
V ∗ e1
y y
Ae2
1
V ∗ e2 2
2
1 Ae1
x U x
V ∗ e1 U
Figure 2.7: The singular value decomposition says that every linear transforma-
tion (i.e., multiplication by a matrix A) can be thought of as a rotation/reflection
V ∗ , followed by a scaling along the standard axes Σ, followed by another rota-
tion/reflection U.
This picture also extends naturally to higher dimensions. For example,

a linear transformation acting on R3 sends the unit sphere to an ellipsoid
whose radii are the 3 singular values of the standard matrix of that linear
transformation. For example, we saw in Example 2.3.2 that the matrix

 
1 2 3
A = −1 0 1
3 2 1
√ √
has singular values 2 6, 6, and 0, so we expect that A acts as√a linear √ trans-
formation that sends the unit sphere to an ellipsoid with radii 2 6, 6, and 0.
Since the third radius is 0, it is actually 2D ellipse living inside of R3 —one of
the dimensions is “squashed” by A, as illustrated in Figure 2.8.
y y
x
Figure 2.8: Every linear transformation on R3 sends the unit sphere to a (possibly
degenerate) ellipsoid. The linear transformation displayed here is the one with the
standard matrix A from Example 2.3.2.
The fact that the unit sphere is turned into a 2D ellipse by this matrix
corresponds to the fact that it has rank 2, so its range is 2-dimensional. In fact,
the first two left singular vectors (which point in the directions of the major and
minor axes of the ellipse) form an orthonormal basis of the range. Similarly, the
third right singular vector, v3 has the property that Av3 = UΣV ∗ v3 = UΣe3 =
σ3Ue3 = 0, since σ3 = 0. It follows that v3 is in the null space of A.
This same type of argument works in general and leads to the following the-
orem, which shows that the singular value decomposition provides orthonormal
bases for each of the four fundamental subspaces of a matrix:
Theorem 2.3.2 Let A ∈ Mm,n be a matrix with rank(A) = r and singular value decompo-
Bases of the sition A = UΣV ∗ , where
Fundamental
Subspaces U = u1 | u2 | · · · | um and V = v1 | v2 | · · · | vn .
Then
a) {u1 , u2 , . . . , ur } is an orthonormal basis of range(A),
b) {ur+1 , ur+2 , . . . , um } is an orthonormal basis of null(A∗ ),
c) {v1 , v2 , . . . , vr } is an orthonormal basis of range(A∗ ), and
d) {vr+1 , vr+2 , . . . , vn } is an orthonormal basis of null(A).
Proof. First notice that we have
Av j = UΣV ∗ v j = UΣe j = σ jUe j = σ j u j for all 1 ≤ j ≤ r.

Dividing both sides by σ j then shows that u1 , u2 , . . . , ur are all in the range
of A. Since range(A) is (by definition) r-dimensional, and {u1 , u2 , . . . , ur } is a
set of r mutually orthogonal unit vectors, it must be an orthonormal basis of
range(A).
Similarly, σ j = 0 whenever j ≥ r + 1, so
Av j = UΣV ∗ v j = UΣe j = 0 for all j ≥ r + 1.
It follows that vr+1 , vr+2 , . . . , vn are all in the null space of A. Since the rank-
nullity theorem (Theorem A.1.2(e)) tells us that null(A) has dimension n − r,
and {vr+1 , vr+2 , . . . , vn } is a set of n − r mutually orthogonal unit vectors, it
must be an orthonormal basis of null(A).
The corresponding facts about range(A∗ ) and null(A∗ ) follow by applying
these same arguments to A∗ instead of A.
Note that this theorem tells us immediately that everything in range(A) is

Look back at orthogonal to everything in null(A∗ ), simply because the members of the set
Figure 1.26 for
a geometric {u1 , u2 , . . . , ur } are all orthogonal to the members of {ur+1 , ur+2 , . . . , um }. A
interpretation of similar argument shows that everything in null(A) is orthogonal to everything
these facts. range(A∗ ). In the terminology of Section 1.B.2, Theorem 2.3.2 shows that these
fundamental subspaces are orthogonal complements of each other (i.e., this
provides another proof of Theorem 1.B.7).
Example 2.3.3 Compute a singular value decomposition of the matrix

Computing Bases  
of the Fundamental 1 1 1 −1
Subspaces via SVD A= 0 1 1 0 ,
−1 1 1 1
and use it to construct bases of the four fundamental subspaces of A.

Solution:
To compute the SVD of A, we could start by computing A∗ A as in the
previous example, but we can also construct an SVD from AA∗ instead.
In general, it Since A∗ A is a 4 × 4 matrix and AA∗ is a 3 × 3 matrix, working with AA∗
is a good idea to
will likely be easier, so that is what we do.
compute the SVD of
A from whichever of Direct calculation shows that
A∗ A or AA∗ is smaller.
 
4 2 0
AA∗ = 2 2 2 ,
0 2 4
which has eigenvalues

√ 1 = 6, λ2 = 4, and λ3 = 0, so the singular values
λ√
of A are σ1 = 6, σ2 = 4 = 2, and σ3 = 0. The normalized eigenvectors
corresponding to these eigenvalues are
     
1 −1 −1
1   1   1  
u1 = √ 1 , u2 = √ 0 , and u3 = √ 2 ,
3 2 6
1 1 −1
respectively. We then place these eigenvectors into the matrix U as

columns and obtain

√  √ √ 
We put these 6 0 0 0 2 − 3 −1
as columns into   1 
√ 
U instead of V
Σ= 0 2 0 0 and U = √  2 0 2
.
6 √ √
because we worked 0 0 0 0 2 3 −1
with AA∗ instead of
A∗ A. Remember that
U must be 3 × 3 and Since 2 of the singular values are non-zero, we know that rank(A) = 2.
V must be 4 × 4. We thus compute the first 2 columns of V via
   
1 0 −1    0
1
1 ∗ 1  
 1 1 1   1   1  1

Here, since we v1 = A u1 = √   √ 1 = √  ,
are working with AA∗ σ1 6 1 1 1  3 2 1
instead of A∗ A, we 1
just swap the roles of
−1 0 1 0
   
U and V , and A and 1 0 −1    −1
A∗ , in Equation −1
(2.3.1). 1 ∗ 1 1 1 1 
 1 1  
0
v2 = A u2 =   √  0  = √   .
σ2 2 1 1 1  2 2 0 
1
−1 0 1 1
The third and fourth columns of V can be found by extending {v1 , v2 }

to an orthonormal basis of R4 . We could do this via the Gram–Schmidt
process, but in this case it is simple√enough to “eyeball” vectors
√ that work:
we can choose v3 = (0, 1, −1, 0)/ 2 and v4 = (1, 0, 0, 1)/ 2. It follows
that this singular value decomposition A = UΣV ∗ can be completed by
choosing
 
0 −1 0 1
1  1 0 1 0

V=√  .
2 1 0 −1 0
0 1 0 1
We can then construct orthonormal bases of the four fundamental

subspaces directly from Theorem 2.3.2 (recall that the rank of A is r = 2):
√ √
• range(A): {u1 , u2 } = (1, 1, 1)/ 3, (−1, 0, 1)/ 2 ,
√
• null(A∗ ): {u3 } = (−1, 2, −1)/ 6 ,
√ √
• range(A∗ ): {v1 , v2 } = (0, 1, 1, 0)/ 2, (−1, 0, 0, 1)/ 2 , and
√ √
• null(A): {v3 , v4 } = (0, 1, −1, 0)/ 2, (1, 0, 0, 1)/ 2 .
Remark 2.3.1 Up until now, the transpose of a matrix (and more generally, the adjoint of
A Geometric a linear transformation) has been one of the few linear algebraic concepts
Interpretation that we have not interpreted geometrically. The singular value decomposi-
of the Adjoint tion lets us finally fill this gap.
Notice that if A has singular value decomposition A = UΣV ∗ then A∗
Keep in mind that A∗
exists even if A is not
and A−1 (if it exists) have singular value decompositions
square, in which
case Σ∗ has the A∗ = V Σ∗U ∗ and A−1 = V Σ−1U ∗ ,
same diagonal
entries as Σ, but a where Σ−1 is the diagonal matrix with 1/σ1 , . . ., 1/σn on its diagonal.
different shape.
In particular, each of A∗ and A−1 act by undoing the rotations that A

applies in the opposite order—they first apply U ∗ , which is the inverse of
the unitary (rotation) U, then they apply a stretch, and then they apply V ,
which is the inverse of the unitary (rotation) V ∗ . The difference between
A∗ and A−1 is that A−1 also undoes the “stretch” Σ that A applies, whereas
A∗ simply stretches by the same amount.
We can thus think of A∗ as rotating in the opposite direction as A,
but stretching by the same amount (whereas A−1 rotates in the opposite
direction as A and stretches by a reciprocal amount):
y y
e2
e2
A might be more
complicated than
this (it might send
the unit circle to an x x
ellipse rather than to
e1
a circle), but it is a
bit easier to visualize e1
the relationships
between these
pictures when it
sends circles to
∗
circles.
−1
y y
∗e
2 ∗e
1
−1 e −1 e
2 1
x x
2.3.2 Relationship with Other Matrix Decompositions

The singular value decomposition also reduces to and clarifies several other
matrix decompositions that we have seen, when we consider special cases of it.
For example, recall from Theorem A.1.3 that if A ∈ Mm,n (F) has rank(A) = r
then we can find vectors {u j }rj=1 ⊂ Fm and {v j }rj=1 ⊂ Fn so that
This theorem is
stated in r
Appendix A.1.4 just in
the F = R case, but it
A= ∑ u j vTj .
j=1
is true over any field.
Furthermore, rank(A) is the fewest number of terms possible in any such sum,
and we can choose the sets {u j }rj=1 and {v j }rj=1 to be linearly independent.
One way of rephrasing the singular value decomposition is as saying that,
if F = R or F = C, then we can in fact choose the sets of vectors {u j }rj=1 and
{v j }rj=1 to be mutually orthogonal (not just linearly independent):
Theorem 2.3.3 Suppose F = R or F = C, and A ∈ Mm,n (F) has rank(A) = r. There exist
Orthogonal orthonormal sets of vectors {u j }rj=1 ⊂ Fm and {v j }rj=1 ⊂ Fn such that
Rank-One Sum
Decomposition r
A= ∑ σ j u j v∗j ,
j=1
where σ1 ≥ σ2 ≥ · · · ≥ σr > 0 are the non-zero singular values of A.
Proof. For simplicity, we again assume that m ≤ n throughout this proof, since
nothing substantial changes if m > n. Use the singular value decomposition to
write A = UΣV ∗ , where U and V are unitary and Σ is diagonal, and then write
U and V in terms of their columns:

U = u1 | u2 | · · · | um and V = v1 | v2 | · · · | vn .
Furthermore, Then performing block matrix multiplication reveals that

it follows from
Theorem 2.3.2 that
{u j }rj=1 and {v j }rj=1
A = UΣV ∗
  
are orthonormal σ1 0 ··· 0 0 ··· 0 v∗1
bases of range(A) 
and range(A∗ ),  0 σ2 ··· 0 0 ··· 0  v∗2 
= u1 | u2 | · · · | um   ..

respectively.  .. .. .. .. .. .. ..   
. . . . . . . . 
0 0 ··· σm 0 ··· 0 v∗n
 
v∗1

 v∗2 

= σ1 u1 | σ2 u2 | · · · | σm um | 0 | · · · | 0  .. 
 . 
v∗n
m
= ∑ σ j u j v∗j .
j=1
By recalling that σ j = 0 whenever j > r, we see that this is exactly what we

wanted.
In fact, not only does the orthogonal rank-one sum decomposition follow
from the singular value decomposition, but they are actually equivalent—we
can essentially just follow the above proof backward to retrieve the singular
value decomposition from the orthogonal rank-one sum decomposition. For
this reason, this decomposition is sometimes just referred to as the singular
value decomposition itself.
Example 2.3.4 Compute an orthogonal rank-one sum decomposition of the matrix

Computing an  
Orthogonal 1 1 1 −1
Rank-One Sum A= 0 1 1 0 .
Decomposition −1 1 1 1
Solution:
This is the same matrix from Example 2.3.3, which has singular value
decomposition
√ √  √ 
2 − 3 −1 6 0 0 0
1 √   
U=√  2 0 2 , Σ =  0 2 0 0 and
6 √ √
2 3 −1 0 0 0 0
 
0 −1 0 1
1 1 0 1 0

V=√  .
2 1 0 −1 0
0 1 0 1
Its orthogonal rank-one sum decomposition can be computed by adding

up the outer products of the columns of U and V , scaled by the diagonal
The term “outer entries of Σ:
product” means a
product of a column    
vector and a row 1 −1
vector, like uv∗ (in A = σ1 u1 v∗1 + σ2 u2 v∗2 = 1 0 1 1 0 +  0  −1 0 0 1 .
contrast
with an “inner
1 1
product” like u∗ v).
If A ∈ Mn is positive semidefinite then the singular value decomposition
coincides exactly with the spectral decomposition. Indeed, the spectral decom-
position says that we can write A = UDU ∗ with U unitary and D diagonal (with
non-negative diagonal entries, thanks to A being positive semidefinite). This is
also a singular value decomposition of A, with the added structure of having
the two unitary matrices U and V being equal to each other. This immediately
tells us the following important fact:
! If A ∈ Mn is positive semidefinite then its singular values equal

its eigenvalues.
Slightly more generally, there is also a close relationship between the

singular value decomposition of a normal matrix and its spectral decomposition
(and thus the singular values of a normal matrix and its eigenvalues):
Theorem 2.3.4 If A ∈ Mn is a normal matrix then its singular values are the absolute
Singular Values of values of its eigenvalues.
Normal Matrices
Proof. Since A is normal, we can use the spectral decomposition to write

A = UDU ∗ , where U is unitary and D is diagonal (with the not-necessarily-
Alternatively, we positive eigenvalues of A on its diagonal). If we write each λ j in its polar form
could prove this
theorem by recalling
λ j = |λ j |eiθ j then
that the singular      iθ 
values of A are the 1 0 ··· 0 | 1| 0 ··· 0 e 1 0 ··· 0
square roots of A∗ A. 0 2 ··· 0  | 2| · · · 0   iθ2 ··· 0 
   0  0 e 
If A is normal then D =  . . . .  = . . . .   . . .. ..  ,
A = UDU , so
∗  .
. .
. . .
. .   .
. .
. . . .
.   .
. .
. . . 
A∗ A = UD∗ DU ∗ , 0 0 ··· n 0 0 · · · |λn | 0 0 ··· e n iθ
which has
eigenvalues
λ j λ j = |λ j |2 . call this Σ call this Θ
so that A = UDU ∗ = UΣΘU ∗ = UΣ(UΘ∗ )∗ . Since U and Θ are both unitary,

so is UΘ∗ , so this is a singular value decomposition of A. It follows that the
diagonal entries of Σ (i.e., |λ1 |, |λ2 |, . . . , |λn |) are the singular values of A.
To see that the above theorem does not hold for non-normal matrices,
consider the matrix
1 1
A= ,
0 1
which has both eigenvalues equal to 1 (it is upper triangular,√so its eigenvalues
are its diagonal entries). However, its singular values are ( 5 ± 1)/2. In fact,
the conclusion of Theorem 2.3.4 does not hold for any non-normal matrix (see
Exercise 2.3.20).
Finally, we note that the singular value decomposition is also “essentially
equivalent” to the polar decomposition:
Keep in mind that

! If A ∈ Mn has singular value decomposition A = UΣV ∗ then
this argument only
A = (UV ∗ )(V ΣV ∗ ) is a polar decomposition of A.
works when A is
square, but the SVD
is more general and The reason that the above fact holds is simply that UV ∗ is unitary whenever
also applies to
U and V are, and V ΣV ∗ is positive semidefinite. This argument also works in
rectangular matrices.
reverse—if we start with a polar decomposition A = UP and apply the spectral
decomposition to P = V ΣV ∗ , we get A = (UV )ΣV ∗ which is a singular value
decomposition of A.
2.3.3 The Operator Norm

One of the primary uses of singular values is that they provide us with ways of
measuring “how big” a matrix is. We have already seen two ways of doing this:
See Appendix A.1.5 • The absolute value of the determinant of a matrix measures how much it
for a refresher
expands space when acting as a linear transformation. That is, it is the
on this geometric
interpretation of the area (or volume, or hypervolume, depending on the dimension) of the
determinant. output of the unit square, cube, or hypercube after it is acted upon by the
matrix.
• The Frobenius norm, which is simply the norm induced by the inner
product in Mm,n .
In fact, both of the above quantities come directly from singular values—we
will see in Exercise 2.3.7 that if A ∈ Mn has singular values σ1 , σ2 , . . ., σn
then | det(A)| = σ1 σ2 · · · σn , and we will see in Theorem 2.3.7 that kAk2F =
σ12 + σ22 + · · · + σn2 .
However, the Frobenius norm is in some ways a very silly matrix norm. It
is really “just” a vector norm—it only cares about the vector space structure of
Mm,n , not the fact that we can multiply matrices. There is thus no geometric
interpretation of the Frobenius norm in terms of how the matrix acts as a linear
transformation. For example, we can rearrange the entries of a matrix freely
without changing its Frobenius norm, but doing so completely changes how
it acts geometrically. In practice, the Frobenius norm is often just used for
computational convenience—it is often not the “right” norm to work with, but
it is so dirt easy to compute that we let it slide.
We now introduce another norm (i.e., way of measuring the “size” of a
matrix) that really tells us something fundamental about how that matrix acts
as a linear transformation. Specifically, it tells us the largest factor by which a

matrix can stretch a vector. Also, as suggested earlier, we will see that this norm
can also be phrased in terms of singular values (it simply equals the largest of
them).
Definition 2.3.1 Suppose F = R or F = C, and A ∈ Mm,n (F). The operator norm of A,

Operator Norm denoted by kAk, is any of the following (equivalent) quantities:

def kAvk
kAk = maxn : v 6= 0
As its name suggests,
v∈F kvk

the operator norm = maxn kAvk : kvk ≤ 1
really is a norm in the v∈F

sense of Section 1.D = maxn kAvk : kvk = 1 .
(see Exercise 2.3.16). v∈F
The fact that the three maximizations above really are equivalent to each
other follows simply from rescaling the vectors v that are being maximized
over. In particular, v is a vector that maximizes kAvk/kvk (i.e., the first maxi-
mization) if and only if v/kvk is a unit vector that maximizes kAvk (i.e., the
second and third maximizations). The operator norm is typically considered
the “default” matrix norm, so if we use the notation kAk without any subscripts
or other indicators, we typically mean the operator norm (just like kvk for a
vector v ∈ Fn typically refers to the norm induced by the dot product if no other
context is provided).
Remark 2.3.2 In the definition of the operator norm (Definition 2.3.1), the norm used on
Induced Matrix vectors in Fn is the usual norm induced by the dot product:
Norms q
√
kvk = v · v = |v1 |2 + |v2 |2 + · · · + |vn |2 .
Most of the results However, it is also possible to define matrix norms (or more generally,
from later in this
norms of linear transformations between any two normed vector spaces—
section break down
if we use a weird see Section 1.D) based on any norms on the input and output spaces. That
norm on the input is, given any normed vector spaces V and W, we define the induced norm
and output vector of a linear transformation T : V → W by
space. For example,
induced matrix def
norms are often very kT k = max kT (v)kW : kvkV = 1 ,
v∈V
difficult to compute,
but the operator and the geometric interpretation of this norm is similar to that of the
norm is easy to
compute. operator norm—kT k measures how much T can stretch a vector, when
“stretch” is measured in whatever norms we chose for V and W.
Notice that a matrix cannot stretch any vector by more than a multiplicative
factor of its operator norm. That is, if A ∈ Mm,n and B ∈ Mn,p then kAwk ≤
Be careful: in kAkkwk and kBvk ≤ kBkkvk for all v ∈ F p and w ∈ Fn . It follows that
expressions like
kAkkwk, the first norm
is the operator norm
k(AB)vk = kA(Bv)k ≤ kAkkBvk ≤ kAkkBkkvk for all v ∈ F p .
(of a matrix) and the
second norm is the Dividing both sides by kvk shows that k(AB)vk/kvk ≤ kAkkBk for all v 6= 0,
norm (of a vector) so kABk ≤ kAkkBk. We thus say that the operator norm is submultiplicative.
induced by the dot It turns out that the Frobenius norm is also submultiplicative, and we state these
product.
two results together as a theorem.
Theorem 2.3.5 Let A ∈ Mm,n and B ∈ Mn,p . Then

Submultiplicativity
kABk ≤ kAkkBk and kABkF ≤ kAkF kBkF .
Proof. We already proved submultiplicativity for the operator norm, and we

leave submultiplicativity of the Frobenius norm to Exercise 2.3.12.
Unlike the Frobenius norm, it is not completely obvious how to actually

compute the operator norm. It turns out that singular values save us here,
and to see why, recall that every matrix sends the unit circle (or sphere, or
hypersphere...) to an ellipse (or ellipsoid, or hyperellipsoid...). The operator
norm asks for the norm of the longest vector on that ellipse, as illustrated in
Figure 2.9. By comparing this visualization with the one from Figure 2.7, it
should seem believable that the operator norm of a matrix just equals its largest
singular value.
y y
e2 Ae2 kAk = kAvk

v
Ae1
x A x
e1
Figure 2.9: A visual representation of the operator norm. Matrices (linear transforma-
tions) transform the unit circle into an ellipse. The operator norm is the distance of the
farthest point on that ellipse from the origin (i.e., the length of its semi-major axis).
To prove this relationship between the operator norm and singular values
algebraically, we first need the following helper theorem that shows that mul-
tiplying a matrix on the left or right by a unitary matrix does not change its
operator norm. For this reason, we say that the operator norm is unitarily
invariant, and it turns out that the Frobenius norm also has this property:
Theorem 2.3.6 Let A ∈ Mm,n and suppose U ∈ Mm and V ∈ Mn are unitary matrices.
Unitary Invariance Then
kUAV k = kAk and kUAV kF = kAkF .
Proof. For the operator norm, we start by showing that every unitary matrix
The fact that kUk = 1 U ∈ Mm has kUk = 1. To this end, just recall from Theorem 1.4.9 that kUvk =
whenever U is
kvk for all v ∈ Fm , so kUvk/kvk = 1 for all v 6= 0, which implies kUk = 1.
unitary is useful.
Remember it! We then know from submultiplicativity of the operator norm that
kUAV k ≤ kUkkAV k ≤ kUkkAkkV k = kAk.
However, by cleverly using the fact that U ∗U = I and VV ∗ = I, we can also

deduce the opposite inequality:
kAk = k(U ∗U)A(VV ∗ )k = kU ∗ (UAV )V ∗ k ≤ kU ∗ kkUAV kkV ∗ k = kUAV k.

We thus conclude that kAk = kUAV k, as desired.

Unitary invariance for the Frobenius norm is somewhat more straight-
forward—we can directly compute

kUAV k2F = tr (UAV )∗ (UAV ) (definition of k · kF )
= tr(V ∗ A∗U ∗UAV ) (expand parentheses)
∗ ∗
= tr(V A AV ) (U is unitary)
= tr(VV ∗ A∗ A) (cyclic commutativity)
= tr(A∗ A) = kAk2F . (V is unitary)
By combining unitary invariance with the singular value decomposition,
we almost immediately confirm our observation that the operator norm should
equal the matrix’s largest singular value, and we also get a new formula for the
Frobenius norm:
Theorem 2.3.7 Suppose A ∈ Mm,n has rank r and singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0.
Matrix Norms in Then s
r
Terms of Singular
Values
kAk = σ1 and kAkF = ∑ σ 2j .
j=1
Proof. If we write A in its singular value decomposition A = UΣV ∗ , then

unitary invariance
q tells us that kAk = kΣk and kAkF = kΣkF . The fact that
kΣkF = ∑rj=1 σ 2j then follows immediately from the fact that σ1 , σ2 , . . . , σr
are the non-zero entries of Σ.
To see that kΣk = σ1 , first note that direct matrix multiplication shows that
Recall that
e1 = (1, 0, . . . , 0). kΣe1 k = k(σ1 , 0, . . . , 0)k = σ1 ,
so kΣk ≥ σ1 . For the opposite inequality, note that for every v ∈ Fn we have
s s
r r
k vk = k( 1 v1 , . . . , r vr , 0, . . . , 0)k = 2
i |vi |
2 ≤ 2
1 |vi |2 ≤ 1 kvk.
i=1 i=1
since 1 ≥ 2 ≥ ··· ≥ r
By dividing both sides of this inequality by kvk, we see that kΣvk/kvk ≤ σ1

whenever v 6= 0, so kΣk ≤ σ1 . Since we already proved the opposite inequality,
we conclude that kΣk = σ1 , which completes the proof.
 
Example 2.3.5 Compute the operator and Frobenius norms of A = 1 2 3 .
Computing −1 0 1
Matrix Norms 3 2 1
Solution:
We√ saw in Example
√ 2.3.2 that this matrix has non-zero singular values
σ1 = 2 6 and σ2 = 6. By Theorem 2.3.7, it follows that
√
kAk = σ1 = 2 6 and
q q √ √ √ √
kAkF = σ12 + σ22 = (2 6)2 + ( 6)2 = 24 + 6 = 30.
It is worth pointing out, however, that if we had not already pre-

computed the singular values of A, it would be quicker and easier to
compute its Frobenius norm directly from the definition:
v
u 3
u
kAkF = t ∑ |ai, j |2
i, j=1
q
= |1|2 + |2|2 + |3|2 + | − 1|2 + |0|2 + |1|2 + |3|2 + |2|2 + |1|2
√
= 30.
The characterization of the operator and Frobenius norms in terms of

singular values are very useful for the fact that they provide us with several
immediate corollaries that are not obvious from their definitions. For example,
if A ∈ Mm,n then kAk = kAT k = kA∗ k since singular values are unchanged by
In fact, this property taking the (conjugate) transpose of a matrix (and similarly, kAkF = kAT kF =
of the Frobenius
kA∗ kF , but this can also be proved directly from the definition of the Frobenius
norm was proved
back in norm).
Exercise 1.3.10. The following property is slightly less obvious, and provides us with one
important condition under which equality is obtained in the submultiplicativity
inequality kABk ≤ kAkkBk:
Corollary 2.3.8 If A ∈ Mm,n then kA∗ Ak = kAk2 .

The C∗ -Property
of
the Operator Norm
Proof. Start by writing A in its singular value decomposition A = UΣV ∗ , where
Σ has largest diagonal entry σ1 (the largest singular value of A). Then
kA∗ Ak = k(UΣV ∗ )∗ (UΣV ∗ )k = kV Σ∗U ∗UΣV ∗ k = kV Σ∗ ΣV ∗ k = kΣ∗ Σk,
with the final equality following from unitary invariance of the operator norm
We use
Theorem 2.3.7 twice (Theorem 2.3.6).
here at the end: Since Σ∗ Σ is a diagonal matrix with largest entry σ12 , it follows that
once to see that
kΣ∗ Σk = σ12 and once kΣ∗ Σk = σ12 = kAk2 , which completes the proof.
to see that σ1 = kAk.
We close this section by noting that there are actually many matrix norms
out there (just like we saw that there are many vector norms in Section 1.D), and
many of the most useful ones come from singular values just like the operator
and Frobenius norms. We explore another particularly important matrix norm
of this type in Exercises 2.3.17–2.3.19.
2.3.1 Compute a singular value decomposition of each of

the following matrices. 2.3.2 Compute the operator norm of each of the matrices
" # " # from Exercise 2.3.1.
∗ (a) −1 3 (b) 1 1 −1
3 −1 −1 1 1 2.3.3 Compute orthonormal bases of the four fundamental
   
∗ (c) 0 2 (d) 1 1 −1 subspaces of each of the matrices from Exercise 2.3.1.
   
1 1 0 1 1
−2 0 1 0 1
   
∗ (e) 2 2 −2 (f) 1 1 1 1
   
−4 −1 4 2 2 −2 −2
−4 2 4 3 −3 3 −3
2.3.4 Determine which of the following statements are 2.3.14 Let A ∈ Mm,n have singular values σ1 ≥ σ2 ≥
true and which are false. · · · ≥ 0. Show that the block matrix
" #
∗ (a) If λ is an eigenvalue of A ∈ Mn (C) then |λ | is a O A
singular value of A.
(b) If σ is a singular value of A ∈ Mn (C) then σ 2 is a A∗ O
singular value of A2 . has eigenvalues ±σ1 , ±σ2 , . . ., together with |m − n| extra 0
∗ (c) If σ is a singular value of A ∈ Mm,n (C) then σ 2 is eigenvalues.
a singular value of A∗ A.
(d) If A ∈ Mm,n (C) then kA∗ Ak = kAk2 .
∗ (e) If A ∈ Mm,n (C) then kA∗ AkF = kAk2F . ∗∗2.3.15 Suppose A ∈ Mm,n and c ∈ R is a scalar. Show
(f) If A ∈ Mn (C) is a diagonal matrix then its singular that the block matrix
" #
values are its diagonal entries.
cIm A
∗ (g) If A ∈ Mn (C) has singular value decomposition
A = UDV ∗ then A2 = UD2V ∗ . A∗ cIn
(h) If U ∈ Mn is unitary then kUk = 1. is positive semidefinite if and only if kAk ≤ c.
∗ (i) Every matrix has the same singular values as its trans-
pose.
∗∗2.3.16 Show that the operator norm (Definition 2.3.1)
is in fact a norm (i.e., satisfies the three properties of Defini-
∗∗2.3.5 Show that A ∈ Mn is unitary if and only if all of tion 1.D.1).
its singular values equal 1.
∗∗2.3.17 The trace norm of a matrix A ∈ Mm,n is the

∗∗2.3.6 Suppose F = R or F = C, and A ∈ Mn (F). Show sum of its singular values σ1 , σ2 , . . ., σr :
that rank(A) = r if and only if there exists a unitary matrix
def
U ∈ Mn (F) and an invertible matrix Q ∈ Mn (F) such that kAktr = σ1 + σ2 + · · · + σr .
" #
I O (a) Show that
A=U r Q.
O O kAktr = max |hA, Bi| : kBk ≤ 1 .
B∈Mm,n
∗∗2.3.7 Suppose A ∈ Mn has singular values σ1 , σ2 , . . . ,
[Side note: For this reason, the trace norm and op-
σn . Show that
erator norm are sometimes said to be dual to each
| det(A)| = σ1 σ2 · · · σn . other.]
[Hint: To show the “≥” inequality, cleverly pick B
based on A’s SVD. To show the (harder) “≤” inequal-
2.3.8 Show that if λ is an eigenvalue of A ∈ Mn then
ity, maybe also use Exercise 2.3.13.]
|λ | ≤ kAk.
(b) Use part (a) to show that the trace norm is in fact a
norm in the sense of Definition 1.D.1. That is, show
2.3.9 Show that if A ∈ Mm,n then that the trace norm satisfies the three properties of
p that definition.
kAk ≤ kAkF ≤ rank(A)kAk.
2.3.18 Compute the trace norm (see Exercise 2.3.17) of

∗2.3.10 Let A ∈ Mn be an invertible matrix. Show that
each of the matrices from Exercise 2.3.1.
kA−1 k ≥ 1/kAk, and give an example where equality does
not hold.
∗∗2.3.19 Show that the trace norm from Exercise 2.3.17
is unitarily-invariant. That is, show that if A ∈ Mm,n , and
2.3.11 Suppose a, b ∈ R and
" # U ∈ Mm and V ∈ Mn are unitary matrices, then kUAV ktr =
a −b kAktr .
A= .
b a
√ ∗∗2.3.20 Prove the converse of Theorem 2.3.4. That is,
Show that kAk = a2 + b2 . suppose that A ∈ Mn (C) has eigenvalues λ1 , . . ., λn with
[Side note: Recall from Remark A.3.2 that this matrix rep- |λ1 | ≥ · · · ≥ |λn | and singular values σ1 ≥ · · · ≥ σn . Show
resents the complex number a + bi. This exercise shows that if A is not normal then σ j 6= |λ j | for some 1 ≤ j ≤ n.
that the operator norm gives exactly the magnitude of that
complex number.] 2.3.21 In this exercise, we show that every square matrix
is a linear combination of two unitary matrices. Suppose
∗∗2.3.12 Let A ∈ Mm,n and B ∈ Mn,p . A ∈ Mn (C) has singular value decomposition A = UΣV ∗ .
(a) Show that kABkF ≤ kAkkBkF . (a) Show that if kAk ≤ 1 then I − Σ2 is positive semidef-
[Hint: Apply the singular value decomposition to A.] inite. √
(b) Use part (a) to show that kABkF ≤ kAkF kBkF . (b) Show that if kAk ≤ 1 then U Σ ± i I − Σ2 V ∗ are
both unitary matrices.
(c) Use part (b) to show that A can be written as a linear
∗2.3.13 Suppose A ∈ Mm,n . Show that combination of two unitary matrices (regardless of
∗
kAk = max |v Aw| : kvk = kwk = 1 . the value of kAk).
m n
v∈F ,w∈F
2.3.22 Suppose that the 2 × 2 block matrix ∗∗2.3.26 A matrix A ∈ Mn (C) is called complex sym-
" # metric if AT = A. For example,
A B " #
B∗ C 1 i
A=
i 2−i
is positive semidefinite. Show that range(B) ⊆ range(A).
is complex symmetric. In this exercise, we show that the sin-
[Hint: Show instead that null(A) ⊆ null(B∗ ).]
gular value decomposition of these matrices can be chosen
to have a special form.
2.3.23 Use the Jordan–von Neumann theorem (Theo-
(a) Provide an example to show that a complex symmet-
rem 1.D.8) to determine whether or not the operator norm
ric matrix may not be normal and thus may not have
is induced by some inner product on Mm,n .
a spectral decomposition.
(b) Suppose A ∈ Mn (C) is complex symmetric. Show
2.3.24 Suppose A ∈ Mm,n has singular values σ1 , σ2 , that there exists a unitary matrix V ∈ Mn (C) such
. . ., σ p (where p = min{m, n}) and QR decomposition (see that, if we define B = V T AV , then B is complex sym-
Section 1.C) A = UT with U ∈ Mm unitary and T ∈ Mm,n metric and B∗ B is real.
upper triangular. Show that the product of the singular val- [Hint: Apply the spectral decomposition to A∗ A.]
ues of A equals the product of the diagonal entries of T : (c) Let B be as in part (b) and define BR = (B + B∗ )/2
and BI = (B − B∗ )/(2i). Show that BR and BI are
σ1 · σ2 · · · σ p = t1,1 · t2,2 · · ·t p,p . real, symmetric, and commute.
[Hint: B = BR + iBI and B∗ B is real.]
∗∗2.3.25 Suppose P ∈ Mn (C) is a non-zero projection [Side note: Here we are using the Cartesian decom-
(i.e., P2 = P). position of B introduced in Remark 1.B.1.]
(d) Let B be as in part (b). Show that there is a unitary
(a) Show that kPk ≥ 1. matrix W ∈ Mn (R) such that W T BW is diagonal.
(b) Show that if P∗ = P (i.e., P is an orthogonal projec- [Hint: Use Exercise 2.1.28—which matrices have we
tion) then kPk = 1. found that commute?]
(c) Show that if kPk = 1 then P∗ = P. (e) Use the unitary matrices V and W from parts (b)
[Hint: Schur triangularize P.] and (d) of this exercise to conclude that there exists
[Side note: This can be seen as the converse of Theo- a unitary matrix U ∈ Mn (C) and a diagonal matrix
rem 1.4.13.] D ∈ Mn (R) with non-negative entries such that
A = UDU T .
[Side note: This is called the Takagi factorization of
A. Be somewhat careful—the entries on the diagonal
of D are the singular values of A, not its eigenvalues.]
2.4 The Jordan Decomposition
All of the decompositions that we have introduced so far in this chapter—Schur

triangularization (Theorem 2.1.1), the spectral decomposition (Theorems 2.1.4
and 2.1.6), the polar decomposition (Theorem 2.2.12), and the singular value
decomposition (Theorem 2.3.1)—have focused on how much simpler we can
make a matrix by either multiplying it by a unitary or applying a unitary
similarity transformation to it.
We now switch gears a bit and return to the setting of diagonalization,
where we allow for arbitrary (not necessarily unitary) similarity transfor-
Recall that mations, and we investigate how “close” to diagonal we can make a non-
diagonalization itself
diagonalizable matrix. Solving this problem and coming up with a way to
was characterized in
Theorem 2.0.1. See “almost-diagonalize” arbitrary matrices is important for at least two reasons:
also Appendix A.1.7 • A linear transformation can be represented by its standard matrix, but by
if you need a
refresher.
changing the basis that we are working in, the entries of that standard
matrix can change considerably. The decomposition in this section tells
us how to answer the question of whether or not two matrices represent
the same linear transformation in different bases.
2.4 The Jordan Decomposition 227
• The decomposition that we see in this section will provide us with a

method of applying functions like ex and sin(x) to matrices, rather than
just to numbers. Some readers may have already learned how to apply
functions to diagonalizable matrices, but going beyond diagonalizable
matrices requires some extra mathematical machinery.
To get us started toward the decomposition that solves these problems, we
first need a definition that suggests the “almost diagonal” form that we will be
transforming matrices into.
Definition 2.4.1 Given a scalar λ ∈ C and an integer k ≥ 1, the Jordan block of order k
Jordan Blocks corresponding to λ is the matrix Jk (λ ) ∈ Mk (C) of the form
 
We say that Jk (λ ) λ 1 0 ··· 0 0
 
has λ along its 0 λ 1 ··· 0 0
diagonal and  
1 along its 0 0 λ ··· 0 0
 
“superdiagonal”. Jk (λ ) =  . .. .. .. .. ..  .
. . 
. . . . .
 
0 0 0 ··· λ 1
0 0 0 ··· 0 λ
For example, the following matrices are all Jordan blocks:

 

−2 1 0
4 1
7 , , and  0 −2 1  ,
0 4
0 0 −2
but the following matrices are not:
 
3 1 0
4 0 2 1 0
, , and 3 0 .
0 4 0 3
0 0 3
The main result of this section says that every matrix is similar to one that is
block diagonal, and whose diagonal blocks are Jordan blocks:
Theorem 2.4.1 If A ∈ Mn (C) then there exists an invertible matrix P ∈ Mn (C) and
Jordan Jordan blocks Jk1 (λ1 ), Jk2 (λ2 ), . . . , Jkm (λm ) such that
Decomposition  
Jk1 (λ1 ) O ··· O
 
 O
 Jk2 (λ2 ) · · · O   −1
A = P . .. ..  P .
 . .. 
 . . . . 
We will see shortly
that the numbers O O · · · Jkm (λm )
λ1 , λ2 , . . . , λm are
necessarily the Furthermore, this block diagonal matrix is called the Jordan canonical
eigenvalues of A
form of A, and it is unique up to re-ordering the diagonal blocks.
listed according
to geometric
multiplicity. Also, For example, the following matrices are in Jordan canonical form:
we must have      
k1 + · · · + km = n. 4 0 0 0 2 1 0 0 3 1 0 0
     
0 2 1 0 0 2 0 0 0 3 0 0
 ,   , and  .
0 0 2 1 0 0 4 0 0 0 3 1
0 0 0 2 0 0 0 5 0 0 0 3
Note in particular that there is no need for the eigenvalues corresponding to

different Jordan blocks to be distinct (e.g., the third matrix above has two
Jordan blocks, both of which are 2 × 2 and correspond to λ = 3). However, the
following matrices are not in Jordan canonical form:
     
2 1 0 0 2 0 0 0 3 0 1 0
     
0 3 0 0 1 2 0 0 0 3 0 0
 ,   , and  .
0 0 3 0 0 0 1 0 0 0 3 0
0 0 0 3 0 0 0 1 0 0 0 2
Diagonal matrices are all in Jordan canonical form, and their Jordan blocks
all have sizes 1 × 1, so the Jordan decomposition really is a generalization of
diagonalization. We now start introducing the tools needed to prove that every
matrix has such a decomposition, to show that it is unique, and to actually
compute it.
2.4.1 Uniqueness and Similarity

Before demonstrating how to compute the entire Jordan decomposition of a
matrix A ∈ Mn (C), we investigate the computation of its Jordan canonical
form. That is, we start by showing how to compute the matrix J, but not P,
in the Jordan decomposition A = PJP−1 , assuming that it exists. As a natural
by-product of this method, we will see that a matrix’s Jordan canonical form is
indeed unique, as stated by Theorem 2.4.1.
Since eigenvalues are unchanged by similarity transformations, and the
eigenvalues of triangular matrices are their diagonal entries, we know that if
the Jordan canonical form of A is
 
Jk1 (λ1 ) O ··· O
 
 O
 Jk2 (λ2 ) · · · O  
J= . .. .. 
 . .. 
 . . . . 
O O · · · Jkm (λm )
then the diagonal entries λ1 , λ2 , . . ., λm of J must be the eigenvalues of A.

It is not obvious how to compute the orders k1 , k2 , . . ., km of the Jordan
blocks corresponding to these eigenvalues, so we first introduce a new “helper”
quantity that will get us most of the way there.
Definition 2.4.2 Suppose λ is an eigenvalue of A ∈ Mn and k ≥ 0 is an integer. We say

Geometric that the geometric multiplicity of order k of λ is the quantity
Multiplicity of
Order k γk = nullity (A − λ I)k .
If k = 0 then (A − λ I)k = I, which has nullity 0, so γ0 = 0 for every eigen-

value of every matrix. More interestingly, if k = 1 then this definition agrees
with the definition of geometric multiplicity that we are already familiar with
(i.e., γ1 is simply the usual geometric multiplicity of the eigenvalue λ ). Further-
more, the chain of inclusions

null(A − λ I) ⊆ null (A − λ I)2 ⊆ null (A − λ I)3 ⊆ · · ·
shows that the geometric multiplicities satisfy γ1 ≤ γ2 ≤ γ3 ≤ . . .. Before pre-

senting a theorem that tells us explicitly how to use the geometric multiplicities
to compute the orders of the Jordan blocks in a matrix’s Jordan canonical form,
we present an example that is a bit suggestive of the connection between these
quantities.
Example 2.4.1 Compute the geometric multiplicities of each of the eigenvalues of

Geometric  
Multiplicities of a 2 1 0 0 0
Matrix in Jordan  0 2 1 0 0 
 
Canonical Form A=  0 0 2 0 0 .

 0 0 0 2 1 
0 0 0 0 2
Solution:
Since A is triangular (in fact, it is in Jordan canonical form), we see
immediately that its only eigenvalue is 2 with algebraic multiplicity 5. We
We partition A as a then compute
block matrix in this
way just to better    
highlight what 0 1 0 0 0 0 0 1 0 0
happens to its  0 0 1 0 0   0 0 0 0 0 
   
Jordan blocks when A − 2I =  0 0 0 0 0  , (A − 2I)2 =  0 0 0 0 0 ,
computing the
   
 0 0 0 0 1   0 0 0 0 0 
powers (A − 2I)k .
0 0 0 0 0 0 0 0 0 0
and (A − 2I)k = O whenever k ≥ 3. The geometric multiplicities of the

eigenvalue 2 of A are the nullities of these matrices, which are γ1 = 2,
γ2 = 4, and γk = 5 whenever k ≥ 3.
In the above example, each Jordan block contributed one dimension to

null(A − 2I), each Jordan
block of order at least 2 contributed one extra dimen-
sion to null (A − 2I)2 , andthe Jordan block of order 3 contributed one extra
dimension to null (A − 2I)3 . The following theorem says that this relationship
between geometric multiplicities and Jordan blocks holds in general, and we
can thus use geometric multiplicities to determine how many Jordan blocks of
each size a matrix’s Jordan canonical form has.
Theorem 2.4.2 Suppose A ∈ Mn (C) has eigenvalue λ with order-k geometric multiplicity
Jordan Canonical γk . Then for each k ≥ 1, every Jordan canonical form of A has
Form from a) γk − γk−1 Jordan blocks J j (λ ) of order j ≥ k, and
Geometric
Multiplicities b) 2γk − γk+1 − γk−1 Jordan blocks Jk (λ ) of order exactly k.
Before proving this theorem, we note that properties (a) and (b) are ac-
tually equivalent to each other—each one can be derived from the other via
straightforward algebraic manipulations. We just present them both because
property (a) is a bit simpler to work with, but property (b) is what we actually
want.
Proof of Theorem 2.4.2. Suppose A has Jordan decomposition

A = PJP−1 .
k k
Since γk = nullity (PJP − λ I) = nullity (J − λ I) for all k ≥ 0, we as-
−1
sume without loss of generality that A = J.

Since J is block diagonal, so is (J − λ I)k for each k ≥ 0. Furthermore, since
the nullity of a block diagonal matrix is just the sum of the nullities of its
diagonal blocks, it suffices to prove the follow two claims:

• nullity (J j (µ) − λ I)k = 0 whenever λ 6= µ. To see why this is the case,
simply notice that J j (µ) − λ I has diagonal entries (and thus eigenvalues,
since it is upper triangular) λ − µ 6= 0. It follows that J j (µ) − λ I is
invertible whenever λ 6= µ, so it and its powers all have full rank (and
thus nullity 0).

• nullity (J j (λ ) − λ I)k = k whenever 0 ≤ k ≤ j. To see why this is the
case, notice that
 
0 1 0 ··· 0 0
0 0 1 · · · 0 0
That is,  
0 0 0 · · · 0 0
J j (λ ) − λ I = J j (0). In  
Exercise 2.4.17, we J j (λ ) − λ I = 
 .. .. .. . .

.. ..  ,
call this matrix N1 . . . . . . .
 
0 0 0 · · · 0 1
0 0 0 ··· 0 0
which is a simple enough matrix that we can compute the nullities of its
powers fairly directly (this computation is left to Exercise 2.4.17).
Indeed, if we let mk (1 ≤ k ≤ n) denote the number of occurrences of
the Jordan block Jk (λ ) along the diagonal of J, the above two claims tell us
that γ1 = m1 + m2 + m3 + · · · + mn , γ2 = m1 + 2m2 + 2m3 + · · · + 2mn , γ3 =
m1 + 2m2 + 3m3 + · · · + 3mn , and in general
k n
γk = ∑ jm j + ∑ km j .
j=1 j=k+1
Subtracting these formulas from each other gives γk − γk−1 = ∑nj=k m j , which is
exactly statement (a) of the theorem. Statement (b) of the theorem then follows
from subtracting the formula in statement (a) from a shifted version of itself:
2γk − γk+1 − γk−1 = (γk − γk−1 ) − (γk+1 − γk )
n n
= ∑ mj − ∑ m j = mk .
j=k j=k+1
The above theorem has the following immediate corollaries that can some-
times be used to reduce the amount of work needed to construct a matrix’s
Jordan canonical form:
• The geometric multiplicity γ1 of the eigenvalue λ counts the number of
Jordan blocks corresponding to λ .
• If γk = γk+1 for a particular value of k then γk = γk+1 = γk+2 = · · · .
Furthermore, if we make use of the fact that the sum of the orders of the Jordan
blocks corresponding to a particular eigenvalue λ must equal its algebraic
multiplicity (i.e., the number of times that λ appears along the diagonal of the
Furthermore,
Jordan canonical form), we get a bound on how many geometric multiplicities
A ∈ Mn (C) is
diagonalizable if we have to compute in order to construct a Jordan canonical form:
and only if, for each
of its eigenvalues,
the multiplicities ! If λ is an eigenvalue of a matrix with algebraic multiplicity µ
satisfy and geometric multiplicities γ1 , γ2 , γ3 , . . ., then γk ≤ µ for each
γ1 = γ2 = · · · = µ. k ≥ 1. Furthermore, γk = µ whenever k ≥ µ.
As one final corollary of the above theorem, we are now able to show
that Jordan canonical forms are indeed unique (up to re-ordering their Jordan
blocks), assuming that they exist in the first place:
Proof of uniqueness in Theorem 2.4.1. Theorem 2.4.2 tells us exactly what

the Jordan blocks in any Jordan canonical form of a matrix A ∈ Mn (C) must
be.
 
Example 2.4.2 Compute the Jordan canonical form of A = −6 0 −2 1 .
 
Our First Jordan  0 −3 −2 1
Canonical Form  
3 0 1 1
−3 0 2 2
Solution:
The eigenvalues of this matrix are (listed according to algebraic multi-
plicity) 3, −3, −3, and −3. Since λ = 3 has algebraic multiplicity 1, we
know that J1 (3) = [3] is one of the Jordan blocks in the Jordan canonical
form J of A. Similarly, since λ = −3 has algebraic multiplicity 3, we know
that the orders of the Jordan blocks for λ = −3 must sum to 3. That is, the
Jordan canonical form of A must be one of
     
3 0 0 0 3 0 0 0 3 0 0 0
 0 −3 1 0   0 −3 0 0   0 −3 0 0 
     
 0 0 −3 1  ,  0 0 −3 1  ,  0 0 −3 0  .
0 0 0 −3 0 0 0 −3 0 0 0 −3
We could also To determine which of these canonical forms is correct, we simply

compute γ2 = 2 and
note that λ = −3 has geometric multiplicity γ1 = 1 (its corresponding
γk = 3 whenever
k ≥ 3, so that A has eigenvectors are all scalar multiples of (0, 1, 0, 0)), so it must have just one
2γ1 − γ2 − γ0 = 0 corresponding Jordan block. That is, the Jordan canonical form J of A is
copies of J1 (−3),
 
2γ2 − γ3 − γ1 = 0 3 0 0 0
copies of J2 (−3),  0 −3 1 0 
and 2γ3 − γ4 − γ2 = 1 J=  0 0 −3 1  .

copy of J3 (−3) as
Jordan blocks. 0 0 0 −3
Example 2.4.3 Compute the Jordan canonical form of

A Hideous Jordan  
Canonical Form 1 −1 −1 0 0 1 −1 0
1 2 1 0 0 −1 1 −1
 
2 1 2 0 1 −1 0 0
 
 
1 1 1 1 1 −1 0 −1
A= .
−2 0 0 0 0 0 1 1
The important  
1 1 1 0 0 0 1 0
take-away from this  
example is how to −2 −1 −1 0 −1 1 1 1
construct the Jordan
canonical form from
0 −1 −1 0 0 1 −1 1
the geometric
multiplicities. Don’t Solution:
worry about the The only eigenvalue of A is λ = 1, which has algebraic multiplicity
details of computing µ = 8. The geometric multiplicities of this eigenvalue can be computed
those multiplicities.
via standard techniques, but we omit the details here due to the size
of this matrix. In particular, the geometric multiplicity of λ = 1 is γ1 = 4,
and we furthermore have

γ2 = nullity (A − I)2
 
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 
 
0 −1 −1 0 0 1 −1 0
 
 
0 1 1 0 0 −1 1 0
= nullity   = 7,
0 0 0 0 0 0 0 0
 
0 −1 −1 0 0 1 −1 0
 
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
and (A − I)3 = O so γk = 8 whenever k ≥ 3. It follows that the Jordan

canonical form J of A has
• 2γ1 − γ2 − γ0 = 8 − 7 − 0 = 1 Jordan block J1 (1) of order 1,
• 2γ2 − γ3 − γ1 = 14 − 8 − 4 = 2 Jordan blocks J2 (1) of order 2,
• 2γ3 − γ4 − γ2 = 16 − 8 − 7 = 1 Jordan block J3 (1) of order 3, and
• 2γk − γk+1 − γk−1 = 16 − 8 − 8 = 0 Jordan blocks Jk (1) of order k
when k ≥ 4.
In other words, the Jordan canonical form J of A is
 
We use dots (·) to 1 1 0 · · · · ·
denote some of the  
 0 1 1 · · · · · 
zero entries, for
ease of visualization.
 0 0 1 · · · · · 
 
 · · · 1 1 · · · 
J=

.

 · · · 0 1 · · · 
 · · · · · 1 1 · 
 
 · · · · · 0 1 · 
· · · · · · · 1
A Return to Similarity
Recall that two matrices A, B ∈ Mn (C) are called similar if there exists an
invertible matrix P ∈ Mn (C) such that A = PBP−1 . Two matrices are similar if
and only if there is a common linear transformation T between n-dimensional
vector spaces such that A and B are both standard matrices of T (with respect to
different bases). Many tools from introductory linear algebra can help determine
whether or not two matrices are similar in certain special cases (e.g., if two
matrices are similar then they have the same characteristic polynomial, and thus
the same eigenvalues, trace, and determinant), but a complete characterization
of similarity relies on the Jordan canonical form:
Theorem 2.4.3 Suppose A, B ∈ Mn (C). The following are equivalent:

Similarity a) A and B are similar,
via the Jordan
b) the Jordan canonical forms of A and B have the same Jordan blocks,
Decomposition
and
c) A and B have the same eigenvalues, and those eigenvalues have the
same order-k geometric multiplicities for each k.
Proof. The equivalence of conditions (b) and (c) follows immediately from
Theorem 2.4.2, which tells us that we can determine the orders of a matrix’s
Jordan blocks from its geometric multiplicities, and vice-versa. We thus just
focus on demonstrating that conditions (a) and (b) are equivalent.
To see that (a) implies (b), suppose A and B are similar so that we can write
A = QBQ−1 for some invertible Q. If B = PJP−1 is a Jordan decomposition
of B (i.e., the Jordan canonical form of B is J) then A = Q(PJP−1 )Q−1 =
(QP)J(QP)−1 is a Jordan decomposition of A with the same Jordan canonical
form J.
If A = PJP−1
is a Jordan For the reverse implication, suppose that A and B have identical Jordan
decomposition, canonical forms. That is, we can find invertible P and Q such that the Jordan
we can permute the decompositions of A and B are A = PJP−1 and B = QJQ−1 , where J is their
columns of P to put
the diagonal blocks
(shared) Jordan canonical form. Rearranging these equations gives J = P−1 AP
of J in any order and J = Q−1 BQ, so P−1 AP = Q−1 BQ. Rearranging one more time then gives
we like. A = P(Q−1 BQ)P−1 = (PQ−1 )B(PQ−1 )−1 , so A and B are similar.
Example 2.4.4 Determine whether or not the following matrices are similar:
Checking Similarity    
1 1 0 0 1 0 0 0
   
0 1 0 0 0 1 1 0
A=  and B =  .
0 0 1 1 0 0 1 1
0 0 0 1 0 0 0 1
Solution:
This example is tricky These matrices are already in Jordan canonical form. In particular, the
if we do not use
Jordan blocks of A are
Theorem 2.4.3, since
A and B have the
1 1 1 1
same rank and and ,
characteristic 0 1 0 1
polynomial.
whereas the Jordan blocks of B are
 
1 1 0
1 and 0 1 1 .
0 0 1
Since they have different Jordan blocks, A and B are not similar.
(Again) −6 0 −2 1 −3 0 3 0
   
 0 −3 −2 1  −2 2 −5 2
A=  and B =  .
3 0 1 1  2 1 −4 −2
−3 0 2 2 −2 2 1 −1
Solution:
We already saw in Example 2.4.2 that A has Jordan canonical form
 
3 0 0 0
 0 −3 1 0 
J= 0 0 −3 1  .

0 0 0 −3
It follows that, to see whether or not A and B are similar, we need to check
whether or not B has the same Jordan canonical form J.
To this end, we note that B also has eigenvalues (listed according to
algebraic multiplicity) 3, −3, −3, and −3. Since λ = 3 has algebraic multi-
plicity 1, we know that J1 (3) = [3] is one of the Jordan blocks in the Jordan
canonical form B. Similarly, since λ = −3 has algebraic multiplicity 3,
we know that the orders of the Jordan blocks for λ = −3 sum to 3.
It is straightforward to show that the eigenvalue λ = −3 has geometric
multiplicity γ1 = 1 (its corresponding eigenvectors are all scalar multiples
of (1, 0, 0, 1)), so it must have just one corresponding Jordan block. That
is, the Jordan canonical form of B is indeed the matrix J above, so A and
B are similar.
(Yet Again) 1 2 3 4 1 0 0 4
   
2 1 0 3 0 2 3 0
A=  and B =  .
3 0 1 2 0 3 2 0
4 3 2 1 4 0 0 1
Solution:
These matrices do note even have the same trace (tr(A) = 4, but tr(B) =
6), so they cannot possibly be similar. We could have also computed their
Jordan canonical forms to show that they are not similar, but it saves
a lot of work to do the easier checks based on the trace, determinant,
eigenvalues, and rank first!
2.4.2 Existence and Computation

The goal of this subsection is to develop a method of computing a matrix’s
Jordan decomposition A = PJP−1 , rather than just its Jordan canonical form J.
One useful by-product of this method will be that it works for every complex
matrix and thus (finally) proves that every matrix does indeed have a Jordan
decomposition—a fact that we have been taking for granted up until now.
In order to get an idea of how we might compute a matrix’s Jordan decom-
position (or even just convince ourselves that it exists), suppose for a moment
that the Jordan canonical form of a matrix A ∈ Mk (C) has just a single Jordan
block. That is, suppose we could write A = PJk (λ )P−1 for some invertible
P ∈ Mk (C) and some scalar λ ∈ C (necessarily equal to the one and only
eigenvalue of A). Our usual block matrix multiplication techniques show that if
v( j) is the j-th column of P then
Recall that Pe j = v( j) ,  
and multiplying both
λ 1 ··· 0
0 ··· 0
sides on the left by  λ 
P−1 shows that Av(1) = PJk (λ )P−1 v(1) = P 
 .. .. ..
 (1)
..  e1 = λ Pe1 = λ v ,
P−1 v( j) = e j . . . . . 
0 0 ··· λ
and
We use superscripts
 
λ 1 ··· 0
for v( j) here since we 0
will use subscripts for  λ ··· 0 
something else Av( j) = PJk (λ )P−1 v( j) = P 
 .. .. ..

..  e j = P(λ e j + e j−1 )
shortly. . . . .
= λ v( j) + v( j−1)
0 0 ··· λ
for all 2 ≤ j ≤ k.
Phrased slightly differently, v(1) is an eigenvector of A with eigenvalue λ ,
(i.e., (A − λ I)v(1) = 0) and v(2) , v(3) , . . . , v(k) are vectors that form a “chain”
with the property that
(A − λ I)v(k) = v(k−1) , (A − λ I)v(k−1) = v(k−2) , ··· (A − λ I)v(2) = v(1) .
It is perhaps useful to picture this relationship between the vectors slightly

more diagrammatically:
A−λ I A−λ I A−λ I A−λ I A−λ I
v(k) −−−→ v(k−1) −−−→ · · · −−−→ v(2) −−−→ v(1) −−−→ 0.
We thus guess that vectors that form a “chain” leading down to an eigenvec-
tor are the key to constructing a matrix’s Jordan decomposition, which leads
naturally to the following definition:
Definition 2.4.3 Suppose A ∈ Mn (C) is a matrix with eigenvalue λ and corresponding

Jordan Chains eigenvector v. A sequence of non-zero vectors v(1) , v(2) , . . . , v(k) is called
a Jordan chain of order k corresponding to v if v(1) = v and
(A − λ I)v( j) = v( j−1) for all 2 ≤ j ≤ k.
By mimicking the block matrix multiplication techniques that we performed

above, we arrive immediately at the following theorem:
Theorem 2.4.4 Suppose v(1) , v(2) , . . . , v(k) is a Jordan chain corresponding to an eigenvalue
Multiplication by a λ of A ∈ Mn (C). If we define
Single Jordan
Chain P = v(1) | v(2) | · · · | v(k) ∈ Mn,k (C)
then AP = PJk (λ ).
Proof. Just use block matrix multiplication to compute

AP = Av(1) | Av(2) | · · · | Av(k)

= λ v(1) | v(1) + λ v(2) | · · · | v(k−1) + λ v(k)
and
If n = k and the  
members of the λ 1 ··· 0
Jordan chain form a  
linearly independent (1) (2) 0 λ ··· 0
(k) 
PJk (λ ) = v | v | · · · | v 
set then P is  .. .. . . .
invertible and this
. . . .. 
theorem tells us that 0 0 ··· λ
A = PJk (λ )P−1 , which (1) (1)
is a Jordan = λ v | v + λ v | · · · | v(k−1) + λ v(k) ,
(2)
decomposition.
so AP = PJk (λ ).
With the above theorem in mind, our goal is now to demonstrate how to
construct Jordan chains, which we will then place as columns into the matrix P
in a Jordan decomposition A = PJP−1 .
One Jordan Block per Eigenvalue
In other words, we Since the details of Jordan chains and the Jordan decomposition simplify
are first considering
considerably in the case when there is just one Jordan block (and thus just one
the case when each
eigenvalue has Jordan chain) corresponding to each of the matrix’s eigenvalues, that is the
geometric situation that we explore first. In this case, we can find a matrix’s Jordan chains
multiplicity γ1 = 1. just by finding an eigenvector corresponding to each of its eigenvalues, and
then solving some additional linear systems to find the other members of the
Jordan chains.
Example 2.4.7 Find the Jordan chains of the matrix

Computing  
Jordan Chains 5 1 −1
 
and the Jordan A = 1 3 −1 ,
Decomposition 2 0 2
and use them to construct its Jordan decomposition.
Solution:
At this point, The eigenvalues of this matrix are (listed according to their alge-
we know that the
braic multiplicity) 2, 4, and 4. An eigenvector corresponding to λ = 2 is
Jordan canonical
form of A must be v1 = (0, 1, 1), and an eigenvector corresponding to λ = 4 is v2 = (1, 0, 1).

2 0 0
 However, the eigenspace corresponding to the eigenvalue λ = 4 is just
 0 4 1 . 1-dimensional (i.e., its geometric multiplicity is γ1 = 1), so it is not possi-
0 0 4 ble to find a linearly independent set of two eigenvectors corresponding to
λ = 4.
(1) (2)
Instead, we must find a Jordan chain v2 , v2 with the property that
(2) (1) (1)

(A − 4I)v2 = v2 and (A − 4I)v2 = 0.
(1)
We choose v2 = v2 = (1, 0, 1) to be the eigenvector corresponding to
(2)
λ = 4 that we already found, and we can then find v2 by solving the
(2) (1) (2)
linear system (A − 4I)v2 = v2 for v2 :
   
1 1 −1 1 1 0 −1 1/2
 1 −1 −1 0  row reduce  0 1 0 1/2  .
−−−−−−→
2 0 −2 1 0 0 0 0
(2)
It follows that the third entry of v2 is free, while its other two entries are
(2)
not. If we choose the third entry to be 0 then we get v2 = (1/2, 1/2, 0)
as one possible solution.
(1) (2)
Now that we have a set of 3 vectors v1 , v2 , v2 , we can place them
as columns in a 3 × 3 matrix, and that matrix should bring A into its Jordan
canonical form:
 
h i 0 1 1/2
(1) (2)  
P = v1 | v2 | v2 = 1 0 1/2 .
1 1 0
Straightforward calculation then reveals that

Keep in mind    
that P−1 AP = J is −1/2 1/2 1/2 5 1 −1 0 1 1/2
equivalent to    
A = PJP−1 , which is
P−1 AP =  1/2 −1/2 1/2 1 3 −1 1 0 1/2
why the order of P 1 1 −1 2 0 2 1 1 0
and P−1 is reversed  
from that of 2 0 0
Theorem 2.4.1. =  0 4 1 ,
0 0 4
which is indeed in Jordan canonical form.
The procedure carried out in the above example works as long as each
eigenspace is 1-dimensional—to find the Jordan decomposition of a matrix A,
we just find an eigenvector for each eigenvalue and then extend it to a Jordan
chain so as to fill up the columns of a square matrix P. Doing so results in P
being invertible (a fact which is not obvious, but we will prove shortly) and
P−1 AP being the Jordan canonical form of A. The following example illustrates
this procedure again with a longer Jordan chain.
Example 2.4.8 Find the Jordan chains of the matrix

Finding Longer  
Jordan Chains −6 0 −2 1
 
 0 −3 −2 1
A= ,
3 0 1 1
−3 0 2 2
and use them to construct its Jordan decomposition.

Solution:
We noted in Example 2.4.2 that this matrix has eigenvalues 3 (with al-
gebraic multiplicity 1) and −3 (with algebraic multiplicity 3 and geometric
multiplicity 1). An eigenvector corresponding to λ = 3 is v1 = (0, 0, 1, 2),
and an eigenvector corresponding to λ = −3 is v2 = (0, 1, 0, 0).
To construct a Jordan decomposition of A, we do what we did in the
(1) (2) (3)
previous example: we find a Jordan chain v2 , v2 , v2 corresponding
(1)
to λ = −3. We choose v2 = v2 = (0, 1, 0, 0) to be the eigenvector that
(2)
we already computed, and then we find v2 by solving the linear system
(2) (1) (2)
(A + 3I)v2 = v2 for v2 :
   
−3 0 −2 1 0 1 0 0 0 1/3
 0 0 −2 1 1   
  row reduce  0 0 1 0 −1/3 
 3 0 4 1 0 −−−−−−→  0 0 0 1 1/3  .
−3 0 2 5 0 0 0 0 0 0
(2)
It follows that the second entry of v2 is free, while its other entries are
(2)
not. If we choose the second entry to be 0 then we get v2 = (1/3, 0,
−1/3, 1/3) as one possible solution.
We still need one more vector to complete this Jordan chain, so we
(3) (2)
The same sequence just repeat this procedure: we solve the linear system (A + 3I)v2 = v2
of row operations (3)
can be used to find for v2 :
(3)
v2 as was used    
(2)
to find v2 . This −3 0 −2 1 1/3 1 0 0 0 −1/9
happens in general,  0 0 −2 1 0   0 0 1 0 0 
  row reduce  .
since the only  3 0 4 1 −1/3 −− −− −−→  0 0 0 1 0 
difference between
these linear systems −3 0 2 5 1/3 0 0 0 0 0
is the right-hand side.
(3)
Similar to before, the second entry of v2 is free, while its other en-
(3)
tries are not. If we choose the second entry to be 0, then we get v2 =
(−1/9, 0, 0, 0) as one possible solution.
(1) (2) (3)
Be careful—we Now that we have a set of 4 vectors v1 , v2 , v2 , v2 , we can place
might be tempted to (1) (2) (3)
(3)
normalize v2 to them as columns in a 4 × 4 matrix P = v1 | v2 | v2 | v2 , and that
(3)
v2 = (1, 0, 0, 0), but matrix should bring A into its Jordan canonical form:
we cannot do this!    
We can choose the 0 0 1/3 −1/9 0 0 1/3 1/3
free variable (the    
(3) 0 1 0 0  0 1 0 0 
second entry of v2 P=  and P−1 =  .
in this case), but not 1 0 −1/3 0   0 0 −2 1 
the overall scaling 2 0 1/3 0 −9 0 −6 3
in Jordan chains.
Straightforward calculation then shows that
 
3 0 0 0
 0 −3 1 0 
P−1 AP =   0 0 −3
,
1 
0 0 0 −3
which is indeed the Jordan canonical form of A.

Multiple Jordan Blocks per Eigenvalue

Unfortunately, the method described above for constructing the Jordan de-
composition of a matrix only works if all of its eigenvalues have geometric
multiplicity equal to 1. To illustrate how this method can fail in more compli-
cated situations, we try to use this method to construct a Jordan decomposition
of the matrix  
−2 −1 1
 
A= 3 2 −1 .
−6 −2 3
The only eigenvalue of this matrix is λ = 1, and one basis of the correspond-
In particular, λ = 1 ing eigenspace is {v1 , v2 } where v1 = (1, −3, 0) and v2 = (1, 0, 3). However,
has geometric
if we try to extend either of these eigenvectors v1 or v2 to a Jordan chain of
multiplicity γ1 = 2.
order larger than 1 then we quickly get stuck: the linear systems (A − I)w = v1
and (A − I)w = v2 each have no solutions.
There are indeed Jordan chains of degree 2 corresponding to λ = 1, but the
eigenvectors that they correspond to are non-trivial linear combinations of v1
and v2 , not v1 or v2 themselves. Furthermore, it is not immediately obvious how
we could find which eigenvectors (i.e., linear combinations) can be extended.
For this reason, we now introduce another method of constructing Jordan chains
that avoids problems like this one.
 
Example 2.4.9 Compute a Jordan decomposition of the matrix A = −2 −1 1 .
Jordan Chains  
3 2 −1
with a Repeated
−6 −2 3
Eigenvalue
Solution:
As we noted earlier, the only eigenvalue of this matrix is λ = 1, with
geometric multiplicity 2 and {v1 , v2 } (where v1 = (1, −3, 0) and v2 =
(1, 0, 3)) forming a basis of the corresponding eigenspace.
(1)
In order to extend some eigenvector w1 of A to a degree-2 Jordan
(2) (2)
chain, we would like to find a vector w1 with the property that (A − I)w1
is an eigenvector of A corresponding to λ = 1. That is, we want
(2) (2) (2)
(A − I)w1 6= 0 but (A − I) (A − I)w1 = (A − I)2 w1 = 0.

Direct computation shows that (A−I)2 = O, so null (A−I)2 = C3 , so we
(2)
can just pick w1 to be any vector that is not in null(A − I): we arbitrarily
(2) (1) (2)
Notice that choose w1 = (1, 0, 0) to give us the Jordan chain w1 , w1 , where
(1)
w1 = −v1 − 2v2 really
is an eigenvector of (2) (1) (2)
A corresponding
w1 = (1, 0, 0) and w1 = (A − I)w1 = (−3, 3, −6).
to λ = 1.
To complete the Jordan decomposition of A, we now just pick any
(1) (2)
other eigenvector w2 with the property that {w1 , w1 , w2 } is linearly
We could independent—we can choose w2 = v2 = (1, 0, 3), for example. Then A
have chosen
w2 = v1 = (1, −3, 0)
has Jordan decomposition A = PJP−1 , where
too, or any number    
of other choices. 1 1 0 h i −3 1 1
(1) (2)  
J =  0 1 0  and P = w1 | w1 | w2 =  3 0 0 .
0 0 1 −6 0 3
The above example suggests that instead of constructing a matrix’s Jordan

decomposition from the “bottom up” (i.e., starting with the eigenvectors, which
are at the “bottom” of a Jordan chain, and working our way “up” the chain by
solving linear systems, as we did earlier), we should instead compute it from
the “top down” (i.e., starting with vectors at the “top” of a Jordan chain and
then multiplying them by (A − λ I) to work our way “down” the chain).
Indeed, this “top down” approach works in general—a fact that we now
start proving. To this end, note that if v(1) , v(2) , . . . , v(k) is a Jordan chain of A
corresponding to the eigenvalue λ , then multiplying their defining equation
(i.e., (A − λ I)v( j) = v( j−1) for all 2 ≤ j ≤ k and (A − λ I)v(1) = 0) on the left
by powers of A − λ I shows that
For any sets X and Y ,
X\Y is the set of all v( j) ∈ null (A − λ j I) j \ null (A − λ j I) j−1 for all 1 ≤ j ≤ k.
members of X that
are not in Y .
Indeed, these subspaces null (A − λ j I) j (for 1 ≤ j ≤ k) play a key role in the
construction of the Jordan decomposition—after all, their dimensions are the
geometric multiplicities from Definition 2.4.2, which we showed determine
the sizes of A’s Jordan blocks in Theorem 2.4.2. These spaces are called the
generalized eigenspaces of A corresponding to λ (and their members are
called the generalized eigenvectors of A corresponding to λ ). Notice that if
j = 1 then they are just standard eigenspace and eigenvectors, respectively.
Our first main result that helps us toward a proof of the Jordan decomposi-
tion is the fact that bases of the generalized eigenspaces of a matrix can always
be “stitched together” to form a basis of the entire space Cn .
Theorem 2.4.5 Suppose A ∈ Mn (C) has distinct eigenvalues λ1 , λ2 , . . ., λm with algebraic

Generalized multiplicities µ1 , µ2 , . . ., µm , respectively, and
Eigenbases
B j is a basis of null (A − λ j I)µ j for each 1 ≤ j ≤ m.
Then B1 ∪ B2 ∪ · · · ∪ Bm is a basis of Cn .
Proof. We can use Schur triangularization to write A = UTU ∗ , where U ∈

Mn (C) is unitary and T ∈ Mn (C) is upper triangular, and we can choose T
so that its diagonal entries (i.e., the eigenvalues of A) are arranged so that any
repeated entries are next to each other. That is, T can be chosen to have the
form
Asterisks (∗) denote  
T1 ∗ · · · ∗
blocks whose entries
O T ··· ∗ 
are potentially  2 
non-zero, but T =  .. .. . .

..  ,
irrelevant (i.e., not . . . . 
important enough
to give names to).
O O · · · Tm
where each T j (1 ≤ j ≤ m) is upper triangular and has every diagonal entry
equal to λ j .
Our next goal is to show that if we are allowed to make use of arbitrary
similarity transformations (rather than just unitary similarity transformations,
as in Schur triangularization) then T can be chosen to be even simpler still. To
this end, we make the following claim:
Claim: If two matrices A ∈ Md1 (C) and C ∈ Md2 (C) do not have any eigenval-
ues in common then, for all B ∈ Md1 ,d2 (C), the following block matrices
This claim fails are similar: " # " #

if A and C share
eigenvalues (if it A B A O
and .
didn’t, it would imply O C O C
that all matrices are
diagonalizable). We prove this claim as Theorem B.2.1 in Appendix B.2, as it is slightly
technical and the proof of this theorem is already long enough without it.
By making use of this claim repeatedly, we see that each of the following
matrices are similar to each other:
   
T1 ∗ ∗ · · · ∗ T1 O O · · · O
 O T2 ∗ · · · ∗   O T2 ∗ · · · ∗ 
   
 O O T3 · · · ∗   O O T3 · · · ∗ 
 ,  ,
 .. .. .. . . .   . .. .. . . . 
 . . . . ..   .. . . . .. 
O O O ··· Tm O O O ··· Tm
   
T1 O O ··· O T1 O O ··· O
 O T2 O ··· O   O T2 O ··· O 
   
 O O T3 ··· ∗   O O T3 ··· O 
 , ...,  .
 .. .. .. .. ..   .. .. .. .. .. 
 . . . . .   . . . . . 
O O O ··· Tm O O O ··· Tm
It follows that T , and thus A, is similar to the block-diagonal matrix at the

bottom-right. That is, there exists an invertible matrix P ∈ Mn (C) such that
 
T1 O · · · O
O T ··· O 
 2  −1
A = P .. .. . .

..  P .
. . . . 
O O · · · Tm
With this decomposition of A in hand, it is now straightforward to verify

that
 
(T1 − λ1 I)µ1 O ··· O
 O (T2 − λ1 I)µ1 · · · O 
  −1
(A − λ1 I)µ1 
= P P
.. .. .. .. 
 . . . . 
O O · · · (Tm − λ1 I) 1
µ
 
O O ··· O
O (T − λ I)µ1 · · · O 
 2 1  −1

= P . P ,
. . . 
 .. .. .. .. 
O O · · · (Tm − λ1 I) 1
µ
In this theorem and with the second equality following from the fact that T1 − λ1 I is an upper
proof, unions like triangular µ1 × µ1 matrix with all diagonal entries equal to 0, so (T1 − λ1 I)µ1 =
C1 ∪ · · · ∪Cm are
meant as multisets,
O by Exercise 2.4.16. Similarly, for each 2 ≤ j ≤ m the matrix (T j − λ1 I)µ1
so that if there were has non-zero diagonal entries and is thus invertible,
so we see that the first µ1
any vector in columns of P form a basis of null (A − λ1 I)µ1 . A similar argument
shows that
multiple C j ’s then the next µ2 columns of P form a basis of null (A − λ2 I)µ2 , and so on.
their union would
necessarily be We have thus proved that, for each 1 ≤ j ≤ m, there exists a basis C j of
linearly dependent. null (A − λ j I)µ j such that C1 ∪ C2 ∪ · · · ∪ Cm is a basis of Cn (in particular,
we can choose C1 to consist of the first µ1 columns of P, C2 to consist of its

next µ2 columns, and so on).To see that the same result holds no matter which
basis B j of null (A − λ j I)µ j is chosen, simply note that since B j and C j are
bases of the same space, we must have span(B j ) = span(C j ). It follows that
span(B1 ∪ B2 ∪ · · · ∪ Bm ) = span(C1 ∪C2 ∪ · · · ∪Cm ) = Cn .
Since we also have the |B j | = |C j | for all 1 ≤ j ≤ m, it must be the case that
|B1 ∪ B2 ∪ · · · ∪ Bm | = |C1 ∪ C2 ∪ · · · ∪ Cm | = n, which implies that B1 ∪ B2 ∪
· · · ∪ Bm is indeed a basis of Cn .
Remark 2.4.1 Another way of phrasing Theorem 2.4.5 is as saying that, for every matrix
The Jordan A ∈ Mn (C), we can write Cn as a direct sum (see Section 1.B) of its
Decomposition generalized eigenspaces:
and Direct Sums
m
M
Cn = null (A − λ j I)µ j .
j=1
Furthermore, A is diagonalizable if and only if we can replace each gener-

alized eigenspace by its non-generalized counterpart:
m
M
Cn = null(A − λ j I).
j=1
We are now in a position to rigorously prove that every matrix has a Jordan
decomposition. We emphasize that it really is worth reading through this proof
(even if we did not do so for the proof of Theorem 2.4.5), since it describes
an explicit procedure for actually constructing the Jordan decomposition in
general.
Proof of existence in Theorem 2.4.1. Suppose λ is an eigenvalue of A with

algebraic multiplicity
µ. We start by showing that we can construct a basis B of
null (A − λ I)µ that is made up of Jordan chains. We do this via the following
iterative procedure that works by starting at the “tops” of the longest Jordan
chains and working its way “down”:
Step 1. Set k = µ and Bµ+1 = {} (the empty set).
Step 2. Set Ck = Bk+1 and then add any vector from

In words, Bk+1 is the null (A − λ I)k \ span Ck ∪ null (A − λ I)k−1
piece of the basis B
that contains Jordan
to Ck . Continue adding vectors to Ck in this way until no longer possible,
chains of order at
least k + 1, and the so that
vectors that we add
to Ck are the “tops” null (A − λ I)k ⊆ span Ck ∪ null (A − λ I)k−1 . (2.4.1)
of the Jordan chains
of order exactly k. Notice that Ck is necessarily linearly independent since Bk+1 is linearly
independent, and each vector that we added to Ck was specifically chosen
to not be in the span of its other members.
Step 3. Construct all Jordan chains that have “tops” at the members of Ck \Bk+1 .
(k) (k)
That is, if we denote the members of Ck \Bk+1 by v1 , v2 , . . ., then for
each 1 ≤ i ≤ |Ck \Bk+1 | we set
( j−1) ( j)
vi = (A − λ I)vi for j = k, k − 1, . . . , 2.
( j)
Then define Bk = Bk+1 ∪ vi : 1 ≤ i ≤ |Ck \Bk+1 |, 1 ≤ j ≤ k . We claim
that Bk is linearly independent, but pinning down this claim is quite
technical, so we leave the details to Theorem B.2.2 in Appendix B.2.
Step 4. If k = 1 then stop. Otherwise, decrease k by 1 and then return to Step 2.
After performing the above procedure, we set B = B1 (which also equals
C1 , since Step 3 does nothing in the k = 1 case). We note that B consists of
Jordan chains of A corresponding to the eigenvalue λ by construction, and
Step 3 tells us that B is linearly independent. Furthermore, the inclusion (2.4.1)
from Step 2 tells us that span(C1 ) ⊇ null(A − λ I). Repeating this argument for
C2 , and using the fact that C2 ⊆ C1 , shows that

span(C1 ) = span(C2 ∪C1 ) ⊇ span C2 ∪ null(A − λ I) ⊇ null (A − λ I)2 .
Carrying on in this way for C3 , C4 , . . ., Cµ shows that span(B) = span(C1 ) ⊇

null (A − λ I)µ . Since B is contained within null (A − λ I)µ , we conclude
that B must be a basis of it, as desired.
Since we now know how to construct a basis of each generalized eigenspace
null (A − λ I)µ consisting of Jordan chains of A, Theorem 2.4.5 tells us that
we can construct a basis of all of Cn consisting of Jordan chains of A. If there
are m Jordan chains in such a basis and the j-th Jordan chain has order k j and
corresponding eigenvalue λ j for all 1 ≤ j ≤ m, then placing the members of
that j-th Jordan chain as columns into a matrix Pj ∈ Mn,k j (C) in the manner
suggested by Theorem 2.4.4 tells us that APj = Pj Jk j (λ j ). We can then construct
a (necessarily invertible) matrix

P = P1 | P2 | · · · | Pm ∈ Mn (C).
Our usual block matrix multiplication techniques show that

AP = AP1 | AP2 | · · · | APm (block matrix mult.)

= P1 Jk1 (λ1 ) | P2 Jk2 (λ2 ) | · · · | Pm Jkm (λm ) (Theorem 2.4.4)
 
Jk1 (λ1 ) · · · O
 
= P .. .. ..  . (block matrix mult.)
 . . . 
O ··· Jkm (λm )
Multiplying on the right by P−1 gives a Jordan decomposition of A.
It is worth noting that the vectors that we add to the set Ck in Step 2 of the
above proof are the “tops” of the Jordan chains of order exactly k. Since each
Jordan chain corresponds to one Jordan block in the Jordan canonical form,
Theorem 2.4.2 tells us that we must add |Ck \Bk+1 | = 2γk − γk+1 − γk−1 such
vectors for all k ≥ 1, where γk is the corresponding geometric multiplicity of
order k.
While this procedure for constructing the Jordan decomposition likely
seems quite involved, we emphasize that things “usually” are simple enough
that we can just use our previous worked examples as a guide. For now though,
we work through a rather large example so as to illustrate the full algorithm in
a bit more generality.
Example 2.4.10 Construct a Jordan decomposition of the 8 × 8 matrix from Example 2.4.3.
A Hideous Jordan
Solution:
Decomposition
As we noted in Example 2.4.3, the only eigenvalue of A is λ = 1,
which has algebraic multiplicity µ = 8 and geometric multiplicities γ1 = 4,
γ2 = 7, and γk = 8 whenever k ≥ 3. In the language of the proof of Theo-
rem 2.4.1, Step 1 thus tells us to set k = 8 and B9 = {}.
In Step 2, nothing happens when k = 8 since we add |C8 \B9 | = 2γ8 −
γ9 − γ7 = 16 − 8 − 8 = 0 vectors. It follows that nothing happens in Step 3
as well, so B8 = C8 = B9 = {}. A similar argument shows that Bk =
Bk+1 = {} for all k ≥ 4, so we proceed directly to the k = 3 case of Step 2.
Just try to follow k = 3: In Step 2, we start by setting C3 = B4 = {}, and then we need
along with the steps
to add 2γ3 − γ4 − γ2 = 16 − 8 − 7 = 1 vector from
of the algorithm
presented here,
without wor- null (A − λ I)3 \ null (A − λ I)2 = C8 \ null (A − λ I)2
rying about the nasty
calculations needed to C3 . One vector that works is (0, 0, −1, 0, 0, −1, 1, 0), so we
to compute any choose C3 to be the set containing this single vector, which we
particular vectors or (3)
matrices we present. call v1 .
(3)
In Step 3, we just extend v1 to a Jordan chain by computing
(2) (3)
v1 = (A − I)v1 = (−1, 1, 0, 0, 1, 1, 0, −1) and
(1) (2)
v1 = (A − I)v1 = (0, 0, −1, 1, 0, −1, 0, 0),
and then we let B3 be the set containing these three vectors.

In practice, large k = 2: In Step 2, we start by setting C2 = B3 , and then we need to
computations like
this one are done 2γ2 − γ3 − γ1 = 14 − 8 − 4 = 2 vectors from null (A −
add any
by computers. λ I)2 \ null(A − λ I) to C2 while preserving linear indepen-
dence. Two vectors that work are
(2)
v2 = (0, 0, 0, −1, 1, 0, 0, 1) and
(2)
v3 = (−1, 0, 1, 1, 0, 0, −1, 0),
so we add these vectors to C2 , which now contains 5 vectors

total.
(2) (2)
In Step 3, we just extend v2 and v3 to Jordan chains by
multiplying by A − I:
(1) (2)
v2 = (A − I)v2 = (0, −1, 1, 0, 0, 0, 0, 0) and
(1) (2)
v3 = (A − I)v3 = (0, −1, −1, 0, 1, −1, 1, 0),
(1) (1)
and let B2 = C2 ∪ v2 , v3 be the set containing all 7 vectors
discussed so far.
Recall that γ0 = 0 k = 1: In Step 2, we start by setting C1 = B2 , and then we need to add

always.
any 2γ1 − γ2 − γ0 = 8 − 7 − 0 = 1 vector from null(A − I) to C1
while preserving linear independence. One vector that works is
(1)
v4 = (0, 1, 0, 0, −1, 0, −1, 0),
so we add this vector to C1 , which now contains 8 vectors total

(so we are done and can set B1 = C1 ).
To complete the Jordan decomposition of A, we simply place all of
these Jordan chains as columns into a matrix P:
h i
(1) (2) (3) (1) (2) (1) (2) (1)
P = v1 | v1 | v1 v2 | v2 v3 | v3 v4
 
0 −1 0 0 0 0 −1 0
 0 1 0 −1 0 −1 0 1 
 
 −1 0 −1 1 0 −1 1 0 
 
 1 0 0 0 −1 0 1 0 
= 0
.
 1 0 0 1 1 0 −1  
 −1 1 −1 0 0 −1 0 0 
 
 0 0 1 0 0 1 −1 −1 
0 −1 0 0 1 0 0 0
It is straightforward (but arduous) to verify that A = PJP−1 , where J is

the Jordan canonical form of A that we computed in Example 2.4.3.
2.4.3 Matrix Functions

One of the most useful applications of the Jordan decomposition is that it
provides us with a method of applying functions to arbitrary matrices. We of
course could apply functions to matrices entrywise, but doing so is a bit silly
and does not agree with how we compute matrix powers or polynomials of
matrices.
In order to apply functions to matrices “properly”, we exploit the fact that
many functions equal their Taylor series (functions with this property are called
analytic). For example, recall that
See Appendix A.2.2
for a refresher on
x2 x3 x4 ∞
xj
Taylor series. ex = 1 + x + + + +··· = ∑ for all x ∈ C.
2! 3! 4! j=0 j!
Since this definition of ex only depends on addition, scalar multiplication, and

powers of x, and all of these operations are well-behaved when applied to
matrices, it seems reasonable to analogously define eA , the exponential of a
matrix, via
When we talk about ∞
1 j
a sum of matrices eA = ∑ A for all A ∈ Mn (C),
converging, we just j=0 j!
mean that every
entry in that sum
converges. as long as this sum converges.
More generally, we define the “matrix version” of an analytic function as
follows:
Definition 2.4.4 Suppose A ∈ Mn (C) and f : C → C is analytic on an open disc centered

Matrix Functions at some scalar a ∈ C. Then
∞
f ( j) (a)
(A − aI) j ,
def
f (A) = ∑
j=0 j!
An “open” disc is a
filled-in circle that as long as the sum on the right converges.
does not include its
boundary. We need
it to be open so that The above definition implicitly makes the (very not obvious) claim that the
things like derivatives matrix that the sum on the right converges to does not depend on which value
make sense a ∈ C is chosen. That is, the value of a ∈ C might affect whether or not the sum
everywhere in it.
on the right converges, but not what it converges to. If k = 1 (so A ∈ M1 (C) is
a scalar, which we relabel as A = x ∈ C) then this fact follows from the sum
in Definition 2.4.4 being the usual Taylor series of f centered at a, which we
know converges to f (x) (if it converges at all) since f is analytic. However, if
k ≥ 2 then it is not so obvious that this sum is so well-behaved, nor is it obvious
how we could possibly compute what it converges to.
The next two theorems solve these problems—they tell us when the sum
in Definition 2.4.4 converges, that the scalar a ∈ C has no effect on what it
converges to (though it may affect whether or not it converges in the first place),
and that we can use the Jordan decomposition of A to compute f (A). We start
by showing how to compute f (A) in the special case when A is a Jordan block.
Theorem 2.4.6 Suppose Jk (λ ) ∈ Mk (C) is a Jordan block and f : C → C is analytic on

Functions of some open set containing λ . Then
Jordan Blocks  
f 0 (λ ) f 00 (λ ) f (k−2) (λ ) f (k−1) (λ )
 f (λ ) 1! 2! ··· (k−2)! (k−1)! 
The notation f ( j) (λ )  
 f 0 (λ ) f (k−3) (λ ) f (k−2) (λ ) 
means the j-th  0
 f (λ ) 1! ··· (k−3)! (k−2)! 

derivative of f  
 0 f (k−4) (λ ) f (k−3) (λ ) 
evaluated at λ .
f Jk (λ ) =  0 f (λ ) ··· (k−3)! 
 (k−4)! .
 . .. .. .. .. .. 
 . . 
 . . . . . 
 
 0 0 0 ··· f (λ ) f 0 (λ ) 
 1! 
0 0 0 ··· 0 f (λ )
Proof. For each 1 ≤ n < k, let Nn ∈ Mk (C) denote the matrix with ones on
its n-th superdiagonal and zeros elsewhere (i.e., [Nn ]i, j = 1 if j − i = n and
Recall that [Nn ]i, j [Nn ]i, j = 0 otherwise). Then Jk (λ ) = λ I + N1 , and we show in Exercise 2.4.17
refers to the
that these matrices satisfy N1n = Nn for all 1 ≤ n < k, and N1n = O when n ≥ k.
(i, j)-entry of Nn .
We now prove the statement of this theorem when we choose to center the
Taylor series from Definition 2.4.4 at a = λ . That is, we write f as a Taylor
series centered at λ , so that
∞
f (n) (λ ) ∞
f (n) (λ )
f (x) = ∑ (x − λ )n , so f Jk (λ ) = ∑ (Jk (λ ) − λ I)n .
n=0 n! n=0 n!
By making use of the fact that Jk (λ ) − λ I = N1 , together with our earlier

observation about powers of N1 , we see that

This sum becomes
finite because ∞
f (n) (λ ) n k−1 f (n) (λ )
f Jk (λ ) = ∑ N1 = ∑ Nn ,
N1n = O when n ≥ k. n! n!
n=0 n=0
which is exactly the formula given in the statement of the theorem.
To complete this proof, we must show that Definition 2.4.4 is actually well-
defined (i.e., no matter which value a in the open set on which f is analytic we
center the Taylor series of f at, the formula provided by this theorem still holds).
This extra argument is very similar to the one we just went through, but with
some extra ugly details, so we defer it to Appendix B.2 (see Theorem B.2.3 in
particular).
Theorem 2.4.7 Suppose A ∈ Mn (C) has Jordan decomposition as in Theorem 2.4.1, and
Matrix Functions f : C → C is analytic on some open disc containing all of the eigenvalues
via the Jordan of A. Then  
Decomposition f Jk1 (λ1 ) O ··· O
 

 O f Jk2 (λ2 ) · · · O 
 −1
f (A) = P  .. .. .. P .
 .. 
 . . . . 

O O · · · f Jkm (λm )
Proof. We just exploit the fact that matrix powers, and thus Taylor series,
behave very well with block diagonal matrices and the Jordan decomposition.
For any a in the open disc on which f is analytic, we have
f ( j) (a)
∞
f (A) = ∑ (A − aI) j
j=0 j!
  j 
In the second Jk1 (λ1 ) − aI ··· O
equality, we use the  ∞ f ( j) (a)  
  .. .. ..  −1
fact that = P∑  . . .  P
(PJP−1 ) j = PJ j P−1 for  j=0 j!  
all j. j
O · · · Jkm (λm ) − aI
 
f J (λ ) · · · O
 k1 1 
 . . ..  −1
= P .. .. . P ,
 
O · · · f Jkm (λm )
as claimed.
By combining the previous two theorems, we can explicitly apply any
analytic function f to any matrix by first constructing that matrix’s Jordan
decomposition, applying f to each of the Jordan blocks in its Jordan canonical
form via the formula of Theorem 2.4.6, and then stitching those computations
together via Theorem 2.4.7.
 
Example 2.4.11 Compute eA if A = 5 1 −1 .
A Matrix  
1 3 −1
Exponential via
2 0 2
the Jordan
Decomposition
Solution:
We computed the following Jordan decomposition A = PJP−1 of this
matrix in Example 2.4.7:
     
0 2 1 2 0 0 −1 1 1
1 1
P= 2 0 1 , J =  0 4 1  , P−1 =  1 −1 1 .
2 2
2 2 0 0 0 4 2 2 −2
Applying the function f (x) = ex to the two Jordan blocks in J via the
formula of Theorem 2.4.6 gives
that f 0 (x) = ex too. " # " #
2 4 1 f (4) f 0 (4) e4 e4
f (2) = e and f = = .
0 4 0 f (4) 0 e4
Stitching everything together via Theorem 2.4.7 then tells us that

  2  
0 2 1 e 0 0 −1 1 1
A 1
f (A) = e = 2 0 1  0 e4 e4   1 −1 1
4
2 2 0 0 0 e4 2 2 −2
 
4e4 2e4 −2e4
1 4 
=  e − e2 e4 + e2 e2 − e4  .
2
3e4 − e2 e2 + e4 e2 − e4
When we define matrix functions in this way, they interact with matrix
multiplication how we √would expect them to. For example, the principal square
root function f (x) = x is analytic everywhere except on the set of non-positive
real numbers,√so Theorems 2.4.6 and 2.4.7 tell us how to compute the principal
square root A of any matrix A ∈ Mn (C) whose eigenvalues can be placed
in an open disc avoiding
√ that strip in the complex plane. As we would hope,
this matrix satisfies ( A)2 = A, and if A is positive definite √
then this method
produces exactly the positive definite principal square root A described by
Theorem 2.2.11.
√  
Example 2.4.12 Compute A if A = 3 2 3 .
The Principal 1 2 −3
Square Root of a −14 −1
Non-Positive Solution:
Semidefinite
We can use the techniques of this section to see that A has Jordan
Matrix
decomposition A = PJP−1 , where
Try computing      
−2 −1 2 1 0 0 0 1 1
a Jordan
    −1 1 
decomposition P= 2 1 −1 , J = 0 4 1 , P = 2 2 −2 .
2
of A on your own. 0 −1 1 0 0 4 2 2 0
√
To apply the function f (x) = x to the two Jordan blocks in J via the
1
formula of Theorem 2.4.6, we note that f 0 (x) = 2√ x
, so that
" # " #
4 1 f (4) f 0 (4) 2 1/4
f (1) = 1 and f = = .
0 4 0 f (4) 0 2
Stitching everything together via Theorem 2.4.7 then tells us that

   
Double-check on
√ −2 −1 2 1 0 0 0 1 1
your own
√ that 1
( A)2 = A. f (A) = A= 2 1 −1  0 2 1/4  2 2 −2
2
0 −1 1 0 0 2 2 2 0
 
7 3 4
1 
= 1 5 −4 .
4
−1 −1 8
Remark 2.4.2 If a function f is not analytic on any disc containing a certain scalar λ ∈ C
Is Analyticity then the results of this section say nothing about how to compute f (A) for
Needed? matrices with λ as an eigenvalue. Indeed, it may be the case that there
is no sensible way to even define f (A) for these matrices, as√the sum
in Definition 2.4.4 may not converge. For example, if f (x) = x is the
principal square root function, then f is not analytic at x = 0 (despite √
being
defined there), so it is not obvious how to compute (or even define) A if
A has 0 as an eigenvalue.
We can sometimes get around this problem by just applying the for-
mulas of Theorems 2.4.6 and 2.4.7 anyway, as long as the quantities used
in those formulas all exist. For example, we could say that if
"√ #
2i 0 √ 2i 0 1+i 0
A= then A= √ = ,
0 0 0 0 0 0
even though A has 0 as an eigenvalue. On the other hand, this method says
that
We are again using "√ √ #
the fact 0 1 √ 0 1/(2 0)
√ that if B= then B= ,
f (x) =√ x then √
0 0 0 0
f 0 (x) = 1/(2 x) here.
which
√ makes no sense (notice the division by 0 in the top-right corner
of B, corresponding to the fact that the square root function is not
differentiable at 0). Indeed, this matrix B does not have a square root
at all, nor does any matrix with a Jordan block of size 2 × 2 or larger
corresponding to the eigenvalue 0.
√
Similarly, Definition 2.4.4 does not tell us what C means if

i 0
C= ,
0 −i
√
even though the principal square root function f = x is analytic on open
discs containing the eigenvalues i and −i. The problem here is that there
is no common disc D containing i and −i on which f is analytic, since any
such disc must contain 0 as well. In practice, we just ignore this problem
and apply the formulas of Theorems 2.4.6 and 2.4.7 anyway to get
"√ #
√ i 0 1 1+i 0
C= √ =√ .
0 −i 2 0 1−i
Example 2.4.13 Compute I + A + A2 + A3 + · · · if

Geometric Series  
for Matrices 3 4 0
1
A= 0 1 −2 .
2
1 2 −1
Solution:
We can use the techniques of this section to see that A has Jordan
decomposition A = PJP−1 , where
     
2 2 0 1/2 1 0 1 0 −2
    −1 1 
P = −1 0 1 , J =  0 1/2 1  , P = 0 0 2 .
2
0 1 0 0 0 1/2 1 2 −2
We then recall that

1
1 + x + x2 + x3 + · · · = whenever |x| < 1,
1−x
Before doing this so we can compute I + A + A2 + A3 + · · · by applying Theorems 2.4.6
computation, it was
and 2.4.7 to the function f (x) = 1/(1 − x), which is analytic on the open
not even clear that
the sum disc D = {x ∈ C : |x| < 1} containing the eigenvalue 1/2 of A. In particular,
I + A + A2 + A3 + · · · f 0 (x) = 1/(1 − x)2 and f 00 (x) = 2/(1 − x)3 , so we have
converges.
f (A) = I + A + A2 + A3 + · · ·
 
1 00
f (1/2) f 0 (1/2) 2 f (1/2)
  −1
= P 0 f (1/2) f 0 (1/2)  P
0 0 f (1/2)
     
2 2 0 2 4 8 1 0 −2 14 24 −16
1  
= −1 0 1 0 2 4 0 0 2  = −4 −6 4  .
2
0 1 0 0 0 2 1 2 −2 2 4 −2
We close this section by noting that if A is diagonalizable then each of its

Jordan blocks is 1 × 1, so Theorems 2.4.6 and 2.4.7 tell us that we can compute
matrix functions in the manner that we already knew from introductory linear
algebra:
Corollary 2.4.8 If A ∈ Mn (C) is diagonalizable via A = PDP−1 , and f : C → C is an-

Functions of alytic on some open disc containing the eigenvalues of D, then f (A) =
Diagonalizable P f (D)P−1 , where f (D) is obtained by applying f to each diagonal entry
Matrices of D.
In particular, since the spectral decomposition (Theorem 2.1.4) is a special

case of diagonalization, we see that if A ∈ Mn (C) is normal with spectral
decomposition A = UDU ∗ , then f (A) = U f (D)U ∗ .
2.4.1 Compute the Jordan canonical form of each of the

∗ (a) Every matrix A ∈ Mn (C) can be diagonalized.
following matrices.
" # " # (b) Every matrix A ∈ Mn (C) has a Jordan decomposi-
∗ (a) 3 −2 (b) 4 2 tion.
∗ (c) If A = P1 J1 P1−1 and A = P2 J2 P2−1 are two Jordan
2 −1 −3 −1
" # " # decompositions of the same matrix A then J1 = J2 .
∗ (c) 0 −1 (d) −1 1 (d) Two matrices A, B ∈ Mn (C) are similar if and only
4 4 −2 1 if there is a Jordan canonical form J such that
    A = PJP−1 and B = QJQ−1 .
∗ (e) 1 −1 2 (f) 3 0 0
    ∗ (e) Two diagonalizable matrices A, B ∈ Mn (C) are sim-
−1 −1 4 1 3 −1 ilar if and only if they have the same eigenvalues
−1 −2 5 1 0 2 (counting with algebraic multiplicity).
   
∗ (g) 5 2 −2 (h) 1 2 0 (f) There exist matrices A, B ∈ Mn (C) such that A is
    diagonalizable, B is not diagonalizable, and A and B
−1 2 1 −1 5 2
are similar.
1 1 2 1 −1 3
∗ (g) The series eA = ∑∞j=0 A j / j! converges for all A ∈
Mn (C).
2.4.2 Compute a Jordan decomposition of each of the
matrices from Exercise 2.4.1. 2.4.6 Suppose A ∈ M4 (C) has all four of its eigenvalues
equal to 2. What are its possible Jordan canonical forms (do
2.4.3 Determine whether or not the given matrices are not list Jordan canonical forms that have the same Jordan
similar. blocks as each other in a different order)?
" # " #
∗ (a) 2 1 and 3 −1
2.4.7 Compute the eigenvalues of sin(A), where
−4 6 −5 7
" # " #  
(b) 1 −2 and 0 1 3 −1 −1
 
2 −3 −1 −2 A= 0 4 0 .
   
∗ (c) 1 2 0 and 2 1 1 −1 −1 3
   
0 1 −1 0 1 0
∗2.4.8 Find bases C and D of R2 , and a linear transforma-
0 −1 1 2 0 0 tion T : R2 → R2 , such that A = [T ]C and B = [T ]D , where
   
(d) 2 1 −1 and 2 1 3 " # " #
    1 2 13 7
−2 −1 2 −1 0 −4 A= and B = .
−1 −1 2 0 0 1 5 4 −14 −8
   
∗ (e) 2 −1 1 and 1 −2 2
    2.4.9 Let A ∈ Mn (C).
0 −1 0 1 6 1
−1 −5 4 −1 −7 −2 (a) Show that if A is Hermitian then eA is positive defi-
    nite.
(f) 1 1 0 0 and 1 1 0 0 (b) Show that if A is skew-Hermitian then eA is unitary.
   
0 1 0 0 0 1 0 0
   
0 0 1 1 0 0 1 0 2.4.10 Show that det(eA ) = etr(A) for all A ∈ Mn (C).
0 0 0 1 0 0 0 1
2.4.4 Compute the indicated matrix. ∗2.4.11 Show that eA is invertible for all A ∈ Mn (C).
" #
√ 2 0
∗ (a) A, where A = . 2.4.12 Suppose A, B ∈ Mn (C).
1 2
" # (a) Show that if AB = BA then eA+B = eA eB .
1 −1
(b) eA , where A = . [Hint: It is probably easier to use the definition of
1 −1 eA+B rather than the Jordan decomposition. You may
" #
1 2 −1 use the fact that you can rearrange infinite sums aris-
(c) sin(A), where A = . ing from analytic functions just like finite sums.]
3 4 −2
" # (b) Provide an example to show that if eA+B may not
1 2 −1 equal eA eB if AB 6= BA.
(d) cos(A), where A = .
3 4 −2
 
1 −2 −1 2.4.13 Show that a matrix A ∈ Mn (C) is invertible if
 
∗ (e) eA , where A = 0 3 1 . and only if it can be written in the form A = UeX , where
0 −4 −1 U ∈ Mn (C) is unitary and X ∈ Mn (C) is Hermitian.
 
√ 1 −2 −1
(f) A, where A = 0

3

1 . ∗∗ 2.4.14 Show that sin2 (A) + cos2 (A) = I for all A ∈
Mn (C).
0 −4 −1
2.4.15 Show that every matrix A ∈ Mn (C) can be written (b) Show that An = O.
in the form A = D + N, where D ∈ Mn (C) is diagonaliz-
able, N ∈ Mn (C) is nilpotent (i.e., N k = O for some integer ∗∗2.4.17 For each 1 ≤ j < k, let Nn ∈ Mk denote the ma-
k ≥ 1), and DN = ND. trix with ones on its n-th superdiagonal and zeros elsewhere
[Side note: This is called the Jordan–Chevalley decompo- (i.e., [Nn ]i, j = 1 if j − i = n and [Nn ]i, j = 0 otherwise).
sition of A.] (a) Show that N1n = Nn for all 1 ≤ n < k, and N1n = O
when n ≥ k.
∗∗2.4.16 Suppose A ∈ Mn is strictly upper triangular (i.e., (b) Show that nullity(Nn ) = min{k, n}.
it is upper triangular with diagonal entries equal to 0).
(a) Show that, for each 1 ≤ k ≤ n, the first k superdiag-
onals ofAk consist entirely of zeros. That is, show
that Ak i, j = 0 whenever j − i < k.
In this chapter, we learned about several new matrix decompositions, and how
they fit in with and generalize the matrix decompositions that we already knew
about, like diagonalization. For example, we learned about a generalization of
diagonalization that applies to all matrices, called the Jordan decomposition
(Theorem 2.4.1), and we learned about a special case of diagonalization called
the spectral decomposition (Theorems 2.1.4 and 2.1.6) that applies to normal
or symmetric matrices (depending on whether the field is C or R, respectively).
See Figure 2.10 for a reminder of which decompositions from this chapter are
special cases of each other.
Schur Jordan decomposition Singular value

triangularization (Theorem 2.4.1) decomposition
(Theorem 2.1.1) (Theorem 2.3.1)
Diagonalization
(Theorem 2.0.1)
Spectral decomposition
(Theorems 2.1.4 and 2.1.6)
Figure 2.10: Some matrix decompositions from this chapter. The decompositions
in the top row apply to any square complex matrix (and the singular value de-
composition even applies to rectangular matrices). Black lines between two
decompositions indicate that the lower decomposition is a special case of the
one above it that applies to a smaller set of matrices.
One common way of thinking about some matrix decompositions is as
providing us with a canonical form for matrices that is (a) unique, and (b)
captures all of the “important” information about that matrix (where the exact
meaning of “important” depends on what we want to do with the matrix or
We call a property what it represents). For example,
of a matrix A “basis • If we are thinking of A ∈ Mm,n as representing a linear system, the
independent” if a
change of basis relevant canonical form is its reduced row echelon form (RREF), which
does not affect it is unique and contains all information about the solutions of that linear
(i.e., A and PAP−1 system.
share that property • If we are thinking of A ∈ Mn as representing a linear transformation
for all invertible P).
and are only interested in basis-independent properties of it (e.g., its
rank), the relevant canonical form is its Jordan decomposition, which is
unique and contains all basis-independent information about that linear

transformation.
These canonical forms can also be thought of as answering the question of
how simple a matrix can be made upon multiplying it on the left and/or right by
invertible matrices, and they are summarized in Table 2.2 for easy reference.
All three of the forms Type Decomposition Name and notes

reached by these
decompositions are One-sided A = PR See Appendix A.1.3.
canonical. For • P is invertible
example, every • R is the RREF of A
matrix can be put
into one, and only Two-sided A = PDQ From Exercise 2.3.6. Either
one, RREF by
• P, Q are invertible P or Q (but not both) can be
multiplying it on the " #
left by an invertible Irank(A) O chosen to be unitary.
• D =
matrix.
O O
Similarity A = PJP−1 Jordan decomposition—see
• P is invertible Section 2.4. J is block diag-
• J is Jordan form onal and each block has con-
stant diagonal and superdiag-
onal equal to 1.
Table 2.2: A summary of the matrix decompositions that answer the question of
how simple a matrix can be made upon multiplying it by an invertible matrix on
the left and/or right. These decompositions are all canonical and apply to every
matrix (with the understanding that the matrix must be square for similarity to make
sense).
We can similarly ask how simple we can make a matrix upon multiplying it
on the left and/or right by unitary matrices, but it turns out that the answers to
this question are not so straightforward. For example, Schur triangularization
Since Schur (Theorem 2.1.1) tells us that by applying a unitary similarity to a matrix
triangularization is A ∈ Mn (C), we can make it upper triangular (i.e., we can find a unitary
not canonical, we matrix U ∈ Mn (C) such that A = UTU ∗ for some upper triangular matrix
cannot use it to
answer the question T ∈ Mn (C)). However, this upper triangular form is not canonical, since most
of, given two matrices have numerous different Schur triangularizations that look nothing
matrices like one another.
A, B ∈ Mn (C),
whether or not there A related phenomenon happens when we consider one-sided multiplication
exists a unitary matrix by a unitary matrix. There are two decompositions that give completely dif-
U ∈ Mn (C) such that ferent answers for what type of matrix can be reached in this way—the polar
A = UBU ∗ .
decomposition (Theorem 2.2.12) says that every matrix can be made positive
semidefinite, whereas the QR decomposition (Theorem 1.C.1) says that every
matrix can be made upper triangular. Furthermore, both of these forms are “not
quite canonical”—they are unique as long as the original matrix is invertible,
but they are not unique otherwise.
Of the decompositions that consider multiplication by unitary matrices, only
the singular value decomposition (Theorem 2.3.1) is canonical. We summarize
these observations in Table 2.3 for easy reference.
We also learned about the special role that the set of normal matrices, as well
as its subsets of unitary, Hermitian, skew-Hermitian, and positive semidefinite
matrices play in the realm of matrix decompositions. For example, the spectral
Of these Type Decomposition Name and notes

decompositions,
only the SVD is One-sided A = UT QR decomposition—see Section 1.C
canonical. • U is unitary and note that T can be chosen to have
• T is triangular non-negative diagonal entries.
One-sided A = UP Polar decomposition—see Theo-
• U is unitary rem 2.2.12. P is positive definite if and
• P is PSD only if A is invertible.
Two-sided A = UΣV Singular value decomposition (SVD)—
• U,V are unitary see Section 2.3. The diagonal entries of
• Σ is diagonal Σ can be chosen to be non-negative and
in non-increasing order.
Unitary A = UTU ∗ Schur triangularization—see Sec-
similarity • U is unitary tion 2.1.1. The spectral decomposition
• T is triangular (Theorem 2.1.4) is a special case.
Table 2.3: A summary of the matrix decompositions that answer the question of
how simple a matrix can be made upon multiplying it by unitary matrices on
the left and/or right. These decompositions all apply to every matrix A (with the
understanding that A must be square for unitary similarity to make sense).
decomposition tells us that normal matrices are exactly the matrices that can
not only be diagonalized, but can be diagonalized via a unitary matrix (rather
than just an invertible matrix).
While we have already given characterization theorems for unitary matrices
(Theorem 1.4.9) and positive semidefinite matrices (Theorem 2.2.1) that pro-
vide numerous different equivalent conditions that could be used to define them,
we have not yet done so for normal matrices. For ease of reference, we provide
such a characterization here, though we note that most of these properties were
already proved earlier in various exercises.
Theorem 2.5.1 Suppose A ∈ Mn (C) has eigenvalues λ1 , . . . , λn . The following are equiv-
Characterization of alent:
Normal Matrices a) A is normal,
b) A = UDU ∗ for some unitary U ∈ Mn (C) and diagonal D ∈ Mn (C),
c) there is an orthonormal basis of Cn consisting of eigenvectors of A,
d) kAk2F = |λ1 |2 + · · · + |λn |2 ,
e) the singular values of A are |λ1 |, . . . , |λn |,
Compare f) (Av) · (Aw) = (A∗ v) · (A∗ w) for all v, w ∈ Cn , and
statements (g) and
g) kAvk = kA∗ vk for all v ∈ Cn .
(h) of this theorem
to the
characterizations of
unitary matrices Proof. We have already discussed the equivalence of (a), (b), and (c) exten-
given in sively, as (b) and (c) are just different statements of the spectral decomposition.
Theorem 1.4.9.
The equivalence of (a) and (d) was proved in Exercise 2.1.12, the fact that (a)
implies (e) was proved in Theorem 2.3.4 and its converse in Exercise 2.3.20.
Finally, the fact that (a), (f), and (g) are equivalent follows from taking B = A∗
in Exercise 1.4.19.
It is worth noting, however, that there are even more equivalent charac-
terizations of normality, though they are somewhat less important than those
discussed above. See Exercises 2.1.14, 2.2.14, and 2.5.4, for example.
2.5.1 For each of the following matrices, say which of 2.5.5 Two Hermitian matrices A, B ∈ MH n are called
the following matrix decompositions can be applied to it: ∗-congruent if there exists an invertible matrix S ∈ Mn (C)
(i) diagonalization (i.e., A = PDP−1 with P invertible and such that A = SBS∗ .
D diagonal), (ii) Schur triangularization, (iii) spectral de- (a) Show that for every Hermitian matrix A ∈ MHn there
composition, (iv) singular value decomposition, and/or (v) exist non-negative integers p and n such that A is
Jordan decomposition. ∗-congruent to
" # " #
∗ (a) 1 1 (b) 0 −i  
Ip O O
−1 1 i 0  O −In O  .
" # " #
∗ (c) 1 0 (d) 1 2 O O O
1 1 3 4 [Hint: This follows quickly from a decomposition
" # " #
∗ (e) 1 0 (f) 1 2 3 that we learned about in this chapter.]
(b) Show that every Hermitian matrix is ∗-congruent to
1 1.0001 4 5 6
" # exactly one matrix of the form described by part (a)
∗ (g) 0 0 (h) The 75 × 75 matrix
(i.e., p and n are determined by A).
with every entry equal
0 0 [Hint: Exercise 2.2.17 might help here.]
to 1.
[Side note: This shows that the form described by
part (a) is a canonical form for ∗-congruence, a fact
2.5.2 Determine which of the following statements are that is called Sylvester’s law of inertia.]
∗ (a) Any two eigenvectors that come from different 2.5.6 Suppose A, B ∈ Mn (C) are diagonalizable and A
eigenspaces of a normal matrix must be orthogonal. has distinct eigenvalues.
(b) Any two distinct eigenvectors of a Hermitian matrix (a) Show that A and B commute if and only if B ∈
must be orthogonal. span I, A, A2 , A3 , . . . . [Hint: Use Exercise 2.1.29
∗ (c) If A ∈ Mn has polar decomposition A = UP then P and think about interpolating polynomials.]
is the principal square root of A∗ A. (b) Provide an example to show that if A does not have
distinct eigenvalues then it may be the case that A and

2.5.3 Suppose A ∈ Mn (C) is normal and B ∈ Mn (C) B commute even though B 6∈ span I, A, A2 , A3 , . . . .
commutes with A (i.e., AB = BA).
(a) Show that A∗ and B commute as well. [Hint: Com- ∗2.5.7 Suppose that A ∈ Mn is a positive definite matrix
pute kA∗ B − BA∗ k2F .] [Side note: This result is called for which each entry is either 0 or 1. In this exercise, we
Fuglede’s theorem.] show that the only such matrix is A = I.
(b) Show that if B is also normal then so is AB. (a) Show that tr(A) ≤ n.
(b) Show det(A) ≥ 1.
∗∗2.5.4 Suppose A ∈ Mn (C) has eigenvalues λ1 , . . . , λk [Hint: First show that det(A) is an integer.]
(listed according to geometric multiplicity) with correspond- (c) Show that det(A) ≤ 1.
ing eigenvectors v1 , . . . , vk , respectively, that form a linearly [Hint: Use the AM–GM inequality (Theorem A.5.3)
independent set. with the eigenvalues of A.]
(d) Use parts (a), (b), and (c) to show that every eigen-
Show that A is normal if and only if A∗ has eigenvalues
value of A equals 1 and thus A = I.
λ1 , . . . , λk with corresponding eigenvectors v1 , . . . , vk , re-
spectively.
[Hint: One direction is much harder than the other. For
the difficult direction, be careful not to assume that k = n
without actually proving it—Schur triangularization might
help.]
2.A Extra Topic: Quadratic Forms and Conic Sections
One of the most useful applications of the real spectral decomposition is a

Refer back to characterization of quadratic forms, which can be thought of as a generalization
Section 1.3.2 if you
need a refresher on
of quadratic functions of one input variable (e.g., q(x) = 3x2 + 2x − 7) to
linear or bilinear multiple variables, in the same way that linear forms generalize linear functions
forms. from one input variable to multiple variables.
Definition 2.A.1 Suppose V is a vector space over R. Then a function q : V → R is called a

Quadratic Forms quadratic form if there exists a bilinear form f : V × V → R such that
q(v) = f (v, v) for all v ∈ V.
For example, the function q : R2 → R defined by q(x, y) = 3x2 + 2xy + 5y2

is a quadratic form, since if we let v = (x, y) and
" #
3 1
A= then q(v) = vT Av.
The degree of a 1 5
term is the sum of
the exponents of That is, q looks like the “piece” of the bilinear form f (v, w) = vT Aw that we
variables being get if we plug v into both inputs. More generally, every polynomial q : Rn → R
multiplied together
in which every term has degree exactly equal to 2 is a quadratic form, since
(e.g., x2 and xy each
have degree 2). this same procedure can be carried out by simply placing the coefficients of the
squared terms along the diagonal of the matrix A and half of the coefficients of
the cross terms in the corresponding off-diagonal entries.
Example 2.A.1 Suppose q : R3 → R is the function defined by

Writing a Degree-2
Polynomial q(x, y, z) = 2x2 − 2xy + 3y2 + 2xz + 3z2 .
as a Matrix
Find a symmetric matrix A ∈ M3 (R) such that if v = (x, y, z) then q has
the form q(v) = vT Av, and thus show that q is a quadratic form.
Solution:
Direct computation shows that if A ∈ M3 (R) is symmetric then
vT Av = a1,1 x2 + 2a1,2 xy + 2a1,3 xz + a2,2 y2 + 2a2,3 yz + a3,3 z2 .
Simply matching up this form with the coefficients of q shows that we can
choose
The fact that the  
(2, 3)-entry of A 2 −1 1
equals 0 is a result of A = −1 3 0 so that q(v) = vT Av.
the fact that q has
no “yz” term. 1 0 3
It follows that q is a quadratic form, since q(v) = f (v, v), where f (v, w) =
vT Aw is a bilinear form.
In fact, the converse holds as well—not only are polynomials with degree-2
terms quadratic forms, but every quadratic form on Rn can be written as a
polynomial with degree-2 terms. We now state and prove this observation.
Theorem 2.A.1 Suppose q : Rn → R is a function. The following are equivalent:

Characterization a) q is a quadratic form,
of Quadratic Forms b) q is an n-variable polynomial in which every term has degree 2,
c) there is a matrix A ∈ Mn (R) such that q(v) = vT Av, and
d) there is a symmetric matrix A ∈ MSn (R) such that q(v) = vT Av.
Furthermore, the vector space of quadratic forms is isomorphic to the
vector space MSn of symmetric matrices.
Proof. We have already discussed why conditions (b) and (d) are equiva-
lent, and the equivalence of (c) and (a) follows immediately from applying
2.A Extra Topic: Quadratic Forms and Conic Sections 257
Theorem 1.3.5 to the bilinear form f associated with q (i.e., the bilinear form
with the property that q(v) = f (v, v) for all v ∈ Rn ). Furthermore, the fact that
(d) implies (c) is trivial, so the only remaining implication to prove is that (c)
implies (d).
To this end, simply notice that if q(v) = vT Av for all v ∈ Rn then
vT Avis a scalar and
thus equals its own vT (A + AT )v = vT Av + vT AT v = q(v) + (vT Av)T = q(v) + vT Av = 2q(v).
transpose.
We can thus replace A by the symmetric matrix (A + AT )/2 without changing
the quadratic form q.
For the “furthermore” claim, suppose A ∈ MSn is symmetric and define a
linear transformation T by T (A) = q, where q(v) = vT Av. The equivalence of
conditions (a) and (d) shows that q is a quadratic form, and every quadratic
form is in the range of T . To see that T is invertible (and thus an isomorphism),
we just need to show that T (A) = 0 implies A = O. In other words, we need
to show that if q(v) = vT Av = 0 for all v ∈ Rn then A = O. This fact follows
immediately from Exercise 1.4.28(b), so we are done.
Recall from Since the above theorem tells us that every quadratic form can be written
Example 1.B.3 that
in terms of a symmetric matrix, we can use any tools that we know of for
the matrix (A + AT )/2
from the above manipulating symmetric matrices to help us better understand quadratic forms.
proof is the For example, the quadratic form q(x, y) = (3/2)x2 +xy+(3/2)y2 can be written
symmetric part of A. in the form q(v) = vT Av, where

x 1 3 1
v= and A = .
y 2 1 3
We investigate Since A has real spectral decomposition A = UDU T with

higher-degree
generalizations of
1 1 1 2 0
linear and quadratic U=√ and D= ,
forms in Section 3.B. 2 1 −1 0 1
we can multiply out q(v) = vT Av = (U T v)D(U T v) to see that

1 2 0 x+y 1
q(v) = x + y, x − y = (x + y)2 + (x − y)2 . (2.A.1)
2 0 1 x−y 2
In other words, the real spectral decomposition tells us how to write q as a

sum (or, if some of the eigenvalues of A are negative, difference) of squares.
We now state this observation explicitly.
Corollary 2.A.2 Suppose q : Rn → R is a quadratic form. Then there exist scalars

Diagonalization of λ1 , . . . , λn ∈ R and an orthonormal basis {u1 , u2 , . . . , un } of Rn such that
Quadratic Forms
q(v) = λ1 (u1 · v)2 + λ2 (u2 · v)2 + · · · + λn (un · v)2 for all v ∈ Rn .
Proof. We know from Theorem 2.A.1 that there exists a symmetric matrix A ∈
Mn (R) such that q(v) = vT Av for some all v ∈ Rn . Applying the real spectral
decomposition (Theorem 2.1.6) to A gives us a unitary matrix U ∈ Mn (R) and
a diagonal matrix D ∈ Mn (R) such that A = UDU T , so q(v) = (U T v)D(U T v),
as before.

If we write U = u1 | u2 | · · · | un and let λ1 , λ2 , . . ., λn denote the
eigenvalues of A, listed in the order in which they appear on the diagonal of D,
then
q(v) = (U T v)D(U T v)
Here we use the   
fact that uTj v = u j · v λ1 0 ··· 0 u1 · v
for all 1 ≤ j ≤ n. 0 0
 λ2 ···  u2 · v
= u1 · v, u2 · v, · · · , un · v   
 .. .. .. ..  
 .. 
. . . . . 
0 0 ··· λn un · v
= λ1 (u1 · v) + λ2 (u2 · v) + · · · + λn (un · v)2
2 2
for all v ∈ Rn , as claimed.

The real magic of the above theorem thus comes from the fact that, not
only does it let us write any quadratic form as a sum or difference of squares,
but it lets us do so through an orthonormal change of variables. To see what
we mean by this, recall from Theorem 1.4.5 that if B= {u1 , u2 , . . . , un } is
an orthonormal basis then [v]B = u1 · v, u2 · v, . . . , un · v . In other words, the
above theorem says that every quadratic form really just looks like a function
Just like of the form
diagonalization of a
matrix gets rid of its
off-diagonal entries, f (x1 , x2 , . . . , xn ) = λ1 x12 + λ2 x22 + · · · + λn xn2 , (2.A.2)
diagonalization of a
quadratic form gets but rotated and/or reflected. We now investigate what effect different types of
rid of its cross terms. eigenvalues λ1 , λ2 , . . ., λn , have on the graph of these functions.
2.A.1 Definiteness, Ellipsoids, and Paraboloids

In the case when a all eigenvalues of a symmetric matrix have the same sign
(i.e., either all positive or all negative), the associated quadratic form becomes
much easier to analyze.
Positive Eigenvalues
If the symmetric matrix A ∈ MSn (R) corresponding to a quadratic form q(v) =
vT Av is positive semidefinite then, by definition, we must have q(v) ≥ 0 for
all v. For this reason, we say in this case that q itself is positive semidefinite
(PSD), and we say that it is furthermore positive definite (PD) if q(v) > 0
whenever v 6= 0.
All of our results about positive semidefinite matrices carry over straight-
forwardly to the setting of PSD quadratic forms. In particular, the following
fact follows immediately from Theorem 2.2.1:
Recall that
the scalars in
! A quadratic form q : Rn → R is positive semidefinite if and
Corollary 2.A.2 are only if each of the scalars in Corollary 2.A.2 are non-negative.
the eigenvalues of A. It is positive definite if and only if those scalars are all strictly
positive.
A level set of q is the
set of solutions to
q(v) = c, where c is a If q is positive definite then has it ellipses (or ellipsoids, or hyperellipsoids,
given scalar. They depending on the dimension) as its level sets. The principal radii of the level
are horizontal slices √ √ √
of q’s graph. set q(v) = 1 are equal to 1/ λ1 , 1/ λ2 , . . ., 1/ λn , where λ1 , λ2 , . . ., λn are
the eigenvalues of its associated matrix A, and the orthonormal eigenvectors

{u1 , u2 , . . . , un } described by Corollary 2.A.2 specify the directions of their cor-
responding principal axes. Furthermore, in this case when the level sets of q are
ellipses, its graph must be a paraboloid (which looks like an infinitely-deep bowl).
For example, the level sets of q(x, y) = x2 + y2 are circles, the √
level sets
of q(x, y) = x2 + 2y2 are ellipses that are squished by a factor of 1/ 2 in the
2 2 2 3
√ y, z) = x√+ 2y + 3z are ellipsoids in R
y-direction, and the level sets of h(x,
that are squished by factors of 1/ 2 and 1/ 3 in the directions of the y- and
z-axes, respectively. We now work through an example that is rotated and thus
cannot be “eyeballed” so easily.
Example 2.A.2 Plot the level sets of the quadratic form q(x, y) = (3/2)x2 + xy + (3/2)y2
A Positive Definite and then graph it.
Quadratic Form Solution:
We already diagonalized this quadratic form back in Equation (2.A.1).
In particular, if we let v = (x, y) then
√ √
q(v) = 2(u1 · v)2 + (u2 · v)2 , where u1 = (1, 1)/ 2, u2 = (1, −1)/ 2.
It follows that the level sets of q are ellipses√rotated so that their principal
√
axes point in the directions of u1 = (1, 1)/ 2 and u2 = (1, −1)/ √2, and
the level set q(v) = 1 has corresponding principal radii equal to 1/ 2 and
1, respectively. These level sets, as well as the resulting graph of q, are
displayed below:
The level sets are = q(x, y)

shown on the left. √ √
The graph is shown 1/ 1 = 1/ 2
on the right.
x
=1 y x
=2 √
=3 1/ 2 =1
=1
If a quadratic form acts on R3 instead of R2 then its level sets are ellipsoids
(rather than ellipses), and its graph is a “hyperparaboloid” living in R4 that is a
bit difficult to visualize. For example, applying the spectral decomposition to
the quadratic form
q(x, y, z) = 3x2 + 4y2 − 2xz + 3z2
reveals that if we write v = (x, y, z) then
q(v) = 4(u1 · v)2 + 4(u2 · v)2 + 2(u3 · v)2 , where

√ √
u1 = (1, 0, −1)/ 2, u2 = (0, 1, 0), and u3 = (1, 0, 1)/ 2.
It follows that the level sets of q are ellipsoids with principal axes pointing in
the directions of u1 , u2 , and u3 . Furthermore, the corresponding principal radii
√
for the level set q(v) = 1 are 1/2, 1/2, and 1/ 2, respectively, as displayed in
Figure 2.11(a).
Negative Eigenvalues
A symmetric matrix If all of the eigenvalues of a symmetric matrix are strictly negative then
with all-negative
the level sets of the associated quadratic form are still ellipses (or ellipsoids,
eigenvalues is called
negative definite. or hyperellipsoids), but its graph is instead a (hyper)paraboloid that opens
down (not up). These observations follow simply from noticing that if q has
negative eigenvalues then −q (which has the same level sets as q) has positive
eigenvalues and is thus positive definite.
The shapes that Some Zero Eigenvalues
arise in this section If a quadratic form is just positive semidefinite (i.e., one or more of the eigen-
are the conic
sections and their values of the associated symmetric matrix equal zero) then its level sets are
higher-dimensional degenerate—they look like lower-dimensional ellipsoids that are stretched into
counterparts. higher-dimensional space. For example, the quadratic form
q(x, y, z) = 3x2 + 2xy + 3y2
does not actually depend on z at all, so its level sets extend arbitrarily far in the
z direction, as in Figure 2.11(b).
q(x, y, ) = 2
q(x, y, ) = 1
q(x, y, ) = 1
These elliptical y x
cylinders can be
thought of as
y x
ellipsoids with one of
their radii equal to ∞.
Figure 2.11: Level sets of positive (semi)definite quadratic forms are (hy-
per)ellipsoids.
Indeed, we can write q in the matrix form

 
3 1 0
q(v) = vT Av, where A = 1 3 0 ,
0 0 0
which has eigenvalues 4, 2, and 0. Everything on the z-axis is an eigenvector

corresponding to the 0 eigenvalue, which explains why nothing interesting
happens in that direction.
Example 2.A.3 Plot the level sets of the quadratic form q(x, y) = x2 − 2xy + y2 and then
A Positive graph it.
Semidefinite
Solution:
Quadratic Form
This quadratic form is simple enough that we can simply eyeball a
diagonalization of it:
q(x, y) = x2 − 2xy + y2 = (x − y)2 .
The fact that we just need one term in a sum-of-squares decomposition of

q tells us right away that the associated symmetric matrix A has at most
one non-zero eigenvalue. We can verify this explicitly by noting that if
v = (x, y) then

T 1 −1
q(v) = v Av, where A = ,
−1 1
which has eigenvalues 2 and 0. The level sets of q are thus degenerate
ellipses (which are just pairs of lines, since “ellipses” in R1 are just pairs
of points). Furthermore, the graph of this quadratic form is degenerate
paraboloid (i.e., a parabolic sheet), as shown below:
√
y 1/ ≈
2
= q(x, y)
We can think of =4
these level sets (i.e., x
pairs of parallel lines) x =3
as ellipses with one y
of their principal radii √ √
equal to ∞.
1/ 1 = 1/ 2 =2
=0 =1
= 1, 2, 3, 4 =0
2.A.2 Indefiniteness and Hyperboloids

If a quadratic form If the symmetric matrix associated with a quadratic form has both positive and
or symmetric matrix
negative eigenvalues then its graph looks like a “saddle”—there are directions
is neither positive
semidefinite nor on in along which it opens up (i.e., the directions of its eigenvectors correspond-
negative ing to positive eigenvalues), and other directions along which it opens down
semidefinite, we say (i.e., the directions of its eigenvectors corresponding to negative eigenvalues).
that it is indefinite.
The level sets of such a shape are hyperbolas (or hyperboloids, depending on
the dimension).
Example 2.A.4 Plot the level sets of the quadratic form q(x, y) = xy and then graph it.
An Indefinite
Solution:
Quadratic Form
Applying the spectral decomposition to the symmetric matrix associ-
ated with q reveals that it has eigenvalues λ1 = 1/2 and λ2 = −1/2, with
corresponding eigenvectors (1, 1) and (1, −1), respectively. It follows that
q can be diagonalized as
1 1
q(x, y) = (x + y)2 − (x − y)2 .
4 4
Since there are terms being both added and subtracted, we conclude that
the level sets of q are hyperbolas and its graph is a saddle, as shown below:
y
= 1, 2, 3, 4
The z = 0 level set is
√ √ = xy
exactly the x- and 1/ 1 = 2 y
y-axes, which
separate the other
two families of x
hyperbolic level sets.
=1
p √ x
1/ | 2 | = 2
= −1, −2, −3, −4
As suggested by the image in the above example, if λ1 is a positive eigen-

value of√the symmetric matrix associated with a quadratic form q : R2 → R
then 1/ λ1 measures the distance from the origin to the closest point on the
hyperbola q(x, y) = 1, and the eigenvectors to which λ1 corresponds specify the
direction between the origin and that closest point. The eigenvectors to which
the other (necessarily negative) eigenvalue λ2 corresponds similarly specifies
the “open” direction of this hyperbola. Furthermore, if we look at the hyperbola
q(x, y) = −1 then the roles of λ1 and λ2 swap.
In higher dimensions, the level sets are similarly hyperboloids (i.e., shapes
There is also one whose 2D cross-sections are ellipses and/or hyperbolas), and a similar in-
other possibility: if terpretation of the eigenvalues and eigenvectors holds. In particular, for the
one eigenvalue is
positive, one is
hyperboloid level set q(v) = 1, the positive eigenvalues specify the radii of its
negative, and one elliptical cross-sections, and the eigenvectors corresponding to the negative
equals zero, then the eigenvalues specify directions along which the hyperboloid is “open” (and
level sets look like these interpretations switch for the level set q(v) = −1).
“hyperbolic
sheets”—2D
For example, if q : R3 → R is the quadratic form
hyperbolas
stretched along a q(x, y, z) = x2 + y2 − z2
third dimension.
then the level set q(x, y, z) = 1 is a hyperboloid that looks like the circle x2 +
y2 = 1 in the xy-plane, the hyperbola x2 − z2 = 1 in the xz-plane, and the
hyperbola y2 − z2 = 1 in the yz-plane. It is thus open along the z-axis and has
radius 1 along the x- and y-axes (see Figure 2.12(a)). A similar analysis of
the level set q(x, y, z) = −1 reveals that it is open along the x- and y-axes and
thus is not even connected (the xy-plane separates its two halves), as shown in
Figure 2.12(c). For this reason, we call this set a “hyperboloid of two sheets”
(whereas we call the level set q(x, y, z) = 1 a “hyperboloid of one sheet”).
∗ (g) q(x, y, z) = 3x2 − xz + z2
2.A.1 Classify each of the following quadratic forms as

positive (semi)definite, negative (semi)definite, or indefinite.
∗ (a) q(x, y) = x2 + 3y2
(b) q(x, y) = 3y2 − 2x2
∗ (c) q(x, y) = x2 + 4xy + 3y2
(d) q(x, y) = x2 + 4xy + 4y2
∗ (e) q(x, y) = x2 + 4xy + 5y2
(f) q(x, y) = 2xy − 2x2 − y2
q(x, y, ) = 1 q(x, y, ) = 0 q(x, y, ) = −1
y y y
x x x
Figure 2.12: The quadratic form q(x, y, z) = x2 + y2 − z2 is indefinite, so its level sets
are hyperboloids. The one exception is the level set q(x, y, z) = 0, which is a double
cone that serves as a boundary between the two types of hyperboloids that exist
(one-sheeted and two-sheeted).
(h) q(x, y, z) = 3x2 + 3y2 + 3z2 − 2xy − 2xz − 2yz ∗ (e) x2 + y2 + 2z2 − 2xz − 2yz = 3
∗ (i) q(x, y, z) = x2 + 2y2 + 2z2 − 2xy + 2xz − 4yz (f) x2 + y2 + 4z2 − 2xy − 4xz = 2
(j) q(w, x, y, z) = w2 + 2x2 − y2 − 2xy + xz − 2wx + yz ∗ (g) 3x2 + y2 + 2z2 − 2xy − 2xz − yz = 1
(h) 2x2 + y2 + 4z2 − 2xy − 4xz = 0
2.A.2 Determine what type of object the graph of the
given equation in R2 is (e.g., ellipse, hyperbola, two lines, 2.A.4 Determine which of the following statements are
or maybe even nothing at all). true and which are false.
∗ (a) x2 + 2y2 = 1 (b) x2 − 2y2 = −4 ∗ (a) Quadratic forms are bilinear forms.
∗ (c) x2 + 2xy + 2y2 = 1 (d) x2 + 2xy + 2y2 = −2 (b) The sum of two quadratic forms is a quadratic form.
∗ (e) 2x2 + 4xy + y2 = 3 (f) 2x2 + 4xy + y2 = 0 ∗ (c) A quadratic form q(v) = vT Av is positive semidefi-
∗ (g) 2x2 + 4xy + y2 = −1 (h) 2x2 + xy + 3y2 = 0 nite if and only if A is positive semidefinite.
(d) The graph of a non-zero quadratic form q : R2 → R
is either an ellipse, a hyperbola, or two lines.
2.A.3 Determine what type of object the graph of the given ∗ (e) The function f : R → R defined by f (x) = x2 is a
equation in R3 is (e.g., ellipsoid, hyperboloid of one sheet, quadratic form.
hyperboloid of two sheets, two touching cones, or maybe
even nothing at all).
2.A.5 Determine which values of a ∈ R make the follow-
∗ (a) x2 + 3y2 + 2z2 = 2
ing quadratic form positive definite:
(b) x2 + 3y2 + 2z2 = −3
∗ (c) 2x2 + 2xy − 2xz + 2yz = 1 q(x, y, z) = x2 + y2 + z2 − a(xy + xz + yz).
(d) 2x2 + 2xy − 2xz + 2yz = −2
2.B Extra Topic: Schur Complements and Cholesky
One common technique for making large matrices easier to work with is to
break them up into 2 × 2 block matrices and then try to use properties of the
smaller blocks to determine corresponding properties of the large matrix. For
example, solving a large linear system Qx = b directly might be quite time-
consuming, but we can make it smaller and easier to solve by writing Q as a
2 × 2 block matrix, and similarly writing x and b as “block vectors”, as follows:
" # " #
A B x1 b
Q= , x= , and b = 1 .
C D x2 b2
Then Qx = b can be written in either of the equivalent forms

" # " #
A B x1 b Ax1 + Bx2 = b1
= 1 ⇐⇒
C D x2 b2 Cx1 + Dx2 = b2 .
If A is invertible then one way to solve this linear system is to subtract CA−1
times the first equation from the second equation, which puts it into the form
This procedure is
sometimes called Ax1 + Bx2 = b1
“block Gaussian
elimination”, and it (D −CA−1 B)x2 = b2 −CA−1 b1 .
really can be
thought of as a This version of the linear system perhaps looks uglier at first glance, but it
block matrix version has the useful property of being block upper triangular: we can solve it by
of the Gaussian
elimination
first solving for x2 in the smaller linear system (D −CA−1 B)x2 = b2 −CA−1 b1
algorithm that we and then solving for x1 in the (also smaller) linear system Ax1 + Bx2 = b1 . By
already know. using this technique, we can solve a 2n × 2n linear system Qx = b entirely via
matrices that are n × n.
This type of reasoning can also be applied directly to the block matrix Q
(rather than the corresponding linear system), and doing so leads fairly quickly
to the following very useful block matrix decomposition:
Theorem 2.B.1 Suppose A ∈ Mn is invertible. Then

2 × 2 Block Matrix " # " #" #" #
LDU Decomposition A B I O A O I A−1 B
= .
C D CA −1 −1
I O D −CA B O I
We do not require Proof. We simply multiply together the block matrices on the right:
A and D to have the
same size in this " #" #" #
theorem (and thus B I O A O I A−1 B
and C need not CA−1 I O D −CA−1 B O I
be square). " #" #
I O A A(A−1 B)
= −1
CA I O D −CA−1 B
" # " #
A B A B
= = ,
(CA−1 )A (CA−1 B) + (D −CA−1 B) C D
as claimed.
We will use the above decomposition throughout this section to come up
with formulas for things like det(Q) and Q−1 , or find a way to determine
positive semidefiniteness of Q, based only on corresponding properties of its
blocks A, B, C, and D.
2.B.1 The Schur Complement

The beauty of Theorem 2.B.1 is that it lets us write almost any 2 × 2 block
matrix as a product of triangular matrices with ones on their diagonals and a
block-diagonal matrix. We can thus determine many properties of Q just from
the corresponding properties of those diagonal blocks. Since it will be appearing
repeatedly, we now give a name to the ugly diagonal block D −CA−1 B.
2.B Extra Topic: Schur Complements and Cholesky 265
Definition 2.B.1 If A ∈ Mn is invertible then the Schur complement of A in the 2 × 2

Schur block matrix " #
Complement A B
C D
is D −CA−1 B.
Example 2.B.1 Partition the following matrix Q as a 2 × 2 block matrix of 2 × 2 matrices

Computing a and compute the Schur complement of the top-left block in Q:
Schur  
Complement 2 1 1 0
 
1 2 1 1
Q= .
1 1 2 1
0 1 1 2
To help remember Solution:

the formula for the
The 2 × 2 blocks of this matrix are
Schur complement,
keep in mind that
CA−1 B is always 2 1 1 0 1 1 2 1
A= , B= , C= , and D= .
defined as long as 1 2 1 1 0 1 1 2
the block matrix Q
makes sense, Since A is invertible (its determinant is 3), the Schur complement of A in
whereas the
incorrect BA−1C will Q is defined and it and equals
not be if B and C are
not square. −1 2 1 1 1 1 2 −1 1 0 1 4 2
D −CA B = − = .
1 2 3 0 1 −1 2 1 1 3 2 4
To illustrate what we can do with the Schur complement and the block
matrix decomposition of Theorem 2.B.1, we now present formulas for the
determinant and inverse of a 2 × 2 block matrix.
Theorem 2.B.2 Suppose A ∈ Mn is invertible and S = D −CA−1 B is the Schur comple-

Determinant of a ment of A in the 2 × 2 block matrix
2 × 2 Block Matrix " #
A B
Q= .
C D
Notice that if the
blocks are all 1 × 1
then this theorem Then det(Q) = det(A) det(S).
says det(Q) = a(d −
c(1/a)b) = ad − bc,
which is the formula
we already know for Proof. We just take the determinant of both sides of the block matrix decom-
2 × 2 matrices. position of Theorem 2.B.1:
" #! " #" #" #!

A B I O A O I A−1 B
Recall that det = det
det(XY ) = det(X) det(Y ) C D CA−1 I O S O I
and that the " #! " #! " #!
determinant of a I O A O I A−1 B
triangular matrix is = det det det
CA −1 I O S O I
the product of its
" #!
diagonal entries.
A O
= 1 · det ·1
O S
= det(A) det(S).
Note that in the final step we used the fact that the determinant of a block diago-
nal matrix equals the product of the determinants of its blocks (this fact is often
covered in introductory linear algebra texts—see the end of Appendix A.1.5).
If A is not invertible
then these block

matrix formulas do
not work. For one In particular, notice that det(Q) = 0 if and only if det(S) = 0 (since we are
way to sometimes assuming that A is invertible, so we know that det(A) 6= 0). By recalling that a
get around this matrix has determinant 0 if and only if it is not invertible, this tells us that Q is
problem, see invertible if and only if its Schur complement is invertible. This fact is re-stated
Exercise 2.B.8.
in the following theorem, along with an explicit formula for its inverse.
Theorem 2.B.3 Suppose A ∈ Mn is invertible and S = D −CA−1 B is the Schur comple-

Inverse of a 2 × 2 ment of A in the 2 × 2 block matrix
Block Matrix " #
A B
Q= .
C D
Then Q is invertible if and only if S is invertible, and its inverse is

" #" #" #
−1 I −A−1 B A−1 O I O
Q = .
O I O S−1 −CA−1 I
Proof. We just take the inverse of both sides of the block matrix decomposition
of Theorem 2.B.1:
" #−1 " #" #" #!−1
A B I O A O I A−1 B
Recall that =
(XY Z)−1 = Z −1Y −1 X −1 . C D CA−1 I O S O I
" #−1 " #−1 " #−1
I A−1 B A O I O
=
If the blocks are all O I O S CA−1 I
1 × 1 then the " #" #" #
formula provided by I −A−1 B A−1 O I O
this theorem = .
simplifies to the O I O S−1 −CA−1 I
familiar formula
" # Note that in the final step we used the fact that the inverse of a block diagonal
1 d −b
Q−1 = matrix is just the matrix with inverted diagonal blocks and the fact that for any
det(Q) −c a
matrix X we have −1
that we already I X I −X
know for 2 × 2 = .
matrices. O I O I
If we wanted to, we could explicitly multiply out the formula provided by

Theorem 2.B.3 to see that
" #−1 " #
A B A−1 + A−1 BS−1CA−1 −A−1 BS−1
= .
C D −S−1CA−1 S−1
However, this formula seems rather ugly and cumbersome, so we typically

prefer the factored form provided by the theorem.
The previous theorems are useful because they let us compute properties
of large matrices just by computing the corresponding properties of matrices
that are half as large. The following example highlights the power of this
technique—we will be able to compute the determinant of a 4 × 4 matrix just
by doing some computations with some 2 × 2 matrices (a much easier task).
Example 2.B.2 Use the Schur complement to compute the determinant of the following
Using the Schur matrix:  
Complement 2 1 1 0
 
1 2 1 1
Q= .
1 1 2 1
0 1 1 2
Solution:
We recall the top-left A block and the Schur complement S of A in Q
from Example 2.B.1:

2 1 1 4 2
A= and S = .
1 2 3 2 4
Since det(A) = 3 and det(S) = 4/3, it follows that
det(Q) = det(A) det(S) = 3 · (4/3) = 4.
Finally, as an application of the Schur complement that is a bit more

relevant to our immediate interests in this chapter, we now show that its positive
(semi)definiteness completely determines positive (semi)definiteness of the full
block matrix as well.
Theorem 2.B.4 Suppose A ∈ Mn is invertible and S = C − B∗ A−1 B is the Schur comple-

Positive ment of A in the self-adjoint 2 × 2 block matrix
(Semi)definiteness " #
of 2 × 2 Block A B
Matrices Q= ∗ .
B C
Then Q is positive (semi)definite if and only if A and S are positive

(semi)definite.
Proof. We notice that in this case, the decomposition of Theorem 2.B.1 simpli-
fies slightly to Q = P∗ DP, where
" # " #
I A−1 B A O
P= and D = .
O I O S
It follows from Exercise 2.2.13 that D is positive (semi)definite if and only if

A and S both are. It also follows from Theorem 2.2.3(d) that if D is positive
(semi)definite then so is Q.
On the other hand, we know that P is invertible since all of its eigenvalues
equal 1 (and in particular are thus all non-zero), so we can rearrange the above
decomposition of Q into the form
(P−1 )∗ Q(P−1 ) = D.
By the same logic as above, if Q is positive (semi)definite then so is D, which

is equivalent to A and S being positive (semi)definite.
For example, if we return to the matrix Q from Example 2.B.2, it is straight-
forward to check that both A and S are positive definite (via Theorem 2.2.7, for
example), so Q is positive definite too.
2.B.2 The Cholesky Decomposition

Recall that one of the central questions that we asked in Section 2.2.3 was
how “simple” we can make the matrix B ∈ Mm,n in a positive semidefinite
decomposition of a matrix A = B∗ B ∈ Mn . One possible answer to this question
was provided by the principal square root (Theorem 2.2.11), which says that
we can always choose B to be positive semidefinite (as long as m = n so that
positive semidefiniteness is a concept that makes sense). We now make use of
Schur complements to show that, alternatively, we can always make B upper
triangular:
Theorem 2.B.5 Suppose F = R or F = C, and A ∈ Mn (F) is positive semidefinite with

Cholesky m = rank(A). There exists a unique matrix T ∈ Mm,n in row echelon form
Decomposition with real strictly positive leading entries such that
A = T ∗ T.
A “leading entry” Before proving this theorem, we recall that T being in row echelon form
is the first non-zero
implies that it is upper triangular, but is actually a slightly stronger requirement
entry in a row.
than just upper triangularity. For example, the matrices

0 1 0 1
and
0 0 0 1
are both upper triangular, but only the one on the left is in row echelon form.
We also note that the choice of m = rank(A) in this theorem is optimal in some
sense:
If A is positive definite • If m < rank(A) then no such decomposition of A is possible (even if
then T is square,
we ignore the upper triangular requirement) since if T ∈ Mm,n then
upper triangular,
and its diagonal rank(A) = rank(T ∗ T ) = rank(T ) ≤ m.
entries are strictly • If m > rank(A) then decompositions of this type exist (for example, we
positive. can just pad the matrix T from the m = rank(A) case with extra rows of
zeros at the bottom), but they are no longer unique—see Remark 2.B.1.
Proof of Theorem 2.B.5. We prove the result by induction on n (the size of A).
√ base case, the result is clearly true if n = 1 since we can choose
For the
T = [ a], which is an upper triangular 1 × 1 matrix with a non-negative
diagonal entry. For the inductive step, suppose that every (n − 1) × (n − 1)

positive semidefinite matrix has a Cholesky decomposition—we want to show
that if A ∈ Mn is positive semidefinite then it has one too. We split into two
cases:
Case 1 can only Case 1: a1,1 = 0. We know from Exercise 2.2.11 that the entire first row and
happen if A is
column of A must equal 0, so we can write A as the block matrix
positive semidefinite
but not positive " #
definite. 0 0T
A= ,
0 A2,2
where A2,2 ∈ Mn−1 is positive semidefinite and has rank(A2,2 ) = rank(A). By

the inductive hypothesis, A2,2 has a Cholesky decomposition A2,2 = T ∗ T , so
" #
0 0T ∗
A= = 0|T 0|T
0 A2,2
is a Cholesky decomposition of A.
Case 2: a1,1 6= 0. We can write A as the block matrix
" #
a1,1 a∗2,1
A= ,
a2,1 A2,2
where A2,2 ∈ Mn−1 and a2,1 ∈ Fn−1 is a column vector. By applying Theo-
The Schur rem 2.B.1, we see that if S = A2,2 − a2,1 a∗2,1 /a1,1 is the Schur complement of
complement exists
a1,1 in A then we can decompose A in the form
because a1,1 6= 0
in this case, so a1,1 " #" #" #
is invertible. 1 0T a1,1 0T 1 a∗2,1 /a1,1
A= .
a2,1 /a1,1 I 0 S 0 I
Since A is positive semidefinite, it follows that S is positive semidefinite too,

so it has a Cholesky decomposition S = T ∗ T by the inductive hypothesis.
Furthermore, since a1,1 6= 0 we conclude that rank(A) = rank(S) + 1, and we
see that
" #" #" #
√ √
1 0T a1,1 a1,1 0T 1 a∗2,1 /a1,1
A=
a2,1 /a1,1 I 0 T ∗T 0 I
"√ √ # "
∗ √ √ #
a1,1 a∗2,1 / a1,1 a1,1 a∗2,1 / a1,1
= ,
0 T 0 T
is a Cholesky decomposition of A. This completes the inductive step and the

proof of the fact that every matrix has a Cholesky decomposition. We leave the
proof of uniqueness to Exercise 2.B.12.
This argument can It is worth noting that the Cholesky decomposition is essentially equivalent
be reversed (to
to the QR decomposition of Section 1.C, which said that every matrix B ∈ Mm,n
derive the QR
decomposition from can be written in the form B = UT , where U ∈ Mm is unitary and T ∈ Mm,n
the Cholesky is upper triangular with non-negative real entries on its diagonal. Indeed, if A is
decomposition) via positive semidefinite then we can use the QR decomposition to write
Theorem 2.2.10.
A = B∗ B = (UT )∗ (UT ) = T ∗U ∗UT = T ∗ T,
which is basically a Cholesky decomposition of A.

 
Example 2.B.3 Find the Cholesky decomposition of the matrix A = 4 −2 2 .
Finding a Cholesky −2 2 −2
Decomposition 3 2 −2
Solution:
To construct A’s Cholesky decomposition, we mimic the proof of
Theorem 2.B.5. We start by writing A as a block matrix with 1 × 1 top-left
block:
" #
a1,1 a∗2,1 −2 2 −2
A= , where a1,1 = 4, a2,1 = , A2,2 = .
a2,1 A2,2 2 −2 3
The Schur complement of a1,1 in A is then
S = A2,2 − a2,1 a∗2,1 /a1,1

2 −2 1 −2 1 −1
= − −2 2 = .
−2 3 4 2 −1 2
It follows from the proof of Theorem 2.B.5 that the Cholesky decomposi-
At this point, we tion of A is
have done one step
of the induction in "√ √ #∗ "√ √ #
the proof that A has a1,1 a∗2,1 / a1,1 a1,1 a∗2,1 / a1,1
a Cholesky A=
decomposition. The
0 T 0 T
 ∗  
next step is to apply
2 −1 1 2 −1 1 (2.B.1)
the same procedure    
= 0   0 ,
to S, and so on.
0
T 0
T
where S = T ∗ T is a Cholesky decomposition of S.

This is a step in the right direction—we now know the top row of the
triangular matrix in the Cholesky decomposition of A. To proceed from
here, we perform the same procedure on the Schur complement S—we
If A was even larger, write S as a block matrix with 1 × 1 top-left block:
we would just repeat
this procedure. Each " #
iteration finds us one s1,1 s∗2,1
more row in its
S= , where s1,1 = 1, s2,1 = −1 , S2,2 = 2 .
s2,1 S2,2
Cholesky
decomposition.
The Schur complement of s1,1 in S is then
S2,2 − s2,1 s∗2,1 /s1,1 = 2 − (−1)2 = 1.
It follows from the proof of Theorem 2.B.5 that a Cholesky decomposition

p
The term S2,2 here
of S is
comes from p thepfact "√
that S2,2 = S2,2 S2,2 √ #∗ "√ √ # ∗
s1,1 s∗2,1 / s1,1 s1,1 s∗2,1 / s1,1 1 −1 1 −1
is a Cholesky S= p p = .
decomposition of 0 S2,2 0 S2,2 0 1 0 1
the 1 × 1 matrix S2,2 .
If we plug this decomposition of S into Equation (2.B.1), we get the

following Cholesky decomposition of A:
 ∗  
2 −1 1 2 −1 1

A= 0 1 −1  0 1 −1 .
0 0 1 0 0 1
Remark 2.B.1 It is worth emphasizing that decompositions of the form A = T ∗ T are no

(Non-)Uniqueness longer necessarily unique if we only require that T be upper triangular
of the Cholesky (rather than in row echelon form) or if m > rank(A). One reason
for
this is
Decomposition that, in Case 1 of the proof of Theorem 2.B.5, the matrix 0 | T may be
upper triangular even if T is not. Furthermore, instead of writing
" #
0 0T ∗
A= = 0 | T1 0 | T1
0 A2,2
where A2,2 = T1∗ T1 is a Cholesky decomposition of A2,2 , we could instead

write " # " #∗ " #
0 0T 0 x∗ 0 x∗
= ,
0 A2,2 0 T2 0 T2
where A2,2 = xx∗ +T2∗ T2 . We could thus just choose x ∈ Fn−1 small enough
so that A2,2 − xx∗ is positive semidefinite and thus has a Cholesky decom-
position A2,2 − xx∗ = T2∗ T2 .
The Cholesky For example, it is straightforward to verify that if
decomposition is a
special case of the  
LU decomposition 0 0 0
 
A = LU. If A is positive A = 0 2 3
semidefinite, we can
choose L = U ∗ . 0 3 5
then A = T1∗ T1 = T2∗ T2 = T3∗ T3 , where

 √ √ 
" √ √ # 0 2 3/ 2
0 2 3/ 2 0 1 1  √ 
T1 = √ , T2 = , T3 = 0 0 1/ 2 .
0 0 1/ 2 0 1 2
0 0 0
However, only the decomposition involving T1 is a valid Cholesky de-

composition, since T2 is not in row echelon form (despite being upper
triangular) and T3 has 3 = m > rank(A) = 2 rows.
2.B.1 Use the Schur complement to help you solve the

following linear system by only ever doing computations
with 2 × 2 matrices:
2w + x + y + z = −3
w − x + 2y − z = 3
−w + x + 2y + z = 1
3w + y + 2z = 2
2.B.2 Compute the Schur complement of the top-left 2 × 2 ∗∗2.B.8 Consider the 2 × 2 block matrix
block in each of the following matrices, and use it to com- " #
pute the determinant of the given matrix and determine A B
Q= .
whether or not it is positive (semi)definite. C D
   
∗ (a) 2 1 1 0 (b) 3 2 0 −1 In this exercise, we show how to come up with block matrix
    formulas for Q if the D ∈ Mn block is invertible (rather
1 3 1 1 2 2 1 0
    than the A block).
1 1 2 0 0 1 2 0
0 1 0 1 −1 0 0 3 Suppose that D is invertible and let S = A − BD−1C (S is
called the Schur complement of D in Q).
2.B.3 Determine which of the following statements are (a) Show how to write
" #
S O
∗ (a) The Schur complement of the top-left 2 × 2 block of Q=U L,
O D
a 5 × 5 matrix is a 3 × 3 matrix.
(b) If a matrix is positive definite then so are the Schur where U is an upper triangular block matrix with
complements of any of its top-left blocks. ones on its diagonal and L is a lower triangular block
∗ (c) Every matrix has a Cholesky decomposition. matrix with ones on its diagonal.
(b) Show that det(Q) = det(D) det(S).
(c) Show that Q is invertible if and only if S is invertible,
∗2.B.4 Find infinitely many different decompositions of
and find a formula for its inverse.
the matrix   (d) Show that Q is positive (semi)definite if and only if
0 0 0
  S is positive (semi)definite.
A = 0 1 1
0 1 2
2.B.9 Suppose A ∈ Mm,n and B ∈ Mn,m . Show that
of the form A = T ∗ T , where T ∈ M3 is upper triangular.
det(Im + AB) = det(In + BA).
2.B.5 Suppose X ∈ Mm,n and c ∈ R is a scalar. Use the
[Side note: This is called Sylvester’s determinant iden-
Schur complement to show that the block matrix
tity.]
" #
cIm X [Hint: Compute the determinant of the block matrix
X ∗ cIn " #
I −A
Q= m
is positive semidefinite if and only if kXk ≤ c. B In
[Side note: You were asked to prove this directly in Exer- using both Schur complements (see Exercise 2.B.8).]
cise 2.3.15.]
2.B.10 Suppose A ∈ Mm,n and B ∈ Mn,m .

2.B.6 This exercise shows that it is not possible to deter-
mine the eigenvalues of a 2 × 2 block matrix from its A (a) Use the result of Exercise 2.B.9 to show that Im + AB
block and its Schur complement. Let is invertible if and only if In + BA is invertible.
" # " # (b) Find a formula for (Im + AB)−1 in terms of A, B, and
1 2 1 3 (In + BA)−1 . [Hint: Try using Schur complements
Q= and R = . just like in Exercise 2.B.9.]
2 1 3 6
(a) Compute the Schur complement of the top-left 1 × 1

∗∗2.B.11 Suppose A ∈ Mm,n , B ∈ Mn,m , and m ≥ n. Use
block in each of Q and R.
the result of Exercise 2.B.9 to show that the characteristic
[Side note: They are the same.]
polynomials of AB and BA satisfy
(b) Compute the eigenvalues of each of Q and R.
[Side note: They are different.] pAB (λ ) = (−λ )m−n pBA (λ ).
∗2.B.7 Suppose A ∈ Mn is invertible and S = D −CA−1 B [Side note: In other words, AB and BA have the same eigen-
is the Schur complement of A in the 2 × 2 block matrix values, counting algebraic multiplicity, but with AB having
" # m − n extra zero eigenvalues.]
A B
Q= .
C D ∗∗2.B.12 Show that the Cholesky decomposition described
by Theorem 2.B.5 is unique.
Show that rank(Q) = rank(A) + rank(S).
[Hint: Follow along with the given proof of that theorem
and show inductively that if Cholesky decompositions in
Mn−1 are unique then they are also unique in Mn .]
2.C Extra Topic: Applications of the SVD 273
2.C Extra Topic: Applications of the SVD
In this section, we explore two particularly useful and interesting applications

of the singular value decomposition (Theorem 2.3.1).
2.C.1 The Pseudoinverse and Least Squares

Recall that if a matrix A ∈ Mn is invertible then the linear system Ax = b has
unique solution x = A−1 b. However, that linear system might have a solution
even if A is not invertible (or even square). For example, the linear system
    
1 2 3 x 6
−1 0 1 y =  0

(2.C.1)
3 2 1 z 6
has infinitely many solutions, like (x, y, z) = (1, 1, 1) and (x, y, z) = (2, −1, 2),
even though its coefficient matrix has rank 2 (which we showed in Exam-
ple 2.3.2) and is thus not invertible.
With this example in mind, it seems natural to ask whether or not there
exists a matrix A† with the property that we can find a solution to the linear
system Ax = b (when it exists, but even if A is not invertible) by setting x = A† b.
Definition 2.C.1 Suppose F = R or F = C, and A ∈ Mm,n (F) has orthogonal rank-one sum
Pseudoinverse decomposition
r
of a Matrix
A= ∑ σ j u j v∗j .
j=1
The orthogonal Then the pseudoinverse of A, denoted by A† ∈ Mn,m , is the matrix

rank-one sum
decomposition was r
1
A† =
def
introduced in
Theorem 2.3.3.
∑ σ j v j u∗j .
j=1
Equivalently, if A has singular value decomposition A = UΣV ∗ then its

pseudoinverse is the matrix A† = V Σ†U ∗ , where Σ† ∈ Mn,m is the diagonal
matrix whose non-zero entries are the reciprocals of the non-zero entries of Σ
(and its zero entries are unchanged). It is straightforward to see that (A† )† = A
for all A ∈ Mm,n . Furthermore, if A is square and all of its singular values are
Be careful: some non-zero (i.e., it is invertible) then Σ† = Σ−1 , so
other books
(particularly physics
books) use A† to A† A = V Σ†U ∗UΣV ∗ = V Σ† ΣV ∗ = VV ∗ = I.
mean the conjugate
transpose of A That is, we have proved that the pseudoinverse really does generalize the
instead of its inverse:
pseudoinverse.
! If A ∈ Mn is invertible then A† = A−1 .
Recall that singular The advantage of the pseudoinverse over the regular inverse is that every
values are unique, matrix has one. Before we can properly explore the pseudoinverse and see what
but singular vectors we can do with it though, we have to prove that it is well-defined. That is, we
are not.
have to show that no matter which orthogonal rank-one sum decomposition
(i.e., no matter which singular value decomposition) of A ∈ Mm,n we use,

the formula provided by Definition 2.C.1 results in the same matrix A† . The
following theorem provides us with a first step in this direction.
Theorem 2.C.1 Suppose A ∈ Mm,n has pseudoinverse A† . Then

The Pseudoinverse a) AA† is the orthogonal projection onto range(A),
and Fundamental
b) I − AA† is the orthogonal projection onto null(A∗ ) = null(A† ),
Subspaces
c) A† A is the orthogonal projection onto range(A∗ ) = range(A† ), and
d) I − A† A is the orthogonal projection onto null(A).
Proof. We start by writing A in its orthogonal rank-one sum decomposition

r r
1
A= ∑ σ j u j v∗j , so A† = ∑ σ j v j u∗j .
j=1 j=1
To see why part (a) of the theorem is true, we multiply A by A† to get

! !
r r
1
AA† = ∑ σi ui v∗i ∑ v j u∗j
The third equality i=1 j=1 σ j
here is the tricky r
σi
one—the double = ∑ ui (v∗i v j )u∗j (product of two sums is a double sum)
sum collapses into a
i, j=1 j
σ
single sum because
all of the terms with r
σj
i 6= j equal 0. = ∑ σ j u j u∗j (v∗i v j = 1 if i = j, v∗i v j = 0 otherwise)
j=1
r
= ∑ u j u∗j . (σ j /σ j = 1)
j=1
The fact that this is the orthogonal projection onto range(A) follows from
the fact that {u1 , u2 , . . . , ur } forms an orthonormal basis of range(A) (Theo-
rem 2.3.2(a)), together with Theorem 1.4.10.
The fact that A† A is the orthogonal projection onto range(A∗ ) follows from
computing A† A = ∑rj=1 v j v∗j in a manner similar to above, and then recalling
from Theorem 2.3.2(c) that {v1 , v2 , . . . , vr } forms an orthonormal basis of
range(A∗ ). On the other hand, the fact that A† A is the orthogonal projection
onto range(A† ) (and thus range(A∗ ) = range(A† )) follows from swapping the
roles of A and A† (and using the fact that (A† )† = A) in part (a).
The proof of parts (b) and (d) of the theorem are all almost identical, so we
leave them to Exercise 2.C.7.
We now show the converse of the above theorem—the pseudoinverse A† is
the only matrix with the property that A† A and AA† are the claimed orthogonal
projections. In particular, this shows that the pseudoinverse is well-defined and
does not depend on which orthogonal rank-one sum decomposition of A was
used to construct it—each one of them results in a matrix with the properties of
Theorem 2.C.1, and there is only one such matrix.
Theorem 2.C.2 If A ∈ Mm,n and B ∈ Mn,m are such that AB is the orthogonal projec-
Well-Definedness tion onto range(A) and BA is the orthogonal projection onto range(B) =
of the range(A∗ ) then B = A† .
Pseudoinverse
Proof. We know from Theorem 2.C.1 that AA† is the orthogonal projection
onto range(A), we know from Theorem 1.4.10 that orthogonal projections are
uniquely determined by their range, so AB = AA† . A similar argument (making
use of the fact that range(A† ) = range(A∗ ) = range(B)) shows that BA = A† A.
Since projections leave everything in their range unchanged, we conclude
that (BA)Bx = Bx for all x ∈ Fn , so BAB = B, and a similar argument shows
that A† AA† = A† . Putting these facts together shows that
B = BAB = (BA)B = (A† A)B = A† (AB) = A† (AA† ) = A† AA† = A† .
 
Example 2.C.1 Compute the pseudoinverse of the matrix A = 1 2 3 .
Computing a −1 0 1
Pseudoinverse 3 2 1
Solution:
We already saw in Example 2.3.2 that the singular value decomposition
of this matrix is A = UΣV ∗ , where
√ √  √ √ 
3 2 1 2 − 3 −1
1  √  1 √ 
U=√   0 2 −2  , V=√   2 0 2
,
6 √ √ 6 √ √
3 − 2 −1 2 3 −1
 √ 
2 6 0 0
 √ 
Σ= 0 6 0 .
0 0 0
The pseudoinverse It follows that

is sometimes called
the Moore–Penrose
pseudoinverse of A. A† = V Σ†U ∗
√ √  √  √ √ 
2 − 3 −1 1/(2 6) 0 0 3 0 3
1 √  √  √ √ √ 
=  2 0 2 
0 1/ 6 0   2 2 − 2
6 √ √

2 3 −1 0 0 0 1 −2 −1
 
−1 −2 3
1 
= 1 0 1 .
12
3 2 −1
 
Example 2.C.2 Compute the pseudoinverse of the matrix A = 1 1 1 −1 .
Computing 0 1 1 0
a Rectangular −1 1 1 1
Pseudoinverse Solution:
This is the same matrix from Example 2.3.3, which has singular value
decomposition A = UΣV ∗ , where

√ √  √ 
2 − 3 −1 6 0 0 0
1 √   
U=√  2 0 2
 , Σ =  0 2 0 0 and
6 √ √
2 3 −1 0 0 0 0
 
0 −1 0 1
1 1 0 1 0

V=√  .
2 1 0 −1 0
0 1 0 1
It follows that
A† = V Σ†U ∗
  √ 
Verify this matrix 0 −1 0 1 1/ 6 0 0 √ √ √ 
2 2 2
multiplication on
1 1 0 1

0  0

1/2 0  √ √ 
your own. It builds =√    − 3 0 3
character. 12 1 0 −1 0  0 0 0
0 1 0 1 −1 2 −1
0 0 0
 
3 0 −3

1 2 
2 2
=  .
12  2 2 2
−3 0 3
Solving Linear Systems

Now that we know how to construct the pseudoinverse of a matrix, we return to
the linear system Ax = b from Equation (2.C.1). If we (very naïvely for now)
try to use the pseudoinverse to solve this linear system by setting x = A† b, then
we get       
6 −1 −2 3 6 1
†  1      
We computed this x = A 0 = 1 0 1 0 = 1 .
pseudoinverse A† in 12
6 3 2 −1 6 1
Example 2.C.1.
This is indeed a solution of the original linear system, as we might hope. The
following theorem shows that this is always the case—if a linear system has a
solution, then the pseudoinverse finds one. Furthermore, if there are multiple
solutions to the linear system, it finds the “best” one:
Theorem 2.C.3 Suppose F = R or F = C, and A ∈ Mm,n (F) and b ∈ Fm are such that the
Pseudoinverses linear system Ax = b has at least one solution. Then x = A† b is a solution,
Solve Linear and furthermore if y ∈ Fn is any other solution then kA† bk < kyk.
Systems
Proof. The linear system Ax = b has a solution if and only if b ∈ range(A). To
see that x = A† b is a solution in this case, we simply notice that
Ax = A(A† b) = (AA† )b = b,
since AA† is the orthogonal projection onto range(A), by Theorem 2.C.1(a).
To see that kA† bk ≤ kyk for all solutions y of the linear system (i.e., all y
such that Ay = b), we note that
A† b = A† (Ay) = (A† A)y.
Since A† A is an orthogonal projection, it follows from the fact that orthogonal

projections cannot increase the norm of a vector (Theorem 1.4.11) that kA† bk =
k(A† A)yk ≤ kyk, and furthermore equality holds if and only if
(A† A)y = y (i.e., if and only if y = A† b).
To get a rough idea for why it’s desirable to find the solution with smallest
norm, as in the previous theorem, we again return to the linear system
    
1 2 3 x 6
−1 0 1 y =  0

3 2 1 z 6
that we originally introduced in Equation (2.C.1). The solution set of this linear
system consists of the vectors of the form (0, 3, 0) + z(1, −2, 1), where z is a
free variable. This set contains some vectors that are hideous (e.g., choosing
z = 341 gives the solution (x, y, z) = (341, −679, 341)), but it also contains
some vectors that are not so hideous (e.g., choosing z = 1 gives the solution
(x, y, z) = (1, 1, 1), which is the solution found by the pseudoinverse). The
guarantee that the pseudoinverse finds the smallest-norm solution means that
we do not have to worry about it returning “large and ugly” solutions like
(341, −679, 341). Geometrically, it means that it finds the solution closest to
the origin (see Figure 2.13).
(1, 1, 1)
(1, −2, 1)
x (0, 3, 0)
y
Figure 2.13: Every point on the line (0, 3, 0) + z(1, −2, 1) is a solution of the linear
system from Equation (2.C.1). The pseudoinverse finds the solution (x, y, z) = (1, 1, 1),
which is the point on that line closest to the origin.
Not only does the pseudoinverse find the “best” solution when a solution
Think of Ax − b as the exists, it even find the “best” non-solution when no solution exists. To make
error in our solution x: sense of this statement, we again think in terms of norms and distances—if
it is the difference
between what Ax
no solution to a linear system Ax = b exists, then it seems reasonable that the
actually is and what “next best thing” to a solution would be the vector that makes Ax as close to b as
we want it to be (b). possible. In other words, we want to find the vector x that minimizes kAx − bk.
The following theorem shows that choosing x = A† b also solves this problem:
Theorem 2.C.4 Suppose F = R or F = C, and A ∈ Mm,n (F) and b ∈ Fm . If x = A† b then

Linear Least
Squares kAx − bk ≤ kAy − bk for all y ∈ Fn .
Here, Ay is just an
arbitrary element Proof. We know from Theorem 1.4.11 that the closest point to b in range(A) is
of range(A). Pb, where P is the orthogonal projection onto range(A). Well,
Theorem 2.C.1(a) tells us that P = AA† , so
kAx − bk = k(AA† )b − bk ≤ kAy − bk for all y ∈ Fn ,
as desired.
This method of finding the closest thing to a solution of a linear system is
called linear least squares, and it is particularly useful when trying to fit data
to a model, such as finding a line (or plane, or curve) of best fit for a set of data.
For example, suppose we have many data points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ),
and we want to find the line that best describes the relationship between x and
y as in Figure 2.14.
y
x
-3 -2 -1 0 1 2 3
Figure 2.14: The line of best fit for a set of points is the line that minimizes the sum of
squares of vertical displacements of the points from the line (highlighted here in
orange).
To find this line, we consider the “ideal” scenario—we try (and typically
fail) to find a line that passes exactly through all n data points by setting up the
corresponding linear system:
Keep in mind that

y1 = mx1 + b
x1 , . . . , xn and y1 , . . . , yn y2 = mx2 + b
are given to us—the
variables that we ..
.
are trying to solve for
in this linear system yn = mxn + b.
are m and b.
Since this linear system has n equations, but only 2 variables (m and b), we
do not expect to find an exact solution, but we can find the closest thing to a
solution by using the pseudoinverse, as in the following example.
Example 2.C.3 Find the line of best fit for the points (−2, 0), (−1, 1), (0, 0), (1, 2), and
Finding a Line (2, 2).
of Best Fit
Solution:
To find the line of best fit, we set up the system of linear equations that
we would like to solve. Ideally, we would like to find a line y = mx + b
that goes through all 5 data points. That is, we want to find m and b such
that
0 = −2m + b, 1 = −1m + b, 0 = 0m + b,
2 = 1m + b, 2 = 2m + b.
It’s not difficult to see that this linear system has no solution, but we
can find the closest thing to a solution (i.e., the line of best fit) by using the
pseudoinverse. Specifically, we write the linear system in matrix form as

   
−2 1 0
−1 1 1
  m  
   
Ax = b where A =  0 1 , x = , and b = 0 .
  b  
 1 1 2
2 1 2
Try computing Well, the pseudoinverse of A is
this pseudoinverse
yourself: the singular
√
values of A are √10 1 −2 −1 0 1 2
A† = ,
and 5. 10 2 2 2 2 2
so the coefficients of the line of best fit are given by
 
0
 1 " #
m † 1 −2 −1 0 1 2   
 1/2
x= =A b= 0 = .
b 10 2 2 2 2 2   1
2
2
In other words, the line of best fit is y = x/2 + 1, as shown below:
y
(1, 2)
2
(2, 2)
(−1, 1)
1
(−2, 0) (0, 0)
x
-3 -2 -1 0 1 2 3
This exact same method also works for finding the “plane of best fit”
for data points (x1 , y1 , z1 ), (x2 , y2 , z2 ), . . . , (xn , yn , zn ), and so on for higher-
dimensional data as well (see Exercise 2.C.5). We can even do things like
find quadratics of best fit, exponentials of best fit, or other weird functions of
best fit (see Exercise 2.C.6).
By putting together all of the results of this section, we see that the pseu-
doinverse gives the “best solution” to a system of linear equations Ax = b in
all cases:
• If the system has a unique solution, it is x = A† b.
• If the system has infinitely many solutions, then x = A† b is the smallest
solution—it minimizes kxk.
• If the system has no solutions, then x = A† b is the closest thing to a
solution—it minimizes kAx − bk.
2.C.2 Low-Rank Approximation

As one final application of the singular value decomposition, we consider the
problem of approximating a matrix by another matrix with small rank. One of
the primary reasons why we would do this is that it allows us to compress the
data that is represented by a matrix, since a full (real) n × n matrix requires us
to store n2 real numbers, but a rank-k matrix only requires us to store 2kn + k
real numbers. To see this, note that we can store a low-rank matrix via its
orthogonal rank-one sum decomposition
k
A= ∑ σ j u j v∗j ,
j=1
which consists of 2k vectors with n entries each, as well as k singular values.

In fact, it suffices to Since 2kn + k is much smaller than n2 when k is small, it is much less resource-
store just 2kn − k real
intensive to store low-rank matrices than general matrices. This observation is
numbers and k signs
(which each require useful in practice—instead of storing the exact matrix A that contains our data
only a single bit), of interest, we can sometimes find a nearby matrix with small rank and store
since each u j and v j that instead.
has norm 1.
To actually find a nearby low-rank matrix, we use the following theorem,
which says that the singular value decomposition tells us what to do. Specifi-
cally, the closest rank-k matrix to a given matrix A is the one that is obtained
by replacing all except for the k largest singular values of A by 0.
Theorem 2.C.5 Suppose F = R or F = C, and A ∈ Mm,n (F) has singular values σ1 ≥

Eckart–Young– σ2 ≥ · · · ≥ σr > 0. If 1 ≤ k ≤ r and A has orthogonal rank-one sum
Mirsky decomposition
r k
A= ∑ σ j u j v∗j , then we define Ak = ∑ σ j u j v∗j .
j=1 j=1
In particular, notice Then kA − Ak k ≤ kA − Bk for all B ∈ Mm,n (F) with rank(B) = k.

that rank(Ak ) = k,
so Ak is the closest
rank-k matrix to A. The Eckart–Young–Mirsky theorem says that the best way to approximate a
high-rank matrix by a low-rank one is to discard the pieces of its singular value
decomposition corresponding to its smallest singular values. In other words, the
singular value decomposition organizes a matrix into its “most important” and
“least important” pieces—the largest singular values (and their corresponding
singular vectors) describe the broad strokes of the matrix, while the smallest
singular values (and their corresponding singular vectors) just fill in its fine
details.
Proof of Theorem 2.C.5. Pick any matrix B ∈ Mm,n with rank(B) = k, which
necessarily has (n − k)-dimensional null space by the rank-nullity theorem
(Theorem A.1.2(e)). Also, consider the vector space Vk+1 = span{v1 , v2 , . . . ,
vk+1 }, which is (k + 1)-dimensional. Since (n − k) + (k + 1) = n + 1, we know
from Exercise 1.5.2(b) that null(B) ∩ Vk+1 is at least 1-dimensional, so there
exists a unit vector w ∈ null(B) ∩ Vk+1 . Then we have
kA − Bk ≥ k(A − B)wk (since kwk = 1)
= kAwk (since w ∈ null(B))
!
r
∗
= ∑ σ j u j v j w (orthogonal rank-one sum decomp. of A)
j=1

k+1

= ∑ σ j (v∗j w)u j . (w ∈ Vk+1 , so v∗j w = 0 when j > k + 1)
j=1
At this point, we note that Theorem 1.4.5 tells us that (v∗1 w, . . . , v∗k+1 w) is the
coefficient vector of w in the basis {v1 , v2 , . . . , vk+1 } of Vk+1 . This then implies,
via Corollary 1.4.4, that kwk2 = ∑k+1 ∗ 2
j=1 |v j w| , and similarly that
2
k+1 k+1

∑ σ j (v∗j w)u j = ∑ σ 2j |v∗j w|2 .
j=1 j=1
With this observation in hand, we now continue the chain of inequalities from
above:
v
uk+1
u
kA − Bk ≥ t ∑ σ 2j |v∗j w|2 (since {u1 , . . . , uk+1 } is an orthonormal set)
j=1
v
uk+1
u
≥ σk+1 t ∑ |v∗j w|2 (since σ j ≥ σk+1 )
j=1
k+1
= σk+1 (since ∑ |v∗j w|2 = kwk2 = 1)
j=1
r
= kA − Ak k, (since A − Ak = ∑ σ j u j v∗j )
j=k+1
as desired.
Example 2.C.4 Find the closest rank-1 approximation to

Closest Rank-1  
Approximation 1 2 3
A = −1 0 1 .
3 2 1
That is, find the matrix B with rank(B) = 1 that minimizes kA − Bk.
Solution:
Recall from Example 2.3.2 that the singular value decomposition of
this matrix is A = UΣV ∗ , where
√ √  √ √ 
3 2 1 2 − 3 −1
1  √  
 , V = √1 √2

U=√   0 2 −2   0 2,
6 √ √ 6 √ √
3 − 2 −1 2 3 −1
 √ 
2 6 0 0
 √ 
Σ= 0 6 0 .
0 0 0
To get the closest rank-1 approximation of A, we simply replace all except

for the largest singular value by 0, and then multiply the singular value
decomposition back together. This procedure gives us the following closest
rank-1 matrix B:
 √ 
2 6 0 0
  ∗
B=U 0 0 0 V
As always, 0 0 0
multiplying matrices √ √   √ √ √ 
√
like these together 3 2 1 2 6 0 0 2 2 2
1  √    √ √ 
is super fun. =  0 
2 −2  0 0 0 − 3 0 3
6 √ √
3 − 2 −1 0 0 0 −1 2 −1
 
2 2 2
= 0 0 0 .
2 2 2
We can use this method to compress pretty much any information that we
can represent with a matrix, but it works best when there is some correlation
between the entries in the rows and columns of the matrix (e.g., this method
does not help much if we just place inherently 1-dimensional data like a text file
into a matrix of some arbitrary shape). For example, we can use it to compress
black-and-white images by representing the brightness of each pixel in the
image by a number, arranging those numbers in a matrix of the same size and
shape as the image, and then applying the Eckart–Young–Mirsky theorem to
that matrix.
Similarly, we can compress color images by using the fact that every color
can be obtained from mixing red, green, and blue, so we can use three matrices:
one for each of those primary colors. Figure 2.15 shows the result of applying
a rank-k approximation of this type to an image for k = 1, 5, 20, and 100.
Remark 2.C.1 It seems natural to ask how low-rank matrix approximation changes if we
Low-Rank use a matrix norm other than the operator norm. It turns out that, for a wide
Approximation in variety of matrix norms (including the Frobenius norm), nothing changes
other Matrix Norms at all. For example, one rank-k matrix B that minimizes kA − BkF is exactly
the same as the one that minimizes kA − Bk. That is, the closest rank-k
approximation does not change at all, even if we measure “closeness” in
this very different way.
This fact about More generally, low-rank approximation works the same way for every
unitarily-invariant
matrix norm that is unitarily-invariant (i.e., if Theorem 2.3.6 holds for a
norms is beyond the
scope of this particular matrix norm, then so does the Eckart–Young–Mirsky theorem).
book—see (Mir60) For example, Exercise 2.3.19 shows that something called the “trace norm”
for a proof. is unitarily-invariant, so the Eckart–Young–Mirsky theorem still works if
we replace the operator norm with it.
2.C.1 For each of the following linear systems Ax = b, " # " #

(b) A = 1 2 , b= 1
determine whether or not it has a solution. If it does, find the
smallest one (i.e., find the solution x that minimizes kxk). If 3 6 3
" # " #
it doesn’t, find the closest thing to a solution (i.e., find the ∗ (c) A = 1 2 3 , b = 2
vector x that minimizes kAx − bk).
" # " # 2 3 4 −1
∗ (a) A = 1 2 , b = 2
3 6 1
The rank-1
approximation is
interesting because
we can actually see
that it has rank 1:
every row and
column is a multiple
of every other row
and column, which
is what creates the
banding effect in
the image.
The rank-100
approximation is
almost
indistinguishable
from the original
image.
Figure 2.15: A picture of the author’s cats that has been compressed via the
Eckart–Young–Mirsky theorem. The images are (a) uncompressed, (b) use a rank-1
approximation, as well as a (c) rank-5, (d) rank-20, and a (e) rank-100 approximation.
The original image is 500 × 700 with full rank 500.
   
(d) A = 1 1 1 , b= 1 ∗∗2.C.5 Find the plane z = ax + by + c of best fit for the
   
1 0 1 1 following 4 data points (x, y, z):
0 1 0 1
(0, −1, −1), (0, 0, 1), (0, 1, 3), (2, 0, 3).
2.C.2 Find the line of best fit (in the sense of Exam-
ple 2.C.3) for the following collections of data points. ∗∗ 2.C.6 Find the curve of the form y = c1 sin(x) +
[Side note: Exercise 2.C.10 provides a way of solving this c2 cos(x) that best fits the following 3 data points (x, y):
problem that avoids computing the pseudoinverse.] (0, −1), (π/2, 1), (π, 0).
∗ (a) (1, 1), (2, 4).
(b) (−1, 4), (0, 1), (1, −1).
∗∗2.C.7 Prove parts (b) and (d) of Theorem 2.C.1.
∗ (c) (1, −1), (3, 2), (4, 7).
(d) (−1, 3), (0, 1), (1, −1), (2, 2).
2.C.8 Suppose F = R or F = C, and A ∈ Mm,n (F),
B ∈ Mn,p (F), and C ∈ M p,r (F). Explain how to compute
2.C.3 Find the best rank-k approximations of each of the
ABC if we know AB, B, and BC, but not necessarily A or C
following matrices for the given value of k.
" # themselves.
∗ (a) A = 3 1 , k = 1.
[Hint: This would be trivial if B were square and invertible.]
1 3
" #
(b) A = 1 −2 3 , k = 1. ∗2.C.9 In this exercise, we derive explicit formulas for the
3 2 1 pseudoinverse in some special cases.
 
∗ (c) A = 1 0 2 , k = 1. (a) Show that if A ∈ Mm,n has linearly independent
  columns then A† = (A∗ A)−1 A∗ .
0 1 1
−2 −1 1 (b) Show that if A ∈ Mm,n has linearly independent rows
  then A† = A∗ (AA∗ )−1 .
(d) A = 1 0 2 , k = 2.

0
 [Side note: A∗ A and AA∗ are indeed invertible in these cases.
1 1
See Exercise 1.4.30, for example.]
−2 −1 1
2.C.4 Determine which of the following statements are 2.C.10 Show that if x1 , x2 , . . ., xn are not all the same then
true and which are false. the line of best fit y = mx + b for the points (x1 , y1 ), (x2 , y2 ),
∗ (a) I † = I. . . ., (xn , yn ) is the unique solution of the linear system
(b) The function T : M3 (R) → M3 (R) defined by A∗ Ax = A∗ b,
T (A) = A† (the pseudoinverse of A) is a linear trans-
formation. where
∗ (c) For all A ∈ M4 (C), it is the case that range(A† ) =    
x1 1 y1
range(A).   y2 
x2 1
(d) If A ∈ Mm,n (R) and b ∈ Rm are such that the linear    
 , and x = m .
system Ax = b has a solution then there is a unique A= . ..  , b=
 . 
.   ..  b
solution vector x ∈ Rn that minimizes kxk. . .
∗ (e) For every A ∈ Mm,n (R) and b ∈ Rm there is a unique xn 1 y n
vector x ∈ Rn that minimizes kAx − bk.
[Hint: Use Exercise 2.C.9.]
2.D Extra Topic: Continuity and Matrix Analysis
Much of introductory linear algebra is devoted to defining and investigating

classes of matrices that are easier to work with than general matrices. For
The pseudoinverse example, invertible matrices are easier to work with than typical matrices
of Section 2.C.1 and since we can do algebra with them like we are used to (e.g., if A is invertible
the Jordan then we can solve the matrix equation AX = B via X = A−1 B). Similarly,
decomposition of diagonalizable matrices are easier to work with than typical matrices since we
Section 2.4 can also
help us cope with can easily compute powers and analytic functions of them.
the fact that some This section explores some techniques that can be used to treat all matrices
matrices are not
invertible or
as if they were invertible and/or diagonalizable, even though they aren’t. The
diagonalizable. rough idea is to exploit the fact that every matrix is extremely close to one that
2.D Extra Topic: Continuity and Matrix Analysis 285
is invertible and diagonalizable, and many linear algebraic properties do not

change much if we change the entries of a matrix by just a tiny bit.
2.D.1 Dense Sets of Matrices

One particularly useful technique from mathematical analysis is to take ad-
vantage of the fact that every continuous function f : R → R is completely
determined by how it acts on the rational numbers (in fact, we made use of
this technique when we proved Theorem 1.D.8). For example, if we know that
Recall that Q f (x) = sin(x) whenever x ∈ Q and we furthermore know that f is continuous,
denotes the set of
then it must be the case that f (x) = sin(x) for all x ∈ R (see Figure 2.16).
rational numbers
(i.e., numbers that Intuitively, this follows because f being continuous means that f (x̃) must
can be written as a be close to f (x) whenever x̃ ∈ R is close to x ∈ Q, and we can always find an
ratio p/q of two
integers).
x ∈ Q that is as close to x̃ as we like. Roughly speaking, Q has no “holes” of
width greater than 0 along the real number line, so defining how f behaves on
Q leaves no “wiggle room” for what its values outside of Q can be.
y
f (x) = sin(x)
for all x ∈ Q
x
f (x̃) = sin(x̃)
/Q
x̃ ∈
Figure 2.16: If f (x) = sin(x) for all x ∈ Q and f is continuous, then it must be the case
that f (x) = sin(x) for all x ∈ R.
Remark 2.D.1 This idea of extending a function from Q to all of R via continuity is
Defining actually how some common functions are defined. For example, what does
Exponential the expression 2x even mean if x is irrational? We build up to the answer
Functions to this question one step at a time:
• If x is a positive integer then 2x = 2·2 · · · 2 (x times). If x is a negative
integer then 2x = 1/2−x . If x = 0 then 2x = 1.
• If x is rational, so we can√ write x = p/q for some integers p, q
(q 6= 0), then 2x = 2 p/q = q 2 p .
Recall that • If x is irrational, we define 2x by requiring that the function f (x) = 2x
“irrational” just
be continuous. That is, we set
means “not rational”.
Some well-known
irrational 2x = lim 2r .
√ numbers r→x
include 2, π, and e. r is rational
For example, a number like 2π is defined as the limit of the sequence of

numbers 23 , 23.1 , 23.14 , 23.141 , . . ., where we take better and better decimal
approximations of π (which are rational) in the exponent.
The property of Q that is required to make this sort of argument work is

the fact that it is dense in R: every x ∈ R can be written as a limit of rational
numbers. We now define the analogous property for sets of matrices.
Definition 2.D.1 Suppose F = R or F = C. A set of matrices B ⊆ Mm,n (F) is called

Dense Set of dense in Mm,n (F) if, for every matrix A ∈ Mm,n (F), there exist matrices
Matrices A1 , A2 , . . . ∈ B such that
lim Ak = A.
k→∞
Recall that [Ak ]i, j is Before proceeding, we clarify that limits of matrices are simply meant
the (i, j)-entry of Ak .
entrywise: lim Ak = A means that lim [Ak ]i, j = [A]i, j for all i and j. We illustrate
k→∞ k→∞
with an example.
" # −1

Example 2.D.1 Let Ak = 1 k 2k − 1 and B = 2 1 . Compute lim Ak B .
k→∞
Limits of Matrices k 2k − 1 4k + 1 4 2
Solution:
We start by computing
Recall that the
inverse of a 2 × 2 " #
matrix is k 4k + 1 1 − 2k
" #−1 A−1
k = .
a b
5k − 1 1 − 2k k
=
c d
" # In particular, it is worth noting that lim A−1
k does not exist (the entries of
1 d −b k→∞
.
ad − bc −c a A−1
k get larger and larger as k increases), which should not be surprising
since " #!
1 k 2k − 1 1 2
lim Ak = lim =
k→∞ k→∞ k 2k − 1 4k + 1 2 4
is not invertible. However, if we multiply on the right by B then
" # " #
−1 k 4k + 1 1 − 2k 2 1 k 6 3
Ak B = = ,
5k − 1 1 − 2k k 4 2 5k − 1 2 1
so " #! " #
k 6 3 1 6 3
lim A−1
k B = lim = .
k→∞ k→∞ 5k − 1 2 1 5 2 1
Intuitively, a set B ⊆ Mm,n (F) is dense if every matrix is arbitrarily close to

a member of B. As suggested earlier, we are already very familiar with two very
useful dense sets of matrices: the sets of invertible and diagonalizable matrices.
In both cases, the rough reason why these sets are dense is that changing the
entries of a matrix just slightly changes its eigenvalues just slightly, and we can
make this change so as to have its eigenvalues avoid 0 (and thus be invertible)
and avoid repetitions (and thus be diagonalizable). We now make these ideas
precise.
Theorem 2.D.1 Suppose F = R or F = C. The set of invertible matrices is dense in Mn (F).

Invertible Matrices
are Dense Proof. Suppose A ∈ Mn (F) and then define, for each integer k ≥ 1, the matrix
Ak = A + 1k I. It is clear that
lim Ak = A,
k→∞
and we claim that Ak is invertible whenever k is sufficiently large.

The idea here is that To see why this claim holds, recall that if A has eigenvalues λ1 , . . ., λn
we can wiggle the
then Ak has eigenvalues λ1 + 1/k, . . ., λn + 1/k. Well, if k > 1/|λ j | for each
zero eigenvalues of
A away from zero non-zero eigenvalue λ j of A, then we can show that all of the eigenvalues of Ak
without wiggling any are non-zero by considering two cases:
of the others
into zero.
• If λ j = 0 then λ j + 1/k = 1/k 6= 0, and
• if λ j 6= 0 then |λ j + 1/k| ≥ |λ j | − 1/k > 0, so λ j + 1/k 6= 0.
It follows that each eigenvalue of Ak is non-zero, and thus Ak is invertible,
whenever k is sufficiently large.
For example, the matrix
 
1 1 2
 
A= 2 3 5
−1 0 −1
is not invertible, since

√ it has 0 as an eigenvalue (as well as two other eigenval-
ues equal to (3 ± 13)/2, which are approximately 3.30278 and −0.30278).
However, the matrix
In fact, this Ak is
invertible when  
k = 1, 2, 3 too, since 1 + 1/k 1 2
none of the 1  
Ak = A + I =  2 3 + 1/k 5 
eigenvalues of A k
equal −1, −1/2, or −1 0 −1 + 1/k
−1/3. However, we
just need invertibility is invertible whenever k ≥ 4 since adding a number in the interval (0, 1/4] =
when k is large.
(0, 0.25] to each of the eigenvalues of A ensures that none of them equal 0.
Theorem 2.D.2 Suppose F = R or F = C. The set of diagonalizable matrices is dense in

Diagonalizable Mn (F).
Matrices are Dense
Proof. Suppose A ∈ Mn (F), which we Schur triangularize as A = UTU ∗ . We

then let D be the diagonal matrix with diagonal entries 1, 2, . . ., n (in that order)
and define, for each integer k ≥ 1, the matrix Ak = A + 1k UDU ∗ . It is clear that
lim Ak = A,
k→∞
and we claim that Ak has distinct eigenvalues, and is thus diagonalizable,

To see why this claim holds, recall that the eigenvalues λ1 , λ2 , . . ., λn of
A are the diagonal entries of T (which we assume are arranged in that order).
Similarly, the eigenvalues of
1 1
Ak = UTU ∗ + UDU ∗ = U T + D U ∗
k k
are the diagonal entries of T + 1k D, which are λ1 + 1/k, λ2 + 2/k, . . ., λn + n/k.

Well, if k > (n − 1)/|λi − λ j | for each distinct pair of eigenvalues λi 6= λ j , then
we can show that the eigenvalues of Ak are distinct as follows:
• If λi = λ j (but i 6= j) then λi + i/k 6= λ j + j/k, and
• if λi 6= λ j then

(λi + i/k) − (λ j + j/k) = (λi − λ j ) + (i/k − j/k)
≥ |λi − λ j | − |i − j|/k
≥ |λi − λ j | − (n − 1)/k > 0.
It follows that the eigenvalues of Ak are distinct, and it is thus diagonalizable,

Remark 2.D.2 Limits can actually be defined in any normed vector space—we have just
Limits in Normed restricted attention to Mm,n since that is the space where these concepts
Vector Spaces are particularly useful for us, and the details simplify in this case since
matrices have entries that we can latch onto.
In general, as long as a vector space V is finite-dimensional, we can
define limits in V by first fixing a basis B of V and then saying that
lim vk = v whenever lim [vk ]B = [v]B ,

k→∞ k→∞
where the limit on the right is just meant entrywise. It turns out that this
notion of limit does not depend on which basis B of V we choose (i.e., for
any two bases B and C of V, it is the case that lim [vk ]B = [v]B if and only
k→∞
if lim [vk ]C = [v]C ).
k→∞
If V is infinite-dimensional then this approach does not work, since
we may not be able to construct a basis of V in the first place, so we may
not have any notion of “entrywise” limits to work with. We can get around
this problem by picking some norm on V (see Section 1.D) and saying that
lim vk = v whenever lim kvk − vk = 0.

k→∞ k→∞
In the finite-dimensional case, this definition of limits based on norms turns

out to be equivalent to the earlier one based on coordinate vectors, and
furthermore it does not matter which norm we use on V (a fact that follows
from all norms on finite-dimensional vector spaces being equivalent to
each other—see Theorem 1.D.1).
On the other hand, in infinite-dimensional vector spaces, it is no longer
necessarily the case that all norms are equivalent (see Exercise 1.D.26), so
it is also no longer the case that limits are independent of the norm being
used. For example, consider the 1-norm and ∞-norm on P[0, 1] (the space
of real-valued polynomials on the interval [0, 1]):
We introduced Z 1
these norms back k f k1 = | f (x)| dx and k f k∞ = max | f (x)|.
in Section 1.D.1. 0 0≤x≤1
We claim that the sequence of polynomials { fk }∞ k=1 ⊂ P[0, 1] defined by

fk (x) = xk for all k ≥ 1 converges to the zero function in the 1-norm, but
does not in the ∞-norm. To see why this is the case, we note that
Z 1 1
xk+1 1
k fk − 0k1 = |xk | dx = = and
0 k + 1 x=0 k + 1
k fk − 0k∞ = max |xk | = 1k = 1 for all k ≥ 1.
x∈[0,1]
A sequence of If we take limits of these quantities, we see that

functions { fk }∞k=1
converging in the
1
∞-norm means lim k fk − 0k1 = lim = 0, so lim fk = 0, but
that the maximum k→∞ k+1
k→∞ k→∞
difference between lim k fk − 0k∞ = lim 1 = 1 6= 0, so lim fk 6= 0.
fk (x) and the limit k→∞ k→∞ k→∞
function f (x) goes to
0. That is, these Geometrically, all this is saying is that the area under the graphs of the
functions converge polynomials fk (x) = xk on the interval [0, 1] decreases toward zero, yet the
pointwise, which is
probably the most
maximum values of these polynomials on that interval does not similarly
intuitive notion of converge toward zero (it is fixed at 1):
convergence for y
functions.
lim kxk k = 1
k→
3
4
1 f1 (x) = x
2
x2 3
1
x
..
4 .
x20 lim kxk k1 = 0
k→
x
1 1 3 1
4 2 4
2.D.2 Continuity of Matrix Functions

Just as with functions on R, the idea behind continuity of functions acting on
Mm,n is that we should be able to make two outputs of the function as close
together as we like simply by making their inputs sufficiently close together.
We can make this idea precise via limits:
Definition 2.D.2 Suppose F = R or F = C. We say that a function f : Mm,n (F) → Mr,s (F)
Continuous is continuous if it is the case that
Functions
lim f (Ak ) = f (A) whenever lim Ak = A.
k→∞ k→∞
Phrased slightly differently, a function f : Mm,n (F) → Mr,s (F) is continu-

ous if we can pull limits in or out of it:

f lim Ak = lim f (Ak )
k→∞ k→∞
whenever the limit on the left exists.

Many of the functions of matrices that we have learned about are continuous.
Some important examples (which we do not rigorously prove, since we are

desperately trying to avoid epsilons and deltas) include
Linear transformations: Roughly speaking, continuity of linear transformations follows from
combining two facts: (i) if f is a linear transformation then the entries of
f (A) are linear combinations of the entries of A, and (ii) if each entry of
Ak is close to the corresponding entry of A, then any linear combination
of the entries of Ak is close to the corresponding linear combination of
the entries of A.
Matrix multiplication: The fact that the function f (A) = BAC (where B and C are fixed matrices)
is continuous follows from it being a linear transformation.
The trace: Again, the trace is a linear transformation and is thus continuous.
Polynomials: Matrix polynomials like f (A) = A2 − 3A + 2I, and also polynomials of a
matrix’s entries like g(A) = a31,1 a1,2 − a22,1 a22,2 , are all continuous.
The determinant: Continuity of the determinant follows from the fact that Theorem A.1.4
shows that it is a polynomial in the matrix’s entries.
Coefficients of char. poly.: If A has characteristic polynomial pA (λ ) = (−1)n λ n + an−1 λ n−1 + · · · +
a1 λ + a0 then the functions fk (A) = ak are continuous for all 0 ≤ k < n.
Continuity of these coefficients follows from the fact that they are all
polynomials in the entries of A, since pA (λ ) = det(A − λ I).
Singular values: For all 1 ≤ k ≤ min{m, n} the function fk (A) = σk that outputs the k-th
largest singular value of A is continuous. This fact follows from the
singular values of A being the square roots of the (necessarily non-
negative) eigenvalues of A∗ A, and the function A 7→ A∗ A is continuous,
as are the coefficients of characteristic polynomials and the non-negative
square root of a non-negative real number.
Analytic functions: The Jordan decomposition and the formula of Theorem 2.4.6 can be used
to show that if f : C → C is analytic on all of C then the corresponding
matrix function f : Mn (C) → Mn (C) is continuous. For example, the
functions eA , sin(A), and cos(A) are continuous.
We also note that the sum, difference, and composition of any continuous
functions are again continuous. With all of these examples taken care of, we
now state the theorem that is the main reason that we have introduced dense
sets of matrices and continuity of matrix functions in the first place.
Theorem 2.D.3 Suppose F = R or F = C, B ⊆ Mm,n (F) is dense in Mm,n (F), and f , g :

Continuity Plus Mm,n (F) → Mr,s (F) are continuous. If f (A) = g(A) whenever A ∈ B then
Density f (A) = g(A) for all A ∈ Mm,n (F).
Proof. Since B is dense in Mm,n (F), for any A ∈ Mm,n (F) we can find matrices
A1 , A2 , . . . ∈ B such that lim Ak = A. Then
k→∞
f (A) = lim f (Ak ) = lim g(Ak ) = g(A),

k→∞ k→∞
with the middle equality following from the fact that Ak ∈ B so f (Ak ) = g(Ak ).

For example, the above theorem tells us that if f : Mn (C) → C is a contin-
uous function for which f (A) = det(A) whenever A is invertible, then f must
in fact be the determinant function, and that if g : Mn (C) → Mn (C) is a con-
tinuous function for which g(A) = A2 − 2A + 3I whenever A is diagonalizable,
then g(A) = A2 − 2A + 3I for all A ∈ Mn (C). The remainder of this section is

devoted to exploring somewhat more exotic applications of this theorem.
2.D.3 Working with Non-Invertible Matrices

We now start illustrating the utility of the concepts introduced in the previous
subsections by showing how they can help us extend results from the set of
invertible matrices to the set of all matrices. These methods are useful because
it is often much easier to prove that some nice property holds for invertible
matrices, and then use continuity and density arguments to extend the result to
all matrices, than it is to prove that it holds for all matrices directly.
Similarity-Invariance and the Trace
We stated back in Corollary 1.A.2 that the trace is the unique (up to scaling)
linear form on matrices that is similarity invariant. We now use continuity to
finally prove this result, which we re-state here for ease of reference.
Theorem 2.D.4 Suppose F = R or F = C, and f : Mn (F) → F is a linear form with the

Similarity-Invariance following properties:
Defines the Trace a) f (A) = f (PAP−1 ) for all A, P ∈ Mn (F) with P invertible, and
b) f (I) = n.
Then f (A) = tr(A) for all A ∈ Mn (F).
We originally proved Proof. We start by noticing that if A is invertible then AB and BA are similar,
that AB is similar to
since
BA (when A or B is
invertible) back in AB = A(BA)A−1 .
Exercise 1.A.5.
In particular, this tells us that f (AB) = f A(BA)A−1 = f (BA) whenever A
is invertible. If we could show that f (AB) = f (BA) for all A and B then we
would be done, since Theorem 1.A.1 would then tell us that f (A) = tr(A) for
all A ∈ Mn .
However, this final claim follows immediately from continuity of f and
density of the set of invertible matrices. In particular, if we fix any matrix
B ∈ Mn and define fB (A) = f (AB) and gB (A) = f (BA) then we just showed
that fB (A) = gB (A) for all invertible A ∈ Mn and all (not necessarily invertible)
B ∈ Mn . Since fB and gB are continuous, it follows from Theorem 2.D.3 that
fB (A) = gB (A) (so f (AB) = f (BA)) for all A, B ∈ Mn , which completes the
proof.
If the reader is uncomfortable with the introduction of the function fB and
gB at the end of the above proof, it can instead be finished a bit more directly by
making use of some of the ideas from the proof of Theorem 2.D.1. In particular,
for any (potentially non-invertible) matrix A ∈ Mn and integer k > 1, we define
Ak = A + 1k I and note that limk→∞ Ak = A and Ak is invertible when k is large.
We then compute

f (AB) = f lim Ak B (since lim Ak = A)
k→∞ k→∞
= lim f (Ak B) (since f is linear, thus continuous)

k→∞
= lim f (BAk ) (since Ak is invertible when k is large)

k→∞

= f lim BAk (since f is linear, thus continuous)
k→∞
= f (BA), (since lim Ak = A)

k→∞
as desired.
Finishing the QR Decomposition

We now use continuity and density arguments to complete the proof that every
matrix A ∈ Mm,n has a QR decomposition, which we originally stated as
Theorem 1.C.1 and only proved in the special case when its leftmost min{m, n}
columns form a linearly independent set. We re-state this theorem here for ease
of reference.
Theorem 2.D.5 Suppose F = R or F = C, and A ∈ Mm,n (F). There exists a unitary ma-
QR Decomposition trix U ∈ Mm (F) and an upper triangular matrix T ∈ Mm,n (F) with non-
(Again) negative real entries on its diagonal such that
A = UT.
In addition to the techniques that we have already presented in this section,

the proof of this theorem also relies on a standard result from analysis called
the Bolzano–Weierstrass theorem, which says that every sequence in a closed
and bounded subset of a finite-dimensional vector space over R or C has a
convergent subsequence. For our purposes, we note that the set of unitary
matrices is a closed and bounded subset of Mn , so any sequence of unitary
matrices has a convergent subsequence.
Proof of Theorem 2.D.5. We assume that n ≥ m and simply note that a com-
pletely analogous argument works if n < m. We write A = [ B | C ] where
B ∈ Mm and define Ak = [ B + 1k I | C ] for each integer k ≥ 1. Since B + 1k I is
invertible (i.e., its columns are linearly independent) whenever k is sufficiently
large, the proof of Theorem 1.C.1 tells us that Ak has a QR decomposition
Ak = Uk Tk whenever k is sufficiently large. We now use limit arguments to
show that A itself also has such a decomposition.
Since the set of unitary matrices is closed and bounded, the Bolzano–
Weierstrass theorem tells us that there is a sequence k1 < k2 < k3 < · · · with
Recall that limits the property that U = lim j→∞ Uk j exists and is unitary. Similarly, if lim j→∞ Tk j
are taken entrywise,
exists then it must be upper triangular since each Tk j is upper triangular. To see
and the limit of the 0
entries in the lower that this limit does exist, we compute
triangular portion of ∗
each Tk j is just 0.
lim Tk j = lim Uk∗j Ak j = lim Uk j lim Ak j = U ∗ lim Ak = U ∗ A.
j→∞ j→∞ j→∞ j→∞ k→∞
It follows that A = UT , where T = lim j→∞ Tk j is upper triangular, which

completes the proof.
If the reader does not like having to make use of the Bolzano–Weierstrass
theorem as we did above, another proof of the QR decomposition is outlined in
Exercise 2.D.7.
From the Inverse to the Adjugate
One final method of extending a property of matrices from those that are
invertible to those that perhaps are not is to make use of something called the
“adjugate” of a matrix:
Definition 2.D.3 Suppose that F = R or F = C and A ∈ Mn (F) has characteristic polyno-

The Adjugate mial
pA (λ ) = det(A − λ I) = (−1)n λ n + an−1 λ n−1 + · · · + a1 λ + a0 .
Then the adjugate of A, denoted by adj(A), is the matrix

adj(A) = − (−1)n An−1 + an−1 An−2 + · · · + a2 A + a1 I .
While the above definition likely seems completely bizarre at first glance,
it is motivated by the following two properties:
• The function adj : Mn → Mn is continuous, since matrix multiplica-
tion, addition, and the coefficients of characteristic polynomials are all
continuous, and
The adjugate is • The adjugate satisfies A−1 = adj(A)/ det(A) whenever A ∈ Mn is in-
sometimes defined
vertible. To verify this claim, first recall that the constant term of the
in terms of the
cofactors of A characteristic polynomial is a0 = pA (0) = det(A − 0I) = det(A). Using
instead (see (Joh20, the Cayley–Hamilton theorem (Theorem 2.1.3) then tells us that
Section 3.A), for
example). These two pA (A) = (−1)n An + an−1 An−1 + · · · + a1 A + det(A)I = O,
definitions are
equivalent.
and multiplying this equation by A−1 (if it exists) shows that
(−1)n An−1 + an−1 An−2 + · · · + a1 I + det(A)A−1 = O.
Rearranging slightly shows exactly that det(A)A−1 = adj(A), as claimed.

It follows that if we have a formula or property of matrices that involves
A−1 , we can often extend it to an analogous formula that holds for all square
matrices by simply making the substitution A−1 = adj(A)/ det(A) and invoking
continuity. We illustrate this method with an example.
Theorem 2.D.6 Suppose that F = R or F = C and A(t) ∈ Mn (F) is a matrix whose entries
Jacobi’s Formula depend in a continuously differentiable way on a parameter t ∈ F. If we let
dA/dt denote the matrix that is obtained by taking the derivative of each
entry of A with respect to t, then

d dA
det A(t) = tr adj A(t) .
dt dt
Proof. We proved in Exercise 1.A.8 that if A(t) is invertible then

d −1 dA
det A(t) = det A(t) tr A(t) .
dt dt

If we make the substitution A(t)−1 = adj A(t) / det A(t) in the above formula
then we see that
d dA
det A(t) = tr adj A(t)
dt dt
whenever A(t) is invertible. Since the set of invertible matrices is dense in
Mn (F) and this function is continuous, we conclude that it must in fact hold
for all A(t) (invertible or not).
2.D.4 Working with Non-Diagonalizable Matrices

We can also apply continuity and density arguments to the set of diagonalizable
matrices in much the same way that we applied them to the set of invertible
matrices in the previous subsection. We can often use Schur triangularization
(Theorem 2.1.1) or the Jordan decomposition (Theorem 2.4.1) to achieve the
same end result as the methods of this section, but it is nonetheless desirable
to avoid those decompositions when possible (after all, diagonal matrices are
much simpler to work with than triangular ones).
We start by illustrating how these techniques can simplify arguments in-
volving matrix functions.
Theorem 2.D.7 For all A ∈ Mn (C) it is the case that sin2 (A) + cos2 (A) = I.
Trigonometric
Identities
for Matrices Proof. Recall that sin2 (x) + cos2 (x) = 1 for all x ∈ C. It follows that if A is
diagonalizable via A = PDP−1 , where D has diagonal entries λ1 , λ2 , . . ., λn ,
then
 
We proved sin2 (λ1 ) 0 ··· 0
this theorem  2

 0 sin (λ2 ) ··· 0 
via the Jordan   −1
sin2 (A) + cos2 (A) = P  . .. .. P
decomposition in  .. .. 
Exercise 2.4.14.  . . . 
0 0 · · · sin2 (λn )
 
cos2 (λ1 ) 0 ··· 0
 2

 0 cos (λ2 ) · · · 0 
  −1
+P .. .. .. P
 .. 
 . . . . 
0 0 ··· cos2 (λn )
= PP−1 = I.
Since f (x) = sin2 (x) + cos2 (x) is analytic on C and thus continuous on Mn (C),
it follows from Theorem 2.D.3 that sin2 (A) + cos2 (A) = I for all (not necessar-
ily diagonalizable) A ∈ Mn (C).
As another example to illustrate the utility of this technique, we now
provide an alternate proof of the Cayley–Hamilton theorem (Theorem 2.1.3)
that avoids the technical argument that we originally used to prove it via Schur
triangularization (which also has a messy, technical proof).
Theorem 2.D.8 Suppose A ∈ Mn (C) has characteristic polynomial pA (λ ) = det(A − λ I).

Cayley–Hamilton Then pA (A) = O.
(Again)
Proof. Since A is complex, the fundamental theorem of algebra (Theo-

rem A.3.1) tells us that the characteristic polynomial of A has a root and
can thus be factored as
pA (λ ) = (λ1 − λ )(λ2 − λ ) · · · (λn − λ ),
where λ1 , λ2 , . . ., λn are the eigenvalues of A, listed according to algebraic

multiplicity. Our goal is thus to show that (λ1 I − A)(λ2 I − A) · · · (λn I − A) = O.
To see why this is the case, notice that if A is diagonalizable via A = PDP−1
then
(λ1 I − PDP−1 ) · · · (λn I − PDP−1 ) = P(λ1 I − D) · · · (λn I − D)P−1 ,
and it is clear that (λ1 I − D)(λ2 I − D) · · · (λn I − D) = O, since D is diagonal

with diagonal entries λ1 , λ2 , . . ., λn . It follows that pA (A) = O whenever A is
diagonalizable.
To see that this same conclusion must hold even when A is not diagonal-
izable, simply notice that the function f (A) = pA (A) is continuous (since the
coefficients in the characteristic polynomial are continuous) and then apply
Theorem 2.D.3 to it.

" #
1 9 2
2.D.1 Compute lim Ak if A = 10 . 2.D.4 Show that adj(AB) = adj(B)adj(A) for all A, B ∈
k→∞ 1 8
Mn .
2.D.2 Determine which of the following statements are
true and which are false. [Hint: This is easy to prove if A and B are invertible.]
∗ (a) The Frobenius norm of a matrix is continuous.

(b) The function T : Mm,n → Mn,m defined by T (A) = ∗ 2.D.5 Show that every positive semidefinite matrix
A† (the pseudoinverse of A—see Section 2.C.1) is A ∈ Mn can be written as a limit of positive definite matri-
continuous. ces.
[Side note: In other words, the set of positive definite matri-
∗∗2.D.3 Show that if A1 , A2 , . . . ∈ Mm,n are matrices with ces is dense in the set of positive semidefinite matrices.]
rank(Ak ) ≤ r for all k then
2.D.6 Suppose that A, B ∈ Mn are positive semidefinite.
rank lim Ak ≤ r Show that all eigenvalues of AB are real and non-negative.
k→∞
[Hint: We proved this fact for positive definite matrices back
too, as long as this limit exists. in Exercise 2.2.26.]
[Side note: For this reason, the rank of a matrix is said to be
a lower semicontinuous function.] ∗∗2.D.7 Provide an alternate proof of the QR decomposi-
tion (Theorem 2.D.5) by combining the Cholesky decompo-
sition (Theorem 2.B.5) and Theorem 2.2.10.
3. Tensors and Multilinearity
Do not think about what tensors are, but rather

what the whole construction of a tensor product
space can do for you.
Keith Conrad
Up until now, all of our explorations in linear algebra have focused on vec-
tors, matrices, and linear transformations—all of our matrix decompositions,
change of basis techniques, algorithms for solving linear systems or construct-
ing orthonormal bases, and so on have had the purpose of deepening our
understanding of these objects. We now introduce a common generalization
of all of these objects, called “tensors”, and investigate which of our tools and
techniques do and do not still work in this new setting.
For example, just like we can think of vectors (that is, the type of vectors
that live in Fn ) as 1-dimensional lists of numbers and matrices as 2-dimensional
arrays of numbers, we can think of tensors as any-dimensional arrays of
numbers. Perhaps more usefully though, recall that we can think of vectors
geometrically as arrows in space, and matrices as linear transformations that
move those arrows around. We can similarly think of tensors as functions that
move vectors, matrices, or even other more general tensors themselves around
in a linear way. In fact, they can even move multiple vectors, matrices, or tensors
around (much like bilinear forms and multilinear forms did in Section 1.3.3).
In a sense, this chapter is where we really push linear algebra to its ultimate
limit, and see just how far our techniques can extend. Tensors provide a common
generalization of almost every single linear algebraic object that we have seen—
not only are vectors, matrices, linear transformations, linear forms, and bilinear
forms examples of tensors, but so are more exotic operations like the cross
product, the determinant, and even matrix multiplication itself.
3.1 The Kronecker Product
Before diving into the full power of tensors, we start by considering a new
operation on Fn and Mm,n (F) that contains much of their essence. After all,
tensors themselves are quite abstract and will take some time to get our heads
around, so it will be useful to have this very concrete motivating operation in
our minds when we are introduced to them.

https://doi.org/10.1007/978-3-030-52815-7_3
298 Chapter 3. Tensors and Multilinearity
3.1.1 Definition and Basic Properties

The Kronecker product can be thought of simply as a way of multiplying
matrices together (regardless of their sizes) so as to get much larger matrices.
Definition 3.1.1 The Kronecker product of matrices A ∈ Mm,n and B ∈ M p,q is the block
Kronecker Product matrix
 
Recall that ai, j is the a1,1 B a1,2 B ··· a1,n B
(i, j)-entry of A. a B a2,2 B ··· a2,n B 
def
 2,1 
A⊗B =   .. .. ..

..  ∈ Mmp,nq .
 . . . . 
am,1 B am,2 B ··· am,n B
In particular, notice that A ⊗ B is defined no matter what the sizes of A and

B are (i.e., we do not need to make sure that the sizes of A and B are compatible
with each other like we do with standard matrix multiplication). As a result of
this, we can also apply the Kronecker product to vectors simply by thinking of
them as 1 × n or m × 1 matrices. For this reason, we similarly say that if v ∈ Fm
and w ∈ Fn then the Kronecker product of v and w is
def
v ⊗ w = (v1 w, v2 w, . . . , vm w)
= (v1 w1 , . . . , v1 wn , v2 w1 , . . . , v2 wn , . . . , vm w1 , . . . , vm wn ).
We now compute a few Kronecker products to make ourselves more com-

fortable with these ideas.

Example 3.1.1 Suppose A = 1 2 , B = 2 1 , v = 1 , and w = 3 . Compute:
Numerical Examples 3 4 0 1 2 4
of the Kronecker a) A ⊗ B,
Product b) B ⊗ A, and
c) v ⊗ w.
Solutions:  
B 2B 2 1 4 2
As always, the bars a) A ⊗ B = = 0 1 0 2  .
that we use to 3B 4B  
partition these block  6 3 8 4 
matrices are just 0 3 0 4
provided for ease of
" #  
visualization—they
2A A 2 4 1 2
have no b) B ⊗ A = = 6 8 3 4  .
mathematical O A  
meaning.  0 0 1 2 
0 0 3 4
 
w 3
c) v ⊗ w = 2w =  4 .
 
 6 
We will see in 8
Theorem 3.1.8 that,
even though A ⊗ B
and B ⊗ A are The above example shows that the Kronecker product is not commutative
typically not equal,
in general: A ⊗ B might not equal B ⊗ A. However, it does have most of the
they share many of
the same properties. other “standard” properties that we would expect a matrix product to have, and
3.1 The Kronecker Product 299
in particular it interacts with matrix addition and scalar multiplication exactly

how we would hope that it does:
Theorem 3.1.1 Suppose A, B, and C are matrices with sizes such that the operations below
Vector Space make sense, and let c ∈ F be a scalar. Then
Properties of the a) (A ⊗ B) ⊗C = A ⊗ (B ⊗C) (associativity)
Kronecker Product b) A ⊗ (B +C) = A ⊗ B + A ⊗C (left distributivity)
c) (A + B) ⊗C = A ⊗C + B ⊗C (right distributivity)
d) (cA) ⊗ B = A ⊗ (cB) = c(A ⊗ B)
Proof. The proofs of all of these statements are quite similar to each other, so
we only explicitly prove part (b)—the remaining parts of the theorem are left
to Exercise 3.1.20.
To this end, we just fiddle around with block matrices a bit:
 
a1,1 (B +C) a1,2 (B +C) · · · a1,n (B +C)
 a (B +C) a (B +C) · · · a (B +C) 
 2,1 2,2 2,n 
A ⊗ (B +C) = 
 .. .. . .


 . . . . .
. 
am,1 (B +C) am,2 (B +C) · · · am,n (B +C)
   
a1,1 B a1,2 B · · · a1,n B a1,1C a1,2C · · · a1,nC
a B a B · · · a B a C a C · · · a2,nC 
 2,1 2,2 2,n   2,1 2,2 
= .. .. .. .
+
  . .. .. .

 . . . ..   .. . . .. 
am,1 B am,2 B · · · am,n B am,1C am,2C · · · am,nC
= A ⊗ B + A ⊗C,
as desired.
In particular, associativity of the Kronecker product (i.e., property (a) of the
above theorem) tells us that we can unambiguously define Kronecker powers
Associativity of the of matrices by taking the Kronecker product of a matrix with itself repeatedly,
Kronecker product
without having to worry about the exact order in which we perform those
also tells us that
expressions like products. That is, for any integer p ≥ 1 we can define
A ⊗ B ⊗C make sense.
def
A⊗p = A ⊗ A ⊗ · · · ⊗ A .
| {z }
p copies
Kronecker powers increase in size extremely quickly, since increasing the

power by 1 multiplies the number of rows and columns in the result by the
number of rows and columns in A, respectively.

Example 3.1.2 Suppose H = 1 1 and I = 1 0 . Compute:
Numerical Examples 1 −1 0 1
of Kronecker
Powers a) I ⊗2 ,
b) H ⊗2 , and
c) H ⊗3 .
Solutions: " #  
I O 1 0 0 0
a) I ⊗2 = I ⊗ I = = 0 1 0 0 .
O I  
 0 0 1 0 
0 0 0 1
 
H H 1 1 1 1
b) H ⊗2 = H ⊗ H = =  1 −1 1 −1 .
H −H  
 1 1 −1 −1 
1 −1 −1 1
" #
H ⊗2 H ⊗2
We could also c) H = H ⊗ (H ) =
⊗3 ⊗2
compute
H ⊗2 −H ⊗2
H ⊗3 = (H ⊗2 ) ⊗ H. We  
would get the same 1 1 1 1 1 1 1 1
answer.  1 −1 1 −1 1 −1 1 −1 
 
 1 1 −1 −1 1 1 −1 −1 
 
 1 −1 −1 1 1 −1 −1 1 
= 1 1
.

 1 1 −1 −1 −1 −1 
 1 −1 1 −1 −1 1 −1 1 
 
 1 1 −1 −1 −1 −1 1 1 
1 −1 −1 1 −1 1 1 −1
Remark 3.1.1 Notice that the matrices H, H ⊗2 , and H ⊗3 from Example 3.1.2 all have the
Hadamard Matrices following two properties: their entries are all ±1, and their columns are
mutually orthogonal. Matrices with these properties are called Hadamard
matrices, and the Kronecker product gives one method of constructing
them: H ⊗k is a Hadamard matrix for all k ≥ 1.
One of the longest-standing unsolved questions in linear algebra asks
which values of n are such that there exists an n × n Hadamard matrix.
The above argument shows that they exist whenever n = 2k for some k ≥ 1
(since H ⊗k is a 2k × 2k matrix), but it is expected that they exist whenever
Entire books have n is a multiple of 4. For example, here is a 12 × 12 Hadamard matrix that
been written about
cannot be constructed via the Kronecker product:
Hadamard matrices
and the various ways  
of constructing them 1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1
[Aga85, Hor06].  1 1 −1 1 −1 −1 −1 1 1 1 −1 1 
 
 1 1 1 −1 1 −1 −1 −1 1 1 1 −1 
 
 1 −1 1 1 −1 1 −1 −1 −1 1 1 1 
 
 1 1 −1 1 1 −1 1 −1 −1 −1 1 1 
 
 1 1 1 −1 1 1 −1 1 −1 −1 −1 1 
 .
 1 1 1 1 −1 1 1 −1 1 −1 −1 −1 
 
 1 −1 1 1 1 −1 1 1 −1 1 −1 −1 
 
 1 −1 −1 1 1 1 −1 1 1 −1 1 −1 
 
 1 −1 −1 −1 1 1 1 −1 1 1 −1 1 
 
 1 1 −1 −1 −1 1 1 1 −1 1 1 −1 
1 −1 1 −1 −1 −1 1 1 1 −1 1 1
Numerous different methods of constructing Hadamard matrices are now

known, and currently the smallest n for which it is not known if there
exists a Hadamard matrix is n = 668.
Notice that, in part (a) of the above example, the Kronecker product of two
identity matrices was simply a larger identity matrix—this happens in general,
We look at some regardless of the sizes of the identity matrices in the product. Similarly, the
other sets of
Kronecker product of two diagonal matrices is always diagonal.
matrices that are
preserved by the The Kronecker product also plays well with usual matrix multiplication and
Kronecker product other operations like the transpose and inverse. We summarize these additional
shortly, in
Theorem 3.1.4.
properties here:
Theorem 3.1.2 Suppose A, B, C, and D are matrices with sizes such that the operations
Algebraic Properties of below make sense. Then
the Kronecker Product a) (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD),
b) (A ⊗ B)−1 = A−1 ⊗ B−1 , if either side of this expression exists
c) (A ⊗ B)T = AT ⊗ BT , and
d) (A ⊗ B)∗ = A∗ ⊗ B∗ if the matrices are complex.
Proof. The proofs of all of these statements are quite similar to each other
and follow directly from the definitions of the relevant operations, so we
only explicitly prove part (a)—the remaining parts of the theorem are left to
Exercise 3.1.21.
To see why part (a) of the theorem holds, we compute (A ⊗ B)(C ⊗ D) via
block matrix multiplication:
  
a1,1 B a1,2 B · · · a1,n B c1,1 D c1,2 D · · · c1,p D
So that these matrix  a B a B · · · a B  c D c D · · · c D
multiplications  2,1 2,2 2,n   2,1 2,2 2,p 
actually make sense, (A ⊗ B)(C ⊗ D) =   .. .. ..

..   .. .. ..

.. 
we are assuming  . . . .   . . . . 
that A ∈ Mm,n and
C ∈ Mn,p .
am,1 B am,2 B · · · am,n B cn,1 D cn,2 D · · · cn,p D
 n 
∑ j=1 a1, j c j,1 BD ∑nj=1 a1, j c j,2 BD · · · ∑nj=1 a1, j c j,p BD
 n 
 ∑ j=1 a2, j c j,1 BD ∑nj=1 a2, j c j,2 BD · · · ∑nj=1 a2, j c j,p BD 
 
= .. .. .. 
 .. 
 . . . . 
n n n
This calculation looks ∑ j=1 am, j c j,1 BD ∑ j=1 am, j c j,2 BD · · · ∑ j=1 am, j c j,p BD
ugly, but really it’s  n 
just applying the ∑ j=1 a1, j c j,1 ∑nj=1 a1, j c j,2 · · · ∑nj=1 a1, j c j,p
definition of matrix  n 
multiplication
 ∑ j=1 a2, j c j,1 ∑nj=1 a2, j c j,2 · · · ∑nj=1 a2, j c j,p 
 
= .. .. ..  ⊗ (BD)
multiple times.
 .. 
 . . . . 
n n n
a c a
∑ j=1 m, j j,1 ∑ j=1 m, j j,2 c · · · a
∑ j=1 m, j j,p c
= (AC) ⊗ (BD),
as desired.
It is worth noting that Theorems 3.1.1 and 3.1.2 still work if we replace
all of the matrices by vectors, since we can think of those vectors as 1 × n or
m × 1 matrices. Doing this in parts (a) and (d) of the above theorem shows us
that if v, w ∈ Fn and x, y ∈ Fm are (column) vectors, then
(v ⊗ x) · (w ⊗ y) = (v ⊗ x)∗ (w ⊗ y) = (v∗ w)(x∗ y) = (v · w)(x · y).
In other words, the dot product of two Kronecker products is just the product
of the individual dot products. In particular, this means that v ⊗ x is orthogonal
to w ⊗ y if and only if v is orthogonal to w or x is orthogonal to y (or both).
Properties Preserved by the Kronecker Product
Because the Kronecker product and Kronecker powers can create such large
matrices so quickly, it is important to understand how properties of A ⊗ B
relate to the corresponding properties of A and B themselves. For example, the
following theorem shows that we can compute the eigenvalues, determinant,
and trace of A ⊗ B directly from A and B themselves.
Theorem 3.1.3 Suppose A ∈ Mm and B ∈ Mn .

Eigenvalues, Trace, a) If λ and µ are eigenvalues of A and B, respectively, with correspond-
and Determinant of the ing eigenvectors v and w, then λ µ is an eigenvalue of A ⊗ B with
Kronecker Product corresponding eigenvector v ⊗ w,
b) tr(A ⊗ B) = tr(A)tr(B), and
n m
c) det(A ⊗ B) = det(A) det(B) .
Proof. Part (a) of the theorem follows almost immediately from the fact that
the Kronecker product plays well with matrix multiplication and scalar multi-
plication (i.e., Theorems 3.1.1 and 3.1.2):
(A ⊗ B)(v ⊗ w) = (Av) ⊗ (Bw) = (λ v) ⊗ (µw) = λ µ(v ⊗ w),
so v ⊗ w is an eigenvector of A ⊗ B corresponding to eigenvalue λ µ, as claimed.
Note in particular Part (b) follows directly from the definition of the Kronecker product:
that parts (b)
and (c) of this  
a1,1 B a1,2 B · · · a1,n B
theorem imply that
 a B a B · · · a B 
tr(A ⊗ B) = tr(B ⊗ A)  2,1 2,2 2,n 
and tr(A ⊗ B) = tr  
 .. .. ..

.. 
det(A ⊗ B) = det(B ⊗ A).  . . . . 
am,1 B am,2 B · · · am,n B
= a1,1 tr(B) + a2,2 tr(B) + · · · + an,n tr(B) = tr(A)tr(B).
Finally, for part (c) we note that

det(A ⊗ B) = det (A ⊗ In )(Im ⊗ B) = det(A ⊗ In ) det(Im ⊗ B).
We will see another Since Im ⊗ B is block diagonal, its determinant just equals the product of the
way to prove this
determinants of its diagonal blocks:
determinant equality
(as long as F = R or  
F = C) in B O ··· O
O B · · · O
Exercise 3.1.9.   m
det(Im ⊗ B) = det  
 .. .. . .

..  = det(B) .
 . . . . 
O O ··· B
A similar argument shows that det(A ⊗ In ) = det(A)n , so we get det(A ⊗ B) =
det(A)n det(B)m , as claimed.
The Kronecker product also preserves several useful families of matrices.
For example, it follows straightforwardly from the definition of the Kronecker
product that if A ∈ Mm and B ∈ Mn are both upper triangular, then so is A ⊗ B.
We summarize some observations of this type in the following theorem:
Theorem 3.1.4 Suppose A ∈ Mm and B ∈ Mn .

Matrix Properties a) If A and B are upper (lower) triangular, so is A ⊗ B,
Preserved by the b) If A and B are diagonal, so is A ⊗ B,
Kronecker Product c) If A and B are normal, so is A ⊗ B,
d) If A and B are unitary, so is A ⊗ B,
e) If A and B are symmetric or Hermitian, so is A ⊗ B, and
f) If A and B are positive (semi)definite, so is A ⊗ B.
We leave the proof of the above theorem to Exercise 3.1.22, as none of it is

very difficult or enlightening—the claims can each be proved in a line or two
simply by invoking properties of the Kronecker product that we saw earlier.
In particular, because the matrix properties described by Theorem 3.1.4 are
exactly the ones used in most of the matrix decompositions we have seen, it
follows that the Kronecker product interacts with most matrix decompositions
just the way we would hope for it to. For example, if A ∈ Mm (C) and B ∈
Mn (C) have Schur triangularizations
A = U1 T1U1∗ and B = U2 T2U2∗
(where U1 and U2 are unitary, and T1 and T2 are upper triangular), then to
find a Schur triangularization of A ⊗ B we can simply compute the Kronecker
products U1 ⊗U2 and T1 ⊗ T2 , since
A ⊗ B = (U1 T1U1∗ ) ⊗ (U2 T2U2∗ ) = (U1 ⊗U2 )(T1 ⊗ T2 )(U1 ⊗U2 )∗ .
An analogous argument shows that the Kronecker product also preserves di-
agonalizations of matrices (in the sense of Theorem 2.0.1), as well as spectral
decompositions, QR decompositions, singular value decompositions, and polar
decompositions.
Because these decompositions behave so well under the Kronecker prod-
uct, we can use them to get our hands on any matrix properties that can be
inferred from these decompositions. For example, by looking at how the Kro-
necker product interacts with the singular value decomposition, we arrive at
the following theorem:
Theorem 3.1.5 Suppose A ∈ Mm,n and B ∈ M p,q .

Kronecker Product a) If σ and τ are singular values of A and B, respectively, then σ τ is a
of Singular Value singular value of A ⊗ B,
Decompositions
b) rank(A ⊗ B) = rank(A)rank(B),

c) range(A ⊗ B) = span v ⊗ w : v ∈ range(A), w ∈ range(B) ,

d) null(A ⊗ B) = span v ⊗ w : v ∈ null(A), w ∈ null(B) , and
e) kA ⊗ Bk = kAkkBk and kA ⊗ BkF = kAkF kBkF .
Again, all of these properties follow fairly quickly from the relevant def-
initions and the fact that if A and B have singular value decompositions
A = U1 Σ1V1∗ and B = U2 Σ2V2∗ , then
A ⊗ B = (U1 Σ1V1∗ ) ⊗ (U2 Σ2V2∗ ) = (U1 ⊗U2 )(Σ1 ⊗ Σ2 )(V1 ⊗V2 )∗
is a singular value decomposition of A ⊗ B. We thus leave the proof of the

above theorem to Exercise 3.1.23.
The one notable matrix decomposition that does not behave quite so cleanly
under the Kronecker product is the Jordan decomposition. In particular, if
J1 ∈ Mm (C) and J2 ∈ Mn (C) are two matrices in Jordan canonical form then
J1 ⊗ J2 may not be in Jordan canonical form. For example, if
 
1 1 1 1
1 1  0 1 0 1 
J1 = J2 = then J1 ⊗ J2 =  0 0 1 1 ,

0 1
0 0 0 1
which is not in Jordan canonical form.

Bases and Linear Combinations of Kronecker Products
The fact that some The Kronecker product of vectors in Fm and Fn can only be used to construct
vectors cannot be
a tiny portion of all vectors in Fmn (and similarly, most matrices in Mmp,nq
written in the form
v ⊗ w is exactly why cannot be written in the form A ⊗ B). For example, there do not exist vectors
we needed the v, w ∈ C2 such that v ⊗ w = (1, 0, 0, 1), since that would imply v1 w = (1, 0)
“span” in and v2 w = (0, 1), but (1, 0) and (0, 1) are not scalar multiples of each other.
conditions (c)
and (d) of However, it is the case that every vector in Fmn can be written as a linear
Theorem 3.1.5. combination of vectors of the form v ⊗ w with v ∈ Fm and w ∈ Fn . In fact, if B
and C are bases of Fm and Fn , respectively, then the set
def
B ⊗C = {v ⊗ w : v ∈ B, w ∈ C}
is a basis of Fmn (see Exercise 3.1.17, and note that a similar statement holds
for matrices in Mmp,nq being written as a linear combination of matrices of
the form A ⊗ B). In the special case when B and C are the standard bases of
Fm and Fn , respectively, B ⊗C is also the standard basis of Fmn . Furthermore,
ordering the basis vectors ei ⊗ e j by placing their subscripts in lexicographical
order produces exactly the “usual” ordering of the standard basis vectors of
Fmn . For example, if m = 2 and n = 3 then
e1 ⊗ e1 = (1, 0) ⊗ (1, 0, 0) = (1, 0, 0, 0, 0, 0),

In other words, if we
“count” in the e1 ⊗ e2 = (1, 0) ⊗ (0, 1, 0) = (0, 1, 0, 0, 0, 0),
subscripts of ei ⊗ e j ,
but with each digit
e1 ⊗ e3 = (1, 0) ⊗ (0, 0, 1) = (0, 0, 1, 0, 0, 0),
starting at 1 instead e2 ⊗ e1 = (0, 1) ⊗ (1, 0, 0) = (0, 0, 0, 1, 0, 0),
of 0, we get the
usual ordering of e2 ⊗ e2 = (0, 1) ⊗ (0, 1, 0) = (0, 0, 0, 0, 1, 0),
these basis vectors. e2 ⊗ e3 = (0, 1) ⊗ (0, 0, 1) = (0, 0, 0, 0, 0, 1).
In fact, this same observation works when taking the Kronecker product of 3 or
more standard basis vectors as well.
When working with vectors that are (linear combinations of) Kronecker
products of other vectors, we typically want to know what the dimensions of
the different factors of the Kronecker product are. For example, if we say that
v ⊗ w ∈ F6 , we might wonder whether v and w live in 2- and 3-dimensional
spaces, respectively, or 3- and 2-dimensional spaces. To alleviate this issue,
we use the notation Fm ⊗ Fn to mean Fmn , but built out of Kronecker products
of vectors from Fm and Fn , in that order (and we similarly use the notation
Mm,n ⊗ M p,q for Mmp,nq ). When working with the Kronecker product of
many vectors, we often use the shorthand notation
(Fn )⊗p = Fn ⊗ Fn ⊗ · · · ⊗ Fn .
def
| {z }
p copies
We will clarify what Furthermore, we say that any vector v = v1 ⊗ v2 ⊗ · · · ⊗ v p ∈ (Fn )⊗p that
the word “tensor”
can be written as a Kronecker product (rather than as a linear combination of
means in the
coming sections. Kronecker products) is an elementary tensor.
3.1.2 Vectorization and the Swap Matrix

We now note that the space Fm ⊗ Fn is “essentially the same” as Mm,n (F). To
clarify what we mean by this, we recall that we can represent matrices in Mm,n
via their coordinate vectors with respect to a given basis. In particular, if we use
Recall that Ei, j has a the standard basis E = {E1,1 , E1,2 , . . . , Em,n }, then this coordinate vector has a
1 in its (i, j)-entry and
very special form—it just contains the entries of the matrix, read row-by-row.
zeros elsewhere.
This leads to the following definition:
Definition 3.1.2 Suppose A ∈ Mm,n . The vectorization of A, denoted by vec(A), is the

Vectorization vector in Fmn that is obtained by reading the entries of A row-by-row.
We typically think of vectorization as the “most natural” isomorphism from

Mm,n (F) to Fmn —we do not do anything “fancy” to transform the matrix into
a vector, but rather we just read it as we see it. In the other direction, we define
the matricization of a vector v ∈ Fmn , denoted by mat(v), to be the matrix
whose vectorization if v (i.e., the linear transformation mat : Fmn → Mm,n is
defined by mat = vec−1 ). In other words, matricization is the operation that
places the entries of a vector row-by-row into a matrix. For example, in the
Some other books m = n = 2 case we have
define vectorization
as reading the    
entries of the matrix
µ· ¸¶ a a · ¸
column-by-column a b b b a b
instead of vec =
c
 and  
mat   = .
row-by-row, but that c d c c d
convention makes d d
some of the
upcoming formulas
a bit uglier. While vectorization is easy to work with when given an explicit matrix, in
this “read a matrix row-by-row” form it is a bit difficult to prove things about
it and work with it abstractly. The following theorem provides an alternate
characterization of vectorization that is often more convenient.

Theorem 3.1.6 If v ∈ Fm and w ∈ Fn are column vectors then vec vwT = v ⊗ w.
Vectorization and the
Kronecker Product
Proof. One way to see why this holds is to compute vwT via block matrix
multiplication as follows:
 
Yes, we really want v1 wT
 
wT (not w∗ ) in this  v2 wT 
T  
theorem, even if vw =  .  .
F = C. The reason is  . 
 . 
that vectorization is
linear (like vm wT
transposition), not
conjugate linear. Since vec(vwT ) places the entries of vwT into a vector one row at a time, we
see that its first n entries are the same as those of v1 wT (which are the same as
those of v1 w), its next n entries are the same as those of v2 w, and so on. But
this is exactly the definition of v ⊗ w, so we are done.
The above theorem could be flipped around to (equivalently) say that

mat(v ⊗ w) = vwT . From this characterization of the Kronecker product, it
is perhaps clearer why not every vector in Fm ⊗ Fn is an elementary tensor
v ⊗ w: those vectors correspond via matricization to the rank-1 matrices vwT ∈
Mm,n (F), and not every matrix has rank 1.
Look back at Vectorization is useful because it lets us think about matrices as vectors,
Section 1.2.1 if you
and linear transformations acting on matrices as (larger) matrices themselves.
need a refresher on
how to represent This is nothing new—we can think of a linear transformation on any finite-
linear dimensional vector space in terms of its standard matrix. However, in this
transformations as particular case the exact details of the computation are carried out by vectoriza-
matrices.
tion and the Kronecker product, which makes them much simpler to actually
work with in practice.
Theorem 3.1.7 Suppose A ∈ M p,m and B ∈ Mr,n . Then

Vectorization of
a Product vec AXBT = (A ⊗ B)vec(X) for all X ∈ Mm,n .
In other words, this theorem tells us about the standard matrix of the
Recall that when linear transformation TA,B : Mm,n → M p,r defined by TA,B (X) = AXBT , where
working with A ∈ M p,m and B ∈ Mr,n are fixed. In particular, it says that the standard matrix
standard matrices
with respect to the of TA,B (with respect to the standard basis E = {E1,1 , E1,2 , . . . , Em,n }) is simply
standard basis E, we [TA,B ] = A ⊗ B.
use the shorthand
[T ] = [T ]E . Proof of Theorem
3.1.7 We start by showing that if X = Ei, j for some i, j then
vec AEi, j BT = (A ⊗ B)vec(Ei, j ). To see this, note that Ei, j = ei eTj , so using
Theorem 3.1.6 twice tells us that

vec AEi, j BT = vec Aei eTj BT = vec (Aei )(Be j )T = (Aei ) ⊗ (Be j )

= (A ⊗ B)(ei ⊗ e j ) = (A ⊗ B)vec ei eTj = (A ⊗ B)vec(Ei, j ).
If we then use the fact that we can write as X = ∑i, j xi, j Ei, j , the result follows
from the fact that vectorization is linear:
This proof technique !
is quite common
T T
when we want to vec AXB = vec A ∑ xi, j Ei, j B = ∑ xi, j vec AEi, j BT
show that a linear i, j i, j
transformation acts !
in a certain way: we
show that it acts that
= ∑ xi, j (A ⊗ B)vec(Ei, j ) = (A ⊗ B)vec ∑ xi, j Ei, j
i, j i, j
way on a basis, and
then use linearity to = (A ⊗ B)vec(X),
show that it must do
the same thing on as desired.
the entire vector
space.
The above theorem is nothing revolutionary, but it is useful because it
provides an explicit and concrete way of solving many problems that we have
(at least implicitly) encountered before. For example, suppose we had a fixed
matrix A ∈ Mn and we wanted to find all matrices X ∈ Mn that commute with
it. One way to do this would be to multiply out AX and XA, set entries of those
matrices equal to each other, and solve the resulting linear system. To make
the details of this linear system more explicit, however, we can notice that
AX = XA if and only if AX − XA = O, which (by taking the vectorization of
both sides of the equation and applying Theorem 3.1.7) is equivalent to
(A ⊗ I − I ⊗ AT )vec(X) = 0.
This is a linear system that we can solve “directly”, as we now illustrate with
an example.

Example 3.1.3 Find all matrices X ∈ M2 that commute with A = 1 1 .
Finding Matrices 0 0
that Commute
Solution:
As noted above, one way to tackle this problem is to solve the linear
system (A ⊗ I − I ⊗ AT )vec(X) = 0. The coefficient matrix of this linear
system is
     
1 0 1 0 1 0 0 0 0 0 1 0
     
0 1 0 1 1 0 0 0 −1 1 0 1
A ⊗ I − I ⊗ AT =  − = .
0 0 0 0 0 0 1 0  0 0 −1 0
0 0 0 0 0 0 1 0 0 0 −1 0
To solve the corresponding linear system, we apply Gaussian elimination

to it:
   
0 0 1 0 0 1 −1 0 −1 0
 −1 1 0 1 0   
  row reduce  0 0 1 0 0 
 0 0 −1 0 0 −−−−−−→  0 0 0 0 0  .
0 0 −1 0 0 0 0 0 0 0
From here we can see that, if we label the entries of vec(X) as vec(X) =
(x1 , x2 , x3 , x4 ), then x3 = 0 and −x1 + x2 + x4 = 0, so x1 = x2 + x4 (x2 and
x4 are free). It follows that vec(X) = (x2 + x4 , x2 , 0, x4 ), so the matrices X
that commute with A are exactly the ones of the form

x2 + x4 x2
X = mat (x2 + x4 , x2 , 0, x4 ) = ,
0 x4
where x2 , x4 ∈ F are arbitrary.
As another application of Theorem 3.1.7, we now start building towards

an explanation of exactly what the relationship between A ⊗ B and B ⊗ A is. In
particular, we will show that there is a specific matrix W with the property that
A ⊗ B = W (B ⊗ A)W T :
Definition 3.1.3 Given positive integers m and n, the swap matrix Wm,n ∈ Mmn is the
Swap Matrix matrix defined in any of the three following (equivalent) ways:
a) Wm,n = [T ], the standard matrix of the transposition map T :
Mm,n → Mn,m with respect to the standard basis E,
The entries of Wm,n b) Wm,n (ei ⊗ e j ) = e j ⊗ ei for all 1 ≤ i ≤ m, 1 ≤ j ≤ n, and
are all just 0 or 1, and  
this result works over c) Wm,n = E1,1 E2,1 · · · Em,1 .
any field F. If F = R or E 
F = C then Wm,n is  1,2 E2,2 · · · Em,2 
 
unitary.  .. .. .. .. 
 . . . . 
E1,n E2,n · · · Em,n
If the dimensions m and n are clear from context or irrelevant, we denote
this matrix simply by W .
For example, we already showed in Example 1.2.10 that the standard matrix
of the transpose map T : M2 → M2 is
 
1 0 0 0
 
0 0 1 0
[T ] =  ,
0 1 0 0
0 0 0 1
We use W (instead of so this is the swap matrix W2,2 . We can check that it satisfies definition (b)
S) to denote the
by directly computing each of W2,2 (e1 ⊗ e1 ), W2,2 (e1 ⊗ e2 ), W2,2 (e2 ⊗ e1 ), and
swap matrix
because the letter W2,2 (e2 ⊗ e2 ). Similarly, it also satisfies definition (c) since we can write it as
“S” is going to the block matrix
become  
overloaded later in " # 1 0 0 0
this section. We can E E2,1  0 0 1 0 
think of “W” as W2,2 = 1,1 =  0 1 0 0 .

standing for “sWap”. E1,2 E2,2
0 0 0 1
To see that these three definitions agree in general (i.e., when m and n do
not necessarily both equal 2), we first compute
T T T

[T ] = [E1,1 ]E | [E1,2 ]E | · · · | [Em,n ]E (definition of [T ])

= vec(E1,1 ) | vec(E2,1 ) | · · · | vec(En,m ) (definition of vec)

= vec(e1 eT1 ) | vec(e2 eT1 ) | · · · | vec(en eTm ) (Ei, j = ei eTj for all i, j)

= e1 ⊗ e1 | e2 ⊗ e1 | · · · | en ⊗ em . (Theorem 3.1.6)
Some books call the The equivalence of definitions (a) and (b) of the swap matrix follows fairly
swap matrix the
quickly now, since multiplying a matrix by ei ⊗ e j just results in one of the
commutation matrix
and denote it by columns of that matrix, and in this case we will have [T ](ei ⊗ e j ) = e j ⊗ ei .
Km,n . Similarly, the equivalence of definitions (a) and (c) follows just by explicitly
writing out what the columns of the block matrix (c) are: they are exactly
e1 ⊗ e1 , e2 ⊗ e1 , . . ., en ⊗ em , which we just showed are also the columns of [T ].
Example 3.1.4 Construct the swap matrix Wm,n ∈ Mmn when

Constructing a) m = 2 and n = 3, and when
Swap Matrices b) m = n = 3.
Solutions:
a) We could use any of the three defining characterizations of the
swap matrix to construct it, but the easiest one to use is likely
characterization (c):
Note that swap    
matrices are always E1,1 E2,1 1 · · · · ·
square, even if m 6= n. W2,3 =   
E1,2 E2,2  =  · · · 1 · · .

 · 1 · · · · 
E1,3 E2,3  
 · · · · 1 · 
 
 · · 1 · · · 
· · · · · 1
Here we use dots (·) b) Again, we use characterization (c) to construct this swap matrix:
instead of zeros for
ease of visualization.    
E1,1 E2,1 E3,1 1 · · · · · · · ·
W3,3 =   
E1,2 E2,2 E3,2  =  · · · 1 · · · · · 

 · · · · · · 1 · · 
E1,3 E2,3 E3,3 
 · 1 · · · · · · · 

 
 · · · · 1 · · · · 
 
 · · · · · · · 1 · 
 
 · · 1 · · · · · · 
 
 · · · · · 1 · · · 
Matrices (like the · · · · · · · · 1
swap matrix) with a
single 1 in each row
and column, and
zeros elsewhere, are
called permutation The swap matrix Wm,n has some very nice properties, which we prove in
matrices. Compare Exercise 3.1.18. In particular, every row and column has a single non-zero
this with signed or entry (equal to 1), if F = R or F = C then it is unitary, and if m = n then it is
complex
symmetric. If the dimensions m and n are clear from context or irrelevant, we
permutation
matrices from just denote this matrix by W for simplicity.
Theorem 1.D.10. The name “swap matrix” comes from the fact that it swaps the two factors
in any Kronecker product: W (v ⊗ w) = w ⊗ v for all v ∈ Fm and w ∈ Fn . To
see this, just write each of v and w as linear combinations of the standard basis
vectors (v = ∑i vi ei and w = ∑ j w j e j ) and then use characterization (b) of swap
matrices:
!

W (v ⊗ w) = W ∑ vi ei ⊗ ∑ w j e j = ∑ vi w jW (ei ⊗ e j )
i j i, j

= ∑ vi w j e j ⊗ ei = ∑ w je j ⊗ ∑ vi ei = w ⊗ v.
i, j j i
More generally, the following theorem shows that swap matrices also solve
exactly the problem that we introduced them to solve—they can be used to
transform A ⊗ B into B ⊗ A:
Theorem 3.1.8 T .
Suppose A ∈ Mm,n and B ∈ M p,q . Then B ⊗ A = Wm,p (A ⊗ B)Wn,q
Almost-Commutativity
of the Kronecker
Product Proof.
Notice that if we write
A and B in terms of their columns as A =
a1 | a2 | · · · | an and B = b1 | b2 | · · · | bq , respectively, then
n q
Alternatively, we A = ∑ ai eTi and B= ∑ b j eTj .
could use
i=1 j=1
Theorem A.1.3 here
to write A and B as a
sum of rank-1
matrices.
Then we can use the fact that W (a ⊗ b) = b ⊗ a to see that

!
n q
T
Wm,p (A ⊗ B)Wn,q = Wm,p ∑ ai eTi ⊗ ∑ b j eTj T
Wn,q
i=1 j=1
n q
= ∑ ∑ Wm,p (ai ⊗ b j )(ei ⊗ e j )T Wn,q
T
i=1 j=1
Throughout this n q
theorem, Wm,p and = ∑ ∑ (b j ⊗ ai )(e j ⊗ ei )T
Wn,q are swap
i=1 j=1
matrices. !
q n
= ∑ b j eTj ⊗ ∑ ai eTi
j=1 i=1
= B ⊗ A,

In the special case when m = n and p = q (i.e., A and B are each square), we
have Wn,q = Wm,p , and since these matrices are real and unitary we furthermore
T = W −1 , which establishes the following corollary:
have Wn,q m,p
Corollary 3.1.9 Suppose F = R or F = C, and A ∈ Mm (F) and B ∈ Mn (F). Then A ⊗ B

Unitary Similarity and B ⊗ A are unitarily similar.
of Kronecker Products
In particular, this corollary tells us that if A and B are square, then A ⊗ B and
B ⊗ A share all similarity-invariant properties, like their rank, trace, determinant,
eigenvalues, and characteristic polynomial (though this claim is not true in
general if A and B are not square, even if A ⊗ B is—see Exercise 3.1.4).
 
Example 3.1.5 Suppose A, B ∈ M2 satisfy A ⊗ B = 2 1 4 2 . Compute B ⊗ A.
 
Swapping a 0 1 0 2
Kronecker Product  
6 3 8 4
0 3 0 4
Solution:
We know from Theorem 3.1.8 that B ⊗ A = W (A ⊗ B)W T , where
 
1 0 0 0
 
0 0 1 0
W = 
0 1 0 0
0 0 0 1
is the swap matrix. We thus just need to perform the indicated matrix
multiplications:
B ⊗ A = W (A ⊗ B)W T
   
1 0 0 0 2 1 4 2 1 0 0 0
   
0 0 1 0 0 1 0 2 0 0 1 0
=   
0 1 0 0 6 3 8 4 0 1 0 0
0 0 0 1 0 3 0 4 0 0 0 1
 
2 4 1 2
 
6 8 3 4
= .
0 0 1 2
0 0 3 4
Note that this agrees with Example 3.1.1, where we computed each of
A ⊗ B and B ⊗ A explicitly.
3.1.3 The Symmetric and Antisymmetric Subspaces

In the previous section, we introduced the swap matrix W ∈ Mn2 as the (unique)
matrix with the property that W (v ⊗ w) = w ⊗ v for all v, w ∈ Fn . However,
since we can take the Kronecker product of three or more vectors as well, we
Recall from can similarly discuss matrices that permute any combination of Kronecker
Theorem 3.1.1(a) that product factors. For example, there is a unique matrix W231 ∈ Mn3 with the
(v ⊗ w) ⊗ x = property that
v ⊗ (w ⊗ x), so it is
okay to omit
parentheses in
W231 (v ⊗ w ⊗ x) = w ⊗ x ⊗ v for all v, w, x ∈ Fn , (3.1.1)
expressions like
v ⊗ w ⊗ x. and it can be constructed explicitly by requiring that W231 (ei ⊗ e j ⊗ ek ) =
e j ⊗ ek ⊗ ei for all 1 ≤ i, j, k ≤ n. For example, in the n = 2 case the swap
matrix W231 has the form
 
As always, the gray 1 · · · · · · ·
bars in this block  · · · · 1 · · · 
matrix has no  
mathematical

 · 1 · · · · · · 

meaning—they are 
 · · · · · 1 · · 

just there to help us W231 =  .
visualize the three  · · 1 · · · · · 
 
different Kronecker  · · · · · · 1 · 
factors that W231 acts
 
 · · · 1 · · · · 
on.
· · · · · · · 1
More generally, we can consider swap matrices acting on the Kronecker

product of any number p of vectors. If we fix a permutation σ : {1, 2, . . . , p} →
These swap matrices {1, 2, . . . , p} (i.e., a function for which σ (i) = σ ( j) if and only if i = j), then
can also be defined
if v1 , v2 , . . ., v p have we define Wσ ∈ Mn p to be the (unique) matrix with the property that
different
dimensionalities, but Wσ (v1 ⊗ v2 ⊗ · · · ⊗ v p ) = vσ (1) ⊗ vσ (2) ⊗ · · · ⊗ vσ (p)
for the sake of
simplicity we only for all v1 , v2 , . . . , v p ∈ Fn . From this definition is should be clear that if σ and
consider the case τ are any two permutations then Wσ Wτ = Wσ ◦τ .
when they are all
n-dimensional here. We typically denote permutations by their one-line notation, which means
we list them simply by a string of digits such that, for each 1 ≤ j ≤ p, the
j-th digit tells us the value of σ ( j). For example, the one-line notation 231
corresponds to the permutation σ : {1, 2, 3} → {1, 2, 3} for which σ (1) = 2,

σ (2) = 3, and σ (3) = 1, and thus W231 is exactly the swap matrix described
by Equation (3.1.1).
In general, there are p! permutations acting on {1, 2, . . . , p}, and we denote
the set consisting of all of these permutations by S p (this set is typically called
Recall that the symmetric group). There are thus exactly p! swap matrices that permute
p! = p(p − 1) · · · 3 · 2 · 1.
the Kronecker factors of vectors like v1 ⊗ v2 ⊗ · · · ⊗ v p as well. In the p = 2,
in which case there are p! = 2 such swap matrices: the matrix that we called
“the” swap matrix back in Section 3.1.2, which corresponds to the permutation
σ = 21, and the identity matrix, which corresponds to the identity permutation
Some background σ = 12.
material on
In the p > 2 case, these more general swap matrices retain some of the
permutations is
provided in nice properties of “the” swap matrix from the p = 2 case, but lose others. In
Appendix A.1.5. particular, they are still permutation matrices and thus unitary, but they are no
longer symmetric in general (after all, even W231 is not symmetric).
The Symmetric Subspace
We now introduce one particularly important subspace that arises somewhat
naturally from the Kronecker product—the subspace of vectors that remain
unchanged when their Kronecker factors are permuted.
Definition 3.1.4 Suppose n, p ≥ 1 are integers. The symmetric subspace Snp is the sub-
The Symmetric space of (Fn )⊗p consisting of vectors that are unchanged by swap matrices:
Subspace
def
Snp = v ∈ (Fn )⊗p : Wσ v = v for all σ ∈ S p .
In the p = 2 case, the symmetric subspace is actually quite familiar. Since

the swap matrix W is the standard matrix of the transpose map, we see that the
equation W v = v is equivalent to mat(v)T = mat(v). That is, the symmetric
subspace Sn2 is isomorphic to the set of n × n symmetric matrices MSn via
matricization.
We can thus think of the symmetric subspace Snp as a natural generalization
of the set MSn of symmetric matrices. We remind ourselves of some of the
properties of MSn here, as well as the corresponding properties of Sn2 that they
imply:
There are many
Properties of MSn
other bases of MSn
too, but the one basis: {E j, j : 1 ≤ j ≤ n} ∪ {Ei, j + E j,i : 1 ≤ i < j ≤ n}
shown here is
dimension: n+1 n(n + 1)
particularly simple. =
2 2
Properties of Sn2
basis: {e j ⊗ e j : 1 ≤ j ≤ n} ∪ {ei ⊗ e j + e j ⊗ ei : 1 ≤ i < j ≤ n}

dimension: n+1 n(n + 1)
=
2 2
For example, the members of S22 are the vectors of the form (a, b, b, c), which
are isomorphic via matricization to the 2 × 2 (symmetric) matrices of the form
" #
a b
.
b c
The following theorem generalizes these properties to higher values of p.
Theorem 3.1.10 The symmetric subspace Snp ⊆ (Fn )⊗p has the following properties:
Properties of the 1
Symmetric Subspace a) One projection onto Snp is given by ∑ Wσ ,
p! σ ∈S p
n+ p−1
b) dim(Snp ) = , and
p
c) the following set is a basis of Snp :
( )
∑ Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) : 1 ≤ j1 ≤ j2 ≤ · · · ≤ j p ≤ n .
σ ∈S p
When we say Furthermore, if F = R or F = C then the projection in part (a) and the
“orthogonal” here,
basis in part (c) are each orthogonal.
we mean with
respect to the usual
dot product on
(Fn )⊗p . Proof. We begin by proving property (a). Define P = ∑σ ∈S p Wσ /p! to be the
proposed projection onto Snp . It is straightforward to check that P2 = P = PT ,
so P is an (orthogonal, if F = R or F = C) projection onto some subspace of
(Fn )⊗p —we leave the proof of those statements to Exercise 3.1.15.
It thus now suffices to show that range(P) = Snp . To this end, first notice
that for all τ ∈ S p we have
The fact that
composing all 1 1
Wτ P = ∑ Wτ Wσ = Wτ◦σ = P,
permutations by τ p! σ ∈S p p! σ∑
∈S p
gives the set of all
permutations follows
from the fact that S p with the final equality following from the fact that every permutation in S p
is a group (i.e., every can be written in the form τ ◦ σ for some σ ∈ S p . It follows that everything in
permutation is
invertible). To write a
range(P) is unchanged by Wτ (for all τ ∈ S p ), so range(P) ⊆ Snp .
particular To prove the opposite inclusion, we just notice that if v ∈ Snp then
permutation ρ ∈ S p
as ρ = τ ◦ σ , just 1 1
choose σ = τ −1 ◦ ρ. Pv = ∑ Wσ v = v = v,
p! σ ∈S p p! σ∑
∈S p
so v ∈ range(P) and thus Snp ⊆ range(P). Since we already proved the opposite
inclusion, it follows that range(P) = Snp , so P is a projection onto Snp as claimed.
To prove property (c) (we will prove property (b) shortly), we first notice
that the columns of the projection P from part (a) have the form
1
P(e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) = Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ),
p! σ∑
∈S p
where 1 ≤ j1 , j2 , . . . , j p ≤ n. To turn this set of vectors into a basis of range(P) =

Snp , we omit the columns that are equal to each other by only considering the
columns for which 1 ≤ j1 ≤ j2 ≤ · · · ≤ j p ≤ n. If F = R or F = C (so we
have an inner product to work with) then these remaining vectors are mutually
orthogonal and thus form an orthogonal basis of range(P), and otherwise they
are linearly independent (and thus form a basis of range(P)) since the coordi-
nates of their non-zero entries in the standard basis form disjoint subsets of
{1, 2, . . . , n p }. If we multiply these vectors each by p! then they form the basis
described in the statement of the theorem.
To demonstrate property (b), we simply notice that the basis from part (c) of
A multiset is just a set the theorem contains as many vectors as there are multisets { j1 , j2 , . . . , j p } ⊆
in which repetition is
{1, 2, . . . , n}. A standard combinatorics result says that there are exactly
allowed, like

{1, 2, 2, 3}. Order n+ p−1 (n − p + 1)!
does not matter in =
multisets (just like p p! (n − 1)!
regular sets). such multisets (see Remark 3.1.2), which completes the proof.
Remark 3.1.2 We now illustrate why there are exactly

Counting Multisets
n+ p−1 (n − p + 1)!
=
p p! (n − 1)!
p-element multisets with entries chosen from an n-element set (a fact that
we made use of at the end of the proof of Theorem 3.1.10). We represent
each multiset graphically via “stars and bars”, where p stars represent the
members of a multiset and n − 1 bars separate the values of those stars.
For example, in the n = 5, p = 6 case, the multisets {1, 2, 3, 3, 5, 5} and
It is okay for bars to {1, 1, 1, 2, 4, 4} would be represented by the stars and bars arrangements
be located at the
start or end, and it is
also okay for multiple
F | F | FF | | FF and FFF | F | | FF |,
bars to appear
consecutively. respectively.
Notice that there are a total of n + p − 1 positions in such an arrange-
ment of stars and bars (p positions for the stars and n − 1 positions for the
bars), and each arrangement is completely determined by the positions
that we choose for the p stars. It follows that there are n+p−1
p such con-
figurations of stars and bars, and thus exactly that many multisets of size
p chosen from a set of size n, as claimed.
The orthogonal basis vectors from part (c) of Theorem 3.1.10 do not form
an orthonormal basis because they are not properly normalized. For example,
in the n = 2, p = 3 case, we have dim(S23 ) = 43 = 4 and the basis of S23
described by the theorem consists of the following 4 vectors:
This tells us that, in Basis Vector Tuple (j1 , j2 , j3 )
the n = 2, p = 3 case,
S23 consists of the 6 e1 ⊗ e1 ⊗ e1 (1, 1, 1)
vectors of the form 2 e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 (1, 1, 2)
(a, b, b, c, b, c, c, d).
2 e1 ⊗ e2 ⊗ e2 + e2 ⊗ e1 ⊗ e2 + e2 ⊗ e2 ⊗ e1 (1, 2, 2)
6 e2 ⊗ e2 ⊗ e2 (2, 2, 2)
In order to turn
√ this basis into an orthonormal one, we must divide each
vector in it by p! m1! m2 ! · · · mn !, where m j denotes the multiplicity of j
in the corresponding tuple ( j1 , j2 , . . . , jn ).For example, for the basis vector
2 e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 corresponding to the√tuple (1, 1, 2)
above,
√ √ m1 = 2 and m2 = 1, so we divide that vector by p! m1! m2 ! =
we have
3! 2! 1! = 2 3 to normalize it:
1
√ e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 .
3
We close our discussion of the symmetric subspace by showing that it could
be defined in another (equivalent) way—as the span of Kronecker powers of
vectors in Fn (as long as F = R or F = C).
Theorem 3.1.11 Suppose F = R or F = C. The symmetric subspace Snp ⊆ (Fn )⊗p is the
Tensor-Power Basis span of Kronecker powers of vectors:
of the Symmetric
Subspace Snp = span v⊗p : v ∈ Fn .
However, a proof of this theorem requires some technicalities that we have

not yet developed, so we defer it to Section 3.B.1. Furthermore, we need to
Recall that be careful to keep in mind that this theorem does not hold over some other
v⊗p = v ⊗ · · · ⊗ v, fields—see Exercise 3.1.13.
where there are p
copies of v on the
For now, we just note that in the p = 2 case it says (via the usual isomor-
right-hand side. phism between Sn2 and MSn ) that every symmetric matrix can be written as
linear combination of rank-1 symmetric matrices. This fact is not completely
obvious, as the most natural basis of MSn contains rank-2 matrices. For exam-
ple, in the n = 3 case the “typical” basis of MS3 consists of the following 6
matrices:
           
1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
0 0 0 , 0 1 0 , 0 0 0 , 1 0 0 , 0 0 0 , 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0
| {z }
each has rank 2
In order to turn this basis into one consisting only of rank-1 symmetric matrices,
we need to make it slightly uglier so that the non-zero entries of the basis
Refer back to matrices overlap somewhat:
Exercise 1.2.8 to see            
a generalization of 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0
this basis to larger 0 0 0 , 0 1 0 , 0 0 0 , 1 1 0 , 0 0 0 , 0 1 1
values of n.
0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1
Remark 3.1.3 Another way to see that symmetric matrices can be written as a linear
The Spectral combination of symmetric rank-1 matrices is to make use of the real
Decomposition in the spectral decomposition (Theorem 2.1.6, if F = R) or the Takagi factoriza-
Symmetric Subspace tion (Exercise 2.3.26, if F = C). In particular, if F = R and A ∈ MSn has
{u1 , u2 , . . . , un } as an orthonormal basis of eigenvectors with correspond-
ing eigenvalues λ1 , λ2 , . . ., λn , then
n
A= ∑ λ j u j uTj
j=1
is one way of writing A in the desired form. If we trace things back through
the isomorphism between Sn2 and MSn then this shows in the p = 2 case
that we can write every vector v ∈ Sn2 in the form
n
When F = C, a v= ∑ λ ju j ⊗ u j,
symmetric (not j=1
Hermitian!) matrix
might not be normal, and a similar argument works when F = C if we use the Takagi factoriza-
so the complex
spectral tion instead.
decomposition Notice that what we have shown here is stronger than the statement
might not apply to it,
which is why we
of Theorem 3.1.11 (in the p = 2 case), which does not require the set
must use the Takagi {u1 , u2 , . . . , un } to be orthogonal. Indeed, when p ≥ 3, this stronger claim
factorization.
is no longer true—not only do we lose orthogonality, but we even lose lin-

ear independence in general! For example, we will show in Example 3.3.4
that the vector v = (0, 1, 1, 0, 1, 0, 0, 0) ∈ S23 cannot be written in the form
v = λ1 v1 ⊗ v1 ⊗ v1 + λ2 v2 ⊗ v2 ⊗ v2
for any choice of λ1 , λ2 ∈ R, and v1 , v2 ∈ R2 . Instead, at least 3 terms are

needed in such a sum, and since each v j is 2-dimensional, they cannot be
chosen to be linearly independent.
The Antisymmetric Subspace

Just like the symmetric subspace can be thought of as a natural generalization
of the set of symmetric matrices, there is also a natural generalization of the set
of skew-symmetric matrices to higher Kronecker powers.
Definition 3.1.5 Suppose n, p ≥ 1 are integers. The antisymmetric subspace Anp is the
The Antisymmetric following subspace of (Fn )⊗p :
Subspace
def
Anp = v ∈ (Fn )⊗p : Wσ v = sgn(σ )v for all σ ∈ S p .
Recall that the sign As suggested above, in the p = 2 case the antisymmetric subspace is
sgn(σ ) of a
permutation σ is the
isomorphic to the set MsS n of skew-symmetric matrices via matricization, since
number of W v = −v if and only if mat(v)T = −mat(v). With this in mind, we now remind
transpositions ourselves of some of the properties of MsS
n here, as well as the corresponding
needed to generate properties of A2n that they imply:
it (see
Appendix A.1.5).
Properties of MsS
n Properties of A2n

basis: {Ei, j − E j,i : 1 ≤ i < j ≤ n} {ei ⊗ e j − e j ⊗ ei : 1 ≤ i < j ≤ n}

dim.: n n(n − 1) n n(n − 1)
= =
2 2 2 2
For example, the members of A22 are the vectors of the form (0, a, −a, 0), which
are isomorphic via matricization to the 2 × 2 (skew-symmetric) matrices of the
form
0 a
.
−a 0
The following theorem generalizes these properties to higher values of p.
Theorem 3.1.12 The antisymmetric subspace Anp ⊆ (Fn )⊗p has the following properties:
Properties of the 1
Antisymmetric a) One projection onto Anp is given by ∑ sgn(σ )Wσ ,
p! σ ∈S p
Subspace n
b) dim(Anp ) = , and
p
c) the following set is a basis of Anp :
( )
∑ sgn(σ )Wσ (e j1 ⊗ · · · ⊗ e j p ) : 1 ≤ j1 < · · · < j p ≤ n .
σ ∈S p
Furthermore, if F = R or F = C then the projection in part (a) and the

basis in part (c) are each orthogonal.
We leave the proof of this theorem to Exercise 3.1.24, as it is almost

identical to that of Theorem 3.1.10, except the details work out more cleanly
here. For example, to normalize the basis vectors in part (c) of this theorem
p
to
√ make an orthonormal basis of An , we just need to divide each of them by
p! (whereas the normalization factor was somewhat more complicated for
the orthogonal basis of the symmetric subspace).
To get a bit of a feel for what the antisymmetric subspace looks like when
p > 2, we consider the p = 3, n = 4 case, where the basis described by the
above theorem consists of the following 43 = 4 vectors:
Basis Vector Tuple (j1 , j2 , j3 )

e1 ⊗ e2 ⊗ e3 + e2 ⊗ e3 ⊗ e1 + e3 ⊗ e1 ⊗ e2 (1, 2, 3)
− e1 ⊗ e3 ⊗ e2 − e2 ⊗ e1 ⊗ e3 − e3 ⊗ e2 ⊗ e1
e1 ⊗ e2 ⊗ e4 + e2 ⊗ e4 ⊗ e1 + e4 ⊗ e1 ⊗ e2 (1, 2, 4)
− e1 ⊗ e4 ⊗ e2 − e2 ⊗ e1 ⊗ e4 − e4 ⊗ e2 ⊗ e1
e1 ⊗ e3 ⊗ e4 + e3 ⊗ e4 ⊗ e1 + e4 ⊗ e1 ⊗ e3 (1, 3, 4)
− e1 ⊗ e4 ⊗ e3 − e3 ⊗ e1 ⊗ e4 − e4 ⊗ e3 ⊗ e1
e2 ⊗ e3 ⊗ e4 + e3 ⊗ e4 ⊗ e2 + e4 ⊗ e2 ⊗ e3 (2, 3, 4)
− e2 ⊗ e4 ⊗ e3 − e3 ⊗ e2 ⊗ e4 − e4 ⊗ e3 ⊗ e2
The symmetric and antisymmetric subspaces are always orthogonal to each

other (as long as F = R or F = C) in the sense that if v ∈ Snp and w ∈ Anp
Direct sums and then v · w = 0 (see Exercise 3.1.14). Furthermore, if p = 2 then in fact they are
orthogonal
orthogonal complements of each other:
complements were
covered in
Section 1.B. (Sn2 )⊥ = A2n ,or equivalently their direct sum is Sn2 ⊕ A2n = Fn ⊗ Fn ,

which
can be verified just by observing that their dimensions satisfy n+1
2 +
n 2 n n
2 = n . In fact, this direct sum decomposition of F ⊗ F is completely
S sS
analogous (isomorphic?) to the fact that Mn = Mn ⊕ Mn . However, this
property fails when p > 2, in which case there are vectors in (Fn )⊗p that are
orthogonal to everything in each of Snp and Anp .
There are certain values of p and n that it is worth focusing a bit of attention
Recall that the zero on. If p > n thenAnp is the zero vector space, which can be verified simply by
vector space is {0}.
observing that np = 0 in this case. If p = n then Ann is nn = 1-dimensional, so
up to scaling there is only one vector in the antisymmetric subspace, and it is
∑ sgn(σ )Wσ (e1 ⊗ e2 ⊗ · · · ⊗ en ) = ∑ sgn(σ )eσ (1) ⊗ eσ (2) ⊗ · · · ⊗ eσ (n) .

σ ∈Sn σ ∈Sn
For example, the unique (up to scaling) vectors in A22 and A33 are
e1 ⊗ e2 − e2 ⊗ e1 and e1 ⊗ e2 ⊗ e3 + e2 ⊗ e3 ⊗ e1 + e3 ⊗ e1 ⊗ e2
− e1 ⊗ e3 ⊗ e2 − e2 ⊗ e1 ⊗ e3 − e3 ⊗ e2 ⊗ e1 ,
respectively.
It is worth comparing these antisymmetric vectors to the formula for the
determinant of a matrix, which for matrices A ∈ M2 and B ∈ M3 have the
forms
det(A) = a1,1 a2,2 − a1,2 a2,1 , and

det(B) = b1,1 b2,2 b3,3 + b1,2 b2,3 b3,1 + b1,3 b2,1 b3,2
− b1,1 b2,3 b3,2 − b1,2 b2,1 b3,3 − b1,3 b2,2 b3,1 ,
This formula is stated respectively, and which has the following form in general for matrices C ∈ Mn :
as Theorem A.1.4 in
Appendix A.1.5. det(C) = ∑ sgn(σ )c1,σ (1) c2,σ (2) · · · cn,σ (n) .
σ ∈Sn
The fact that the antisymmetric vector in Ann looks so much like the formula for
the determinant of an n × n matrix is no coincidence—we will see in Section 3.2
(Example 3.2.9 in particular) that there is a well-defined sense in which the
determinant “is” this unique antisymmetric vector.
3.1.1 Compute A ⊗ B for the following pairs of matrices:

" # " # ∗∗3.1.4 Construct an example to show that if A ∈ M2,3
∗(a) A = 1 2 , B = −1 2 and B ∈ M3,2 then it might be the case that tr(A ⊗ B) 6=
3 0 0 −3 tr(B ⊗ A). Why does this not contradict Theorem 3.1.3 or
" # " #
(b) A = 3 1 2 , B = 2 Corollary 3.1.9?
−1
2 0 1 −1 3
  h i 3.1.5 Suppose H ∈ Mn is a matrix with every entry equal
∗(c) A = 1 , B = 2 −3 1
  to 1 or −1.
2
(a) Show that | det(H)| ≤ nn/2 .
3
[Hint: Make use of Exercise 2.2.28.]
3.1.2 Use the method of Example 3.1.3 to find all matrices (b) Show that Hadamard matrices (see Remark 3.1.1)
that commute with the given matrix. are exactly the ones for which equality is attained in
" # " # part (a).
∗(a) 1 0 (b) 0 1
0 0 0 0 ∗3.1.6 Suppose A ∈ Mm,n (C) and B ∈ M p,q (C), and A†
" # " #
∗(c) 1 0 (d) 1 2 denotes the pseudoinverse of A (introduced in Section 2.C.1).
Show that (A ⊗ B)† = A† ⊗ B† .
0 1 −2 1
3.1.7 Suppose that λ is an eigenvalue of A ∈ Mm (C)

3.1.3 Determine which of the following statements are and µ is an eigenvalue of B ∈ Mn (C) with corresponding
true and which are false. eigenvectors v and w, respectively.
∗(a) If A is 3 × 4 and B is 4 × 3 then A ⊗ B is 12 × 12. (a) Show that λ + µ is an eigenvalue of A ⊗ In + Im ⊗ B
(b) If A, B ∈ Mn are symmetric then so is A ⊗ B. by finding a corresponding eigenvector.
∗(c) If A, B ∈ Mn are skew-symmetric then so is A ⊗ B. [Side note: This matrix A ⊗ In + Im ⊗ B is sometimes
(d) If A ∈ Mm,1 and B ∈ M1,n then A ⊗ B = B ⊗ A. called the Kronecker sum of A and B.]
(b) Show that every eigenvalue of A ⊗ In + Im ⊗ B is the ∗∗3.1.17 Show that if B and C are bases of Fm and Fn ,
sum of an eigenvalue of A and an eigenvalue of B. respectively, then the set
B ⊗C = {v ⊗ w : v ∈ B, w ∈ C}
∗∗3.1.8 A Sylvester equation is a matrix equation of the
form is a basis of Fmn .
AX + XB = C,
[Hint: Use Exercises 1.2.27(a) and 3.1.16.]
where A ∈ Mm (C), B ∈ Mn (C), and C ∈ Mm,n (C) are
given, and the goal is to solve for X ∈ Mn,m (C).
(a) Show that the equation AX + XB = C is equivalent ∗∗3.1.18 Show that the swap matrix Wm,n has the following
to (A ⊗ I + I ⊗ BT )vec(X) = vec(C). properties:
(b) Show that a Sylvester equation has a unique solu- (a) Each row and column of Wm,n has exactly one non-
tion if and only if A and −B do not share a common zero entry, equal to 1.
eigenvalue. [Hint: Make use of part (a) and the result (b) If F = R or F = C then Wm,n is unitary.
of Exercise 3.1.7.] (c) If m = n then Wm,n is symmetric.
∗∗3.1.9 Let A ∈ Mm (C) and B ∈ Mn (C). 3.1.19 Show that 1 and −1 are the only eigenvalues of the
(a) Show that every eigenvalue of A ⊗ B is of the form swap matrix Wn,n , and the corresponding eigenspaces are
λ µ for some eigenvalues λ of A and µ of B. Sn2 and A2n , respectively.
[Side note: This exercise is sort of the converse of
Theorem 3.1.3(a).] ∗∗3.1.20 Recall Theorem 3.1.1, which established some
(b) Use part (a) to show that det(A ⊗ B) = of the basic properties of the Kronecker product.
det(A)n det(B)m .
(a) Prove part (a) of the theorem.
(b) Prove part (c) of the theorem.
3.1.10 Suppose F = R or F = C.
(c) Prove part (d) of the theorem.
(a) Construct the orthogonal projection onto S23 . That is,
write this projection down as an 8 × 8 matrix.
(b) Construct the orthogonal projection onto A23 . ∗∗3.1.21 Recall Theorem 3.1.2, which established some
of the ways that the Kronecker product interacts with other
matrix operations.
∗∗3.1.11 Suppose x ∈ Fm ⊗ Fn .
(a) Prove part (b) of the theorem.
(a) Show that there exist linearly independent sets
(b) Prove part (c) of the theorem.
{v j } ⊂ Fm and {w j } ⊂ Fn such that
(c) Prove part (d) of the theorem.
min{m,n}
x= ∑ v j ⊗ w j.
j=1 ∗∗3.1.22 Prove Theorem 3.1.4.
(b) Show that if F = R or F = C then the sets {v j } and
{w j } from part (a) can be chosen to be mutually or- ∗∗3.1.23 Prove Theorem 3.1.5.
thogonal.
[Side note: This is sometimes called the Schmidt
decomposition of x.] ∗∗3.1.24 Prove Theorem 3.1.12.
3.1.12 Compute a Schmidt decomposition (see Exer- 3.1.25 Let 1 ≤ p ≤ ∞ and let k · k p denote the p-norm
cise 3.1.11) of x = (2, 1, 0, 0, 1, −2) ∈ R2 ⊗ R3 . from Section 1.D.1. Show that kv ⊗ wk p = kvk p kwk p for
all v ∈ Cm , w ∈ Cn .
∗∗3.1.13 Show that Theorem 3.1.11 does not hold when
F = Z2 is the field with 2 elements (see Appendix A.4) and ∗∗3.1.26 Let 1 ≤ p, q ≤ ∞ be such that 1/p + 1/q = 1.
n = 2, p = 3. We now provide an alternate proof of Hölder’s inequality
(Theorem 1.D.5), which says that

∗∗3.1.14 Suppose F = R or F = C. Show that if v ∈ Snp v · w ≤ kvk p kwkq for all v, w ∈ Cn .
and w ∈ Anp then v · w = 0.
(a) Explain why it suffices to prove this inequality in the
case when kvk p = kwkq = 1. Make this assumption
∗∗3.1.15 In this exercise, we complete the proof of Theo- throughout the rest of this exercise.
rem 3.1.10. Let P = ∑σ ∈S p Wσ /p!. (b) Show that, for each 1 ≤ j ≤ n, either
(a) Show that PT = P. |v j w j | ≤ |v j | p or |v j w j | ≤ |w j |q .
(b) Show that P2 = P.
(c) Show that

∗∗3.1.16 Show that if {w1 , w2 , . . . , wk } ⊆ Fn
is linearly v · w ≤ kvk pp + kwkqq = 2.
independent and {v1 , v2 , . . . , vk } ⊆ Fm is any set then the
equation (d) Thisis not quite what we wanted (we wanted to show
k that v · w ≤ 1, not 2). To fix this problem, let k ≥ 1
∑ vj ⊗wj = 0 be an integer and replace v and w in part (c) by v⊗k
j=1
and w⊗k , respectively. What happens as k gets large?
implies v1 = v2 = · · · = vk = 0.
3.2 Multilinear Transformations
When we first encountered matrix multiplication, it seemed like a strange

and arbitrary operation, but it was defined specifically so as to capture how
composition of linear transformations affects standard matrices. Similarly, we
have now introduced the Kronecker product of matrices, and it perhaps seems
like a strange operation to focus so much attention on. However, there is a very
natural reason for introducing and exploring it—it lets us represent multilinear
transformations, which are functions that act on multiple vectors, each in a
linear way. We now explore this more general class of transformations and how
they are related to the Kronecker product.
3.2.1 Definition and Basic Examples

Recall from Section 1.2.3 that a linear transformation is a function that sends
vectors from one vector space to another in a linear way, and from Section 1.3.3
that a multilinear form is a function that sends a collection of vectors to a scalar
in a manner that treats each input vector linearly. Multilinear transformations
provide the natural generalization of both of these concepts—they can be
thought of as the sweet spot in between linear transformations and multilinear
forms where we have lots of input spaces and potentially have a non-trivial
output space as well.
Definition 3.2.1 Suppose V1 , V2 , . . . , V p and W are vector spaces over the same field. A
Multilinear multilinear transformation is a function T : V1 × V2 × · · · × V p → W
Transformations with the property that, if we fix 1 ≤ j ≤ p and any p − 1 vectors vi ∈ Vi
(1 ≤ i 6= j ≤ p), then the function S : V j → W defined by
S(v) = T (v1 , . . . , v j−1 , v, v j+1 , . . . , v p ) for all v ∈ V j
is a linear transformation.
The above definition is a bit of a mouthful, but the idea is simply that a
multilinear transformation is a function that looks like a linear transformation
on each of its inputs individually. When there are just p = 2 input spaces
we refer to these functions as bilinear transformations, and we note that
bilinear forms (refer back to Section 1.3.3) are the special case that arises
when the output space is W = F. Similarly, we sometimes call a multilinear
transformation with p input spaces a p-linear transformation (much like we
sometimes called multilinear forms p-linear forms).
Example 3.2.1 Consider the function C : R3 × R3 → R3 defined by C(v, w) = v × w for

The Cross Product all v, w ∈ R3 , where v × w is the cross product of v and w:
is a Bilinear
Transformation v × w = (v2 w3 − v3 w2 , v3 w1 − v1 w3 , v1 w2 − v2 w1 ).
Show that C is a bilinear transformation.

3.2 Multilinear Transformations 321
Solution:
The following facts about the cross product are equivalent to C being
bilinear:
• (v + w) × x = v × x + w × x,
• v × (w + x) = v × w + v × x, and
• (cv) × w = v × (cw) = c(v × w) for all v, w, x ∈ R3 and c ∈ R.
These properties are all straightforward to prove directly from the defini-
We proved these tion of the cross product, so we just prove the first one here:
properties in
Section 1.A of
[Joh20]. (v + w) × x = (v2 + w2 )x3 − (v3 + w3 )x2 , (v3 + w3 )x1 − (v1 + w1 )x3 ,

(v1 + w1 )x2 − (v2 + w2 )x1
= (v2 x3 − v3 x2 , v3 x1 − v1 x3 , v1 x2 − v2 x1 )
+ (w2 x3 − w3 x2 , w3 x1 − w1 x3 , w1 x2 − w2 x1 )
= v × x + w × x,
as claimed.
In fact, the cross product being bilinear is exactly why we think of it as a

“product” or “multiplication” in the first place—bilinearity (and more generally,
multilinearity) corresponds to the “multiplication” distributing over vector
addition. We already made note of this fact back in Section 1.3.3—the usual
multiplication operation on R is bilinear, as is the dot product on Rn . We now
present some more examples that all have a similar interpretation.
• Matrix multiplication is a bilinear transformation. That is, if we define
the function T× : Mm,n × Mn,p → Mm,p that multiplies two matrices
together via
T× (A, B) = AB for all A ∈ Mm,n , B ∈ Mn,p ,
then T× is bilinear. To verify this claim, we just have to recall that matrix
multiplication is both left- and right-distributive, so for any matrices A,
B, and C of appropriate size and any scalar c, we have
T× (A + cB,C) = (A + cB)C = AC + cBC = T× (A,C) + cT× (B,C) and

T× (A, B + cC) = A(B + cC) = AB + cAC = T× (A, B) + cT× (A,C).
• The Kronecker product is bilinear. That is, the function T⊗ : Mm,n ×

M p,q → Mmp,nq that is defined by T⊗ (A, B) = A ⊗ B, is a bilinear trans-
formation. This fact follows immediately from Theorem 3.1.1(b–d).
All of these “multiplications” also become multilinear transformations
if we extend them to three or more inputs in the natural way. For example,
the function T× defined on triples of matrices via T× (A, B,C) = ABC is a
multilinear (trilinear?) transformation.
It is often useful to categorize multilinear transformations into groups based
on how many input and output vector spaces they have. With this in mind, we
say that a multilinear transformation T : V1 × V2 × · · · × V p → W is of type
(p, 0) if W = F is the ground field, and we say that it is of type (p, 1) otherwise.
That is, the first number in a multilinear transformation’s type tells us how
many input vector spaces it has, and its second number similarly tells us how
many output vector spaces it has (with W = F being interpreted as 0 output
spaces, since the output space is trivial). The sum of these two numbers (i.e.,
either p or p + 1, depending on whether W = F or not) is called its order.
For example, matrix multiplication (between two matrices) is a multilinear
transformation of type (2, 1) and order 2 + 1 = 3, and the dot product has type
(2, 0) and order 2 + 0 = 2. We will see shortly that the order of a multilinear
transformation tells us the dimensionality of an array of numbers that should
be used to represent that transformation (just like vectors can be represented
via a 1D list of numbers and linear transformations can be represented via a 2D
array of numbers/a matrix). For now though, we spend some time clarifying
what types of multilinear transformations correspond to which sets of linear
algebraic objects that we are already familiar with:
• Transformations of type (1, 1) are functions T : V → W that act linearly
on V. In other words, they are exactly linear transformations, which we
already know and love. Furthermore, the order of a linear transformation
is 1 + 1 = 2, which corresponds to the fact that we can represent them
via matrices, which are 2D arrays of numbers.
• Transformations of type (1, 0) are linear transformations f : V → F,
As a slightly more which are linear forms. The order of a linear form is 1 + 0 = 1, which
trivial special case,
corresponds to the fact that we can represent them via vectors (via
scalars can be
thought of as Theorem 1.3.3), which are 1D arrays of numbers. In particular, recall
multilinear that we think of these as row vectors.
transformations of
type (0, 0).
• Transformations of type (2, 0) are bilinear forms T : V1 × V2 → F, which
have order 2 + 0 = 2. The order once again corresponds to the fact
(Theorem 1.3.5) that they can be represented naturally by matrices (2D
arrays of numbers).
• Slightly more generally, transformations of type (p, 0) are multilinear
forms T : V1 × · · · × V p → F. The fact that they have order p + 0 = p
corresponds to the fact that we can represent them via Theorem 1.3.6 as
p-dimensional arrays of scalars.
• Transformations of type (0, 1) are linear transformations T : F → W,
which are determined completely by the value of T (1). In particular, for
every such linear transformation T , there exists a vector w ∈ W such that
T (c) = cw for all c ∈ F. The fact that the order of these transformations
is 0 + 1 = 1 corresponds to the fact that we can represent them via the
vector w (i.e., a 1D array of numbers) in this way, though this time we
think of it as a column vector.
We summarize the above special cases, as well the earlier examples of bilin-
ear transformations like matrix multiplication, in Figure 3.1 for easy reference.
Remark 3.2.1 Multilinear transformations can be generalized even further to something

Tensors called tensors, but doing so requires some technicalities that we enjoy
avoiding. While multilinear transformations allow multilinearity (rather
than just linearity) on their input, sometimes it is useful to allow their
output to behave in a multilinear (rather than just linear) way as well. The
Recall from simplest way to make this happen is to make use of dual spaces.
Definition 1.3.3 that
W ∗j is the dual space Specifically, if V1 , V2 , . . . , V p and W1 , W2 , . . . , Wq are vector spaces
of W j , which consists over a field F, then a tensor of type (p, q) is a multilinear form
of all linear forms
acting on W j . Dual f : V1 × V2 × · · · × V p × W1∗ × W2∗ × · · · × Wq∗ → F.
spaces were
explored back in
Section 1.3.2. Furthermore, its order is the quantity p + q. The idea is that attaching W ∗j
Type-(2, 0)
(bilinear forms) dot product
(matrices)
Type-(1, 0)
We do not give any
(linear forms) trace
explicit examples of
(row vectors)
type-(0, 0), (0, 1), or Type-(p, 0)
(1, 1) transformations (p-linear forms) determinant
here (i.e., scalars,
column vectors, Type-(0, 0)
or linear
transformations) (scalars)
since you are
hopefully familiar
enough with these Type-(0, 1)
objects by now that (column vectors) matrix
you can come up
with some yourself. multiplication
Type-(p, 1) Type-(1, 1)
(linear transforms.) cross
(matrices) product
Type-(2, 1) Kronecker
(bilinear transforms.) product
Figure 3.1: A summary of the various multilinear transformations, and their types,
that we have encountered so far.
as an input vector space sort of mimics having W j itself as an output vector

space, for the exact same reason that the double-dual W ∗∗ j is so naturally
We do not make use isomorphic to W j . In particular, if q = 1 then we can consider a tensor
of any tensors of f : V1 × V2 × · · · × V p × W ∗ → F as “the same thing” as a multilinear
type (p, q) with q ≥ 2,
transformation
which is why we
focus on multilinear T : V1 × V2 × · · · × V p → W.
transformations.
3.2.2 Arrays
We saw back in Theorem 1.3.6 that we can represent multilinear forms (i.e.,
type-(p, 0) multilinear transformations) on finite-dimensional vector spaces as
multi-dimensional arrays of scalars, much like we represent linear transforma-
tions as matrices. We now show that we can similarly do this for multilinear
transformations of type (p, 1). This fact should not be too surprising—the idea
is simply that each multilinear transformation is determined completely by how
it acts on basis vectors in each of the input arguments.
Theorem 3.2.1 Suppose V1 , . . . , V p and W are finite-dimensional vector spaces over the
Multilinear same field and T : V1 × · · · × V p → W is a multilinear transformation.
Transformations Let v1, j , . . ., v p, j denote the j-th coordinate of v1 ∈ V1 , . . ., v p ∈ V p with
as Arrays respect to some bases of V1 , . . ., V p , respectively, and let {w1 , w2 , . . .} be
a basis of W. Then there exists a unique family of scalars {ai; j1 ,..., j p } such
that
T (v1 , . . . , v p ) = ∑ ai; j1 ,..., j p v1, j1 · · · v p, j p wi
i, j1 ,..., j p
for all v1 ∈ V1 , . . ., v p ∈ V p .
Proof. In the p = 1 case, this theorem simply says that, once we fix bases of
V and W, we can represent every linear transformation T : V → W via its
standard matrix, whose (i, j)-entry we denote here by ai; j . This is a fact that we
already know to be true from Theorem 1.2.6. We now prove this more general
result for multilinear transformations via induction on p, and note that linear
transformations provide the p = 1 base case.
For the inductive step, we proceed exactly as we did in the proof of The-
orem 1.3.6. Suppose that the result is true for all (p − 1)-linear transforma-
tions acting on V2 × · · · × V p . If we let B = {x1 , x2 , . . . , xm } be a basis of V1
We use j1 here then the inductive hypothesis tells us that the (p − 1)-linear transformations
instead of just j since
S j1 : V2 × · · · × V p → W defined by
it will be convenient
later on.
S j1 (v2 , . . . , v p ) = T (x j1 , v2 , . . . , v p )
can be written as
The scalar ai; j1 , j2 ,..., j p
here depends on j1 T (x j1 , v2 , . . . , v p ) = S j1 (v2 , . . . , v p ) = ∑ ai; j1 , j2 ,..., j p v2, j2 · · · v p, j p wi
(not just i, j2 , . . . , j p ) i, j2 ,..., j p
since each choice
of x j1 gives a for some fixed family of scalars {ai; j1 , j2 ,..., j p }. If we write an arbitrary vector
different
(p − 1)-linear
v1 ∈ V1 as a linear combination of the basis vectors x1 , x2 , . . . , xm (i.e., v1 =
transformation S j1 . v1,1 x1 + v1,2 x2 + · · · + v1,m xm ), it then follows via linearity that
T (v1 , v2 , . . . , v p )
!
m m
=T ∑ v1, j1 x j1 , v2 , . . . , v p (v1 = ∑ v1, j1 x j1 )
j1 =1 j1 =1
m
= ∑ v1, j1 T (x j1 , v2 , . . . , v p ) (multilinearity of T )
j1 =1
m
= ∑ v1, j1 ∑ ai; j1 , j2 ,..., j p v2, j2 · · · v p, j p wi (inductive hypothesis)
j1 =1 i, j2 ,..., j p
= ∑ a j1 ,..., j p v1, j1 · · · v p, j p wi (group sums together)

i, j1 ,..., j p
for all v1 ∈ V1 , v2 ∈ V2 , . . ., v p ∈ V p , which completes the inductive step and

shows that the family of scalars {ai; j1 ,..., j p } exists.
To see that the scalars {ai; j1 ,..., j p } are unique, just note that if we choose
v1 to be the j1 -th member of the basis of V1 , v2 to be the j2 -th member of the
basis of V2 , and so on, then we get
T (v1 , . . . , v p ) = ∑ ai; j1 ,..., j p wi .

i
Since the coefficients of T (v1 , . . . , v p ) in the basis {w1 , w2 , . . .} are unique, it

follows that the scalars {ai; j1 ,..., j p }i are as well.
The above theorem generalizes numerous theorems for representing (multi)-
linear objects that we have seen in the past. In the case of type-(1, 1) transfor-
mations, it gives us exactly the fact (Theorem 1.2.6) that every linear trans-
formation can be represented by a matrix (with the (i, j)-entry denoted by ai; j
This result also instead of ai, j this time). It also generalizes Theorem 1.3.6, which told us that
generalizes
every multilinear form (i.e., type-(p, 0) transformation) can be represented by
Theorem 1.3.5 for
bilinear forms, since an array {a j1 , j2 ,..., j p }.
that was a special Just as was the case when representing linear transformations via standard
case of
Theorem 1.3.6 for
matrices, the “point” of the above theorem is that, once we know the values of
multilinear forms. the scalars {ai; j1 ,..., j p }, we can compute the value of the multilinear transfor-
mation T on any tuple of input vectors just by writing each of those vectors in
terms of the given basis and then using multilinearity. It is perhaps most natural
to arrange those scalars in a multi-dimensional array (with one dimension
for each subscript), which we call the standard array of the transformation,
just as we are used to arranging the scalars {ai; j } into a 2D matrix for linear
transformations. However, the details are somewhat uglier here since it is not
so easy to write these multi-dimensional arrays on two-dimensional paper or
computer screens.
Since the order of a multilinear transformation tells us how many subscripts
are needed to specify the scalars ai; j1 ,..., j p , it also tells us the dimensionality of
the standard array of the multilinear transformation. For example, this is why
both linear transformations (which have type (1, 1) and order 1 + 1 = 2) and
bilinear forms (which have type (2, 0) and order 2 + 0 = 2) can be represented
by 2-dimensional matrices. The size of an array is an indication of how many
rows it has in each of its dimensions. For example, an array {ai; j1 , j2 } for which
1 ≤ i ≤ 2, 1 ≤ j1 ≤ 3, and 1 ≤ j2 ≤ 4 has size 2 × 3 × 4.
We already saw one example of an order 3 multilinear transformation (in
particular, the determinant of a 3 × 3 matrix, which has type (3, 0)) repre-
sented as a 3D array back in Example 1.3.14. We now present two more such
examples, but this time for bilinear transformations (i.e., order-3 multilinear
transformations of type (2, 1)).
Example 3.2.2 Let C : R3 × R3 → R3 be the cross product (defined in Example 3.2.1).

The Cross Product Construct the standard array of C with respect to the standard basis of R3 .
as a 3D Array
Solution:
Since C is an order-3 tensor, its standard array is 3-dimensional and
specifically of size 3 × 3 × 3. To compute the scalars {ai; j1 , j2 }, we do
exactly the same thing that we do to compute the standard matrix of a
linear transformation: plug the basis vectors of the input spaces into C
and write the results in terms of the basis vectors of the output space. For
example, direct calculation shows that
C(e1 , e2 ) = e1 × e2 = (1, 0, 0) × (0, 1, 0) = (0, 0, 1) = e3 ,

and the other 8 cross products can be similarly computed to be

Some of these
computations can C(e1 , e1 ) = 0 C(e1 , e2 ) = e3 C(e1 , e3 ) = −e2
be sped up by
recalling that C(e2 , e1 ) = −e3 C(e2 , e2 ) = 0 C(e2 , e3 ) = e1
v × v = 0 and C(e3 , e1 ) = e2 C(e3 , e2 ) = −e1 C(e3 , e3 ) = 0.
v × w = −(w × v) for
all v, w ∈ R3 .
It follows that the scalars {ai; j1 , j2 } are given by
a3;1,2 = 1 a2;1,3 = −1
a3;2,1 = −1 a1;2,3 = 1
a2;3,1 = 1 a1;3,2 = −1,
Recall that the first and ai; j1 , j2 = 0 otherwise. We can arrange these scalars into a 3D array,
subscript in ai; j1 , j2
which we display below with i indexing the rows, j1 indexing the columns,
corresponds to the
output of the and j2 indexing the “layers” as indicated:
transformation and
the next two
1
subscripts j2 =  2
correspond to the j2 =  3
inputs. 0 j2 = 
0 0 1
 1
− 
0 0 0 0 0 0
1 0
0 −1 0 0  0 0
0 0 0 0
1 −1 0
0
If you ever used a It is worth noting that the standard array of the cross product that we
mnemonic like
  computed in the above example is exactly the same as the standard array of the
e1 e2 e3 determinant (of a 3 × 3 matrix) that we constructed back in Example 1.3.14. In
det  v1 v2 v3 
w1 w2 w3
a sense, the determinant and cross product are the exact same thing, just written
in a different way. They are both order-3 multilinear transformations, but the
to compute v × w, determinant on M3 is of type (3, 0) whereas the cross product is of type (2, 1),
this is why it works.
which explains why they have such similar properties (e.g., they can both be
used to find the area of parallelograms in R2 and volume of parallelepipeds in
R3 ).
Example 3.2.3 Construct the standard array of the matrix multiplication map T× : M2 ×
Matrix Multiplication M2 → M2 with respect to the standard basis of M2 .
as a 3D Array
Solution:
Since T× is an order-3 tensor, its corresponding array is 3-dimensional
and specifically of size 4 × 4 × 4. To compute the scalars {ai; j1 , j2 }, we
again plug the basis vectors (matrices) of the input spaces into T× and
write the results in terms of the basis vectors of the output space. Rather
than compute all 16 of the required matrix products explicitly, we simply
recall that
(
Ek,` , if a = b or
T× (Ek,a , Eb,` ) = Ek,a Eb,` =
O, otherwise.
We thus have
For example,
a1;2,3 = 1 because a1;1,1 = 1 a1;2,3 = 1 a2;1,2 = 1 a2;2,4 = 1
E1,2 (the 2nd basis a3;3,1 = 1 a3;4,3 = 1 a4;3,2 = 1 a4;4,4 = 1,
vector) times E2,1
(the 3rd basis vector)
equals E1,1 (the 1st and ai; j1 , j2 = 0 otherwise. We can arrange these scalars into a 3D array,
basis vector). which we display below with i indexing the rows, j1 indexing the columns,
and j2 indexing the “layers” as indicated:
1
j2 =  2
j2 =  3
0 j2 =  4
0
0 0
 0 0
0 j2 = 
1 0 0 0
0  0 0  0 0 0
1 0 0 0  1  0 0
0 0 0 0 0 0 1
 0 1 0 0  0 
0 0  0 1 0 0 0 0 1 0 0
0 0 0  0 0 0 0 0 1
0 0 0 
0 0 0 0
0
Even though there is It is worth noting that there is no “standard” order for which of the
no standard order
three subscripts should represent the rows, columns, and layers of a 3D
which which
dimensions array, and swapping the roles of the subscripts can cause the array to look
correspond to the quite different. For example, if we use i to index the rows, j2 to index the
subscripts j1 , . . ., jk , columns, and j1 to index the layers, then this array instead looks like
we always use i
(which
1
corresponded to the j1 =  2
single output vector j1 =  3
space) to index the 0 j1 =  4
rows of the standard 0
0 0
 1 1
0 j1 = 
1 0 0 0
array.
1  0 0  0 0 0
0 0 0  0 
0 0 0 0 0 0 0 0 0
0 0

 0 0 0 
0 0  0 0 0 0 0 0 0 0 0 0
0 0 0  0 0 0 0 1 1
0 1 1 
0 0 0 0
0
One of the author’s
greatest failings is his
inability to draw 4D
objects on 2D paper.
Arrays as Block Matrices

When the order of a multilinear transformation is 4 or greater, it becomes ex-
Reading the
tremely difficult to effectively visualize its (4- or higher-dimensional) standard
subscripts in this array. To get around this problem, we can instead write an array {ai; j1 ,..., j p }
matrix from as a block matrix. To illustrate how this procedure works, suppose for now
left-to-right, that p = 2, so the array has the form {ai; j1 , j2 }. We can represent this array as a
top-to-bottom, is like
counting: the final
block matrix by letting i index the rows, j1 index the block columns, and j2
digit increases to its index the columns within the blocks. For example, we could write a 3 × 2 × 4
maximum value (4 in array {ai; j1 , j2 } as a matrix with 3 rows, 2 block columns, and 4 columns in
this case) and then each block:
“rolls over” to
increase the
 
a1;1,1 a1;1,2 a1;1,3 a1;1,4 a1;2,1 a1;2,2 a1;2,3 a1;2,4
next-to-last digit by 1,
and so on. A =  a2;1,1 a2;1,2 a2;1,3 a2;1,4 a2;2,1 a2;2,2 a2;2,3 a2;2,4  .
a3;1,1 a3;1,2 a3;1,3 a3;1,4 a3;2,1 a3;2,2 a3;2,3 a3;2,4
As a more concrete example, we can write the array corresponding to the

matrix multiplication tensor T× : M2 × M2 → M2 from Example 3.2.3 as the
following 4 × 1 block matrix of 4 × 4 matrices:
 
1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 
 
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 .
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
More generally, if k > 2 (i.e., there are more than two “ j” subscripts), we
can iterate this procedure to create a block matrix that has blocks within blocks.
We call this the standard block matrix of the corresponding multilinear trans-
formation. For example, we could write a 3 × 2 × 2 × 2 array {ai; j1 , j2 , j3 } as a
block matrix with 3 rows and 2 block columns, each of which contain 2 block
columns containing 2 columns, as follows:
 
a1;1,1,1 a1;1,1,2 a1;1,2,1 a1;1,2,2 a1;2,1,1 a1;2,1,2 a1;2,2,1 a1;2,2,2
 a2;1,1,1 a2;1,1,2 a2;1,2,1 a2;1,2,2 a2;2,1,1 a2;2,1,2 a2;2,2,1 a2;2,2,2 
 
a3;1,1,1 a3;1,1,2 a3;1,2,1 a3;1,2,2 a3;2,1,1 a3;2,1,2 a3;2,2,1 a3;2,2,2
In particular, notice that if we arrange an array into a block matrix in this

way, the index j1 indicates the “most significant” block column, j2 indicates the
next most significant block column, and so on. This is completely analogous
to how the digits of a number are arranged left-to-right from most significant
to least significant (e.g., in the number 524, the digit “5” indicates the largest
piece (hundreds), “2” indicates the next largest piece (tens), and “4” indicates
the smallest piece (ones)).

The Standard Construct the standard block matrix of C with respect to the standard basis
Block Matrix of the of R3 .
Cross Product
Solution:
We already computed the standard array of C in Example 3.2.2, so we
just have to arrange the 27 scalars from that array into its 3 × 9 standard
For example, the block matrix as follows:
leftmost “−1” here
comes from the fact  
that, in the standard 0 0 0 0 0 1 0 −1 0
array A of C, A= 0 0 −1 0 0 0 1 0 0 .
a2;1,3 = −1 (the 0 1 0 −1 0 0 0 0 0
subscripts say row 2,
block column 1,
column 3. As one particular case that requires some special attention, consider what
the standard block matrix of a bilinear form f : V1 × V2 → F looks like. Since
bilinear forms are multilinear forms of order 2, we could arrange the entries
of their standard arrays into a matrix of size dim(V1 ) × dim(V2 ), just as we did
back in Theorem 1.3.5. However, if we follow the method described above for
turning a standard array into a standard block matrix, we notice that we instead
get a 1 × dim(V1 ) block matrix (i.e., row vector) whose entries are themselves
1 × dim(V2 ) row vectors. For example, the standard block matrix {a1; j1 , j2 } of
a bilinear form acting on two 3-dimensional vector spaces would have the form

a1;1,1 a1;1,2 a1;1,3 a1;2,1 a1;2,2 a1;2,3 a1;3,1 a1;3,2 a1;3,3 .
Example 3.2.5 Construct the standard block matrix of the dot product on R4 with respect
The Standard to the standard basis.
Block Matrix of the
Dot Product
Solution:
Recall from Example 1.3.9 that the dot product is a bilinear form that
thus a multilinear transformation of type (2, 0). To compute the entries
of its standard block matrix, we compute its value when all 16 possible
combinations of standard vectors are plugged into it:
(
1, if j1 = j2 or
e j1 · e j2 =
0, otherwise.
It follows that
a1;1,1 = 1 a1;2,2 = 1 a1;3,3 = 1 a1;4,4 = 1,
and a1; j1 , j2 = 0 otherwise. If we let j1 index which block column an entry

is placed in and j2 index which column within that block it is placed in,
we arrive at the following standard block matrix:

A= 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 .
Notice that the standard block matrix of the dot product that we constructed
in the previous example is simply the identity matrix, read row-by-row. This is
not a coincidence—it follows from the fact that the matrix representation of the
dot product (in the sense of Theorem 1.3.5) is the identity matrix. We now state
this observation slightly more generally and more prominently:
! If f : V1 × V2 → F is a bilinear form acting on finite-dimensional

vector spaces then its standard block matrix is the vectorization
of its matrix representation from Theorem 1.3.5, as a row vector.
Remark 3.2.2 One of the advantages of the standard block matrix over a standard array is
Same Order but that its shape tells us its type (rather than just its order). For example, we
Different Type saw in Examples 3.2.2 and 1.3.14 that the cross product and the determinant
have the same standard arrays. However, their standard block matrices are
different—they are matrices of size 3 × 9 and 1 × 27, respectively.
This distinction can be thought of as telling us that the cross product
and determinant of a 3 × 3 matrix are the same object (e.g., anything that
can be done with the determinant of a 3 × 3 matrix can be done with the
cross product, and vice-versa), but the way that they act on other objects
Similarly, while is different (the cross product takes in 2 vectors and outputs 1 vector,
matrices represent
both bilinear forms whereas the determinant takes in 3 vectors and outputs a scalar).
and linear We can similarly construct standard block matrices of tensors (see
transformations, the
way that they act in
Remark 3.2.1), and the only additional wrinkle is that they can have
those two settings is block rows too (not just block columns). For example, the standard arrays
different (since their of bilinear forms f : V1 × V2 → F of type (2, 0), linear transformations
types are different T : V → W and type-(0, 2) tensors f : W1∗ × W2∗ → F have the following
but their orders
are not.
forms, respectively, if all vectors spaces involved are 3-dimensional:

a1;1,1 a1;1,2 a1;1,3 a1;2,1 a1;2,2 a1;2,3 a1;3,1 a1;3,2 a1;3,3 ,
 
  a1,1;1
a1;1 a1;2 a1;3  a1,2;1 
 a2;1 a2;2 a2;3  , and  
 a1,3;1 .
a3;1 a3;2 a3;3  
 a2,1;1 
 
 a2,2;1 
 
 a2,3;1 
 
 a3,1;1 
 
 a3,2;1 
a3,3;1
In general, just like there are two possible shapes for tensors of order 1
(row vectors and column vectors), there are three possible shapes for
tensors of order 2 (corresponding to types (2, 0), (1, 1), and (0, 2)) and
p + 1 possible shapes for tensors of order p. Each of these different types
corresponds to a different shape of the standard block matrix and a different
way in which the tensor acts on vectors.
Refer back to Just like the action of a linear transformation on a vector can be represented
Theorem 1.2.6 for a
precise statement via multiplication by that transformation’s standard matrix, so too can the
about how a linear action of a multilinear transformation on multiple vectors. However, we need
transformations one additional piece of machinery to make this statement actually work (after
relate to matrix
all, what does it even mean to multiply a matrix by multiple vectors?), and that
multiplication.
machinery is the Kronecker product.
Theorem 3.2.2 Suppose V1 , . . ., V p and W are finite-dimensional vector spaces with bases
Standard Block Matrix B1 , . . ., B p and D, respectively, and T : V1 × · · · × V p → W is a multilinear
of a Multilinear transformation. If A is the standard block matrix of T with respect to those
Transformation bases, then

T (v1 , . . . , v p ) D = A [v1 ]B1 ⊗· · ·⊗ [v p ]B p for all v1 ∈ V1 , . . . , v p ∈ V p .
We could use Before proving this result, we look at some examples to clarify exactly what
notation like
it is saying. Just as with linear transformations, the main idea here is that we
[T ]D←B1 ,...,B p to
denote the standard can now represent arbitrary multilinear transformations (on finite-dimensional
block matrix of T vector spaces) via explicit matrix calculations. Even more amazingly though,
with respect to the this theorem tells us that the Kronecker product turns multilinear things (the
bases B1 , . . ., B p and
D. That seems like a
transformation T ) into linear things (multiplication by the matrix A). In a
bit much, though. sense, the Kronecker product can be used to absorb all of the multilinearity of
multilinear transformations, turning them into linear transformations (which
are much easier to work with).
Example 3.2.6 Verify Theorem 3.2.2 for the cross product C : R3 × R3 → R3 with respect
The Cross Product via to the standard basis of R3 .
the Kronecker Product
Solution:
Our goal is simply to show that if A is the standard block matrix of the
cross product then C(v, w) = A(v ⊗ w) for all v, w ∈ R3 . If we recall this
standard block matrix from Example 3.2.4 then we can simply compute
In this setting, we
 
v1 w1
interpret v ⊗ w as a
 v1 w2 
column vector, just  
like we do for v in a  v1 w3 
  
matrix equation like 0 0 0 0 0 1 0 −1 0  v2 w1 

Av = b.
A(v ⊗ w) =  0 0 −1 0 0 0 1 0 0 
 v2 w2 

0 1 0 −1 0 0 0 0 0  v2 w3 

   v3 w1 
v2 w3 − v3 w2  
 v3 w2 
= v3 w1 − v1 w3  ,
v3 w3
v1 w2 − v2 w1
which is indeed exactly the cross product C(v, w), as expected.
Example 3.2.7 Verify Theorem 3.2.2 for the dot product on R4 with respect to the standard
The Dot Product via basis.
the Kronecker Product
Solution:
Our goal is to show that if A is the standard block matrix of the dot
product (which we computed in Example 3.2.5) then v · w = A(v ⊗ w) for
all v, w ∈ R4 . If we recall that this standard block matrix is
For space reasons,
we do not explicitly
list the product A= 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 ,
A(v ⊗ w)—it is a 1 × 16
matrix times a 16 × 1 then we see that
matrix. The four
terms in the dot A(v ⊗ w) = v1 w1 + v2 w2 + v3 w3 + v4 w4 ,
product correspond
to the four “1”s in A.
which is indeed the dot product v · w.
Example 3.2.8 Verify Theorem 3.2.2 for matrix multiplication on M2 with respect to the
Matrix Multiplication standard basis.
via the Kronecker
Solution:
Product
Once again, our goal is to show that if E = {E1,1 , E1,2 , E2,1 , E2,2 } is
the standard basis of M2 and
 
1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 
A=  0 0 0 0 0 0 0 0 1 0

0 0 0 0 1 0 
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
is the standardblock matrix of matrix multiplication on M2 , then [BC]E =

A [B]E ⊗ [C]E for all B,C ∈ M2 . To this end, we first recall that if bi, j
and ci, j denote the (i, j)-entries of B and C (as usual), respectively, then
   
b1,1 c1,1
  c1,2 
b
 
[B]E =  1,2  and [C]E =  
c2,1  .
b2,1 
b2,2 c2,2
Directly computing the Kronecker product and matrix multiplication then

shows that
 
b1,1 c1,1 + b1,2 c2,1
 b c + b1,2 c2,2 

A [B]E ⊗ [C]E =  1,1 1,2 ,
b2,1 c1,1 + b2,2 c2,1 
b2,1 c1,2 + b2,2 c2,2
On the other hand, it is similarly straightforward to compute

" #
b1,1 c1,1 + b1,2 c2,1 b1,1 c1,2 + b1,2 c2,2
BC = ,
b2,1 c1,1 + b2,2 c2,1 b2,1 c1,2 + b2,2 c2,2

so [BC]E = A [B]E ⊗ [C]E , as expected.
Example 3.2.9 Construct the standard block matrix of the determinant (as a multilinear
The Standard form from Fn × · · · × Fn to F) with respect to the standard basis of Fn .
Block Matrix of the
Solution:
Determinant is the
Antisymmetric Recall that if A, B ∈ Mn and B is obtained from A by interchanging two
Kronecker Product of its columns, then det(B) = − det(A). As a multilinear form acting on the
columns of a matrix, this means that the determinant is antisymmetric:
det(v1 , . . . , v j1 , . . . , v j2 , . . . , vn ) = − det(v1 , . . . , v j2 , . . . , v j1 , . . . , vn )
for all v1 , . . . , vn and 1 ≤ j1 6= j2 ≤ n. By applying Theorem 3.2.2 to both

sides of this equation, we see that if A is the standard block matrix of the
determinant (which is 1 × nn and can thus be thought of as a row vector)
then

A v1 ⊗ · · · ⊗ v j1 ⊗ · · · ⊗ v j2 ⊗ · · · ⊗ vn

= −A v1 ⊗ · · · ⊗ v j2 ⊗ · · · ⊗ v j1 ⊗ · · · ⊗ vn .
Now we notice that, since interchanging two Kronecker factors multi-

plies this result by −1, it is in fact the case that permuting the Kronecker
Recall that Sn is the factors according to a permutation σ ∈ Sn multiplies the result by sgn(σ ).
symmetric group,
Furthermore, we use the fact that the elementary tensors v1 ⊗ · · · ⊗ vn span
which consists of all
permutations acting all of (Fn )⊗n , so we have
on {1, 2, . . . , n}. T
Av = sgn(σ ) A(Wσ v) = sgn(σ )WσT AT v
for all σ ∈ Sn and v ∈ (Fn )⊗n . By using the fact that WσT = Wσ −1 , we then
see that AT = sgn(σ )Wσ −1 AT for all σ ∈ Sn , which exactly means that A
lives in the antisymmetric subspace of (Fn )⊗n : A ∈ An,n .
We now recall from Section 3.1.3 that An,n is 1-dimensional, and in
particular A must have the form
A=c ∑ sgn(σ )eσ (1) ⊗ eσ (2) ⊗ · · · ⊗ eσ (n)

σ ∈Sn
This example also for some c ∈ F. Furthermore, since det(I) = 1, we see that AT (e1 ⊗ · · · ⊗
shows that the
en ) = 1, which tells us that c = 1. We have thus shown that the standard
determinant is the
only function with block matrix of the determinant is
the following three
properties: A= ∑ sgn(σ )eσ (1) ⊗ eσ (2) ⊗ · · · ⊗ eσ (n) ,
multilinearity, σ ∈Sn
antisymmetry, and
det(I) = 1. as a row vector.
Proof of Theorem 3.2.2. There actually is not much that needs to be done
to prove this theorem—all of the necessary bits and pieces just fit together
via naïve (but messy) direct computation. For ease of notation, we let v1, j ,
. . ., v p, j denote the j-th entry of [v1 ]B1 , . . ., [v p ]B p , respectively, just like in
Theorem 3.2.1.
On the one hand, using the definition of the Kroneckerproduct tells us that,
for each integer i, the i-th entry of A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p is
The notation here is h
a bit unfortunate. [·]B i
A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p = ∑ ai; j1 , j2 ,..., j p v1, j1 v2, j2 · · · v p, j p .
refers to a i | {z }
j1 , j2 ,..., j p
coordinate vector | {z } entries of
and [·]i refers to the sum across i-th row of A [v1 ]B1 ⊗···⊗[v p ]B p
i-th entry of a vector.
Onthe other hand, this is exactly what Theorem 3.2.1 tells us that the i-th entry
of T (v1 , . . . , v p ) D is (in the notation of that theorem, this was the coefficient
of wi , the i-th basis vector in D), which is all we needed to observe to complete
the proof.
3.2.3 Properties of Multilinear Transformations

Most of the usual properties of linear transformations that we are familiar with
can be extended to multilinear transformations, but the details almost always
become significantly uglier and more difficult to work with. We now discuss
how to do this for some of the most basic properties like the operator norm,
null space, range, and rank.
Operator Norm
The operator norm of a linear transformation T : V → W between finite-
A similar definition dimensional inner product spaces is
works if V and W are
infinite-dimensional,
but we must replace kT k = max kT (v)k : kvk = 1 ,
v∈V
“max” with “sup”, and
even then the value where kvk refers to the norm of v induced by the inner product on V, and
of the supremum
might be ∞. kT (v)k refers to the norm of T (v) induced by the inner product on W. In the
special case when V = W = Fn , this is just the operator norm of a matrix from
Section 2.3.3.
To extend this idea to multilinear transformations, we just optimize over
unit vectors in each input argument. That is, if T : V1 × V2 × · · · × V p → W is
a multilinear transformation between finite-dimensional inner product spaces,
then we define its operator norm as follows:
def
kT k = max kT (v1 , v2 , . . . , v p )k : kv1 k = kv2 k = · · · = kv p k = 1 .
v j ∈V j
1≤ j≤p
While the operator norm of a linear transformation can be computed quickly

via the singular value decomposition (refer back to Theorem 2.3.7), no quick
or easy method of computing the operator norm of a multilinear transformation
is known. However, we can carry out the computation for certain special
multilinear transformations if we are willing to work hard enough.

The Operator Norm Compute kCk.
of the Cross Product
Solution:
One of the basic properties of the cross product is that the norm
(induced by the dot product) of its output satisfies
This formula is
proved in [Joh20, kC(v, w)k = kvkkwk sin(θ ) for all v, w ∈ R3 ,
Section 1.A], for
example.
where θ is the angle between v and w. Since | sin(θ )| ≤ 1, this implies
kC(v, w)k ≤ kvkkwk. If we normalize v and w to be unit vectors then we
get kC(v, w)k ≤ 1, with equality if and only if v and w are orthogonal. It
follows that kCk = 1.
Example 3.2.11 Suppose F = R or F = C, and det : Fn × · · · × Fn → F is the determinant.

The Operator Norm Compute k det k.
of the Determinant
Solution:
Recall from Exercise 2.2.28 (Hadamard’s inequality) that

det(v1 , v2 , . . . , vn ) ≤ kv1 kkv2 k · · · kvn k for all v1 , v2 , . . . , vn ∈ Fn .
If we normalize each v1 , v2 , . . . , vn to have norm 1 then this inequality

becomes | det(v1 , v2 , . . . , vn )| ≤ 1. Furthermore, equality is sometimes
attained (e.g., when {v1 , v2 , . . . , vn } is an orthonormal basis of Fn ), so we
conclude that k det k = 1.
It is perhaps worth noting that if a multilinear transformation acts on

finite-dimensional inner product spaces, then Theorem 3.2.2 tells us that we
can express its operator norm in terms of its standard block matrix A (with
respect to orthonormal bases of all of the vector spaces). In particular, if we let
d j = dim(V j ) for all 1 ≤ j ≤ k, then

kT k = max kA(v1 ⊗ v2 ⊗ · · · ⊗ v p )k : kv1 k = kv2 k = · · · = kv p k = 1}.
d
v j ∈F j
1≤ j≤p
While this quantity is difficult to compute in general, there is an easy-to-

compute upper bound of it: if we just optimize kAvk over all unit vectors
Be careful: kT k is the in Fd1 ⊗ · · · ⊗ Fd p (instead of just over elementary tensors, as above), we get
multilinear operator
exactly the “usual” operator norm of A. In other words, we have shown the
norm (which is hard
to compute), while following:
kAk is the operator
norm of a matrix ! If T : V1 × · · · × V p → W is a multilinear transformation with
(which is easy to standard block matrix A then kT k ≤ kAk.
compute).
For example, it is straightforward to verify that the standard block matrix A
of the cross product
√ C (see Example 3.2.4)√ has all three of its non-zero singular
values equal to 2, so kCk ≤ kAk = 2. Similarly, the standard block matrix
B of the determinant det : Fn × · · · × Fn → F (see Example 3.2.9) is just

√ a row
vector with n! non-zero entries all equal to ±1, so k det k ≤ kBk = n!. Of
course, both of these bounds agree with (but are weaker than) the facts from
Examples 3.2.10 and 3.2.11 that kCk = 1 and k det k = 1.
Range and Kernel
In Section 1.2.4, we introduced the range and null space of a linear transforma-
tion T : V → W to be the following subspaces of W and V, respectively:

range(T ) = T (x) : x ∈ V and null(T ) = x ∈ V : T (x) = 0 .
The natural generalizations of these sets to a multilinear transformation T :

V1 × V2 × · · · × V p → W are called its range and kernel, respectively, and they
are defined by
def
range(T ) = T (v1 , v2 , . . . , v p ) : v1 ∈ V1 , v2 ∈ V2 , . . . , v p ∈ V p and
def
ker(T ) = (v1 , v2 , . . . , v p ) ∈ V1 × V2 × · · · × V p : T (v1 , v2 , . . . , v p ) = 0 .
In other words, the range of a multilinear transformation contains all of

the possible outputs of that transformation, just as was the case for linear
transformations, and its kernel consists of all inputs that result in an output of 0,
just as was the case for the null space of linear transformations. The reason for
changing terminology from “null space” to “kernel” in this multilinear setting
is that these sets are no longer subspaces if p ≥ 2. We illustrate this fact with
an example.
Example 3.2.12 Describe the range of the bilinear transformation T⊗ : R2 × R2 → R4 that

The Range of the computes the Kronecker product of two vectors, and show that it is not a
Kronecker Product subspace of R4 .
Solution:
The range of T⊗ is the set of all vectors of the form T⊗ (v, w) = v ⊗ w,
which are the elementary tensors in R2 ⊗ R2 ∼ = R4 . We noted near the end
2 2
of Section 3.1.1 that R ⊗ R is spanned by elementary tensors, but not
every vector in this space is an elementary tensor.
Recall from For example, (1, 0)⊗(1, 0) = (1, 0, 0, 0) and (0, 1)⊗(0, 1) = (0, 0, 0, 1)
Theorem 3.1.6 that if
are both in range(T⊗ ), but (1, 0, 0, 0) + (0, 0, 0, 1) = (1, 0, 0, 1) is not (after
v is an elementary
tensor then all, its matricization is the identity matrix I2 , which does not have rank 1),
rank(mat(v)) = 1. so range(T⊗ ) is not a subspace of R4 .
To even talk about whether or not ker(T ) is a subspace, we need to clarify

what it could be a subspace of. The members of ker(T ) come from the set
V1 × V2 × · · · × V p , but this set is not a vector space (so it does not make sense
for it to have subspaces). We can turn it into a vector space by defining a vector
addition and scalar multiplication on it—one way of doing this gives us the
external direct sum from Section 1.B. However, ker(T ) fails to be a subspace
under pretty much any reasonable choice of vector operations.
Example 3.2.13 Describe the kernel of the determinant det : Fn × · · · × Fn → F.

Kernel of the
Solution:
Determinant
Recall that the determinant of a matrix equals 0 if and only if that
matrix is not invertible. Also recall that a matrix is not invertible if and
only if its columns form a linearly dependent set. By combining these two
facts, we see that det(v1 , v2 , . . . , vn ) = 0 if and only if {v1 , v2 , . . . , vn } is
linearly dependent. In other words,

ker(det) = (v1 , v2 , . . . , vn ) ∈ Fn × · · · × Fn :

{v1 , v2 , . . . , vn } is linearly dependent .
To get our hands on range(T ) and ker(T ) in general, we can make use
of the standard block matrix A of T , just like we did for the operator norm.
Specifically, making use of Theorem 3.2.2 immediately gives us the following
result:
Theorem 3.2.3 Suppose V1 , . . . , V p and W are finite-dimensional vector spaces with bases
Range and Kernel B1 , . . . , B p and D, respectively, and T : V1 × · · · × V p → W is a multilinear
in Terms of Standard transformation. If A is the standard block matrix of T with respect to these
Block Matrices bases, then

range(T ) = w ∈ W : [w]D = A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p

for some v1 ∈ V1 , . . . , v p ∈ V p , and

ker(T ) = (v1 , . . . , v p ) ∈ V1 × · · · × V p : A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p = 0 .
That is, the range and kernel of a multilinear transformation consist of the
elementary tensors in the range and null space of its standard block matrix,
respectively (up to fixing bases of the vector spaces so that this statement
actually makes sense). This fact is quite analogous to our earlier observation that
the operator norm of a multilinear transformation T is obtained by maximizing
kAvk over elementary tensors v = v1 ⊗ · · · ⊗ v p .
Example 3.2.14 Compute the kernel of the cross product C : R3 × R3 → R3 directly, and
Kernel of the also via Theorem 3.2.3.
Cross Product
Solution:
To compute the kernel directly, recall that kC(v, w)k = kvkkwk| sin(θ )|,
where θ is the angle between v and w (we already used this formula in
Example 3.2.10). It follows that C(v, w) = 0 if and only if v = 0, w = 0,
In other words, ker(C) or sin(θ ) = 0. That is, ker(C) consists of all pairs (v, w) for which v and
consists of sets {v, w}
that are linearly w lie on a common line.
dependent To instead arrive at this result via Theorem 3.2.3, we first recall that C
(compare with
Example 3.2.13).
has standard block matrix
 
0 0 0 0 0 1 0 −1 0
A =  0 0 −1 0 0 0 1 0 0  .
0 1 0 −1 0 0 0 0 0
Standard techniques from introductory linear algebra show that the null
space of A is the symmetric subspace: null(A) = S32 . The kernel of C thus
consists of the elementary tensors in S32 (once appropriately re-interpreted
just as pairs of vectors, rather than as elementary tensors).
We know from Theorem 3.1.11 that these elementary tensors are ex-
actly the ones of the form c(v ⊗ v) for some v ∈ R3 . Since c(v ⊗ v) =
(cv) ⊗ v = v ⊗ (cv), it follows that ker(C) consists of all pairs (v, w) for
which v is a multiple of w or w is a multiple of v (i.e., v and w lie on a

common line).
Rank
Recall that the rank of a linear transformation is the dimension of its range.
Since the range of a multilinear transformation is not necessarily a subspace,
we cannot use this same definition in this more general setting. Instead, we
treat the rank-one sum decomposition (Theorem A.1.3) as the definition of the
rank of a matrix, and we then generalize it to linear transformations and then
multilinear transformations.
Once we do this, we see that one way of expressing the rank of a linear
transformation T : V → W is as the minimal integer r such that there exist linear
forms f1 , f2 , . . . , fr : V → F and vectors w1 , . . . , wr ∈ W with the property that
Notice that
r
Equation (3.2.1)
implies that the
T (v) = ∑ f j (v)w j for all v ∈ V. (3.2.1)
range of T is j=1
contained within the
span of w1 , . . . , wr , so
Indeed, this way of defining the rank of T makes sense if we recall that,
rank(T ) ≤ r. in the case when V = Fn and W = Fm , each f j looks like a row vector: there
exist vectors x1 , . . . , xr ∈ Fn such that f j (v) = xTj v for all 1 ≤ j ≤ r. In this
case, Equation (3.2.1) says that
r r
T (v) = ∑ (xTj v)w j = ∑ w j xTj v for all v ∈ Fn ,
j=1 j=1
which simply means that the standard matrix of T is ∑rj=1 w j xTj . In other words,
Equation (3.2.1) just extends the idea of a rank-one sum decomposition from
matrices to linear transformations.
To generalize Equation (3.2.1) to multilinear transformations, we just intro-
duce additional linear forms—one for each of the input vector spaces.
Definition 3.2.2 Suppose V1 , . . . , V p and W are vector spaces and T : V1 × · · · × V p → W

The Rank of a is a multilinear transformation. The rank of T , denoted by rank(T ), is
(i) (i) (i)
Multilinear the minimal integer r such that there exist linear forms f1 , f2 , . . . , fr :
Transformation Vi → F (1 ≤ i ≤ p) and vectors w1 , . . . , wr ∈ W with the property that
r
(1) (p)
T (v1 , . . . , v p ) = ∑ fj (v1 ) · · · f j (v p )w j for all v1 ∈ V1 , . . . , v p ∈ V p .
j=1
Just like the previous properties of multilinear transformations that we

looked at (the operator norm, range, and kernel), the rank of a multilinear
transformation is extremely difficult to compute in general. In fact, the rank is
so difficult to compute that we do not even know the rank of many of the most
basic multilinear transformations that we have been working with, like matrix
multiplication and the determinant.
Example 3.2.15 Show that if T× : M2 × M2 → M2 is the bilinear transformation that

The Rank of multiplies two 2 × 2 matrices, then rank(T× ) ≤ 8.
2 × 2 Matrix
Solution:
Multiplication
Our goal is to construct a sum of the form described by Definition 3.2.2
for T× that consists of no more than 8 terms. Fortunately, one such sum
is very easy to construct—we just write down the definition of matrix

multiplication.
More explicitly, recall that the top-left entry in the product T× (A, B) =
AB is a1,1 b1,1 + a1,2 b2,1 , so 2 of the 8 terms in the sum that defines AB are
a1,1 b1,1 E1,1 and a1,2 b2,1 E1,1 . The other entries of AB similarly contribute
2 terms each to the sum, for a total of 8 terms in the sum. Even more
explicitly, if we define
(1) (1) (1) (1)
f1 (A) = a1,1 f2 (A) = a1,2 f3 (A) = a1,1 f4 (A) = a1,2
(1) (1) (1) (1)
f5 (A) = a2,1 f6 (A) = a2,2 f7 (A) = a2,1 f8 (A) = a2,2
(2) (2) (2) (2)

f1 (B) = b1,1 f2 (B) = b2,1 f3 (B) = b1,2 f4 (B) = b2,2
(2) (2) (2) (2)
f5 (B) = b1,1 f6 (B) = b2,1 f7 (B) = b1,2 f8 (B) = b2,2
w1 = E1,1 w2 = E1,1 w3 = E1,2 w4 = E1,2

w5 = E2,1 w6 = E2,1 w7 = E2,2 w8 = E2,2 ,
A similar argument then it is straightforward (but tedious) to check that

shows that matrix
multiplication T× : 8
(1) (2)
Mm,n × Mn,p → Mm,p T× (A, B) = AB = ∑ fj (A) f j (B)w j for all A, B ∈ M2 .
has rank(T× ) ≤ mnp. j=1
After all, this sum can be written more explicitly as

AB = (a1,1 b1,1 + a1,2 b2,1 )E1,1 + (a1,1 b1,2 + a1,2 b2,2 )E1,2
+ (a2,1 b1,1 + a2,2 b2,1 )E2,1 + (a2,1 b1,2 + a2,2 b2,2 )E2,2 .
One of the reasons that the rank of a multilinear transformation is so difficult

to compute is that we do not have a method of reliably detecting whether or
not a sum of the type described by Definition 3.2.2 is optimal.
For example, it seems reasonable to expect that the matrix multiplication
transformation T× : M2 × M2 → M2 has rank(T× ) = 8 (i.e., we might expect
that the rank-one sum decomposition from Example 3.2.15 is optimal). Rather
surprisingly, however, this is not the case—the following decomposition of T×
(1) (2)
as T× (A, B) = AB = ∑7j=1 f j (A) f j (B)w j shows that rank(T× ) ≤ 7:
(1) (1) (1)

It turns out that, for f1 (A) = a1,1 f2 (A) = a2,2 f3 (A) = a1,2 − a2,2
2 × 2 matrix (1) (1) (1)
multiplication, f4 (A) = a2,1 + a2,2 f5 (A) = a1,1 + a1,2 f6 (A) = a2,1 − a1,1
rank(T× ) = 7, but this (1)
is not obvious. That is, f7 (A) = a1,1 + a2,2
it is hard to show that
there is no sum of (2) (2) (2)
the type described
f1 (B) = b1,2 − b2,2 f2 (B) = b2,1 − b1,1 f3 (B) = b2,1 + b2,2
by Definition 3.2.2 (2) (2) (2) (†)
consisting of only 6
f4 (B) = b1,1 f5 (B) = b2,2 f6 (B) = b1,1 + b1,2
terms. (2)
f7 (B) = b1,1 + b2,2
w1 = E1,2 + E2,2 w2 = E1,1 + E2,1 w3 = E1,1

w4 = E2,1 − E2,2 w5 = E1,2 − E1,1 w6 = E2,2
w7 = E1,1 + E2,2 .
The process of verifying that this decomposition works is just straightfor-

ward (but tedious) algebra. However, actually discovering this decomposition
in the first place is much more difficult, and beyond the scope of this book.
Remark 3.2.3 When multiplying two 2 × 2 matrices A and B together, we typically

Algorithms for perform 8 scalar multiplications:
Faster Matrix
" #
Multiplication a1,1 a1,2 b1,1 b1,2
AB =
a2,1 a2,2 b2,1 b2,2
" #
a1,1 b1,1 + a1,2 b2,1 a1,1 b1,2 + a1,2 b2,2
= .
a2,1 b1,1 + a2,2 b2,1 a2,1 b1,2 + a2,2 b2,2
However, it is possible to compute this same product a bit more efficiently—

only 7 scalar multiplications are required if we first compute the 7 matrices
This method of
computing AB M1 = a1,1 (b1,2 − b2,2 ) M2 = a2,2 (b2,1 − b1,1 )
requires more
additions, but that is M3 = (a1,2 − a2,2 )(b2,1 + b2,2 ) M4 = (a2,1 + a2,2 )b1,1
OK—the number of M5 = (a1,1 + a1,2 )b2,2 M6 = (a2,1 − a1,1 )(b1,1 + b1,2 )
additions grows
much more slowly M7 = (a1,1 + a2,2 )(b1,1 + b2,2 ).
than the number of
multiplications when It is straightforward to check that if w1 , . . . , w7 are as in the decomposi-
we apply it to larger tion (†), then
matrices.
7
M2 + M3 − M5 + M7 M1 + M5
AB = ∑ Mj ⊗ wj = .
j=1 M2 + M4 M1 − M4 + M6 + M7
To apply Strassen’s By iterating this procedure, we can compute the product of large
algorithm to
matrices much more quickly than we can via the definition of matrix
matrices of size that
is not a power of 2, multiplication. For example, computing the product of two 4 × 4 matrices
just pad it with rows directly from the definition of matrix multiplication requires 64 scalar
and columns of multiplications. However, if we partition those matrices as 2 × 2 block
zeros as necessary.
matrices and multiply them via the clever method above, we just need
to perform 7 multiplications of 2 × 2 matrices, each of which can be
The expressions like implemented via 7 scalar multiplications in the same way, for a total of
“O(n3 )” here are
72 = 49 scalar multiplications.
examples of big-O
notation. For This faster method of matrix multiplication is called Strassen’s al-
example, “O(n3 )
operations” means
gorithm, and it requires only O(nlog2 (7) ) ≈ O(n2.8074 ) scalar operations,
“no more than Cn3 versus the standard matrix multiplication algorithm’s O(n3 ) scalar opera-
operations, for some tions. The fact that we can multiply two 2 × 2 matrices together via just 7
scalar C”. scalar multiplications (i.e., the fact that Strassen’s algorithm exists) follows
immediately from the 7-term rank sum decomposition (†) for the 2 × 2
Even the rank of the matrix multiplication transformation. After all, the matrices M j (1 ≤ j ≤ 7)
3 × 3 matrix
multiplication are simply the products f1, j (A) f2, j (B).
transformation is Similarly, finding clever rank sum decompositions of larger matrix
currently
multiplication transformations leads to even faster algorithms for ma-
unknown—all we
know is that it is trix multiplication, and these techniques have been used to construct an
between 19 and 23, algorithm that multiplies two n × n matrices in O(n2.3729 ) scalar opera-
inclusive. See [Sto10] tions. One of the most important open questions in all of linear algebra
and references
therein for details.
asks whether or not, for every ε > 0, there exists such an algorithm that
requires only O(n2+ε ) scalar operations.
The rank of a multilinear transformation can also be expressed in terms

of its standard block matrix. By identifying the linear forms { fi, j } with row
vectors in the usual way and making use of Theorem 3.2.2, we get the following
theorem:
Theorem 3.2.4 Suppose V1 , . . . , V p and W are d1 -, . . ., d p -, and dW -dimensional vector

Rank via Standard spaces, respectively, T : V1 ×· · ·×V p → W is a multilinear transformation,
Block Matrices and A is the standard block matrix of T . Then the smallest integer r for
(1) r
which there exist sets of vectors {w j }rj=1 ⊂ FdW and v j j=1 ⊂ Fd1 ,
(p) r
. . ., v j j=1 ⊂ Fd p with
r
(1) (2) (p) T
A= ∑ wj vj ⊗vj ⊗···⊗vj
j=1
The sets of vectors in is exactly r = rank(T ).

this theorem cannot
be chosen to be
linearly independent If we recall from Theorem A.1.3 that rank(A) (where now we are just
in general, in thinking of A as a matrix—not as anything to do with multilinearity) equals the
contrast with
Theorem A.1.3
smallest integer r for which we can write
(where they could). r
A= ∑ w j vTj ,
j=1
we see learn immediately the important fact that the rank of a multilinear trans-
formation is bounded from below by the rank of its standard block matrix (after
all, every rank-one sum decomposition of the form described by Theorem 3.2.4
is also of the form described by Theorem A.1.3, but not vice-versa):
! If T : V1 × · · · × V p → W is a multilinear transformation with

standard block matrix A then rank(T ) ≥ rank(A).
For example, it is straightforward to check that if C : R3 × R3 → R3 is the

cross product, with standard block matrix A ∈ M3,9 as in Example 3.2.4, then
rank(C) ≥ rank(A) = 3 (in fact, rank(C) = 5, but this is difficult to show—see
Exercise 3.2.13). Similarly, if T× : M2 × M2 → M2 is the matrix multiplica-
tion transformation with standard matrix A ∈ M4,16 from Example 3.2.8, then
rank(T× ) ≥ rank(A) = 4 (recall that rank(T× ) actually equals 7).
3.2.1 Determine which of the following functions are and 2

∗(c) The function T : Rn × Rn → Rn defined by
are not multilinear transformations.
T (v, w) = v ⊗ w.
∗(a) The function T : R2 ×R2 → R2 defined by T (v, w) = (d) The function T : Mn × Mn → Mn defined by
(v1 + v2 , w1 − w2 ). T (A, B) = A + B.
(b) The function T : R2 ×R2 → R2 defined by T (v, w) = ∗(e) Given a fixed matrix X ∈ Mn , the function TX :
(v1 w2 + v2 w1 , v1 w1 − 2v2 w2 ). Mn → Mn defined by TX (A, B) = AXBT .
(f) The function per : Fn × · · · × Fn → F (with n copies ∗3.2.9 Suppose T : V1 × · · · × V p → W is a multilinear

of Fn ) defined by transformation. Show that if its standard array has k non-
(1) (2) (n) zero entries then rank(T ) ≤ k.
per v(1) , v(2) , . . . , v(n) = ∑ vσ (1) vσ (2) · · · vσ (n) .
σ ∈Sn
[Side note: This function is called the permanent, ∗3.2.10 Suppose T× : Mm,n × Mn,p → Mm,p is the bilin-
and its formula is similar to that of the determinant, ear transformation that multiplies two matrices. Show that
but with the signs of permutations ignored.] mp ≤ rank(T× ) ≤ mnp.
3.2.2 Compute the standard block matrix (with respect ∗3.2.11 We motivated many properties of multilinear trans-
to the standard bases of the given spaces) of each of the formations so as to generalize properties of linear transfor-
following multilinear transformations: mations. In this exercise, we show that they also generalize
properties of bilinear forms.
(a) The function T : R2 ×R2 → R2 defined by T (v, w) =
(2v1 w1 − v2 w2 , v1 w1 + 2v2 w1 + 3v2 w2 ). Suppose V and W are finite-dimensional vector spaces over
∗(b) The function T : R3 ×R2 → R3 defined by T (v, w) = a field F, f : V × W → F is a bilinear form, and A is the
(v2 w1 +2v1 w2 +3v3 w1 , v1 w1 +v2 w2 +v3 w1 , v3 w2 − matrix associated with f via Theorem 1.3.5.
v1 w1 ). (a) Show that rank( f ) = rank(A).
∗(c) The Kronecker product T⊗ : R2 × R2 → R4 (i.e., (b) Show that if F = R or F = C, and we choose the
the bilinear transformation defined by T⊗ (v, w) = bases B and C in Theorem 1.3.5 to be orthonormal,
v ⊗ w). then k f k = kAk.
(d) Given a fixed matrix X ∈ M2 , the function TX :
M2 × R2 → R2 defined by TX (A, v) = AXv.
3.2.12 Let D : Rn × Rn → R be the dot product.
(a) Show that kDk = 1.
3.2.3 Determine which of the following statements are (b) Show that rank(D) = n.
∗(a) The dot product D : Rn × Rn → R is a multilinear ∗∗3.2.13 Let C : R3 × R3 → R3 be the cross product.
transformation of type (n, n, 1).
(b) The multilinear transformation from Exer- (a) Construct the “naïve” rank sum decomposition that
cise 3.2.2(b) has type (2, 1). shows that rank(C) ≤ 6.
∗(c) The standard block matrix of the matrix multiplica- [Hint: Mimic Example 3.2.15.]
tion transformation T× : Mm,n × Mn,p → Mm,p has (b) Show that C(v, w) = ∑5j=1 f1, j (v) f2, j (w)x j for all
size mn2 p × mp. v, w ∈ R3 , and thus rank(C) ≤ 5, where
(d) The standard block matrix of the Kronecker product f1,1 (v) = v1 f1,2 (v) = v1 + v3
T⊗ : Rm × Rn → Rmn (i.e., the bilinear transforma-
tion defined by T⊗ (v, w) = v ⊗ w) is the mn × mn f1,3 (v) = −v2 f1,4 (v) = v2 + v3
identity matrix. f1,5 (v) = v2 − v1
f2,1 (w) = w2 + w3 f2,2 (w) = −w2

3.2.4 Suppose T : V1 × · · · × V p → W is a multilinear
f2,3 (w) = w1 + w3 f2,4 (w) = w1
transformation. Show that if there exists an index j for
which v j = 0 then T (v1 , . . . , v p ) = 0. f2,5 (w) = w3
x1 = e1 + e3 x2 = e1
3.2.5 Suppose T× : Mm,n × Mn,p → Mm,p is the bilinear
x3 = e2 + e3 x4 = e2
transformation that multiplies two matrices. Compute kT× k.
x5 = e1 + e2 + e3 .
[Hint: Keep in mind that the norm induced by the inner
product on Mm,n is the Frobenius norm. What inequalities [Side note: It is actually the case that rank(C) = 5,
do we know involving that norm?] but this is quite difficult to show.]
3.2.6 Show that the operator norm of a multilinear trans- 3.2.14 Use the rank sum decomposition from Exam-
formation is in fact a norm (i.e., satisfies the three properties ple 3.2.13(b) to come up with a formula for the cross product
of Definition 1.D.1). C : R3 × R3 → R3 that involves only 5 real number multi-
plications, rather than 6.
3.2.7 Show that the range of a multilinear form f : V1 × [Hint: Mimic Remark 3.2.3.]
· · ·×V p → F is always a subspace of F (i.e., range( f ) = {0}
or range( f ) = F). ∗3.2.15 Let det : Rn × · · · × Rn → R be the determinant.
[Side note: Recall that this is not true of multilinear trans- (a) Suppose n = 2. Compute rank(det).
formations.] (b) Suppose n = 3. What is rank(det)? You do not need
to rigorously justify your answer—it is given away
∗3.2.8 Describe the range of the cross product (i.e., the in one of the other exercises in this section.
bilinear transformation C : R3 × R3 → R3 from Exam- [Side note: rank(det) is not known in general. For example,
ple 3.2.1). if n = 5 then the best bounds we know are 17 ≤ rank(det) ≤
20.]
3.2.16 Provide an example to show that in the rank sum [Hint: Consider one of the multilinear transformations
decomposition of Theorem 3.2.4, if r = rank(T ) then the whose rank we mentioned in this section.]
sets of vectors {w j }rj=1 ⊂ FdW and {v1, j }rj=1 ⊂ Fd1 , . . .,
{vk, j }rj=1 ⊂ Fdk cannot necessarily be chosen to be linearly
independent (in contrast with Theorem A.1.3, where they
could).
3.3 The Tensor Product
We now introduce an operation, called the tensor product, that combines two
vector spaces into a new vector space. This operation can be thought of as a
generalization of the Kronecker product that not only lets us multiply together
vectors (from Fn ) and matrices of different sizes, but also allows us to multiply
together vectors from any vector spaces. For example, we can take the tensor
product of a polynomial from P 3 with a matrix from M2,7 while retaining a
vector space structure. Perhaps more usefully though, it provides us with a
systematic way of treating multilinear transformations as linear transformations,
without having to explicitly construct a standard block matrix representation as
in Theorem 3.2.2.
3.3.1 Motivation and Definition

Recall that we defined vector spaces in Section 1.1 so as to generalize Fn
to other settings in which we can apply our linear algebraic techniques. We
similarly would like to define the tensor product of two vectors spaces to
behave “like” the Kronecker product does on Fn or Mm,n . More specifically, if
V and W are vector spaces over the same field F, then we would like to define
their tensor product V ⊗ W to consist of vectors that we denote by v ⊗ w and
their linear combinations, so that the following properties hold for all v, x ∈ V,
w, y ∈ W, and c ∈ F:
• v ⊗ (w + y) = (v ⊗ w) + (v ⊗ y)
• (v + x) ⊗ w = (v ⊗ w) + (x ⊗ w), and (3.3.1)
• (cv) ⊗ w = v ⊗ (cw) = c(v ⊗ w).
In other words, we would like the tensor product to be bilinear. Indeed,
Theorem 3.1.1 also if we return to the Kronecker product on Fn or Mm,n , then these are exactly
had the associativity
property the properties from Theorem 3.1.1 that describe how it interacts with vector
(v⊗w)⊗x = v⊗(w⊗x). addition and scalar multiplication—it does so in a bilinear way.
We will see shortly
that associativity
However, this definition of V ⊗W is insufficient, as there are typically many
comes “for free”, so ways of constructing a vector space in a bilinear way out of two vector spaces
we do not worry V and W, and they may look quite different from each other. For example,
about it right now. if V = W = R2 then each of the following vector spaces X and operations
⊗ : R2 × R2 → X satisfy the three bilinearity properties (3.3.1):
1) X = R4 , with v ⊗ w = (v1 w1 , v1 w2 , v2 w1 , v2 w2 ). This is just the usual
Kronecker product.
2) X = R3 , with v ⊗ w = (v1 w1 , v1 w2 , v2 w2 ). This can be thought of as the
orthogonal projection of the usual Kronecker product down onto R3 by
omitting its third coordinate.
3.3 The Tensor Product 343
In cases (2)–(5) here, 3) X = R2 , with v ⊗ w = (v1 w1 + v2 w2 , v1 w2 − v2 w1 ). This is a projected

“⊗” does not refer to
and rotated version of the Kronecker product—it is obtained by multi-
the Kronecker
product, but rather plying the Kronecker product by the matrix
to new operations
that we are defining 1 0 0 1
so as to mimic its .
0 1 −1 0
properties.
4) X = R, with v ⊗ w = v · w for all v, w ∈ R2 .

5) The zero subspace X = {0} with v ⊗ w = 0 for all v, w ∈ R2 .
Notice that X = R4 with the Kronecker product is the largest of the above
vector spaces, and all of the others can be thought of as projections of the
Kronecker product down onto a smaller space. In fact, it really is the largest
example possible—it is not possible to construct a 5-dimensional vector space
via two copies of R2 in a bilinear way, since every vector must be a linear
combination of e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , and e2 ⊗ e2 , no matter how we define
the bilinear “⊗” operation.
The above examples suggest that the extra property that we need to add in
order to make tensor products unique is the fact that they are as large as possible
while still satisfying the bilinearity properties (3.3.1). We call this feature the
universal property of the tensor product, and the following definition pins all
of these ideas down. In particular, property (d) is this universal property.
Definition 3.3.1 Suppose V and W are vector spaces over a field F. Their tensor product
Tensor Product is the (unique up to isomorphism) vector space V ⊗ W, also over the field
F, with vectors and operations satisfying the following properties:
a) For every pair of vectors v ∈ V and w ∈ W, there is an associated
vector (called an elementary tensor) v ⊗ w ∈ V ⊗ W, and every
Unique “up to vector in V ⊗ W can be written as a linear combination of these
isomorphism” means elementary tensors.
that all vector
b) Vector addition satisfies
spaces satisfying
these properties are
isomorphic to each v ⊗ (w + y) = (v ⊗ w) + (v ⊗ y) and
other. (v + x) ⊗ w = (v ⊗ w) + (x ⊗ w) for all v, x ∈ V, w, y ∈ W.
c) Scalar multiplication satisfies
c(v ⊗ w) = (cv) ⊗ w = v ⊗ (cw) for all c ∈ F, v ∈ V, w ∈ W.
Property (d) is the d) For every vector space X over F and every bilinear transformation
universal property
T : V × W → X , there exists a linear transformation S : V ⊗ W → X
that forces V ⊗ W to
be as large as such that
possible.
T (v, w) = S(v ⊗ w) for all v ∈ V and w ∈ W.
The way to think about property (d) above (the universal property) is that
it forces V ⊗ W to be so large that we can squash it down (via the linear
transformation S) onto any other vector space X that is similarly constructed
from V and W in a bilinear way. In this sense, the tensor product can be thought
of as “containing” (up to isomorphism) every bilinear combination of vectors
from V and W.
To make the tensor product seem more concrete, it is good to keep the
Kronecker product in the back of our minds as the motivating example. For
Throughout this example, Fm ⊗ Fn = Fmn and Mm,n ⊗ M p,q = Mmp,nq contain all vectors of
entire section, every
the form v ⊗ w and all matrices of the form A ⊗ B, respectively, as well as their
time we encounter a
new theorem, ask linear combinations (this is property (a) of the above definition). Properties (b)
yourself what it and (c) of the above definition are then just chosen to be analogous to the
means for the properties that we proved for the Kronecker product in Theorem 3.1.1, and
Kronecker product
of vectors or
property (d) gets us all linear combinations of Kronecker products (rather than
matrices. just some subspace of them). In fact, Theorem 3.2.2 is essentially equivalent
to property (d) in this case: the standard matrix of S equals the standard block
matrix of T .
We now work through another example to try to start building up some
intuition for how the tensor product works more generally.
Example 3.3.1 Show that P ⊗ P = P2 , the space of polynomials in two variables, if we

The Tensor Product define the elementary tensors f ⊗ g by ( f ⊗ g)(x, y) = f (x)g(y).
of Polynomials
Solution:
We just need to show that P2 satisfies all four properties of Defini-
tion 3.3.1 if we define f ⊗ g in the indicated way:
a) If h ∈ P2 is any 2-variable polynomial then we can write it in the
form
∞
h(x, y) = ∑ a j,k x j yk
j,k=1
For example, if where there are only finitely many non-zero terms in the sum. Since
f (x) = 2x2 − 4 and
each term x j yk is an elementary tensor (it is x j ⊗ yk ), it follows
g(x) = x3 + x − 1 then
( f ⊗ g)(x, y) = that every h ∈ P2 is a linear combination of elementary tensors, as
f (x)g(y) = desired.
(2x2 − 4)(y3 + y − 1).
b), c) These properties follow almost immediately from bilinearity of
usual multiplication: if f , g, h ∈ P and c ∈ R then

f ⊗ (g + ch) (x, y) = f (x) g(y) + ch(y)
= f (x)g(y) + c f (x)h(y)
= ( f ⊗ g)(x, y) + c( f ⊗ h)(x, y)
for all x, y ∈ R, so f ⊗ (g + ch) = ( f ⊗ g) + c( f ⊗ h). The fact that

( f + cg) ⊗ h = ( f ⊗ h) + c(g ⊗ h) can be proved similarly.
This example relies d) Given any vector space X and bilinear form T : P × P → X , we just
on the fact that
every function in P2 define S(x j yk ) = T (x j , yk ) for all integers j, k ≥ 0 and then extend
can be written as a via linearity (i.e., we are defining how S acts on a basis of P2 and
finite sum of then using linearity to determine how it acts on other members of
polynomials of the P2 ).
form f (x)g(y).
Analogous
statements in other Once we move away from the spaces Fn , Mm,n , and P, there are quite a few
function spaces fail details that need to be explained in order for Definition 3.3.1 to make sense. In
fact, that definition appears at first glance to be almost completely content-free.
It does not give us an explicit definition of either the vector addition or scalar
multiplication operations in V ⊗ W, nor has it even told us what its vectors
“look like”—we denote some of them by v ⊗ w, but so what? It is also not clear
that it exists or that it is unique up to isomorphism.
3.3.2 Existence and Uniqueness

We now start clearing up the aforementioned issues with the tensor product—
we show that it exists, that it is unique up to isomorphism, and we discuss how
to actually construct it.
Uniqueness
We first show that the tensor product (if it exists) is unique up to isomorphism.
Fortunately, this property follows almost immediately from the universal prop-
erty of the tensor product.
Suppose there are two vector spaces Y and Z satisfying the four defining
properties from Definition 3.3.1, so we are thinking of each of them as the
Our goal here is to tensor product V ⊗ W. In order to distinguish these two tensor products, we
show that Y and Z
denote them by Y = V ⊗Y W and Z = V ⊗Z W, respectively, and similarly
are isomorphic (i.e.,
there is an invertible for the vectors in these spaces (for example, v ⊗Y w refers to the elementary
linear transformation tensor in Y obtained from v and w).
from one to the
other).
Since ⊗Z : V × W → Z is a bilinear transformation, the universal property
(property (d) of Definition 3.3.1, with X = Z and T = ⊗Z ) for Y = V ⊗Y W
tells us that there exists a linear transformation SY →Z : V ⊗Y W → Z such that
SY →Z (v ⊗Y w) = v ⊗Z w for all v ∈ V, w ∈ W.
A similar argument via the universal property of Z = V ⊗Z W shows that there

In fact, the linear exists a linear transformation SZ →Y : V ⊗Z W → Y such that
transformation S in
Definition 3.3.1(d) is
necessarily unique
SZ →Y (v ⊗Z w) = v ⊗Y w for all v ∈ V, w ∈ W.
Since every vector in Y or Z is a linear combination of elementary tensors
−1
v ⊗Y w or v ⊗Z w, respectively, we conclude that SY →Z = SZ →Y is an invert-
ible linear transformation from Y to Z. It follows that Y and Z are isomorphic,
as desired.
Existence and Bases
Showing that the tensor product space V ⊗ W actually exists in general (i.e.,
there is a vector space whose vector addition and scalar multiplication opera-
tions satisfy the required properties) requires some abstract algebra machinery
Look back at that we have not developed here. For this reason, we just show how to construct
Remark 1.2.1 for a V ⊗ W in the special case when V and W each have a basis, which is true at
discussion of which least in the case when V and W are each finite-dimensional. In particular, the
vector spaces have
bases.
following theorem tells us that we can construct a basis of V ⊗ W in exactly the
same way we did when working with the Kronecker product—simply tensor
together bases of V and W.
Theorem 3.3.1 Suppose V and W are vector spaces over the same field with bases B
Bases of Tensor and C, respectively. Then their tensor product V ⊗ W exists and has the
Products following set as a basis:
def
B ⊗C = e ⊗ f | e ∈ B, f ∈ C .
Proof. We start with the trickiest part of this proof to get our heads around—we
simply define B ⊗C to be a linearly independent set and V ⊗ W to be its span.
That is, we are not defining V ⊗ W via Definition 3.3.1, but rather we are
defining it as the span of a collection of vectors, and then we will show that
it has all of the properties of Definition 3.3.1 (thus establishing that a vector
space with those properties really does exist). Before proceeding, we make
some points of clarification:
If you are • Throughout this proof, we are just thinking of “e ⊗ f” as a symbol that
uncomfortable with
means nothing more than “the ordered pair of vectors e and f”.
defining V ⊗ W to be
a vector space • Similarly, we do not yet know that V ⊗ W really is the tensor product
made up of abstract of V and W, but rather we are just thinking of it as the vector space that
symbols, look ahead
to Remark 3.3.1,
has B ⊗C as a basis:
which might clear ( )
things up a bit.
V ⊗W = ∑ ci d j (ei ⊗ f j ) : bi , c j ∈ F, ei ∈ B, f j ∈ C for all i, j ,
i, j
where the sum is finite.

For each v = ∑i ci ei ∈ V and w = ∑ j d j f j ∈ W (where ci , d j ∈ F, ei ∈ B,
and f j ∈ C for all i and j) we then similarly define
This definition of v ⊗ w
implicitly relies on v ⊗ w = ∑ ci d j (ei ⊗ f j ). (3.3.2)
uniqueness of linear i, j
combinations when
representing v and w Our goal is now to show that when we define V ⊗ W and v ⊗ w in this way,
with respect to the each of the properties (a)–(d) of Definition 3.3.1 holds.
bases B and C,
respectively
Property (a) holds trivially since we have defined v ⊗ w ∈ V ⊗ W in general,
(Theorem 1.1.4). and we constructed V ⊗ W so that it is spanned by vectors of the form e ⊗ f,
where e ∈ B and f ∈ C, so it is certainly spanned by vectors of the form v ⊗ w,
where v ∈ V and w ∈ W.
To see why properties (b) and (c) hold, suppose v = ∑i ci ei ∈ V, w =
∑ j d j f j ∈ W, and y = ∑ j b j f j ∈ W. Then
!
v ⊗ (w + y) = v ⊗ ∑ d jf j + ∑ b jf j (expand w and y)
j j
!
= v⊗ ∑(d j + b j )f j (distributivity in W)
j
= ∑ ci (d j + b j )(ei ⊗ f j ) (use Equation (3.3.2))

i, j
= ∑ ci d j (ei ⊗ f j ) + ∑ ci b j (ei ⊗ f j ) (distributivity in V ⊗ W)

i, j i, j
= (v ⊗ w) + (v ⊗ y), (Equation (3.3.2) again)
and a similar argument shows that

! !
v ⊗ (cw) = v ⊗ c ∑ d j f j = v⊗ ∑(cd j )f j
j j
= ∑ ci (cd j )(ei ⊗ f j ) = c ∑ ci d j (ei ⊗ f j ) = c(v ⊗ w).

i, j i, j
The proofs that (v + x) ⊗ w = (v ⊗ w) + (x ⊗ w) and (cw) ⊗ w = c(v ⊗ w) are

almost identical to the ones presented above, and are thus omitted.
Finally, for property (d) we simply define S : V ⊗ W → X by setting
S(e ⊗ f) = T (e, f) for each e ∈ B and f ∈ C, and extending via linearity (recall
that every linear transformation is determined completely by how it acts on a
basis of the input space).
Since the dimension of a vector space is defined to equal the number of

vectors in any of its bases, we immediately get the following corollary that tells
us that the dimension of the tensor product of two finite-dimensional vector
spaces is just the product of their individual dimensions.
Corollary 3.3.2 Suppose V and W are finite-dimensional vector spaces. Then

Dimensionality of
Tensor Products dim(V ⊗ W) = dim(V) dim(W).
Remark 3.3.1 The tensor product is actually quite analogous to the external direct sum
The Direct Sum and from Section 1.B.3. Just like the direct sum V ⊕ W can be thought of
Tensor Product as a way of constructing a new vector space by “adding” two vector
spaces V and W, the tensor product V ⊗ W can be thought of as a way of
constructing a new vector space by “multiplying” V and W.
More specifically, both of these vector spaces V ⊕ W and V ⊗ W are
constructed out of ordered pairs of vectors from V and W, along with
a specification of how to add those ordered pairs and multiply them by
scalars. In V ⊕ W we denoted those ordered pairs by (v, w), whereas in
V ⊗ W we denote those ordered pairs by v ⊗ w. The biggest difference
between these two constructions is that these ordered pairs are the only
members of V ⊕ W, whereas some members of V ⊗ W are just linear
combinations of these ordered pairs.
This analogy between the direct sum and tensor product is even more
explicit if we look at what they do to bases B and C of V and W, respec-
tively: a basis of V ⊕ W is the disjoint union B ∪ C, whereas a basis of
V ⊗ W is the Cartesian product B ×C (up to relabeling the members of
these sets appropriately). For example, if V = R2 and W = R3 have bases
B = {v1 , v2 } and C = {v3 , v4 , v5 }, respectively, then R2 ⊕ R3 ∼ = R5 has
In this basis of basis
R2 ⊕ R3 , each {(v1 , 0), (v2 , 0), (0, v3 ), (0, v4 ), (0, v5 )},
subscript from B and
C appears exactly which is just the disjoint union of B and C, but with each vector turned into
once. In the basis of
an ordered pair so that this statement makes sense. Similarly, R2 ⊗R3 ∼= R6
R2 ⊗ R3 , each
ordered pair of has basis
subscripts from B
and C appears {v1 ⊗ v3 , v1 ⊗ v4 , v1 ⊗ v5 , v2 ⊗ v3 , v2 ⊗ v4 , v2 ⊗ v5 },
exactly once.
which is just the Cartesian product of B and C, but with each ordered pair
written as an elementary tensor so that this statement makes sense.
Even after working through the proof of Theorem 3.3.1, tensor products
still likely feel very abstract—how do we actually construct them? The reason
that they feel this way is that they are only defined up to isomorphism, so it’s
somewhat impossible to say what they look like. We could say that Fm ⊗ Fn =
Fmn and that v ⊗ w is just the Kronecker product of v ∈ Fm and w ∈ Fn , but
we could just as well say that Fm ⊗ Fn = Mm,n (F) and that v ⊗ w = vwT
is the outer product of v and w. After all, these spaces are isomorphic via
vectorization/matricization, so they look the same when we just describe what
linear algebraic properties they satisfy.
For this reason, when we construct “the” tensor product of two vector
spaces, we have some freedom in how we represent it. In the following
example, we pick one particularly natural representation of the tensor product,

but we emphasize that it is not the only one possible.
Example 3.3.2 Describe the vector space M2 (R) ⊗ P 2 and its elementary tensors.
The Tensor Product
Solution:
of Matrices and
If we take inspiration from the Kronecker product, which places a copy
Polynomials
of one vector space on each basis vector (i.e., entry) of the other vector
space, it seems natural to guess that we can represent M2 ⊗ P 2 as the
vector space of 2 × 2 matrices whose entries are polynomials of degree at
Recall that P 2 is the most 2. That is, we guess that
space of
polynomials of (" # )

degree at most 2. 2 f1,1 (x) f1,2 (x) 2
M2 ⊗ P = fi, j ∈ P for 1 ≤ i, j ≤ 2 .
f2,1 (x) f2,2 (x)
The elementary tensors in M2 ⊗ P 2 are those that can be written in the

form A f (x) for some A ∈ M2 and f ∈ P 2 (i.e., the elementary tensors are
the ones for which the polynomials fi, j are all multiples of each other).
Alternatively, by grouping the coefficients of each fi, j , we could think
of the members of M2 ⊗ P 2 as degree-2 polynomials from R to M2 :
M2 ⊗ P 2 =

f : R → M2 f (x) = Ax2 + Bx +C for some A, B,C ∈ M2 .
When viewed in this way, the elementary tensors in M2 ⊗ P 2 are again

those of the form A f (x) for some A ∈ M2 and f ∈ P 2 (i.e., the ones for
which A, B, and C are all multiples of each other).
To verify that this set really is the tensor product M2 ⊗ P 2 , we could
proceed as we did in Example 3.3.1 and check the four defining properties
of Definition 3.3.1. All four of these properties are straightforward and
somewhat unenlightening to check, so we leave them to Exercise 3.3.5.
Higher-Order Tensor Products

Just as was the case with the Kronecker product, one of the main purposes
of the tensor product is that it lets us turn bilinear transformations into linear
ones. This is made explicit by the universal property (property (d) of Defini-
tion 3.3.1)—if we have some bilinear transformation T : V × W → X that we
wish to know more about, we can instead construct the tensor product space
V ⊗ W and investigate the linear transformation S : V ⊗ W → X defined by
S(v ⊗ w) = T (v, w).
Slightly more generally, we can also consider the tensor product of three or
more vector spaces, and doing so lets us represent multilinear transformations
as linear transformations acting on the tensor product space. We now pin down
the details needed to show that this is indeed the case.
Theorem 3.3.3 If V, W, and X are vector spaces over the same field then
Associativity of the
Tensor Product (V ⊗ W) ⊗ X ∼
= V ⊗ (W ⊗ X ).
Recall that “∼
=” Proof. For each x ∈ X , define a bilinear map Tx : V × W → V ⊗ (W ⊗ X )
means “is
by Tx (v, w) = v ⊗ (w ⊗ x). Since Tx is bilinear, the universal property of the
isomorphic to”.
tensor product (i.e., property (d) of Definition 3.3.1) says that there exists a
linear transformation Sx : V ⊗ W → V ⊗ (W ⊗ X ) that acts via Sx (v ⊗ w) =
Tx (v, w) = v ⊗ (w ⊗ x).
Next, define a bilinear map Te : (V ⊗W)×X → V ⊗(W ⊗X ) via Te(u, x) =
Sx (u) for all u ∈ V ⊗ W and x ∈ X . By using the universal property again, we
see that there exists a linear transformation Se : (V ⊗ W) ⊗ X → V ⊗ (W ⊗ X )
e ⊗ x) = Te(u, x) = Sx (u). If u = v ⊗ w then this says that
that acts via S(u

Se (v ⊗ w) ⊗ x = Sx (v ⊗ w) = v ⊗ (w ⊗ x).
This argument can also be reversed to find a linear transformation that sends
v ⊗ (w ⊗ x) to (v ⊗ w) ⊗ x, so Se is invertible and thus an isomorphism.
Now that we know that the tensor product is associative, we can unambigu-
ously refer to the tensor product of 3 vector spaces V, W, and X as V ⊗ W ⊗ X ,
since it does not matter whether we take this to mean (V ⊗ W) ⊗ X or V ⊗
(W ⊗ X ). The same is true when we tensor together 4 or more vector spaces,
and in this setting we say that an elementary tensor in V1 ⊗ V2 ⊗ · · · ⊗ V p is a
vector of the form v1 ⊗ v2 ⊗ · · · ⊗ v p , where v j ∈ V j for each 1 ≤ j ≤ p.
It is similarly the case that the tensor product is commutative in the sense
Keep in mind that that V ⊗ W ∼ = W ⊗ V (see Exercise 3.3.10). For example, if V = Fm and
V ⊗W ∼ = W ⊗ V does n
W = F then the swap operator Wm,n of Definition 3.1.3 is an isomorphism from
not mean that
v ⊗ w = w ⊗ v for all V ⊗ W = Fm ⊗ Fn to W ⊗ V = Fn ⊗ Fm . More generally, higher-order tensor
v ∈ V, w ∈ W. Rather, product spaces V1 ⊗ V2 ⊗ · · · ⊗ V p are also isomorphic to the tensor product of
it just means that the same spaces in any other order. That is, if σ : {1, 2, . . . , p} → {1, 2, . . . , p}
there is an
isomorphism that
is a permutation then
sends each v ⊗ w to
w ⊗ v. V1 ⊗ V 2 ⊗ · · · ⊗ V p ∼
= Vσ (1) ⊗ Vσ (2) ⊗ · · · ⊗ Vσ (p) .
Again, if each of these vector spaces is Fn (and thus the tensor product is the
Kronecker product) then the swap matrix Wσ introduced in Section 3.1.3 is the
standard isomorphism between these spaces.
We motivated the tensor product as a generalization of the Kronecker
product that applies to any vector spaces (as opposed to just Fn and/or Mm,n ).
We now note that if we fix bases of the vector spaces that we are working
with then the tensor product really does look like the Kronecker product of
coordinate vectors, as we would hope:
Theorem 3.3.4 Suppose V1 , V2 , . . ., V p are finite-dimensional vector spaces over the same
Kronecker Product of field with bases B1 , B2 , . . ., B p , respectively. Then
Coordinate Vectors
def
B1 ⊗ B2 ⊗ · · · ⊗ B p = b(1) ⊗ b(2) ⊗ · · · ⊗ b(p) | b( j) ∈ B j for all 1 ≤ j ≤ p
is a basis of V1 ⊗ V2 ⊗ · · · ⊗ V p , and if we order it lexicographically then

In the concluding
line of this theorem, v1 ⊗ v2 ⊗ · · · ⊗ v p B ⊗B ⊗···⊗B p = [v1 ]B1 ⊗ [v2 ]B2 ⊗ · · · ⊗ [v p ]B p
1 2
the “⊗” on the left is
the tensor product
and the “⊗” on the for all v1 ∈ V1 , v2 ∈ V2 , . . ., v p ∈ V p .
right is the Kronecker
product. In the above theorem, when we say that we are ordering the basis B1 ⊗
B2 ⊗ · · · ⊗ B p lexicographically, we mean that we order it so as to “count” its
basis vectors in the most natural way, much like we did for bases of Kronecker
product spaces in Section 3.1 and for standard block matrices of multilinear
transformations in Section 3.2.2. For example, if B1 = {v1 , v2 } and B2 =

{w1 , w2 , w3 } then we order B1 ⊗ B2 as
B1 ⊗ B2 = {v1 ⊗ w1 , v1 ⊗ w2 , v1 ⊗ w3 , v2 ⊗ w1 , v2 ⊗ w2 , v2 ⊗ w3 }.
Proof of Theorem 3.3.4. To see that B1 ⊗ B2 ⊗ · · · ⊗ B p is a basis of V1 ⊗ V2 ⊗

· · · ⊗ V p , we note that repeated application of Corollary 3.3.2 tells us that
dim(V1 ⊗ V2 ⊗ · · · ⊗ V p ) = dim(V1 ) dim(V2 ) · · · dim(V p ),
which is exactly the number of vectors in B1 ⊗ B2 ⊗ · · · ⊗ B p . It is straight-

forward to see that every elementary tensor is in span(B1 ⊗ B2 ⊗ · · · ⊗ B p ), so
V1 ⊗ V2 ⊗ · · · ⊗ V p = span(B1 ⊗ B2 ⊗ · · · ⊗ B p ), and thus Exercise 1.2.27(b)
tells us that B1 ⊗ B2 ⊗ · · · ⊗ B p is indeed a basis of V1 ⊗ V2 ⊗ · · · ⊗ V p .

The fact that v1 ⊗v2 ⊗· · ·⊗v p B ⊗B ⊗···⊗B p = [v1 ]B1 ⊗[v2 ]B2 ⊗· · ·⊗[v p ]B p
1 2
follows just by matching up all of the relevant definitions: if we write B j =
( j) ( j) ( j) kj ( j) ( j) ( j) ( j) ( j)
b1 , b2 , . . . , bk j and v j = ∑i=1 ci bi (so that [v j ]B j = c1 , c2 , . . . , ck j )
This proof is the for each 1 ≤ j ≤ p, then
epitome of
“straightforward but ! ! kp
!
k1 k2
hideous”. Nothing (1) (1) (2) (2) (p) (p)
that we are doing v1 ⊗ v2 ⊗ · · · ⊗ v p = ∑ ci bi ⊗ ∑ ci bi ⊗···⊗ ∑ ci bi
here is clever—it is all i=1 i=1 i=1
just definition (1) (2) (p) (1) (2) (p)
chasing, but it is ugly
= ∑ ci1 ci2 · · · ci p bi1 ⊗ bi2 ⊗ · · · ⊗ bi p .
i1 ,i2 ,...,i p
because there are
so many objects to (1) (2) (p)
keep track of. Since the vectors bi1 ⊗bi2 ⊗· · ·⊗bi p on the right are the members of B1 ⊗
(1) (2) (p)
B2 ⊗ · · · ⊗ B p and the scalars ci1 ci2 · · · ci p are the entries of the Kronecker
product [v1 ]B1 ⊗ [v2 ]B2 ⊗ · · · ⊗ [v p ]B p , in the same order, the result follows.
It might sometimes
be convenient
to instead arrange the entries of the coor-
dinate vector v1 ⊗ v2 ⊗ · · · ⊗ v p B ⊗B ⊗···⊗B p into a p-dimensional dim(V1 ) ×
1 2
dim(V2 )×· · ·×dim(V p ) array, rather than a long dim(V1 ) dim(V2 ) · · · dim(V p )-
entry vector as we did here. What the most convenient representation of
v1 ⊗ v2 ⊗ · · · ⊗ v p is depends heavily on context—what the tensor product
represents and what we are trying to do with it.
3.3.3 Tensor Rank

Although not every vector in a tensor product space V1 ⊗ V2 ⊗ · · · ⊗ V p is
an elementary tensor, every vector can (by definition) be written as a linear
combination of elementary tensors. It seems natural to ask how many terms
are required in such a linear combination, and we give this minimal number of
terms a name:
Definition 3.3.2 Suppose V1 , V2 , . . . , V p are finite-dimensional vector spaces over the same
Tensor Rank field and v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p . The tensor rank (or simply the rank) of
v, denoted by rank(v), is the minimal integer r such that v can be written
as a sum of r elementary tensors:
r
(1) (2) (p)
v = ∑ vi ⊗ vi ⊗ · · · ⊗ vi ,
i=1
( j)
where vi ∈ V j for each 1 ≤ i ≤ r and 1 ≤ j ≤ p.
The tensor rank generalizes the rank of a matrix in the following sense: if
p = 2 and V1 = Fm and V2 = Fn , then Fm ⊗ Fn ∼ = Mm,n (F), and the tensor rank
in this space really is just the usual matrix rank. After all, when we represent
Fm ⊗ Fn in this way, its elementary tensors are the matrices v ⊗ w = vwT , and
we know from Theorem A.1.3 that the rank of a matrix is the fewest number of
these (rank-1) matrices that are needed to sum to it.
In fact, a similar argument shows that when all of the spaces are finite-
dimensional, the tensor rank is equivalent to the rank of a multilinear transfor-
mation. After all, we showed in Theorem 3.2.4 that the rank of a multilinear
transformation T is the least integer r such that its standard block matrix A can
be written in the form
r
(1) (2) (p) T
A= ∑ wj vj ⊗vj ⊗···⊗vj .
j=1
Well, the vectors in the above sum are exactly the elementary tensors if we
represent Fd1 ⊗ Fd2 ⊗ · · · ⊗ Fd p ⊗ FdW as MdW ,d1 d2 ···d p (F) in the natural way.
Flattenings and Bounds
Since the rank of a multilinear transformation is difficult to compute, so is
tensor rank. However, there are a few bounds that we can use to help narrow
it down somewhat. The simplest of these bounds comes from just forgetting
about part of the tensor product structure of V1 ⊗ V2 ⊗ · · · ⊗ V p . That is, if we
let {S1 , S2 , . . . , Sk } be any partition of {1, 2, . . . , p} (i.e., S1 , S2 , . . ., Sk are sets
such that S1 ∪ S2 ∪ · · · ∪ Sk = {1, 2, . . . , p} and Si ∩ S j = {} whenever i 6= j)
then
The notation
! ! !
O O O O
Vi ∼
i∈S j
V1 ⊗ V 2 ⊗ · · · ⊗ V p = Vi ⊗ Vi ⊗ · · · ⊗ Vi .
i∈S1 i∈S2 i∈Sk
means the tensor
product of each Vi To be clear, we are thinking of the vector space on the right as a tensor product
where i ∈ S j . It is
of just k vector spaces—its elementary tensors are the ones of the form v1 ⊗
analogous to big-Σ N
notation for sums. v2 ⊗ · · · ⊗ vk , where v j ∈ i∈S j Vi for each 1 ≤ j ≤ k.
Perhaps the most natural isomorphism between these two spaces comes
from recalling that we can write each v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p as a sum of
elementary tensors
r
(1) (2) (p)
v= ∑ vj ⊗vj ⊗···⊗vj ,
j=1
and if we simply regroup the those products we get the following vector
N N N
ṽ ∈ i∈S1 Vi ⊗ i∈S2 Vi ⊗···⊗ i∈Sk Vi :
! ! !
r O (i) O (i) O (i)
ṽ = ∑ vj ⊗ vj ⊗···⊗ vj .
j=1 i∈S1 i∈S2 i∈Sk
We call ṽ a flattening of v, and we note that the rank of ṽ never exceeds the
rank of v, simply because the procedure that we used to construct ṽ from v
turns a sum of r elementary tensors into another sum of r elementary tensors.
We state this observation as the following theorem:
Theorem 3.3.5 Suppose V1 , V2 , . . . , V p are finite-dimensional vector spaces over the same
Tensor Rank of field and v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p . If ṽ is a flattening of v then
Flattenings
rank(v) ≥ rank(ṽ).
Flattenings are most We emphasize that the opposite inequality does not holds in general, since
useful when k = 2, as
the
N
coarser
tensor
N
product
structure
N
of some
of the elementary tensors in
we can then easily
compute their rank i∈S1 Vi ⊗ i∈S2 Vi ⊗ · · · ⊗ i∈Sk Vi do not correspond to elementary
by thinking of them tensors in V1 ⊗ V2 ⊗ · · · ⊗ V p , since the former space is a “coarser” tensor
as matrices. product of just k ≤ p spaces.
For example, if v ∈ Fd1 ⊗ Fd2 ⊗ Fd3 ⊗ Fd4 then we obtain flattenings of
v by just grouping some of these tensor product factors together. There are
many ways to do this, so these flattenings of v may live in many different
spaces, some of which are listed below along with the partition {S1 , S2 , . . . , Sk }
of {1, 2, 3, 4} that they correspond to:
Partition {S1 , S2 , . . . , Sk } Flattened Space

S1 = {1, 2}, S2 = {3}, S3 = {4} Fd1 d2 ⊗ Fd3 ⊗ Fd4
S1 = {1, 3}, S2 = {2}, S3 = {4} Fd1 d3 ⊗ Fd2 ⊗ Fd4
S1 = {1, 2}, S2 = {3, 4} Fd1 d2 ⊗ Fd3 d4
S1 = {1, 2, 3}, S2 = {4} Fd1 d2 d3 ⊗ Fd4
In fact, if we flatten the space Fd1 ⊗ Fd2 ⊗ · · · ⊗ Fd p ⊗ FdW by choos-

ing S1 = {p + 1} and S2 = {1, 2, . . . , p} then we see that the tensor rank
of v ∈ Fd1 ⊗ Fd2 ⊗ · · · ⊗ Fd p ⊗ FdW is lower-bounded by the tensor rank of
its flattening ṽ ∈ FdW ⊗ Fd1 d2 ···d p , which equals the usual (matrix) rank of
mat(ṽ) ∈ MdW ,d1 d2 ···d p (F). We have thus recovered exactly our observation
from the end of Section 3.2.3 that if T is a multilinear transformation with
standard block matrix A then rank(T ) ≥ rank(A); we can think of standard
block matrices as flattenings of arrays.
However, flattenings are somewhat more general than standard block ma-
trices, as we are free to partition the tensor factors in any way we like, and
some choices of partition may give better lower bounds than others. As long as
we form the partition via k = 2 subsets though, we can interpret the resulting
flattening just as a matrix and compute its rank using standard linear algebra
machinery, thus obtaining an easily-computed lower bound on tensor rank.
Example 3.3.3 Show that the following vector v ∈ (C2 )⊗4 has tensor rank 4:
Computing Tensor
Rank Bounds via v = e1 ⊗ e1 ⊗ e1 ⊗ e1 + e1 ⊗ e2 ⊗ e1 ⊗ e2 +
Flattenings e2 ⊗ e1 ⊗ e2 ⊗ e1 + e2 ⊗ e2 ⊗ e2 ⊗ e2 .
Solution:
It is clear that rank(v) ≤ 4, since v was provided to us as a sum of
4 elementary tensors. To compute lower bounds of rank(v), we could
use any of its many flattenings. For example, if we choose the partition
(C2 )⊗4 ∼= C2 ⊗ C8 then we get the following flattening ṽ of v:
In the final line here,
the first factor of ṽ = e1 ⊗ (e1 ⊗ e1 ⊗ e1 ) + e1 ⊗ (e2 ⊗ e1 ⊗ e2 ) +
terms like e1 ⊗ e6 lives
in C2 while the e2 ⊗ (e1 ⊗ e2 ⊗ e1 ) + e2 ⊗ (e2 ⊗ e2 ⊗ e2 )
second lives in C8 . = e1 ⊗ e1 + e1 ⊗ e6 + e2 ⊗ e3 + e2 ⊗ e8 ,
To compute rank(ṽ), we note that its matricization is

As another way to
see that rank(ṽ) = 2, 1 0 0 0 0 1 0 0
we can rearrange ṽ mat(ṽ) = e1 eT1 + e1 eT6 + e2 eT3 + e2 eT8 = ,
as ṽ = e1 ⊗ (e1 + e6 ) + 0 0 1 0 0 0 0 1
e2 ⊗ (e3 + e8 ).
so rank(ṽ) = rank(mat(ṽ)) = 2 and thus rank(v) ≥ 2.
To get a better lower bound, we just try another flattening of v: we
instead choose the partition (C2 )⊗4 ∼
= C4 ⊗ C4 so that we get the following
0
flattening v of v:
v0 = (e1 ⊗ e1 ) ⊗ (e1 ⊗ e1 ) + (e1 ⊗ e2 ) ⊗ (e1 ⊗ e2 ) +

(e2 ⊗ e1 ) ⊗ (e2 ⊗ e1 ) + (e2 ⊗ e2 ) ⊗ (e2 ⊗ e2 )
= e1 ⊗ e1 + e2 ⊗ e2 + e3 ⊗ e3 + e4 ⊗ e4 .
To compute rank(v0 ), we note that its matricization is

 
1 0 0 0
 
0 1 0 0
mat(v0 ) = e1 eT1 + e2 eT2 + e3 eT3 + e4 eT4 =  ,
0 0 1 0
0 0 0 1
so rank(v0 ) = rank(mat(v0 )) = 4 and thus rank(v) ≥ 4, as desired.
There are many situations where none of a vector’s flattenings have rank
equal to the vector itself, so it may be the case that none of the lower bounds
obtained in this way are tight. We thus introduce one more way of bounding
tensor rank that is sometimes a bit stronger than these bounds based on flatten-
ings. The rough idea behind this bound is that if f ∈ V1∗ is a linear form then
We only specified
the linear transformation f ⊗ I ⊗ · · · ⊗ I : V1 ⊗ V2 ⊗ · · · ⊗ V p → V2 ⊗ · · · ⊗ V p
how f ⊗ I ⊗ · · · ⊗ I defined by
acts on elementary
tensors here. How it ( f ⊗ I ⊗ · · · ⊗ I)(v1 ⊗ v2 ⊗ · · · ⊗ v p ) = f (v1 )(v2 ⊗ · · · ⊗ v p )
acts on the rest of
space is determined sends elementary tensors to elementary tensors, and can thus be used to help us
via linearity.
investigate tensor rank. We note that the universal property of the tensor product
(Definition 3.3.1(d)) tells us that this function f ⊗ I ⊗ · · · ⊗ I actually exists and
is well-defined: existence of f ⊗ I ⊗ · · · ⊗ I follows from first constructing a
multilinear transformation g : V1 × V2 × · · · × V p → V2 ⊗ · · · ⊗ V p that acts via
g(v1 , v2 , . . . , v p ) = f (v1 )(v2 ⊗ · · · ⊗ v p ) and then using the universal property
to see that there exists a function f ⊗ I ⊗ · · · ⊗ I satisfying ( f ⊗ I ⊗ · · · ⊗ I)(v1 ⊗
v2 ⊗ · · · ⊗ v p ) = g(v1 , v2 , . . . , v p )).
Theorem 3.3.6 Suppose V1 , V2 , . . . , V p are finite-dimensional vector spaces over the same
Another Lower Bound field and v ∈ V1 ⊗ V2 ⊗ · · · ⊗ V p . Define
on Tensor Rank
Sv = ( f ⊗ I ⊗ · · · ⊗ I)(v) | f ∈ V1∗ ,
which is a subspace of V2 ⊗ · · · ⊗ V p . If there does not exist a basis of Sv

consisting entirely of elementary tensors then rank(v) > dim(Sv ).
Proof. We prove the contrapositive of the statement of the theorem. Let r =

This theorem also rank(v) so that we can write v as a sum of r elementary tensors:
works if we instead
apply a linear form r
(1) (2) (p)
f ∈ Vk∗ to the k-th v= ∑ vj ⊗vj ⊗···⊗vj .
tensor factor for any j=1
1 ≤ k ≤ p (we just
state it in the k = 1 Since
r
(1) (2) (p)
case here for ( f ⊗ I ⊗ · · · ⊗ I)(v) = ∑f vj vj ⊗···⊗vj ,
simplicity).
j=1
(2) (p)
for all f ∈ V1∗ , it follows that the set B = v j ⊗ · · · ⊗ v j : 1 ≤ j ≤ r satisfies
span(B) ⊇ Sv . If r ≤ dim(Sv ) then we must actually have r = dim(Sv ) and
span(B) = Sv , as that is the only way for an r-dimensional vector space to
be contained in a vector space of dimension at most r. It then follows from
Exercise 1.2.27(b) that B is a basis of Sv consisting entirely of elementary
tensors.
In the p = 2 case, the above theorem says nothing at all, since the elementary
tensors in Sv ⊆ V2 are simply all of the members of Sv , and there is of course a
basis of Sv within Sv . However, when p ≥ 3, the above theorem can sometimes
be used to prove bounds on tensor rank that are better than any bound possible
via flattenings.
For example, all non-trivial flattenings of a vector v ∈ (C2 )⊗3 live in either
C ⊗ C4 , so flattenings can never provide a better lower bound than rank(v) ≥ 2
2
in this case. However, the following example shows that some vectors in (C2 )⊗3
have tensor rank 3.
Example 3.3.4 Show that the following vector v ∈ (C2 )⊗3 has tensor rank 3:
Tensor Rank Can
Exceed Local v = e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 .
Dimension
Solution:
It is clear that rank(v) ≤ 3, since v was provided to us as a sum of
3 elementary tensors. On the other hand, to see that rank(v) ≥ 3, we
Here we have construct the subspace Sv ⊆ C2 ⊗ C2 described by Theorem 3.3.6:
associated the linear
form f with a row
vector wT (via Sv = (wT e1 )(e1 ⊗ e2 + e2 ⊗ e1 ) + (wT e2 )(e1 ⊗ e1 ) | w ∈ C2

Theorem 1.3.3) and = (a, b, b, 0) | a, b ∈ C .
defined a = wT e2
and b = wT e1 for
simplicity of notation. It is clear that dim(Sv ) = 2. Furthermore, we can see that the only
elementary tensors in Sv are those of the form (a, 0, 0, 0), since the matri-
cization of (a, b, b, 0) is " #
a b
,
b 0
which has rank 2 whenever b 6= 0. It follows that Sv does not have a

basis consisting of elementary tensors, so Theorem 3.3.6 tells us that
rank(v) > dim(Sv ) = 2, so rank(v) = 3.
Tensor Rank is a Nightmare

We close this section by demonstrating two ways in which tensor rank is
less well-behaved than usual matrix rank (besides just being more difficult to
compute), even when we just restrict to the space Fd1 ⊗· · ·⊗ Fd p when F = R or
F = C. The first of these unfortunate aspects of tensor rank is that some vectors
can be approximated arbitrarily well by vectors with strictly smaller rank.
For example, if we define
vk = k(e1 + e2 /k) ⊗ (e1 + e2 /k) ⊗ (e1 + e2 /k) − ke1 ⊗ e1 ⊗ e1
= e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1
1
+ e2 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e2
k
1
+ 2 e2 ⊗ e2 ⊗ e2
k
then it is straightforward to see that rank(vk ) = 2 for all k. However,
lim vk = e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1
k→∞
is the vector with tensor rank 3 from Example 3.3.4. To deal with issues like this,
we define the border rank of a vector v ∈ Fd1 ⊗ · · · ⊗ Fd p to be the smallest
integer r such that v can be written as a limit of vectors with tensor rank no
In fact, its border larger than r. We just showed that the vector from Example 3.3.4 has border
rank is exactly 2. See
rank ≤ 2, despite having tensor rank 3.
Exercise 3.3.12.
Nothing like this happens when p = 2 (i.e., when we can think of tensor
rank in Fd1 ⊗ Fd2 as the usual rank of a matrix in Md1 ,d2 (F)): if A1 , A2 , . . . ∈
Md1 ,d2 (F) each have rank(Ak ) ≤ r then
In other words, the
rank of a matrix can
“jump down” in a rank lim Ak ≤ r
k→∞
limit, but it cannot
“jump up”. When too (as long as this limit exists). This fact can be verified using the techniques
p ≥ 3, tensor rank
can jump either up of Section 2.D—recall that the singular values of a matrix are continuous in its
or down in limits. entries, so if each Ak has at most r non-zero singular values then the same must
be true of their limit (in fact, this was exactly Exercise 2.D.3).
The other unfortunate aspect to tensor rank is that it is field-dependent. For
example, if we let e+ = e1 + e2 ∈ R2 and e− = e1 − e2 ∈ R2 then the vector
v = e+ ⊗ e+ ⊗ e1 − e1 ⊗ e1 ⊗ e− − e2 ⊗ e2 ⊗ e+ ∈ (R2 )⊗3 (3.3.3)

The tensor rank over
C is always a lower
bound of the tensor
has tensor rank 3 (see Exercise 3.3.2). However, direct calculation shows that if
rank over R, since we instead interpret v as a member of (C2 )⊗3 then it has tensor rank 2 thanks
every real tensor sum to the decomposition
decomposition is
also a complex one 1
(but not vice-versa). v= w⊗w⊗w+w⊗w⊗w , where w = (i, −1).
2
Situations like this do not arise for the rank of a matrix (and thus the tensor
rank when p = 2), since the singular value decomposition tells us that the
singular values (and thus the rank) of a real matrix do not depend on whether
we consider it as a member of Mm,n (R) or Mm,n (C). Another way to see that
matrix rank does not suffer from this problem is to just notice that allowing
complex arithmetic in Gaussian elimination does not increase the number of
zero rows that we can obtain in a row echelon form of a real matrix.
Remark 3.3.2 Tensors are one of the most ubiquitous and actively-researched areas in
We Are Out all of science right now. We have provided a brief introduction to the
of Our Depth topic and its basic motivation, but there are entire textbooks devoted to
exploring properties of tensors and what can be done with them, and we
cannot possibly do the subject justice here. See [KB09] and [Lan12], for
example, for a more thorough treatment.
3.3.1 Compute the tensor rank of each of the following

vectors. 3.3.7 Suppose that V1 , . . ., V p , and W1 , . . ., W p are vector
spaces over the same field, and T j : V j → W j is an isomor-
∗(a) e1 ⊗ e1 + e2 ⊗ e2 ∈ C2 ⊗ C2
phism for each 1 ≤ j ≤ p. Let T1 ⊗· · ·⊗Tp : V1 ⊗· · ·⊗V p →
(b) e1 ⊗ e1 ⊗ e1 + e2 ⊗ e2 ⊗ e2 ∈ (C2 )⊗3
W1 ⊗ · · · ⊗ W p be the linear transformation defined on
∗(c) e1 ⊗ e1 ⊗ e1 + e2 ⊗ e2 ⊗ e2 + e3 ⊗ e3 ⊗ e3 ∈ (C3 )⊗3
elementary tensors by (T1 ⊗ · · · ⊗ Tp )(v1 ⊗ · · · ⊗ v p ) =
T1 (v1 ) ⊗ · · · ⊗ Tp (v p ). Show that
∗∗ 3.3.2 Show that the vector v ∈ (R2 )⊗3 from Equa-
tion (3.3.3) has tensor rank 3. rank (T1 ⊗ · · · ⊗ Tp )(v) = rank(v)
[Hint: Mimic Example 3.3.4. Where does this argument for all v ∈ V1 ⊗ · · · ⊗ V p .
break down if we use complex numbers instead of real num-
bers?] 3.3.8 In this exercise, we generalize some of the obser-
vations that we made about the vector from Example 3.3.4.
3.3.3 Determine which of the following statements are Suppose {x1 , x2 }, {y1 , y2 }, and {z1 , z2 } are bases of C2 and
true and which are false. let
∗(a) C ⊗ C ∼= C. v = x1 ⊗ y1 ⊗ z2 + x1 ⊗ y2 ⊗ z1 + x2 ⊗ y1 ⊗ z1 ∈ (C2 )⊗3 .
(b) If we think of C as a 2-dimensional vector space over
(a) Show that v has tensor rank 3.
R (e.g., with basis {1, i}) then C ⊗ C ∼
= C.
(b) Show that v has border rank 2.
∗(c) If v ∈ Fm ⊗ Fn then rank(v) ≤ min{m, n}.
(d) If v ∈ Fm ⊗ Fn ⊗ F p then rank(v) ≤ min{m, n, p}.
∗(e) The tensor rank of v ∈ Rd1 ⊗ Rd2 ⊗ · · · ⊗ Rd p is 3.3.9 Show that the 3-variable polynomial f (x, y, z) =
at least as large as its tensor rank as a member of x + y + z cannot be written in the form f (x, y, z) =
Cd1 ⊗ Cd2 ⊗ · · · ⊗ Cd p . p1 (x)q1 (y)r1 (z) + p2 (x)q2 (y)r2 (z) for any single-variable
polynomials p1 , p2 , q1 , q2 , r1 , r2 .
∗∗3.3.4 Show that the linear transformation S : V ⊗ W → [Hint: You can prove this directly, but it might be easier to
X from Definition 3.3.1(d) is necessarily unique. leech off of some vector that we showed has tensor rank 3.]
∗∗3.3.5 Verify the claim of Example 3.3.2 that we can ∗∗3.3.10 Show that if V and W are vector spaces over the
represent M2 ⊗ P 2 as the set of functions f : F → M2 of same field then V ⊗ W ∼
= W ⊗ V.
the form f (x) = Ax2 + Bx +C for some A, B,C ∈ M2 , with
elementary tensors defined by (A ⊗ f )(x) = A f (x). That is,
3.3.11 Show that if V, W, and X are vector spaces over
verify that the four defining properties of Definition 3.3.1
the same field then the tensor product distributes over the
hold for this particular representation of M2 ⊗ P 2 .
external direct sum (see Section 1.B.3) in the sense that
∗∗3.3.6 Show that if C is the vector space of 1-variable (V ⊕ W) ⊗ X ∼

= (V ⊗ X ) ⊕ (V ⊗ W)
continuous functions then C ⊗ C is not the space of 2-
variable continuous functions with elementary tensors of ∗∗3.3.12 Suppose F = R or F = C. Show that if a non-
the form f (x)g(y) (in contrast with Example 3.3.1). zero vector v ∈ Fd1 ⊗ · · · ⊗ Fd p has border rank 1 then it
[Hint: Consider the function f (x, y) = exy and use Exer- also has tensor rank 1.
cise 1.1.22.] [Hint: The Bolzano–Weierstrass theorem from analysis
might help.]
In this chapter, we introduced multilinear transformations, which are functions

that act on tuples of vectors in such a way that, if we keep all except for one
of the vectors in that tuple constant, then it looks like a linear transformation.
Multilinear transformations are an extremely wide class of functions that gen-
eralize and contain as special cases many objects that we have seen earlier in
linear algebra:
• Linear transformations are multilinear transformations with just one
input space;
We introduced • Multilinear forms (and thus bilinear forms) are multilinear transformation
multilinear forms
for which the output vector space is just F; and
back in Section 1.3.2.
• Most “multiplications”, including the dot product, cross product, matrix
multiplication, and the Kronecker product, are bilinear transformations
(i.e., multilinear transformations with 2 input spaces).
We also presented a new way of multiplying two vectors (in Fn ) or matrices
called the Kronecker product. While this product has many useful properties,
the “point” of it is that it lets us represent multilinear transformations via
matrices in much the same way that we represent linear transformations via
matrices. In particular, Theorem 3.2.2 tells us that if T : V1 × · · · × V p → W is
a multilinear transformation acting on finite-dimensional vector spaces with
That is, B1 , . . ., B p are bases B1 , . . ., B p , and D, respectively, then there exists a matrix A (called the
bases of V1 , . . ., V p ,
standard block matrix of T ) such that
respectively, and D is
a basis of W.
T (v1 , . . . , v p ) D = A [v1 ]B1 ⊗ · · · ⊗ [v p ]B p for all v1 ∈ V1 , . . . , v p ∈ V p .
In particular, if p = 1 (i.e., T is just a linear transformation) then this theo-

rem reduces to the usual statement that linear transformations have standard
matrices.
Finally, we finished this chapter by introducing the tensor product, which
generalizes the Kronecker product to arbitrary vector spaces. It can roughly
be thought of as a way of “multiplying” two vector spaces together, much like
we think of the external direct sum from Section 1.B.3 as a way of “adding”
two vector spaces together. Just like the Kronecker product lets us represent
multilinear transformations via matrices (once we have chosen bases of the
given vector spaces), the tensor product lets us represent multilinear transfor-
mations via linear transformations on the tensor product space. In particular, if
T : V1 × · · · × V p → W is a multilinear transformation then there exists a linear
transformation S : V1 ⊗ · · · ⊗ V p → W such that
This is the “universal
property” (d) from T (v1 , . . . , v p ) = S(v1 ⊗ · · · ⊗ v p ) for all v1 ∈ V1 , . . . , v p ∈ V p .
Definition 3.3.1.
The advantage of the tensor product in this regard is that we do not have to
choose bases of the vector spaces. In particular, this means that we can even
apply this technique to infinite-dimensional vector spaces.

true and which are false. 3.4.3 Let V and W be finite-dimensional inner product
∗(a) If A, B ∈ Mn (C) are normal then so is A ⊗ B. spaces.
(b) If A ∈ Mm,n , B ∈ Mn,m then tr(A ⊗ B) = tr(B ⊗ A). (a) Show that there is an inner product h·, ·iV⊗W on
∗(c) If A, B ∈ Mn then det(A ⊗ B) = det(B ⊗ A). V ⊗ W that satisfies
(d) The cross product is a multilinear transformation
with type (2, 1). hv1 ⊗ w1 , v2 ⊗ w2 iV⊗W = hv1 , v2 iV hw1 , w2 iW
∗(e) If v, w ∈ R7 then v ⊗ w = w ⊗ v.
(f) If V and W are vector spaces over the same field for all v1 , v2 ∈ V and w1 , w2 ∈ W.
then V ⊗ W ∼ (b) Construct an example to show that not all inner prod-
= W ⊗ V.
∗(g) The tensor rank of v ∈ Rd1 ⊗ Rd2 ⊗ · · · ⊗ Rd p equals ucts on V ⊗ W arise from inner products on V and
its tensor rank as a member of Cd1 ⊗ Cd2 ⊗ · · · ⊗ Cd p . W in the manner described by part (a), even in the
simple case when V = W = R2 .
∗3.4.2 Let Wn,n ∈ Mn2 be the swap matrix. Find a formula
(depending on n) for det(Wn,n ).
3.A Extra Topic: Matrix-Valued Linear Maps
We now provide a more thorough treatment of linear transformations that act

on the vector space of matrices. In a sense, there is nothing special about these
linear transformations—we know from way back in Section 1.2 that we can
represent them by their standard matrix and do all of our usual linear algebra
trickery on them. However, many of their interesting properties are most easily
unearthed by investigating how they interact with the Kronecker product, so it
makes sense to revisit them now.
3.A.1 Representations
We start by precisely defining the types linear transformations that we are now
going to focus our attention on.
Definition 3.A.1 A matrix-valued linear map is a linear transformation

Matrix-Valued
Linear Map Φ : Mm,n → M p,q .
Since matrix-valued linear maps are linear transformations, we can rep-

resent them by their standard matrices. However, there are also several other
ways of representing them that are often much more useful, so it is worthwhile
to make the details explicit and see how these different representations relate to
each other.
The Standard Matrix
Matrix-valued linear We start by considering the most basic representation that a matrix-valued linear
maps are sometimes
map Φ : Mm,n → M p,q (or any linear transformation) can have—its standard
called
superoperators, matrix. In particular, we focus on its standard matrix with respect to the standard
since they can be basis E of Mm,n and M p,q , which we recall is denoted simply by [Φ].
thought of as If we recall from Theorem 1.2.6 that the standard matrix [Φ] is constructed
operators (functions)
acting on operators
so that its columns are [Φ(Ei, j )]E (1 ≤ i ≤ m, 1 ≤ j ≤ n) and that [Φ(Ei, j )]E =
(matrices). vec(Φ(Ei, j )), we immediately arrive at the following formula for the standard
matrix of Φ:

[Φ] = vec(Φ(E1,1 )) | vec(Φ(E1,2 )) | · · · | vec(Φ(Em,n )) .
3.A Extra Topic: Matrix-Valued Linear Maps 359
If we are interested in standard linear-algebraic properties of Φ like its

eigenvalues, rank, invertibility, or finding a matrix square root of it, this standard
matrix is likely the simplest tool to make use of, since it satisfies vec(Φ(X)) =
[Φ]vec(X) for all X ∈ Mm,n . For example, we showed back in Example 1.2.10
More generally, the that the standard matrix of the transposition map T : M2 → M2 is
standard matrix of
 
the transposition
map
1 0 0 0
 
T : Mm,n → Mn,m is 0 0 1 0
[T ] =  .
the swap matrix (this 0 1 0 0
is what property (a)
of Definition 3.1.3 0 0 0 1
says).
We then used this standard matrix in Example 1.2.17 to find the eigenvalues and
eigenvectors of T , and we used it in Example 1.2.18 to find a square root of it.
Example 3.A.1 Construct the standard matrix of the linear map ΦR : Mn → Mn (called
The Reduction Map the reduction map) defined by ΦR (X) = tr(X)I − X.
Solution:
We first compute
(
I − E j, j if i = j,
ΦR (Ei, j ) = tr(Ei, j )I − Ei, j =
−Ei, j otherwise.
n
def
If we define e+ = vec(I) = ∑ ei ⊗ ei then
i=1
(
e+ − e j ⊗ e j if i = j,
vec ΦR (Ei, j ) =
−ei ⊗ e j otherwise.
It follows that the standard matrix of ΦR is

ΦR = vec(ΦR (E1,1 )) | vec(ΦR (E1,2 )) | · · · | vec(ΦR (En,n ))

= e+ − e1 ⊗ e1 | − e1 ⊗ e2 | · · · | e+ − en ⊗ en
= e+ eT+ − I.
For example, in the n = 3 case this standard matrix has the form
 
· · · · 1 · · · 1
 · −1 · · · · · · · 
 
 · · −1 · · · · · · 
As usual, we use dots  
 · · · −1 · · · · · 
(·) to denote entries  
equal to 0. ΦR = 
 1 · · · · · · · 1 .

 · · · · · −1 · · · 
 
 · · · · · · −1 · · 
 
 · · · · · · · −1 · 
1 · · · 1 · · · ·
For example, the From the standard matrix of a matrix-valued linear map Φ, it is straight-
(unique up to scalar
multiplication) forward to deduce many of its properties. For example, since e+ eT+ has rank 1
eigenvector of ΦR and
with eigenvalue n − 1 n √
is I: ΦR (I) = nI − I =
ke+ k = ∑ ei ⊗ ei = n,
(n − 1)I. i=1

we conclude that if ΦR is the reduction map from Example 3.A.1, then ΦR =
e+ eT+ − I (and thus ΦR itself) has one eigenvalue equal to n − 1 and the other
n − 1 eigenvalues (counted according to multiplicity) equal to −1.
The Choi Matrix
There is another matrix representation of a matrix-valued linear map Φ that has
some properties that often make it easier to work with than the standard matrix.
The idea here is that Φ : Mm,n → M p,q (like every linear transformation) is
completely determined by how it acts on a basis of the input space, so it is
completely determined by the mn matrices Φ(E1,1 ), Φ(E1,2 ), . . ., Φ(Em,n ), and
it is often convenient to arrange these matrices into a single large block matrix.
Definition 3.A.2 The Choi matrix of a matrix-valued linear map Φ : Mm,n → M p,q is the
Choi Matrix mp × nq matrix
m n
def
CΦ = ∑ ∑ Φ(Ei, j ) ⊗ Ei, j .
i=1 j=1
We can think of the Choi matrix as a p × q block matrix whose blocks are
m × n. When partitioned in this way, the (i, j)-entry of each block is determined
by the corresponding entry of Φ(Ei, j ). For example, if Φ : M3 → M3 is such
that
 
a ∗ ∗ b ∗ ∗ c ∗ ∗
We use asterisks (∗)  ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ 
 
here to denote    ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ 
entries whose values  
a b c  d ∗ ∗ e ∗ ∗ f ∗ ∗ 
we do not care    
about right now. Φ(E1,1 ) = d e f  , then CΦ = 
 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ .

g h i  ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ 
 
 g ∗ ∗ h ∗ ∗ i ∗ ∗ 
 
 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Equivalently, Theorem 3.1.8 tells us that CΦ is just the swapped version of the
block matrix whose (i, j)-block equals Φ(Ei, j ):
!
m n
Notice that Ei, j ⊗ A is T
a block matrix with A
CΦ = Wm,p ∑ ∑ Ei, j ⊗ Φ(Ei, j ) Wn,q
in its (i, j)-block, i=1 j=1
whereas A ⊗ Ei, j is a  
Φ(E1,1 ) Φ(E1,2 ) ··· Φ(E1,n )
block matrix with the
entries of A in the
 Φ(E ) Φ(E ) ··· Φ(E2,n ) 
 2,1 2,2  T
(i, j)-entries of each = Wm,p 
 .. .. .. ..
 Wn,q .

block.  . . . . 
Φ(Em,1 ) Φ(Em,2 ) ··· Φ(Em,n )
Example 3.A.2 Construct the Choi matrix of the linear map ΦR : Mn → Mn defined by
The Choi Matrix of ΦR (X) = tr(X)I − X (i.e., the reduction map from Example 3.A.1).
the Reduction Map
Solution:
As we noted in Example 3.A.1,
(
I − E j, j if i = j,
ΦR (Ei, j ) = tr(Ei, j )I − Ei, j =
−Ei, j otherwise.
The fact that It follows that the Choi matrix of ΦR is

CΦR = −[ΦR ] in this
case is just a  
coincidence, but Φ(E1,1 ) Φ(E1,2 ) · · · Φ(E1,n )
the fact that CΦR  Φ(E ) Φ(E ) · · · Φ(E ) 
and [ΦR ] have the
 2,1 2,2 2,n 
T
CΦR = W3,3 
 .. .. . .
 W3,3

same entries as
 . . . . .
. 
each other in
different positions is Φ(Em,1 ) Φ(Em,2 ) · · · Φ(Em,n )
not (see  
Theorem 3.A.3). I − E1,1 −E1,2 · · · −E1,n
 −E I − E2,2 · · · −E2,n 
 2,1  T

= W3,3  .  W3,3
. . .
.. 
 .. .. .. 
−Em,1 −Em,2 · · · I − Em,n
T
= W3,3 I − e+ e+ W3,3 = I − e+ eT+ ,
T
n
where e+ = ∑ ei ⊗ ei as before.
i=1
Definition 3.1.3(c)
tells us that the Choi
matrix of the We emphasize that the Choi matrix CΦ of a linear map Φ does not act
transpose is the as a linear transformation in the same way that Φ does. That is, it is not the
swap matrix (just like case that vec(Φ(X)) = CΦ vec(X); the Choi matrix CΦ behaves as a linear
its standard matrix).
transformation very differently than Φ does. However, CΦ does make it easier
to identify some important properties of Φ. For example, we say that Φ : Mn →
Mm is transpose-preserving if Φ(X T ) = Φ(X)T for all X ∈ Mn —a property
that is encoded very naturally in CΦ :
Theorem 3.A.1 A linear map Φ : Mn → Mm is transpose-preserving if and only if CΦ is

Transpose-Preserving symmetric.
Linear Maps
Proof. Since Φ is linear, its behavior is determined entirely by how it acts on

the standard basis matrices. In particular, it is transpose-preserving if and only
if Φ(Ei,T j ) = Φ(Ei, j )T for all 1 ≤ i, j ≤ n. On the other hand, CΦ is symmetric
if and only if
!T
In the final equality n n
here, we just swap ∑ Φ(Ei, j ) ⊗ Ei, j = ∑ Φ(Ei, j ) ⊗ Ei, j
the names of i and j. i, j=1 i, j=1
n n
= ∑ Φ(Ei, j )T ⊗ E j,i = ∑ Φ(E j,i )T ⊗ Ei, j .
i, j=1 i, j=1
Since Ei,T j = E j,i , these two conditions are equivalent to each other.
For example, the reduction map from Example 3.A.2 is transpose-preserving—
a fact that can be verified in a straightforward manner from its definition or by
noticing that its Choi matrix is symmetric.
In the case when the ground field is F = C and we focus on the conjugate
transpose (i.e., adjoint), things work out even more cleanly. We say that Φ :
Mn → Mm is adjoint preserving if Φ(X ∗ ) = Φ(X)∗ for all X ∈ Mn , and
we say that it is Hermiticity-preserving if Φ(X) is Hermitian whenever X ∈
Mn (C) is Hermitian. The following result says that these two families of maps
coincide with each other when F = C and can each be easily identified from
their Choi matrices.
Theorem 3.A.2 Suppose Φ : Mn (C) → Mm (C) is a linear map. The following are equiv-
Hermiticity-Preserving alent:
Linear Maps a) Φ is Hermiticity-preserving,
b) Φ is adjoint-preserving, and
c) CΦ is Hermitian.
Proof. The equivalence of properties (b) and (c) follows from the same argu-
ment as in the proof of Theorem 3.A.1, just with a complex conjugation thrown
on top of the transposes. We thus focus on the equivalence of properties (a)
and (b).
To see that (b) implies (a) we notice that if Φ is adjoint-preserving and
X is Hermitian then Φ(X)∗ = Φ(X ∗ ) = Φ(X), so Φ(X) is Hermitian too. For
the converse, notice that we can write every matrix X ∈ Mn (C) as a linear
combination of Hermitian matrices:
It is worth comparing
this with the 1 1
X= (X + X ∗ ) + (iX − iX ∗ ) .
Cartesian 2 | {z } 2i | {z }
decomposition of Hermitian Hermitian
Remark 1.B.1.
It follows that if Φ is Hermiticity-preserving then Φ(X + X ∗ ) and Φ(iX −
iX ∗ ) are each Hermitian, so
∗
∗ 1 ∗ 1 ∗
Φ(X) = Φ (X + X ) + (iX − iX )
2 2i
∗
1 1
= Φ(X + X ∗ ) + Φ(iX − iX ∗ )
2 2i
1 1
= Φ(X + X ∗ ) − Φ(iX − iX ∗ )
2 2i
1 1
=Φ (X + X ) − (iX − iX ∗ )
∗
2 2i
∗

=Φ X
for all X ∈ Mn (C), so Φ is adjoint-preserving too.

Just like standard It is worth emphasizing that the equivalence of conditions (a) and (b) in
matrices, Choi
the above result really is specific to the field C. While it is possible to define
matrices are
isomorphic to a matrix-valued linear map on Mn (F) (regardless of F) to be symmetry-
matrix-valued linear preserving if Φ(X) is symmetric whenever X is symmetric, such a map may
maps (i.e., the linear not be transpose-preserving (see Exercise 3.A.12). The reason for this difference
transformation that
sends Φ to CΦ is
in behavior is that the set of symmetric matrices MSn is a subspace of Mn , so
invertible). specifying how a matrix-valued linear map acts on symmetric matrices does
not specify how it acts on all of Mn . In contrast, the set of Hermitian matrices
MH n spans all of Mn (C), so Hermiticity-preservation of Φ restricts how it acts
on all of Mn (C).
It turns out that symmetry-preserving maps do not actually come up in

practice much, nor do they have many nice mathematical properties, so we
instead focus on yet another closely-related family of maps, called bisym-
metric maps. These are the matrix-valued linear maps Φ with the property
Here, T denotes the that Φ = T ◦ Φ = Φ ◦ T (whereas transpose-preserving maps just require
transpose map.
T ◦ Φ = Φ ◦ T ). The output of such a map is always symmetric, and it only de-
pends on the symmetric part of the input, so we can think of a bisymmetric map
Φ as sending MSn to MSm (much like we can think of Hermiticity-preserving
maps as sending MH H
n to Mm ).
These different families of maps that can be thought of as preserving the
transpose and/or symmetry in slightly different ways, and are summarized in
Table 3.1 for ease of reference.
Transpose-Preserving Bisymmetric
Γ refers to the partial
transpose, which we Characterizations: Φ◦T = T ◦Φ Φ = Φ◦T = T ◦Φ
introduce in Section T
CΦ = CΦ T = Γ(C )
CΦ = CΦ Φ
3.A.2.
Table 3.1: A comparison of transpose-preserving maps and bisymmetric maps.

The characterizations in terms of Choi matrices come from Theorem 3.A.1 and
Exercise 3.A.14.
Operator-Sum Representations
The third and final representation of matrix-valued linear maps that we will use
is one that is a bit more “direct”—it does not aim to represent Φ : Mm,n → M p,q
as a matrix, but rather provides a formula to clarify how Φ acts on matrices.
Definition 3.A.3 An operator-sum representation of a linear map Φ : Mm,n → M p,q is a

Operator-Sum formula of the form
Representation
Φ(X) = ∑ Ai XBTi for all X ∈ Mm,n ,
i
If the input and where {Ai } ⊆ M p,m and {Bi } ⊆ Mq,n are fixed families of matrices.
output spaces are
square (i.e., m = n
and p = q) then the It is perhaps not immediately obvious that every matrix-valued linear map
Ai and Bi matrices even has an operator-sum representation, so before delving into any examples,
have the same size.
we present a theorem that tells us how to convert a standard matrix or Choi
matrix into an operator-sum representation (and vice-versa).
Theorem 3.A.3 Suppose Φ : Mm,n → M p,q is a matrix-valued linear map. The following
Converting Between are equivalent:
Representations a) [Φ] = ∑ Ai ⊗ Bi .
i
b) CΦ = ∑ vec(Ai )vec(Bi )T .
i
c) Φ(X) = ∑ Ai XBTi for all X ∈ Mm,n .
i
Proof. The equivalence of the representations (a) and (c) follows immediately
from Theorem 3.1.7, which tells us that
! !
vec ∑ Ai XBTi = ∑ Ai ⊗ Bi vec(X)
i i

Here we use the fact for all X ∈ Mm,n . Since it is similarly the case that vec Φ(X) = [Φ]vec(X)
that standard
matrices are unique.
for all X ∈ Mm,n , we conclude that Φ(X) = ∑i Ai XBTi for all X ∈ Mm,n if and
only if [Φ] = ∑i Ai ⊗ Bi .
To see that an operator-sum representation (c) corresponds to a Choi matrix
of
the form (b), we first write each Ai and Bi in terms of their columns: Ai =
ai,1 | ai,2 | · · · | ai,m and Bi = bi,1 | bi,2 | · · · | bi,n , so that vec(Ai ) =
∑mj=1 ai, j ⊗ e j and vec(Bi ) = ∑nj=1 bi, j ⊗ e j . Then
m n
CΦ = ∑ ∑ Φ(E j,k ) ⊗ E j,k
j=1 k=1
m n
=∑∑ ∑ Ai E j,k BTi ⊗ E j,k
i j=1 k=1
m n
=∑∑ ∑ ai, j bTi,k ⊗ e j eTk
i j=1 k=1
! !T
m n
=∑ ∑ ai, j ⊗ e j ∑ bi,k ⊗ ek
i j=1 k=1
= ∑ vec(Ai )vec(Bi )T ,
i
as claimed. Furthermore, each of these steps can be reversed to see that (b)
implies (c) as well.
In particular, part (b) of this theorem tells us that we can convert any
rank-one sum decomposition of the Choi matrix of Φ into an operator-sum
representation. Since every matrix has a rank-one sum decomposition, every
matrix-valued linear map has an operator-sum representation as well. In fact,
by leeching directly off of the many things that we know about rank-one sum
decompositions, we immediately arrive at the following corollary:
Corollary 3.A.4 Every matrix-valued linear map Φ : Mm,n (F) → M p,q (F) has an operator-
The Size of Operator- sum representation of the form
Sum Decompositions
rank(CΦ )
Φ(X) = ∑ Ai XBTi for all X ∈ Mm,n
i=1
In particular, with both sets {Ai } and {Bi } linearly independent. Furthermore, if F =
rank(CΦ ) ≤
R or F = C then the sets {Ai } and {Bi } can be chosen to be mutually
min{mp, nq} is the
minimum number of orthogonal with respect to the Frobenius inner product.
terms in any
operator-sum
representation of Φ. Proof. This result for an arbitrary field F follows immediately from the fact
This quantity is
(see Theorem A.1.3) that CΦ has a rank-one sum decomposition of the form
sometimes called
the Choi rank of Φ.
rank(CΦ )
CΦ = ∑ vi wTi ,
i=1
where {vi }ri=1 ⊂ Fmp and {wi }ri=1 ⊂ Fnq are linearly independent sets of col-
umn vectors. Theorem 3.A.3 then gives us the desired operator-sum repre-
sentation of Φ(X) by choosing Ai = mat(vi ) and Bi = mat(wi ) for all i. To
see that the orthogonality claim holds when F = R or F = C, we instead use
the orthogonal rank-one sum decomposition provided by the singular value
decomposition (Theorem 2.3.3).
Example 3.A.3 Construct an operator-sum representation of the transposition map T :

Operator-Sum Mm,n → Mn,m .
Representation of the
Solution:
Transpose Map
Recall that the Choi matrix of T is the swap matrix, which satisfies
CT (ei ⊗ e j ) = e j ⊗ ei for all 1 ≤ i ≤ m, 1 ≤ j ≤ n. It follows that CT has
the rank-one sum decomposition
This rank-one sum
decomposition is just m n
another way of CT = ∑ ∑ (e j ⊗ ei )(ei ⊗ e j )T .
saying that the i=1 j=1
column of CT
corresponding to the Since ei ⊗ e j = vec(Ei, j ), it follows from Theorem 3.A.3 that one operator-
basis vector ei ⊗ e j is sum representation of T is
e j ⊗ ei .
m n m n
X T = T (X) = ∑ ∑ E j,i XEi,T j = ∑ ∑ E j,i XE j,i for all X ∈ Mm,n .
i=1 j=1 i=1 j=1
We close this subsection by briefly noting how the various representations

Adjoints were of the adjoint of a matrix-valued linear map Φ : Mm,n → M p,q (i.e., the linear
introduced in
map Φ∗ : M p,q → Mm,n defined by hΦ(A), Bi = hA, Φ∗ (B)i for all A ∈ Mm,n
Section 1.4.2.
and B ∈ M p,q , where the inner product is the standard Frobenius inner product)
are related to the corresponding representations of the original map.
Corollary 3.A.5 Suppose F = R or F = C. The adjoint of a linear map Φ : Mm,n (F) →

The Adjoint of a M p,q (F) has the following representations:

Matrix-Valued a) Φ∗ = [Φ]∗ .
Linear Map T
b) CΦ∗ = Wp,mCΦWq,n .
c) If Φ(X) = ∑ Ai XBTi then Φ∗ (Y ) = ∑ A∗i Y Bi .
i i
The overline on CΦ in Proof. Property (a) is just the special case of Theorem 1.4.8 that arises when
this theorem just working with the standard bases of Mm,n (F) and M p,q (F), which are orthonor-
means complex
conjugation (which mal. Property (c) then follows almost immediately from Theorem 3.A.3: if
has no effect if Φ(X) = ∑i Ai XBTi then
F = R).
!∗
∗
Φ = [Φ]∗ = ∑ Ai ⊗ Bi = ∑ A∗i ⊗ B∗i ,
i i
so Φ∗ (Y ) = ∑i A∗i Y (B∗i )T = ∑i A∗i Y Bi , as claimed.

Property (b) follows similarly by noticing that if CΦ = ∑i vec(Ai )vec(Bi )T
then
T
CΦ∗ = ∑ vec(A∗i )vec(B∗i )T = ∑ Wp,m vec(Ai )vec(Bi ) Wq,n
T T
= Wp,mCΦWq,n ,
i i

In particular, if Φ maps between square matrix spaces (i.e., m = n and p = q)
then part (b) of the above corollary tells us that CΦ and CΦ∗ are unitarily similar,
so equalities involving similarity-invariant quantities like det(CΦ∗ ) = det(CΦ )
follow almost immediately.
3.A.2 The Kronecker Product of Matrix-Valued Maps

So far, we have defined the Kronecker product on vectors (i.e., members of
Fn ) and linear transformations on Fn (i.e., members of Mm,n (F)). There is a
natural way to ramp this definition up to matrix-valued linear maps acting on
Mm,n (F) as well:
Definition 3.A.4 Suppose Φ and Ψ are matrix-valued linear maps acting on Mm,n and
Kronecker Product of M p,q , respectively. Then Φ ⊗ Ψ is the matrix-valued linear map defined
Matrix-Valued Linear by
Maps
(Φ ⊗ Ψ)(A ⊗ B) = Φ(A) ⊗ Ψ(B) for all A ∈ Mm,n , B ∈ M p,q .
To apply this map Φ ⊗ Ψ to a matrix C ∈ Mmp,nq that is not of the form

A ⊗ B, we just use linearity and the fact that every such matrix can be written
in the form
To see that C can be
written in the form, C = ∑ Ai ⊗ Bi , so (Φ ⊗ Ψ)(C) = ∑ Φ(Ai ) ⊗ Ψ(Bi ).
we can either use i i
Theorem 3.A.3 or
recall that It is then perhaps not completely clear why this linear map Φ ⊗ Ψ is well-
{Ei, j ⊗ Ek,` } is a basis
defined—we can write C as a sum of Kronecker products in numerous ways,
of Mmp,nq .
and why should different decompositions give the same value of (Φ ⊗ Ψ)(C)?
It turns out that this is not actually a problem, which can be seen either by
using the universal property of the tensor product (Definition 3.3.1(d)) or by
considering how Φ ⊗ Ψ acts on standard basis matrices (see Exercise 3.A.16).
In fact, this is completely analogous to the fact that the matrix A ⊗ B is the
unique one for which (A ⊗ B)(v ⊗ w) = (Av) ⊗ (Bw) for all (column) vectors
v and w.
In the special case when Φ = Im,n is the identity map on Mm,n (which we
denote by Im if m = n or simply I if the dimensions are clear from context or
unimportant), the map I ⊗ Ψ just acts independently on each block of a p × q
block matrix:
   
A1,1 A1,2 · · · A1,s Ψ(A1,1 ) Ψ(A1,2 ) · · · Ψ(A1,s )
A  Ψ(A ) Ψ(A ) · · · Ψ(A ) 
 2,1 A2,2 · · · A2,s   2,1 2,2 2,s 
 
(I ⊗ Ψ)  .  = 


.
.. .. ..   .. .. .. .. 
 .. . . .   . . . . 

A p,1 A p,2 · · · A p,q Ψ(A p,1 ) Ψ(A p,2 ) · · · Ψ(A p,q )
The map Φ ⊗ I is perhaps a bit more difficult to think of in terms of block

matrices, but it works by having Φ act independently on the matrix made up
the (1, 1)-entries of each block, the matrix made up of the (1, 2)-entries of each
block, and so on.
Two particularly important maps of this type are the partial transpose
maps T ⊗ I and I ⊗ T , which act on block matrices by either transposing the
block structure or transposing the blocks themselves, respectively (where as
the usual “full” transpose T transposes both):

   
A1,1 A1,2 ··· A1,q A1,1 A2,1 ··· A p,1
A A2,2 A2,q   A A p,2 
 2,1 ···   1,2 A2,2 ··· 
(T ⊗ I)  
 .. .. ..
 
..  =  .. .. ..

..  and
 . . . .   . . . . 
A p,1 A p,2 ··· A p,q A1,q A2,q ··· A p,q
 
  AT AT1,2 ··· AT1,q
A1,1 A1,2 ··· A1,q  1,1 
A A2,2 A2,q    T T AT2,q 
 2,1 ···  A2,1 A2,2 ··· 
(I ⊗ T ) 
 .. .. ..
 = 
..   .
.
. .. .. .. 
 . . . .   . . 
 . . 
A p,1 A p,2 ··· A p,q T T
A p,1 A p,2 ··· ATp,q
def def
The notations Γ and For brevity, we often denote these maps by Γ = T ⊗I and Γ = I ⊗T , respectively,
Γ were chosen to
and we think of each of them as one “half” of the transpose of a block matrix—
each look like half of
a T. after all, Γ ◦ Γ = T .
We can similarly define the partial trace maps as follows:
def def
tr1 = tr ⊗ In : Mmn → Mn and tr2 = Im ⊗ tr : Mmn → Mm .
Explicitly, these maps act on block matrices via

Alternatively, we  
could denote the
A1,1 A1,2 ··· A1,m
 A A2,2 ··· A2,m 
partial transposes by  2,1 
T1 = T ⊗ I and tr1  
 .. .. ..
 = A1,1 + A2,2 + · · · + Am,m , and
.. 
T2 = I ⊗ T , just like we  . . . . 
use tr1 and tr2 to
denote the partial Am,1 Am,2 ··· Am,m
traces. However, we    
find the Γ and Γ A1,1 A1,2 ··· A1,m tr(A1,1 ) tr(A1,2 ) · · · tr(A1,m )
 A A2,m   
notation too cute to  2,1 A2,2 ···   tr(A2,1 ) tr(A2,2 ) · · · tr(A2,m ) 
pass up. tr2  
 .. .. ..
 = 
..   .. .. .. ..
.

 . . . .   . . . . 
Am,1 Am,2 ··· Am,m tr(Am,1 ) tr(Am,2 ) · · · tr(Am,m )
 
Example 3.A.4 Compute each of the following matrices if A = 1 2 0 −1 .
Partial Transposes,  2 −1 3 2 
 
Traces, and Maps  −3 2 2 1 
a) Γ(A), 1 0 2 3
b) tr1 (A), and
c) (I ⊗Φ)(A), where Φ(X) = tr(X)I −X is the map from Example 3.A.1.
Solutions:
a) To compute Γ(A) = (I ⊗ T )(A), we just transpose each block of A:
 
1 2 0 3
 2 −1 −1 2 
Γ(A) =  −3 1
.
2 2 
2 0 1 3
b) tr1 (A) is just the sum of the diagonal blocks of A:

1 2 2 1 3 3
tr1 (A) = + = .
2 −1 2 3 4 2
c) We just apply the map Φ to each block of A independently:

 
1 2 0 −1
 Φ Φ 
 2 −1 3 2
(I ⊗ Φ)(A) =   
 −3 2 2 1 
Φ Φ
1 0 2 3
 
−1 −2 2 1
 −2 1 −3 0 
= 0 −2 3 −1  .

−1 −3 −2 2
def
This special vector e+ Finally, it is worthwhile to notice that if e+ = vec(I) = ∑ ei ⊗ ei , then
was originally i
introduced in
Example 3.A.1. ! !T
e+ eT+ = ∑ ei ⊗ ei ∑ej ⊗ej = ∑(ei eTj ) ⊗ (ei eTj ) = ∑ Ei, j ⊗ Ei, j .
i j i, j i, j
In particular, this means that the Choi matrix of a matrix-valued linear map
Φ can be constructed by applying Φ to one half of this special rank-1 matrix
e+ eT+ :
(Φ ⊗ I)(e+ eT+ ) = ∑ Φ(Ei, j ) ⊗ Ei, j = CΦ .
i, j
3.A.3 Positive and Completely Positive Maps

One of the most interesting things that we can do with a matrix-valued linear
map Φ that we cannot do with an arbitrary linear transformation is ask what
sorts of matrix properties it preserves. For example, we saw in Theorem 3.A.2
that Φ : Mn (C) → Mm (C) is Hermiticity-preserving (i.e., Φ(X) is Hermitian
whenever X is Hermitian) if and only if CΦ is Hermitian.
The problem of characterizing which matrix-valued linear maps preserve a
Isometries (see given matrix property is called a linear preserver problem, and the transpose
Section 1.D.3) are
map T : Mm,n → Mn,m plays a particularly important role in this setting.
linear preservers that
preserve a norm. After all, it is straightforward to verify that AT and A have the same rank
See Exercise 3.A.5 for and determinant as each other, as well as the same eigenvalues and the same
an example. singular values. Furthermore, AT is normal, unitary, Hermitian, and/or positive
semidefinite if and only if A itself has the same property, so we say that the
transpose map is rank-preserving, eigenvalue-preserving, normality-preserving,
and so on.
On the other hand, determining which maps other than the transpose (if
any) preserve these matrix properties is often quite difficult.
Remark 3.A.1 Many standard linear preserver problems have answers that are very similar
Linear Preserver to each other. For example, the linear maps Φ : Mn → Mn that preserve
Problems the determinant (i.e., the maps Φ that satisfy det(Φ(X)) = det(X) for all
X ∈ Mn ) are exactly those of the form

It is straightforward to
see that every map Φ(X) = AXB for all X ∈ Mn , or
of this form preserves
T
(‡)
the determinant; the Φ(X) = AX B for all X ∈ Mn ,
hard part is showing
that there are no
where A, B ∈ Mn have det(AB) = 1. Various other types of linear pre-
other determinant-
preserving linear servers have the same general form, but with different restrictions placed
maps. on A and B:
• Rank-preserving maps have the form (‡) with A and B invertible.
• Singular value preservers have the form (‡) with A and B unitary.
• Eigenvalue preservers have the form (‡) with B = A−1 .
Proving these results is outside of the scope of this book (see survey papers
like [LT92, LP01] instead), but it is good to be familiar with them so that
we see just how special the transpose map is in this setting.
The linear preserver problem that we are most interested in is preservation

of positive semidefiniteness.
Definition 3.A.5 Suppose F = R or F = C. An adjoint-preserving linear map Φ : Mn (F) →

Positive Maps Mm (F) is called positive if Φ(X) is positive semidefinite whenever X is
positive semidefinite.
We note that if F = C then we can omit adjoint-preservation from the above

This is analogous to definition, since any map that sends positive semidefinite matrices to positive
the definition of PSD
semidefinite matrices is necessarily adjoint-reserving (see Exercise 3.A.13).
matrices
(Definition 2.2.1). We However, if F = R then the adjoint- (i.e., transpose)-preservation property is
don’t need A to be required to make this family of maps have “nice” properties (e.g., we want the
self-adjoint for the Choi matrix of every positive map to be self-adjoint).
property v∗ Av ≥ 0 to
make sense—we Since the transposition map preserves eigenvalues, it necessarily preserves
add it in so that PSD positive semidefiniteness as well, so it is positive (and it is possibly the most
matrices have nice important example of such a map). The following example presents another
properties.
positive linear map.
Example 3.A.5 Show that the linear map ΦR : Mn → Mn defined by ΦR (X) = tr(X)I − X
The Reduction Map (i.e., the reduction map from Example 3.A.1) is positive.
is Positive
Solution:
Just notice that if the eigenvalues of X are λ1 , . . ., λn then the eigenval-
ues of ΦR (X) = tr(X)I − X are tr(X) − λn , . . ., tr(X) − λ1 . Since tr(X) =
λ1 + · · · + λn , it follows that the eigenvalues of ΦR (X) are non-negative
(since they are each the sum of n − 1 of the eigenvalues of X), so it is
positive semidefinite whenever X is positive semidefinite. It follows that
ΦR is positive.
We of course would like to be able to characterize positive maps in some

way (e.g., via some easily-testable condition on their Choi matrix or operator-
sum representations), but doing so is extremely difficult. Since we are ill-
equipped to make any substantial progress on this problem at this point, we first
introduce another notation of positivity of a matrix-valued linear map that has
somewhat simpler mathematical properties, which will help us in turn better
understand positive maps.
Definition 3.A.6 A matrix-valued linear map Φ : Mn → Mm is called

k-Positive and a) k-positive if Φ ⊗ Ik is positive, and
Completely Positive
b) completely positive (CP) if Φ is k-positive for all k ≥ 1.
Maps
It is worth noting that In particular, 1-positive linear maps are exactly the positive maps them-
Φ ⊗ Ik is positive if
selves, and if a map is (k + 1)-positive then it is necessarily k-positive as well
and only if Ik ⊗ Φ is
positive. (see Exercise 3.A.18). We now look at some examples of linear maps that are
and are not k-positive.
Example 3.A.6 Show that the transposition map T : Mn → Mn (n ≥ 2) is not 2-positive.

The Transposition
Solution:
Map is Not 2-Positive
To see that the map T ⊗ I2 is not positive, lets consider how it acts in
the n = 2 case on e+ eT+ , which is positive semidefinite:
   
1 0 0 1 1 0 0 0
 0 0 0 0   0 0 1 0 
(T ⊗ I2 ) e+ eT+ = (T ⊗ I2 )    
 0 0 0 0  =  0 1 0 0  .

1 0 0 1 0 0 0 1
Recall that the The matrix on the right has eigenvalues 1 (with multiplicity 3) and −1
transposition map is
(with multiplicity 1), so it is not positive semidefinite. It follows that T ⊗ I2
1-positive.
is not positive, so T : M2 → M2 is not 2-positive.
To see that T is not 2-positive when n > 2, we can just embed this
same example into higher dimensions by padding the input matrix e+ eT+
with extra rows and columns of zeros.
One useful technique that sometimes makes working with positive maps
simpler is the observation that a linear map Φ : Mn → Mm is positive if and
only if Φ(vv∗ ) is positive semidefinite for all v ∈ Fn . The reason for this is
The spectral the spectral decomposition, which says that every positive semidefinite matrix
decomposition is
X ∈ Mn can be written in the form
either of
Theorems 2.1.4 (if n
F = C) or 2.1.6 (if for some {v j } ⊂ Fn .
F = R).
X= ∑ v j v∗j
j=1
In particular, if Φ(vv∗ ) is PSD for all v ∈ Fn then Φ(X) = ∑ j Φ(v j v∗j ) is PSD
whenever X is PSD, so Φ is positive.
The following example makes use of this technique to show that the set of
k-positive linear maps is not just contained within the set of (k + 1)-positive
maps, but in fact this inclusion is strict as long as 1 ≤ k < n.
Example 3.A.7 Suppose k ≥ 1 is an integer and consider the linear map Φ : Mn → Mn

A k-Positive defined by Φ(X) = ktr(X)I − X.
Linear Map a) Show that Φ is k-positive, and
When k = 1, this is b) show that if k < n then Φ is not (k + 1)-positive.
exactly the
reduction map that Solutions:
we already showed a) As noted earlier, it suffices to just show that (Φ ⊗ Ik )(vv∗ ) is positive
is positive in
Example 3.A.5. semidefinite for all v ∈ Fn ⊗ Fk . We can use the Schmidt decompo-
Part (b) here shows
that it is not 2-positive.
sition (Exercise 3.1.11) to write

k
v = ∑ wi ⊗ x i ,
i=1
where {wi } ⊂ Fn and {xi } ⊂ Fk are mutually orthogonal sets (and

by absorbing a scalar from wi to xi , we can furthermore assume
that kwi k = 1 for all i). Then we just observe that we can write
(Φ ⊗ Ik )(vv∗ ) as a sum of positive semidefinite matrices in the
following way:
k
Verifying the second
equality here is ugly,
(Φ ⊗ Ik )(vv∗ ) = ∑ k(w∗j wi )I − wi w∗j ⊗ xi x∗j
i, j=1
but routine—we do
not need to be k k ∗
clever. =∑ ∑ wi ⊗ xi − w j ⊗ x j wi ⊗ xi − w j ⊗ x j
i=1 j=i+1
Since kwi k = 1, the
k
term I − wi w∗i is
positive semidefinite. + k ∑ (I − wi w∗i ) ⊗ xi x∗i .
i=1
b) To see that Φ is not (k + 1)-positive (when k < n), we consider

how Φ ⊗ Ik+1 acts on the positive semidefinite matrix e+ e∗+ , where
e+ = ∑k+1 n
i=1 ei ⊗ ei ∈ F ⊗ F
k+1 . It is straightforward to check that
(Φ ⊗ Ik+1 )(e+ e∗+ ) = k(In ⊗ Ik+1 ) − e+ e∗+ ,
which has −1 as an eigenvalue (with corresponding eigenvector e+ )

and is thus not positive semidefinite. It follows that Φ ⊗ Ik+1 is not
positive, so Φ is not (k + 1)-positive.
The above example raises a natural question—if the sets of k-positive linear
maps are strict subsets of each other when 1 ≤ k ≤ n, what about when k > n?
For example, is there a matrix-valued linear map that is n-positive but not (n + 1)-
positive? The following theorem shows that no, this is not possible—any map
that is n-positive is actually necessarily completely positive (i.e., k-positive for
all k ≥ 1). Furthermore, there is a simple characterization of these maps in terms
of either their Choi matrices and operator-sum representations.
Theorem 3.A.6 Suppose Φ : Mn (F) → Mm (F) is a linear map. The following are equiva-
Characterization of lent:
Completely Positive a) Φ is completely positive.
Maps b) Φ is min{m, n}-positive.
c) CΦ is positive semidefinite.
d) There exist matrices {Ai } ⊂ Mm,n (F) such that Φ(X) = ∑ Ai XA∗i .
i
This result is Proof. We prove this theorem via the cycle of implications (a) =⇒ (b) =⇒
sometimes called
(c) =⇒ (d) =⇒ (a). The fact that (a) =⇒ (b) is immediate from the relevant
Choi’s theorem.
definitions, so we start by showing that (b) =⇒ (c).
Recall that To this end, first notice that if n ≤ m then Φ being n-positive tells us
n
e+ = ∑ ei ⊗ ei . that (Φ ⊗ In )(e+ e∗+ ) is positive semidefinite. However, we noted earlier that
i=1
(Φ ⊗ In )(e+ e∗+ ) = CΦ , so CΦ is positive semidefinite. On the other hand, if
n > m then Φ being m-positive tells us that (Φ ⊗ Im )(X) is positive semidefinite

whenever X ∈ Mn ⊗ Mm is positive semidefinite. It follows that
Here we are using
the fact that e∗+ (Φ ⊗ Im )(X)e+ ≥ 0, so tr X(Φ∗ ⊗ Im )(e+ e∗+ ) ≥ 0
v∗ Av = tr(Avv∗ ).
for all positive semidefinite X. By Exercise 2.2.20, it follows that (Φ∗ ⊗
Im )(e+ e∗+ ) = CΦ∗ is positive semidefinite, so Corollary 3.A.5 tells us that
CΦ = Wm,nCΦ∗ Wm,n ∗ is positive semidefinite too.
The fact that (c) =⇒ (d) follows from using the spectral decomposition to
write CΦ = ∑i vi v∗i . Theorem 3.A.3 tells us that if we define Ai = mat(vi ) then
Φ has the operator-sum representation Φ(X) = ∑i Ai XA∗i , as desired.
To see that (d) =⇒ (a) and complete the proof, we notice that if k ≥ 1 and
X ∈ Mn ⊗ Mk is positive semidefinite, then so is
∑(Ai ⊗ Ik )X(Ai ⊗ Ik )∗ = (Φ ⊗ Ik )(X).

i
It follows that Φ is k-positive for all k, so it is completely positive.

The equivalence of properties (a) and (c) in the above theorem is rather
extraordinary. It is trivially the case that (a) implies (c), since if Φ is completely
positive then Φ ⊗ In is positive, so (Φ ⊗ In )(e+ e∗+ ) = CΦ is positive semidefinite.
However, the converse says that in order to conclude that (Φ⊗Ik )(X) is positive
semidefinite for all k ≥ 1 and all positive semidefinite X ∈ Mn ⊗Mk , it suffices
to just check what happens when k = n and X = e+ e∗+ . In this sense, the matrix
e+ e∗+ is extremely special and can roughly be thought of as the “least positive
semidefinite” PSD matrix—if a map of the form Φ ⊗ Ik ever creates a non-PSD
output from a PSD input, it does so at e+ e∗+ .
With completely positive maps taken care of, we now return to the prob-
lem of trying to characterize (not-necessarily-completely) positive maps. It is
straightforward to show that the composition of positive maps is again positive,
as is their sum. It follows that if a linear map Φ : Mn → Mm can be written in
the form
Again, T is the
transposition map Φ = Ψ1 + T ◦ Ψ2 for some CP maps Ψ1 , Ψ2 : Mn → Mm , (3.A.1)
here.
then it is positive. We say that any linear map Φ of this form is decompos-
able. It is straightforward to see that every CP map is decomposable, and the
above argument shows that every decomposable map is positive. The following
example shows that the reduction map ΦR , which we showed is positive in
Example 3.A.5, is also decomposable.
Example 3.A.8 Show that the linear map Ψ : Mn → Mn defined by Ψ(X) = tr(X)I − X T
The Reduction Map is completely positive.
is Decomposable
Solution:
We just compute the Choi matrix of Ψ. Since the Choi matrix of the
This map Ψ is map X 7→ tr(X)I is I ⊗ I and the Choi matrix of the transpose map T is
sometimes called
the swap matrix Wn,n , we conclude that CΨ = I ⊗ I −Wn,n . Since Wn,n is
the Werner–Holevo
map. Hermitian and unitary, its eigenvalues all equal 1 and −1, so CΨ is positive
semidefinite. Theorem 3.A.6 then implies that Ψ is completely positive.
In particular, notice that if ΦR (X) = tr(X)I − X is the reduction map
from Example 3.A.5 then ΦR = T ◦ Ψ has the form (3.A.1) and is thus
decomposable.
In fact, every positive map Φ : Mn → Mm that we have seen so far is

decomposable, so it seems natural to ask whether or not there are positive maps
that are not. Remarkably, it turns out that the answer to this question depends
on the dimensions m and n. We start by showing that the answer is “no” when
m, n ≥ 3—there are ugly positive linear maps out there that cannot be written
in the form (3.A.1).
Theorem 3.A.7 Show that the linear map ΦC : M3 → M3 defined by

A Non-Decomposable  
Positive Linear Map x1,1 + x3,3 −x1,2 −x1,3
ΦC (X) =  −x2,1 x2,2 + x1,1 −x2,3 
−x3,1 −x3,2 x3,3 + x2,2
The linear map ΦC is is positive but not decomposable.

typically called the
Choi map.
It is worth noting that the map described by this theorem is quite similar to
the reduction map ΦR of Example 3.A.1—the diagonal entries of its output are
just shuffled up a bit (e.g., the top-left entry is x1,1 + x3,3 instead of x2,2 + x3,3 ).
Proof of Theorem 3.A.7. We have to prove two distinct claims: we have to

show that ΦC is positive and we have to show that it cannot be written in the
form (3.A.1).
To see that ΦC is positive, recall that it suffices to show that
 
|v1 |2 + |v3 |2 −v1 v2 −v1 v3
∗  
ΦC (vv ) =  −v1 v2 |v2 |2 + |v1 |2 −v2 v3 
−v1 v3 −v2 v3 |v3 |2 + |v2 |2
is positive semidefinite for all v ∈ F3 . To this end, recall that we can prove that
this matrix is positive semidefinite by checking that all of its principal minors
A principal minor of are nonnegative (see Remark 2.2.1).
a square matrix A is
the determinant of a There are three 1 × 1 principal minors of ΦC (vv∗ ):
square submatrix
that is obtained by |v1 |2 + |v3 |2 , |v2 |2 + |v1 |2 , and |v3 |2 + |v2 |2 ,
deleting some rows
and the same which are clearly all nonnegative. Similarly, one of the three 2 × 2 principal
columns of A.
minors of ΦC (vv∗ ) is
" #!
|v1 |2 + |v3 |2 −v1 v2
det = (|v1 |2 + |v3 |2 )(|v2 |2 + |v1 |2 ) − |v1 |2 |v2 |2
−v1 v2 |v2 |2 + |v1 |2
= |v1 |4 + |v1 |2 |v3 |2 + |v2 |2 |v3 |2
≥ 0.
The calculation for the other two 2 × 2 principal minors is almost identical, so
we move right on to the one and only 3 × 3 principal minor of ΦC (vv∗ ):

det ΦC (vv∗ ) = (|v1 |2 + |v3 |2 )(|v2 |2 + |v1 |2 )(|v3 |2 + |v2 |2 ) − 2|v1 |2 |v2 |2 |v3 |2
− (|v1 |2 + |v3 |2 )|v2 |2 |v3 |2 − (|v2 |2 + |v1 |2 )|v1 |2 |v3 |2
− (|v3 |2 + |v2 |2 )|v1 |2 |v2 |2
= |v1 |2 |v3 |4 + |v2 |2 |v1 |4 + |v3 |2 |v2 |4 − 3|v1 |2 |v2 |2 |v3 |2 .
In order to show that this quantity is nonnegative, we recall that the AM–
The AM–GM GM inequality (Theorem A.5.3) tells us that if x, y, z ≥ 0 then (x + y + z)/3 ≥
inequality is √
3 xyz. Choosing x = |v |2 |v |4 , y = |v |2 |v |4 and z = |v |2 |v |4 gives
introduced in 1 3 2 1 3 2
Appendix A.5.2. q
|v1 |2 |v3 |4 + |v2 |2 |v1 |4 + |v3 |2 |v2 |4
≥ 3
|v1 |6 |v2 |6 |v3 |6 = |v1 |2 |v2 |2 |v3 |2 ,
3
which is exactly what we wanted. It follows that ΦC is positive, as claimed.
To see that ΦC cannot be written in the form ΦC = Ψ1 + T ◦ Ψ2 with Ψ1
and Ψ2 completely positive, we first note that constructing the Choi matrix of
both sides of this equation shows thats it is equivalent to show that we cannot
Recall that Y Γ refers write CΦC = X +Y Γ with both X and Y positive semidefinite. Suppose for the
to (T ⊗ I)(Y ); the first sake of establishing a contradiction that CΦC can be written in this form. A
partial transpose
of Y . straightforward computation shows that the Choi matrix of ΦC is
 
Here, X and Y are 1 · · · −1 · · · −1
the Choi matrices of  · · · · · · · · · 
 
Ψ1 and Ψ2 ,  · · 1 · · · · · · 
respectively.
 
 · · · 1 · · · · · 
 
CΦC =
 −1 · · · 1 · · · −1 
.
 · · · · · · · · · 
 
 · · · · · · · · · 
 
 · · · · · · · 1 · 
−1 · · · −1 · · · 1

Since CΦC 2,2 = 0 and X and Y are positive semidefinite, it is necessarily

the case that y2,2 = 0. The fact that CΦC 6,6 = CΦC 7,7 = 0 similarly implies
y6,6 = y7,7 = 0. If a diagonal entry of a PSD matrix equals 0 then its entire row
and column must equal 0 by Exercise 2.2.11, so we conclude in particular that
y2,4 = y4,2 = y3,7 = y7,3 = y6,8 = y8,6 = 0.
It is also worth noting The reason that we focused on these entries of Y is that, under partial transposi-
that this linear map
tion, they are the entries that are moved to the locations of the “−1” entries of
ΦC is not
2-positive—see CΦC above. That is,
Exercise 3.A.4.
−1 = CΦC 1,5 = x1,5 + Y Γ 1,5 = x1,5 + y4,2 = x1,5 ,
and a similar argument shows that x1,9 = x5,1 = x5,9 = x9,1 = x9,5 = −1 as well.
Positive
semidefiniteness
of Y also tells us that y1,1 , y5,5 , y9,9 ≥ 0, which
(since CΦC 1,1 = CΦC 5,5 = CΦC 9,9 = 1) implies x1,1 , x5,5 , x9,9 ≤ 1. We have
now learned enough about X to see that it is not positive semidefinite: it is
straightforward to check that
Recall that
e+ = ∑3i=1 ei ⊗ ei = eT+ Xe+ = x1,1 + x5,5 + x9,9 − 6 ≤ 3 − 6 = −3,
(1, 0, 0, 0, 1, 0, 0, 0, 1).
so X is not positive semidefinite. This is a contradiction that completes the
proof.
On the other hand, if the dimensions m and n are both sufficiently small,
then a somewhat deep result (which we now state without proof) says that every
positive map is indeed decomposable.
Theorem 3.A.8 Suppose Φ : Mn → Mm is a linear map and that mn ≤ 6. Then Φ is

Positive Maps in positive if and only if it is decomposable.
Small Dimensions
With all of the results of this section taken care of, it is worthwhile at this
point to remind ourselves of the relationships between the various types of
positivity of linear maps that we have now seen. A visual summary of these
relationships is provided in Figure 3.2.
positive linear maps Φ : Mn → Mn

Choi map (Theorem 3.A.7) tr(X)I − X
In certain 2-positive
dimensions, the
relationship of the 2tr(X)I − X (Example 3.A.7)
decomposable set
with the others (n − 1)-positive
“collapses”. For
example, (n − 1)tr(X)I − X
Theorem 3.A.8 says decomposable
that if n = 2 then the
decomposable set
equals the entire
n-positive = completely positive
positive set. Also, if trace
n = 3 then every
2-positive map is Werner–Holevo (Example 3.A.8)
decomposable
[YYT16].
Figure 3.2: A schematic that depicts the relationships between the sets of positive,
k-positive, completely positive, and decomposable linear maps, as well the lo-
cations of the various positive linear maps that we have seen so far within these
sets.
Remark 3.A.2 The proof of Theorem 3.A.8 is quite technical—it was originally proved in
The Set of Positive the m = n = 2 case in [Stø63] (though there are now some slightly simpler
Linear Maps proofs of this case available [MO15]) and in the {m, n} = {2, 3} case in
[Wor76]. If n = 2 and m ≥ 4 then there are positive linear maps that are not
decomposable (see Exercise 3.A.22), just like we showed in the m, n ≥ 3
case in Theorem 3.A.7.
In higher dimensions, the structure of the set of positive linear maps is
not understood very well, and constructing “strange” positive linear maps
like the one from Theorem 3.A.7 is an active research area. The fact that
this set is so much simpler when mn ≤ 6 can be thought of roughly as a
statement that there just is not enough room in those small dimensions for
the “true nature” of the set of positive maps to become apparent.
3.A.1 Determine which of the following statements are

(b) If Φ : Mn (C) → Mm (C) is Hermiticity-preserving
then Φ∗ = Φ.
∗(a) If A ∈ Mm,n (C) then the linear map Φ : Mn (C) → ∗(c) If Φ, Ψ : Mn → Mm are positive linear maps then
Mm (C) defined by Φ(X) = AXA∗ is positive. so is Φ + Ψ.
∗∗ 3.A.2 Let Ψ : M3 → M3 be the (completely posi- ∗∗ 3.A.12 A linear map Φ : Mn → Mm is called

tive) Werner–Holevo map Ψ(X) = tr(X)I − X T from Ex- symmetry-preserving if Φ(X) is symmetric whenever X
ample 3.A.8. Find matrices {Ai } ⊂ M3 such that Ψ(X) = is symmetric.
∑i Ai XA∗i for all X ∈ M3 . (a) Show that every transpose-preserving linear map is
symmetry-preserving.
3.A.3 Suppose p ∈ R. Show that the linear map Ψ p : (b) Provide an example to show that a map can
Mn → Mn defined by Ψ p (X) = tr(X)I + pX T is com- be symmetry-preserving without being transpose-
pletely positive if and only if −1 ≤ p ≤ 1. preserving.
[Side note: If p = −1 then this is the Werner–Holevo map
from Example 3.A.8.] ∗∗3.A.13 Recall from Definition 3.A.5 that we required
positive linear maps to be adjoint-preserving.
∗∗3.A.4 Show that the Choi map ΦC from Theorem 3.A.7 (a) Show that if F = C then adjoint-preservation comes
is not 2-positive. for free. That is, show that if Φ(X) is positive
semidefinite whenever X is positive semidefinite,
[Hint: Let x = (e1 + e2 ) ⊗ e1 + e3 ⊗ e2 ∈ F3 ⊗ F2 . What is
then Φ is adjoint-preserving.
(ΦC ⊗ I2 )(xx∗ )?]
(b) Show that if F = R then adjoint-preservation does
not come for free. That is, find a matrix-valued lin-
∗∗ 3.A.5 Show that a matrix-valued linear map Φ : ear map Φ with the property that Φ(X) is positive
Mm,n → Mm,n preserves the Frobenius norm (i.e., semidefinite whenever X is positive semidefinite, but
kΦ(X)kF = kXkF for all X ∈ Mm,n ) if and only if [Φ] is Φ is not adjoint-preserving.
unitary.
[Side note: In the language of Section 1.D.3, we would say ∗∗ 3.A.14 Show that a matrix-valued linear map Φ is
that such a map Φ is an isometry of the Frobenius norm.] T = Γ(C ).
bisymmetric if and only if CΦ = CΦ Φ
3.A.6 Show that every adjoint-preserving linear map ∗∗3.A.15 Show that Φ : Mn (R) → Mm (R) is both bisym-
Φ : Mn → Mm can be written in the form Φ = Ψ1 − Ψ2 , metric and decomposable if and only if there exists a
where Ψ1 , Ψ2 : Mn → Mm are completely positive. completely positive map Ψ : Mn (R) → Mm (R) such that
Φ = Ψ + T ◦ Ψ.
∗ 3.A.7 Show that if Φ : Mn → Mm is positive then
so is Φ∗ . ∗∗3.A.16 Show that the linear map Φ ⊗ Ψ from Defini-
tion 3.A.4 is well-defined. That is, show that it is uniquely
∗3.A.8 A matrix-valued linear map Φ : Mn → Mm is determined by Φ and Ψ, and that the value of (Φ ⊗ Ψ)(C)
called unital if Φ(In ) = Im . Show that the following are does not depend on the particular way in which we write
equivalent: C = ∑ Ai ⊗ Bi .
i) Φ is unital, i
ii) tr2 (CΦ ) = Im , and

3.A.17 Suppose F = R or F = C, and let f : Mn (F) → F
iii) if Φ(X) = ∑ j A j XBTj then ∑ j A j BTj = Im .
be a linear form, which we can think of as a matrix-valued
linear map by noting that F ∼
= M1 (F) in the obvious way.
3.A.9 A matrix-valued linear map Φ : Mn → Mm is Show that the following are equivalent:
called trace-preserving if tr(Φ(X)) = tr(X) for all X ∈ i) f is positive,
Mn . Show that the following are equivalent: ii) f is completely positive, and
i) Φ is trace-preserving, iii) there exists a positive semidefinite matrix A ∈
ii) tr1 (CΦ ) = In , and Mn (F) such that f (X) = tr(AX) for all X ∈ Mn (F).
iii) if Φ(X) = ∑ j A j XBTj then ∑ j BTj A j = In .
[Hint: This can be proved directly, or it can be proved by ∗∗3.A.18 Show that if a linear map Φ : Mn → Mm is
noting that Φ is trace-preserving if and only if Φ∗ is unital (k + 1)-positive for some integer k ≥ 1 then it is also k-
and then invoking Exercise 3.A.8.] positive.
3.A.10 Suppose Φ, Ψ : Mn → Mn are linear maps. ∗∗3.A.19 Suppose a linear map Φ : Mn → Mm is such
(a) Show that CΨ◦Φ = (Ψ ⊗ I)(CΦ ). that Φ(X) is positive definite whenever X is positive defi-
(b) Show that CΦ◦Ψ∗ = (I ⊗ Ψ)(CΦ ), where Ψ is the lin- nite. Use the techniques from Section 2.D to show that Φ is
positive.
ear map defined by Ψ(X) = Ψ(X) for all X ∈ Mn .
3.A.11 Suppose Φ : Mn → Mm is a linear map. Show

that Φ is completely positive if and only if T ◦ Φ ◦ T is
completely positive.
3.A.20 Suppose U ∈ Mn (C) is a skew-symmetric uni- (c) Show that Φ is not completely positive.
tary matrix (i.e., U T = −U and U ∗U = I). Show that
the map ΦU : Mn (C) → Mn (C) defined by ΦU (X) = ∗∗3.A.22 Consider the linear map Φ : M2 → M4 defined
tr(X)I − X −UX T U ∗ is positive. by
[Hint: Show that x · (Ux) = 0 for all x ∈ Cn . Exercise 2.1.22 " #!
a b
might help, but is not necessary.] Φ =
c d
[Side note: ΦU is called a Breuer–Hall map, and it is worth  
comparing it to the (also positive) reduction map of Ex- 4a − 2b − 2c + 3d 2b − 2a 0 0
ample 3.A.1. While the reduction map is decomposable,  
 2c − 2a 2a b 0 
Breuer–Hall maps are not.]  .
 0 c 2d −2b − d 
0 0 −2c − d 4a + 2d
3.A.21 Consider the linear map Φ : M2 → M4 defined
by (a) Show that if X is positive definite then the determi-
  nant of the top-left 1 × 1, 2 × 2, and 3 × 3 blocks of
0 0 x1,2 tr(X)/2 Φ(X) are strictly positive.
 
 0 0 x2,1 x1,2  (b) Show that if X is positive definite then det(Φ(X)) > 0
Φ(X) = tr(X)I4 − 
 x
.
as well.
 2,1 x1,2 0 0  
[Hint: This is hard. Try factoring this determinant as
tr(X)/2 x2,1 0 0 det(X)p(X), where p is a polynomial in the entries
(a) Show that Φ is positive. of X. Computer software might help.]
[Hint: Show that Φ(X) is diagonally dominant.] (c) Use Sylvester’s Criterion (Theorem 2.2.6) and Exer-
[Side note: In fact, Φ is decomposable—see Exer- cise 3.A.19 to show that Φ is positive.
cise 3.C.15.] [Side note: However, Φ is not decomposable—see
(b) Construct CΦ . Exercise 3.C.17.]
3.B Extra Topic: Homogeneous Polynomials

It might be a good Recall from Section 1.3.2 that if F = R or F = C then every linear form
idea to refer to
f : Fn → F can be represented in the form f (x) = v · x, where v ∈ Fn is a fixed
Appendix A.2.1 for
some basic facts vector. If we just expand that expression out via the definition of the dot product,
about multivariable we see that it is equivalent to the statement that every linear form f : Fn → F
polynomials before can be represented as a multivariable polynomial in which each term has degree
reading this section.
equal to exactly 1. That is, there exist scalars a1 , a2 , . . ., an ∈ F such that
n
f (x1 , x2 , . . . , xn ) = a1 x1 + a2 x2 + · · · + an xn = ∑ a jx j.
j=1
Similarly, we saw in Section 2.A (Theorem 2.A.1 in particular) that every

The degree of a term quadratic form f : Rn → R can be written as a multivariable polynomial in
(like ai, j xi x j ) is the
sum of the which each term has degree equal to exactly 2—there exist scalars ai, j ∈ R for
exponents of its 1 ≤ i ≤ j ≤ n such that
variables. For
example, 7x2 yz3 has n n
degree 2 + 1 + 3 = 6. f (x1 , x2 , . . . , xn ) = ∑ ∑ ai, j xi x j . (3.B.1)
i=1 j=i
We then used our various linear algebraic tools like the real spectral decom-
Every multivariable
position (Theorem 2.1.6) to learn more about the structure of these quadratic
polynomial can be forms. Furthermore, analogous results hold for quadratic forms f : Cn → C,
written as a sum of and they can be proved simply by instead making use of the complex spectral
homogeneous decomposition (Theorem 2.1.4).
polynomials (e.g., a
degree-2 polynomial In general, a polynomial for which every term has the exact same degree
is a sum of a is called a homogeneous polynomial. Linear and quadratic forms constitute
quadratic form,
the degree-1 and degree-2 homogeneous polynomials, respectively, and in this
linear form, and a
scalar). section we investigate their higher-degree brethren. In particular, we will see
that homogeneous polynomials in general are intimately connected with tensors

and the Kronecker product, so we can use the tools that we developed in this
chapter to better understand these polynomials, and conversely we can use
these polynomials to better understand the Kronecker product.
We denote the vector space of degree-p homogeneous polynomials in n
The “H” in HP np
variables by HP np (or HP np (F), if we wish to emphasize which field F these
stands for polynomials are acting on and have coefficients from). That is,
“homogeneous”. n

HP np = f : Fn → F f (x1 , x2 , . . . , xn ) = ∑ ak1 ,k2 ,...,kn x1k1 x2k2 · · · xnkn
k1 +k2 +···+kn =p
Degree-3 and o
degree-4
for some scalars {ak1 ,k2 ,...,kn } ⊂ F .
homogeneous
polynomials are
called cubic forms For example, HP 1n is the space of n-variable linear forms, HP 2n is the space of
and quartic forms,
respectively.
n-variable quadratic forms, and HP 32 consists of all 2-variable polynomials in
which every term has degree equal to exactly 3, and thus have the form
a3,0 x13 + a2,1 x12 x2 + a1,2 x1 x22 + a0,3 x1 x23
for some a3,0 , a2,1 , a1,2 , a0,3 ∈ F.

When working with homogeneous polynomials of degree higher than 2, it
seems natural to ask which properties of low-degree polynomials carry over.
We focus in particular on two problems that ask us how we can decompose
these polynomials into powers of simpler ones.
3.B.1 Powers of Linear Forms

Recall from Corollary 2.A.2 that one way of interpreting the spectral decom-
position of a real symmetric matrix is as saying that every quadratic form
f : Rn → R can be written as a linear combination of squares of linear forms:
n
f (x1 , x2 , . . . , xn ) = ∑ λi (ci,1 x1 + ci,2 x2 + · · · + ci,n xn )2 .
i=1
In particular, λ1 , λ2 , . . ., λn are the eigenvalues of the symmetric matrix A that

represents f (in the sense of Equation (3.B.1)), and (ci,1 , ci,2 , . . . , ci,p ) is a unit
eigenvector corresponding to λi .
A natural question then is whether or not a similar technique can be applied
to homogeneous polynomials of higher degree (and to polynomials with com-
plex coefficients). That is, we would like to know whether or not it is possible
to write every degree-p homogeneous polynomial as a linear combination of
p-th powers of linear forms. The following theorem shows that the answer is
yes, this is possible.
Theorem 3.B.1 Suppose F = R or F = C, and f ∈ HP np (F). There exists an integer r and

Powers of scalars λi ∈ F and ci, j ∈ F for 1 ≤ i ≤ r, 1 ≤ j ≤ n such that
Linear Forms
r
f (x1 , x2 , . . . , xn ) = ∑ λi (ci,1 x1 + ci,2 x2 + · · · + ci,n xn ) p .
i=1
3.B Extra Topic: Homogeneous Polynomials 379
By the spectral Proof. In order to help us prove this result, we first define an inner product
decomposition, if
h·, ·i : HP np ×HP np → F by setting h f , gi equal to a certain weighted dot product
p = 2 then we can
choose r = n in this of the vectors of coefficients of f , g ∈ HP np . Specifically, if
theorem. In general,
we can choose r = f (x1 , x2 , . . . , xn ) = ∑ ak1 ,k2 ,...,kn x1k1 x2k2 · · · xnkn and

dim HP np = n+p−1
p
k1 +k2 +···+kn =p
(we will see how this
g(x1 , x2 , . . . , xn ) = ∑ bk1 ,k2 ,...,kn x1k1 x2k2 · · · xnkn ,
dimension is
k1 +k2 +···+kn =p
computed shortly).
then we define
ak1 ,k2 ,...,kn bk1 ,k2 ,...,kn
h f , gi = ∑ p , (3.B.2)
k1 +···+kn =p k1 ,k2 ,...,kn

Multinomial where k1 ,k2p,...,kn = k1! k2p! ! ··· kn ! is a multinomial coefficient. We show that this
coefficients and the
multinomial theorem function is an inner product in Exercise 3.B.11.
(Theorem A.2.2) are Importantly, notice that if g is the p-th power of a linear form (i.e., there
introduced in
are scalars c1 , c2 , . . . , cn ∈ F such that g(x1 , x2 , . . . , xn ) = (c1 x1 + c2 x2 + · · · +
Appendix A.2.1.
cn xn ) p ) then we can use the multinomial theorem (Theorem A.2.2) to see that

p k1 k2
g(x1 , x2 , . . . , xn ) = ∑ c c
1 2 · · · c kn
n x1k1 x2k2 · · · xnkn ,
k +k +···+kn =p k 1 , k 2 , . . . , kn
1 2
so straightforward computation then shows us that

p k1 k2 kn
k1 ,k2 ,...,kn ak1 ,k2 ,...,kn c1 c2 · · · cn
h f , gi = p = ak1 ,k2 ,...,kn ck11 ck22 · · · cknn
k1 ,k2 ,...,kn (3.B.3)
= f (c1 , c2 , . . . , cn ).
With these details out of the way, we can now prove the theorem relatively
straightforwardly. Another way of stating the theorem is as saying that
n
HP np = span g : Fn → F g(x1 , x2 , . . . , xn ) = (c1 x1 + c2 x2 + · · · + cn xn ) p
o
for some c1 , c2 , . . . , cn ∈ F .
Our goal is thus to show that this span is not a proper subspace of HP np .
Suppose for the sake of establishing a contradiction that this span were a
proper subspace of HP np . Then there would exist a non-zero homogeneous
For example, we polynomial f ∈ HP np orthogonal to each p-th power of a linear form:
could choose f to
be any non-zero h f , gi = 0 whenever g(x1 , x2 , . . . , xn ) = (c1 x1 + c2 x2 + · · · + cn xn ) p .
vector in the
orthogonal Equation (3.B.3) then tells us that f (c1 , c2 , . . . , cn ) = 0 for all c1 , c2 , . . . , cn ∈ F,
complement (see which implies f is the zero polynomial, which is the desired contradiction.
Section 1.B.2) of the
span of the g’s. In the case when p is odd, the linear combination of powers described by
Theorem 3.B.1 really is just a sum of powers, since we can absorb any scalar
(positive or negative) inside the linear forms. For example, every cubic form
f : Fn → F can be written in the form
r
f (x1 , x2 , . . . , xn ) = ∑ (ci,1 x1 + ci,2 x2 + · · · + ci,n xn )3 .
i=1
In fact, if F = C then this can be done regardless of whether p is even or odd.
However, if p is even and F = R then the best we can do is absorb the absolute
value of scalars λi into the powers of linear forms (i.e., we can assume without
loss of generality that λi = ±1 for each 1 ≤ i ≤ m).
Example 3.B.1 Write the cubic form f : R2 → R defined by

Decomposing a
Cubic Form f (x, y) = 4x3 + 12xy2 − 6y3
as a sum of cubes of linear forms.
Solution:
Unfortunately, the proof of Theorem 3.B.1 is non-constructive, so even
though we know that such a sum-of-cubes way of writing f exists, we
have not yet seen how to find one. The trick is to simply construct a basis
of HP 32 consisting of powers of linear forms, and then represent f in that
basis.
We do not have a general method or formula for constructing such a
basis, but one method that works in practice is to just choose powers of
randomly-chosen linear forms until we have the right number of them to
This basis is arbitrary. form a basis. For example, since HP 32 is 4-dimensional, we know that
Many other similar any such basis must consist of 4 polynomials, and it is straightforward to
sets like x3 , (x −
verify that the set
y)3 , (x + 2y)3 , (x − 3y)3 3 3
are also bases. x , y , (x + y)3 , (x − y)3
is one such basis.
Now our goal is to simply to represent f in this basis—we want to find
The final line here just c1 , c2 , c3 , c4 such that
comes from
expanding the terms
(x + y)3 and (x − y)3 f (x, y) = 4x3 + 12xy2 − 6y3
via the binomial
= c1 x3 + c2 y3 + c3 (x + y)3 + c4 (x − y)3
theorem
(Theorem A.2.1) and = (c1 + c3 + c4 )x3 + (3c3 − 3c4 )x2 y+
then grouping
coefficients by (3c3 + 3c4 )xy2 + (c2 + c3 − c4 )y3 .
which monomial (x3 ,
x2 y, xy2 , or y3 ) they Matching up coefficients of these polynomials gives us the four linear
multiply.
equations
c1 + c3 + c4 = 4, 3c3 − 3c4 = 0,
3c3 + 3c4 = 12, and c2 + c3 − c4 = −6.
This linear system has (c1 , c2 , c3 , c4 ) = (1, −6, 2, 2) as its unique solution,
which gives us the decomposition
f (x, y) = x3 − 6y3 + 2(x + y)3 + 2(x − y)3 .
To turn this linear combination of cubes into a sum of cubes, we just

absorb constants inside of the cubed terms:
√ 3 √ 3 √ 3
f (x, y) = x3 + − 6y + 2(x + y) + 2(x − y) .
3 3 3
The key observation that connects homogeneous polynomials with the other
topics of this chapter is that HP np is isomorphic to the symmetric subspace
Snp ⊂ (Fn )⊗p that we explored in Section 3.1.3. To see why this is the case,
recall that one basis of Snp consists of the vectors of the form
( )
∑ Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) : 1 ≤ j1 ≤ j2 ≤ · · · ≤ j p ≤ n .
σ ∈S p
In particular, each vector in this basis is determined by how many of the

subscripts j1 , j2 , . . ., j p are equal to each of 1, 2, . . ., n. If we let k1 , k2 , . . ., kn
denote the multiplicity of the subscripts 1, 2, . . ., n in this manner then we can
associate the vector
∑ Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p )
σ ∈S p
with the tuple (k1 , k2 , . . . , kn ).

The fact that these That is, the linear transformation T : Snp → HP np defined by
spaces are !
isomorphic tellsus
that dim HP np =
T ∑ Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) = x1k1 x2k2 · · · xnkn ,
dim Snp = n+p−1
p . σ ∈S p
where k1 , k2 , . . ., kn count the number of occurrences of the subscripts 1, 2, . . .,

n (respectively) in the multiset { j1 , j2 , . . . , j p }, is an isomorphism. We have
thus proved the following observation:
! HP np is isomorphic to the symmetric subspace Snp ⊂ (Fn )⊗p .
The particular isomorphism that we constructed here perhaps seems ugly at

first, but it really just associates the members of the “usual” basis of Snp with
the monomials from HP np in the simplest way possible. For example, if we
return to the n = 2, p = 3 case, we get the following pairing of basis vectors
for these two spaces:
Basis Vector in S23 Basis Vector in HP 32

Compare this table 6 e1 ⊗ e1 ⊗ e1 x13
with the one on 2 e1 ⊗ e1 ⊗ e2 + e1 ⊗ e2 ⊗ e1 + e2 ⊗ e1 ⊗ e1 x12 x2
page 314.
2 e1 ⊗ e2 ⊗ e2 + e2 ⊗ e1 ⊗ e2 + e2 ⊗ e2 ⊗ e1 x1 x22
6 e2 ⊗ e2 ⊗ e2 x23
If we trace the statement of Theorem 3.B.1 through this isomorphism, we

learn that every w ∈ Snp can be written in the form
r
w = ∑ λi v⊗p
i
i=1
for some λ1 , λ2 , . . . , λr ∈ F and v1 , v2 , . . . , vr ∈ Fn . In other words, we have

finally proved Theorem 3.1.11.
Example 3.B.2 Write the vector w = (2, 3, 3, 1, 3, 1, 1, 3, 3, 1, 1, 3, 1, 3, 3, 2) ∈ S24 as a linear

Finding a Symmetric combination of vectors of the form v⊗4 .
Kronecker
Solution:
Decomposition
We can solve this problem by mimicking what we did in Exam-
ple 3.B.1. We start by constructing a basis of S24 consisting of vectors of
Again, there is the form v⊗4 . We do not have an explicit method of carrying out this task,
but in practice we can just pick random vectors of the form ⊗4
v until we
nothing special
about this basis. Pick 4
random tensor
have enough of them to form a basis (so we need dim S2 = 5 of them).
powers of 5 distinct One such basis is
vectors and there’s a
good chance we
n o
will get a basis. (1, 0)⊗4 , (0, 1)⊗4 , (1, 1)⊗4 , (1, 2)⊗4 , (2, 1)⊗4 .
Now our goal is to simply to represent w in this basis—we want to

find c1 , c2 , c3 , c4 , c5 such that
w = c1 (1, 0)⊗4 + c2 (0, 1)⊗4 + c3 (1, 1)⊗4 + c4 (1, 2)⊗4 + c5 (2, 1)⊗4 .
If we explicitly write out what all of these vectors are (they have 16
entries each!), we get a linear system consisting of 16 equations and 5
variables. However, many of those equations are identical to each other,
and after discarding those duplicates we arrive at the following linear
system consisting of just 5 equations:
c1 + c3 + c4 + 16c5 = 2
c2 + c3 + 16c4 + c5 = 2
c3 + 2c4 + 8c5 = 3
c3 + 4c4 + 4c5 = 1
c3 + 8c4 + 2c5 = 3
This linear system has (c1 , c2 , c3 , c4 , c5 ) = (−8, −8, −7, 1, 1) as its unique
solution, which gives us the decomposition
w = (1, 2)⊗4 + (2, 1)⊗4 − 7(1, 1)⊗4 − 8(1, 0)⊗4 − 8(0, 1)⊗4 .
Remark 3.B.1 Recall from Section 3.3.3 that the tensor rank of w ∈ (Fn )⊗p (denoted by
Symmetric rank(w)) is the least integer r such that we can write
Tensor Rank
r
(1) (2) (p)
w = ∑ vi ⊗ vi ⊗ · · · ⊗ vi ,
i=1
( j)
where vi ∈ Fn for each 1 ≤ i ≤ r and 1 ≤ j ≤ p.
Now that we know that Theorem 3.1.11 holds, we could similarly
define (as long as F = R or F = C) the symmetric tensor rank of any
w ∈ Snp (denoted by rankS (w)) to be the least integer r such that we can
If F = C then we can write
omit the λi scalars r
here since
√ we can w = ∑ λi v⊗p
i ,
absorb p λi into vi . If i=1
F = R then we can
similarly assume that where λi ∈ F and vi ∈ Fn for each 1 ≤ i ≤ r.
λi = ±1 for all i.
Perhaps not surprisingly, only a few basic facts are known about the
symmetric tensor rank:
• When p = 2, the symmetric tensor rank just equals the usual rank:
rank(w) = rankS (w) for all w ∈ Sn2 . This is yet another consequence
of the spectral decomposition (see Exercise 3.B.6).
• In general, rank(w) ≤ rankS (w) for all w ∈ Snp . This fact follows
directly from the definitions of these two quantities—when comput-
ing the symmetric rank, we minimize over a strictly smaller set of
sums than for the non-symmetric rank.
• There are cases when rank(w) < rankS (w), even just when p = 3,
but they are not easy to construct or explain [Shi18].
n+p−1 n+p−1
• Since dim Snp = p , we have rankS (w) ≤ p for all
w ∈ Snp .
3.B.2 Positive Semidefiniteness and Sums of Squares

Recall that if F = C, or F = R and p is odd, then Theorem 3.B.1 says that
every homogeneous polynomial f ∈ HP np can be written as a sum (not just as
a linear combination) of powers of linear forms, since we can absorb scalars
inside those powers. However, in the case when F = R and p is even, this is
not always possible, since any such polynomial necessarily always produces
non-negative output:
r
f (x1 , x2 , . . . , xn ) = ∑ (ci,1 x1 + ci,2 x2 + · · · + ci,n xn ) p ≥ 0 for all x1 , x2 , . . . , xn .
i=1
We investigate polynomials of this type in this section, so we start by giving

them a name:
Definition 3.B.1 We say that a real-valued polynomial f is positive semidefinite (PSD) if

Positive Semidefinite f (x1 , x2 , . . . , xn ) ≥ 0 for all x1 , x2 , . . . , xn ∈ R.
Polynomials
In particular, we are interested in the converse to the observation that we
made above—if a homogeneous polynomial is positive semidefinite, can we
write it as a sum of even powers of linear forms? This question is certainly too
restrictive, since we cannot even write the (clearly PSD) polynomial f (x, y) =
x2 y2 in such a manner (see Exercise 3.B.5). We thus ask the following slightly
weaker question about sums of squares of polynomials, rather than sums of
even powers of linear forms:
? For which values of n and p can the PSD polynomials in

HP np (R) be written as a sum of squares of polynomials?
We already know (again, as a corollary of the spectral decomposition) that this

is possible when p = 2. The main result of this section shows that it is also
possible when n = 2 (i.e., when there are only two variables).
The main technique that goes into proving this upcoming theorem is a
trick called dehomogenization, which is a process for turning a homogeneous
polynomial in n variables into a (not necessarily homogeneous) polynomial in
If you prefer, you n − 1 variables while preserving many of its important properties (like positive
can instead divide
through by x pj for any semidefiniteness). For a polynomial f ∈ HP np acting on variables x1 , x2 , . . . , xn ,
1 ≤ j ≤ n of your this procedure works by dividing f through by xnp and then relabeling the ratios
choosing. x1 /xn , x2 /xn , . . ., xn−1 /xn as the variables. We illustrate with an example.
Example 3.B.3 Dehomogenize the quartic polynomial f ∈ HP 43 defined by

Dehomogenizing
a Quartic f (x1 , x2 , x3 ) = x14 + 2x13 x2 + 3x12 x32 + 4x1 x22 x3 + 5x34 .
Solution:
If we divide f through by x34 , we get
f (x1 , x2 , x3 ) x14 x 3 x2 x2 x1 x2
4
= 4 + 2 1 4 + 3 12 + 4 3 2 + 5,
x3 x3 x3 x3 x3
which is not a polynomial in x1 , x2 , and x3 , but is a polynomial in the

The terms in the variables x = x1 /x3 and y = x2 /x3 . That is, the dehomogenization of f is
dehomogenization
do not necessarily all the two-variable (non-homogeneous) polynomial g defined by
have the same
degree. However, g(x, y) = x4 + 2x3 y + 3x2 + 4xy2 + 5.
they do all have
degree no larger
than that of the It should be reasonably clear that a polynomial is positive semidefinite if
original and only if its dehomogenization is positive semidefinite, since dividing by a
homogeneous positive term like xnp when p is even does not affect positive semidefiniteness.
polynomial.
We now make use of this technique to answer our central question about PSD
homogeneous polynomials in the 2-variable case.
Theorem 3.B.2 Suppose f ∈ HP 2p (R). Then f is positive semidefinite if and only if it can
Two-Variable PSD be written as a sum of squares of polynomials.
Homogeneous
Polynomials
Proof. The “if” direction is trivial (even for polynomials of more than 2 vari-
ables), so we only show the “only if” direction. Also, thanks to dehomogeniza-
tion, it suffices to just prove that every PSD (not necessarily homogeneous)
polynomial in a single variable can be written as a sum of squares.
To this end, we induct on the degree p of the polynomial f . In the p = 2
base case, we can complete the square in order to write f in its vertex form
f (x) = a(x − h)2 + k,
where a > 0 and the vertex of the graph of f is located at the point (h, k) (see
Figure 3.3(a)). Since f (x) ≥ 0 for all x we know that f (h) = k ≥ 0, so this
vertex form is really a sum of squares decomposition of f .
y y = 13 (x − 2)2 + 1 y y = 1 (x + 1)(x − 2)2

3
x = −1 x=2
(order 1) (order 2)
x
(2, 1) -1 1 2 3
x
-1 1 2 3 4 5
Figure 3.3: Illustrations of some basic facts about real-valued polynomials of a

single variable.
For the inductive step, suppose f has even degree p ( f can clearly not be
positive semidefinite if p is odd). Let k ≥ 0 be the minimal value of f (which
exists, since p is even) and let h be a root of the polynomial f (x) − k. Since
f (x) − k ≥ 0 for all x, the multiplicity of h as a root of f (x) − k must be even

(see Figure 3.3(b)). We can thus write
f (x) = (x − h)2q g(x) + k
for some integer q ≥ 1 and some polynomial g. Since k ≥ 0 and g(x) = ( f (x) −
k)/(x − h)2q ≥ 0 is a PSD polynomial of degree p − 2q < p, it follows from the
inductive hypothesis that g can be written as a sum of squares of polynomials,
so f can as well.
It is also known [Hil88] that every PSD polynomial in HP 43 (R) can be
written as a sum of squares of polynomials, though the proof is rather technical
and outside of the scope of this book. Remarkably, this is the last case where
such a property holds—for all other values of n and p, there exist PSD polyno-
mials that cannot be written as a sum of squares of polynomials. The following
example presents such a polynomial in HP 63 (R).
Example 3.B.4 Consider the real polynomial f (x, y, z) = x2 y4 + y2 z4 + z2 x4 − 3x2 y2 z2 .

A Weird PSD a) Show that f is positive semidefinite.
Polynomial of b) Show that f cannot be written as a sum of squares of polynomials.
Degree 6
Solutions:
a) If we apply the AM–GM inequality (Theorem A.5.3) to the quanti-
ties x2 y4 , y2 z4 , and z2 x4 , we learn that
q
x2 y4 + y2 z4 + z2 x4
≥ 3
(x2 y4 )(y2 z4 )(z2 x4 ) = x2 y2 z2 .
3
Multiplying this inequality through by 3 and rearranging gives us
exactly the inequality f (x, y, z) ≥ 0.
b) If f could be written as a sum of squares of polynomials, then it
would have the form
The terms being r
squared and f (x, y, z) = ∑ ai xy2 + bi yz2 + ci zx2 + di x2 y + ei y2 z
summed here are
i=1
just general cubics. 2
Recall
that + gi z2 x + hi x3 + ji y3 + ki z3 + ì xyz
dim HP 33 = 10, so
there are 10
for some scalars ai , bi , . . . , ì ∈ R. By matching up terms of f with
monomials in such a
cubic. terms in this hypothetical sum-of-squares representation of f , we
can learn about their coefficients.
For example, since the coefficient of x6 in f is 0, whereas the
coefficient of x6 in this sum-of-squares representation is ∑ri=1 h2i ,
we learn that hi = 0 for all 1 ≤ i ≤ r. A similar argument with the
coefficients of y6 and z6 then shows that ji = ki = 0 for all 1 ≤ i ≤ r
as well.
The coefficient of x4 y2 then similarly tells us that
r
0 = ∑ di2 , so di = 0 for all 1 ≤ i ≤ r,
i=1
and the same argument with the coefficients of y4 z2 and z4 x2 tells

us that ei = gi = 0 for all 1 ≤ i ≤ r as well.
At this point, our hypothetical sum-of-squares representation of f

has been simplified down to the form
r 2
f (x, y, z) = ∑ ai xy2 + bi yz2 + ci zx2 + ì xyz .
i=1
Comparing the coefficient of x2 y2 z2 in f to this decomposition of

f then tells us that −3 = 0, which is (finally!) a contradiction that
shows that no such sum-of-squares decomposition of f is possible
in the first place.
Similar examples show that there are PSD polynomials that cannot be
written as a sum of squares of polynomials in HP np (R) whenever n ≥ 3 and
p ≥ 6 is even. An example that demonstrates the existence of such polynomials
in HP 44 (R) (and thus HP np (R) whenever n ≥ 4 and p ≥ 4 is even) is provided in
Exercise 3.B.7. Finally, a summary of these various results about the connection
between positive semidefiniteness and the ability to write a polynomial as a
sum of squares of polynomials is provided by Table 3.2.
n (number of variables)
p (degree) 1 2 3 4 5
2 X (spectral decomp.)
4 7 (Exer. 3.B.7)
X (Thm. 3.B.2)
X[Hil88]
X (trivial)
6 7 (Exam. 3.B.4)
8
10
12
Table 3.2: A summary of which values of n and p lead to a real n-variable degree-p
PSD homogeneous polynomial necessarily being expressible as a sum of squares
of polynomials.
Remark 3.B.2 Remarkably, even though some positive semidefinite polynomials cannot
Sums of Squares of be written as a sum of squares of polynomials, they can all be written as a
Rational Functions sum of squares of rational functions [Art27]. For example, we showed in
Example 3.B.4 that the PSD homogeneous polynomial
f (x, y, z) = x2 y4 + y2 z4 + z2 x4 − 3x2 y2 z2
cannot be written as a sum of squares of polynomials. However, if we

define
g(x, y, z) = x3 z − xzy2 and h(x, y, z) = x2 (y2 − z2 )
then we can write f in the following form that makes it “obvious” that it
is positive semidefinite:
g(x, y, z)2 + g(y, z, x)2 + g(z, x, y)2

f (x, y, z) =
x2 + y2 + z2
h(x, y, z)2 + h(y, z, x)2 + h(z, x, y)2
+ .
2(x2 + y2 + z2 )
This is not quite a sum of squares of rational functions, but it can be turned
into one straightforwardly (see Exercise 3.B.8).
3.B.3 Biquadratic Forms

Recall from Section 2.A that one way of constructing a quadratic form is
to plug the same vector into both input slots of a bilinear form. That is, if
f : Rn × Rn → R is a bilinear form then the function q : Rn → R defined by
q(x) = f (x, x) is a quadratic form (and the scalars that represent q are exactly
the entries of the symmetric matrix A that represents f ). The following facts
follow via a similar argument:
• If f : Rn × Rn × Rn → R is a trilinear form then the function q : Rn → R
defined by q(x) = f (x, x, x) is a cubic form.
• If f : Rn × Rn × Rn × Rn → R is a 4-linear (quadrilinear?) form then the
function q : Rn → R defined by q(x) = f (x, x, x, x) is a quartic form.
(Rn )×p means the • If f : (Rn )×p → R is a p-linear form then the function q : Rn → R defined
Cartesian product of
by q(x) = f (x, x, . . . , x) is a degree-p homogeneous polynomial.
p copies of Rn with
itself. We now focus on what happens in the p = 4 case if we go halfway between
quadrilinear and quartic forms: we consider functions q : Rm × Rn → R of the
form q(x, y) = f (x, y, x, y), where f : Rm × Rn × Rm × Rn → R is a quadri-
linear form. We call such functions biquadratic forms, and it is reasonably
straightforward to show that they can all be written in the following form (see
Exercise 3.B.9):
Definition 3.B.2 A biquadratic form is a degree-4 homogeneous polynomial q : Rm ×

Biquadratic Forms Rn → R for which there exist scalars {ai, j;k,` } such that
m n
q(x, y) = ∑ ∑ ai, j;k,` xi x j yk y` .
1=i≤ j 1=k≤`
Analogously to how the set of bilinear forms is isomorphic to the set of

matrices, quadratic forms are isomorphic to the set of symmetric matrices,
and PSD quadratic forms are isomorphic to the set of positive semidefinite
matrices, we can also represent quadrilinear and biquadratic forms in terms
of matrix-valued linear maps in a natural way. In particular, notice that every
linear map Φ : Mn (R) → Mm (R) can be thought of as a quadrilinear form
f : Rm × Rn × Rm × Rn → R just by defining

f (v, w, x, y) = vT Φ wyT x.
Indeed, it is straightforward to show that Φ is linear if and only if f is

quadrilinear, and that this relationship between f and Φ is actually an iso-
morphism. It follows that biquadratic forms similarly can be written in the
form
q(x, y) = f (x, y, x, y) = xT Φ yyT x.
The following theorem pins down what properties of Φ lead to various desirable
properties of the associated biquadratic form q.
Theorem 3.B.3 A function q : Rm × Rn → R is a biquadratic form if and only if there exists

Biquadratic Forms a bisymmetric linear map Φ : Mn (R) → Mm (R) (i.e., a map satisfying
as Matrix-Valued Φ = T ◦ Φ = Φ ◦ T ) such that q(x, y) = xT Φ yyT x for all x ∈ Rm and
Linear Maps y ∈ Rn . Properties of Φ correspond to properties of q as follows:
a) Φ is positive if and only if q is positive semidefinite.
b) Φ is decomposable if and only if q can be written as a sum of squares
of bilinear forms.
Bisymmetric maps Furthermore, this relationship between q and Φ is an isomorphism.
were introduced in
Section 3.A.1.
Decomposable
maps are ones that Proof. We already argued why q is a biquadratic form if and only if we can
can be written in the write q(x, y) = xT Φ yyT x for some linear map Φ. To see why we can choose
form Φ = Ψ1 + T ◦ Ψ2 Φ to be bisymmetric, we just notice that if we define
for some CP Ψ1 , Ψ2 .
1
Ψ= Φ + (T ◦ Φ) + (Φ ◦ T ) + (T ◦ Φ ◦ T )
4

then Ψ is bisymmetric and q(x, y) = xT Ψ yyT x for all x ∈ Rm and y ∈ Rn as
well.
To see that this association between biquadratic forms q and bisymmetric
linear maps Φ is an isomorphism, we just notice that the dimensions of these
two vector spaces match up. It is evident from Definition 3.B.2 that the space
of biquadratic forms has dimension m+1 2
n+1
2 = mn(m + 1)(n + 1)/4, since
each biquadratic form is determined by that many scalars {ai, j;k,` }. On the
other hand, recall from Exercise 3.A.14 that Φ is bisymmetric if and only
if CΦ = CΦ T = Γ(C ). It follows that the set of bisymmetric maps also has
n+1
Φ
dimension m+1 2 2 = mn(m + 1)(n + 1)/4, since every bisymmetric Φ

depends only on the m+1 2 = m(m + 1)/2 entries in the upper-triangular portion
n+1
of each of the 2 = n(n + 1)/2 blocks in the upper-triangular portion of CΦ .
To see why property (a) holds, we note that q is positive semidefinite if and
only if
q(x, y) = xT Φ yyT x ≥ 0 for all x ∈ Rm , y ∈ Rn .

Positive semidefiniteness of q is thus equivalent to Φ yyT being positive
semidefinite for all y ∈ Rn , which in turn is equivalent to Φ being positive
(since the real spectral decomposition tells us that every positive semidefinite
X ∈ Mn (R) can be written as a sum of terms of the form yyT ).
To demonstrate property (b), we note that if Φ is decomposable then we can
write it in the form Φ = Ψ1 + T ◦ Ψ2 for some completely positive maps Ψ1
and Ψ2 with operator-sum representations Ψ1 (X) = ∑i Ai XATi and Ψ2 (X) =
∑i Bi XBTi . Then
T
q(x, y) = xT Ψ1 yyT x + xT Ψ2 (yyT ) x
T
= ∑ xT Ai yyT ATi x + ∑ xT Bi yyT BTi x
i i
= ∑(xT Ai y)2 + ∑(xT Bi y)2 ,
i i
which is a sum of squares of the bilinear forms gi (x, y) = xT Ai y and hi (x, y) =

xT Bi y.
Conversely, if q can be written as a sum of squares of bilinear forms
gi : Rm × Rn → R then we know from Theorem 1.3.5 that we can find matrices
{Ai } such that gi (x, y) = xT Ai y. It follows that

q(x, y) = ∑(xT Ai y)2 = ∑ xT Ai yyT ATi x = xT Ψ yyT x,
i i
where Ψ is the (completely positive) map with operator-sum representation

Ψ(X) = ∑i Ai XATi . However, this map Ψ may not be bisymmetric. To see how
we can represent q by a bisymmetric map (at the expense of weakening com-
plete positivity to just decomposability), we introduce the decomposable map
Φ = (Ψ + T ◦ Ψ)/2. It is straightforward to check that q(x, y) = xT Φ yyT x
for all x ∈ Rm and y ∈ Rn as well, and the fact that it is bisymmetric follows
from Exercise 3.A.15.
It is worth noting that many of the implications of the above theorem still
work even if the linear map Φ is not bisymmetric. For example, if Φ : Mn (R) →
Mm (R) is a positive linear map and we define q(x, y) = xT Φ yyT x then the
This is analogous to proof given above shows that q must be positive semidefinite. However, the
the fact that a
converse is not necessarily true unless Φ is bisymmetric—q might be positive
quadratic form
q(x) = xT Ax being semidefinite even if Φ is not positive (it might not even send symmetric matrices
PSD does not imply A to symmetric matrices). Similarly, the proof of Theorem 3.B.3 shows that a
is PSD unless we biquadratic form can be written as a sum of squares if and only if it can be
choose A to be
symmetric.
represented by a completely positive map, which is sometimes easier to work
with than a decomposable bisymmetric map.
For ease of reference, we summarize the various ways that quadratic forms,
biquadratic forms, and various other closely-related families of homogeneous
polynomials are isomorphic to important sets of matrices and matrix-valued
linear maps in Table 3.3.
Polynomial Matrix(-Valued Map)
The top half of this

bilinear form matrix
table has one fewer quadratic form symmetric matrix
row than the bottom PSD quadratic form PSD matrix
half since a
quadratic form is quadrilinear form matrix-valued linear map
PSD if and only if it is biquadratic form bisymmetric matrix-valued map
a sum of squares of PSD biquadratic form positive bisymmetric linear map
linear forms.
sum of squares of bilin. forms decomposable bisymmetric map
Table 3.3: A summary of various isomorphisms that relate certain families of ho-
mogeneous polynomials to certain families of matrices and matrix-valued linear
maps.
Example 3.B.5 Write the biquadratic form q : R3 × R3 → R defined by

A Biquadratic Form
as a Sum of Squares q(x, y) = x12 (y22 + y23 ) + x22 (y23 + y21 ) + x32 (y21 + y22 )
− 2x1 x2 y1 y2 − 2x2 x3 y2 y3 − 2x3 x1 y3 y1
as a sum of squares of bilinear forms.

Solution:
It is straightforward to show that q(x, y) = xT Φ yyT x, where Ψ :
We were lucky here M3 → M3 is the (completely positive) Werner–Holevo map defined by
that we could
“eyeball” a CP map
Ψ(X) = tr(X)I − X T (see Example 3.A.8). It follows that if we can find
Ψ for which an operator-sum representation Ψ(X) = ∑i Ai XATi then we will get a sum-
q(x, y) = xT Ψ(yyT )x. In of-squares representation q(x, y) = ∑i (xT Ai y)2 as an immediate corollary.
general, finding a
suitable map Ψ to To construct such an operator-sum representation, we mimic the proof
show that q is a sum of Theorem 3.A.6: we construct CΨ and then let the Ai ’s be the matri-
of squares can be cizations of its scaled eigenvectors. We recall from Example 3.A.8 that
done via
CΨ = I − W3,3 , and applying the spectral decomposition to this matrix
semidefinite
programming (see shows that CΨ = ∑3i=1 vi vTi , where
Exercise 3.C.18).
v1 = (0, 1, 0, −1, 0, 0, 0, 0, 0),
v2 = (0, 0, 1, 0, 0, 0, −1, 0, 0), and
v3 = (0, 0, 0, 0, 0, 1, 0, −1, 0).
It follows that Ψ(X) = ∑3i=1 Ai XATi , where

     
We already 0 1 0 0 0 1 0 0 0
constructed this A1 = −1 0 0 , A2 =  0 0 0 , and A3 = 0 0 1 .
operator-sum
representation back
0 0 0 −1 0 0 0 −1 0
in Exercise 3.A.2.
This then gives us the sum-of-squares representation
3 2
q(x, y) = ∑ xT Ai y
i=1
= (x1 y2 − x2 y1 )2 + (x1 y3 − x3 y1 )2 + (x2 y3 − x3 y2 )2 .
In light of examples like Example 3.B.4 and Exercise 3.B.7, it is perhaps

not surprising that there are PSD biquadratic forms that cannot be written as
a sum of squares of polynomials. What is interesting though is that this fact
is so closely-related to the fact that there are positive linear maps that are not
decomposable (i.e., that cannot be written in the form Φ = Ψ1 + T ◦ Ψ2 for
some completely positive maps Ψ1 , Ψ2 )—a fact that we demonstrated back in
Theorem 3.A.7.
Example 3.B.6 Consider the biquadratic form q : R3 × R3 → R defined by

A PSD Biquadratic
Form That is Not a q(x, y) = x12 (y21 + y22 ) + x22 (y22 + y23 ) + x32 (y23 + y21 )
Sum of Squares − 2x1 x2 y1 y2 − 2x2 x3 y2 y3 − 2x3 x1 y3 y1 .
a) Show that q is positive semidefinite.

Compare this b) Show that q cannot be written as a sum of squares of bilinear forms.
biquadratic form
with the one from Solutions:
Example 3.B.5. a) We could prove that q is positive semidefinite directly in a manner
analogous to that used in Example 3.B.4, but we instead make use of
the relationship with matrix-valued linear maps that was described
by Theorem 3.B.3. Specifically, we notice that if ΦC : M3 → M3
This biquadratic form is the Choi map described by Theorem 3.A.7, then
(as well as the Choi
map from
Theorem 3.A.7) was yT ΦC xxT y = x12 (y21 + y22 ) + x22 (y22 + y23 ) + x32 (y23 + y21 )
introduced by Choi − 2x1 x2 y1 y2 − 2x2 x3 y2 y3 − 2x3 x1 y3 y1 = q(x, y).
[Cho75].
However, ΦC is not bisymmetric, so we instead consider the bisym-
metric map
1
Ψ = ΦC + T ◦ ΦC ,
2

which also has the property that q(x, y) = yT Ψ xxT y. Since ΦC is
positive, so is Ψ, so we know from Theorem 3.B.3 that q is positive
We show that q is semidefinite.
PSD “directly” in
Exercise 3.B.10. b) On the other hand, even though we know that ΦC is not decompos-
able, this does not directly imply that the map Ψ from part (a) is not
decomposable, so we cannot directly use Theorem 3.B.3 to see that
q cannot be written as a sum of squares of bilinear forms. We thus
prove this property of q directly.
If q could be written as a sum of squares of bilinear forms, then it
would have the form
r
q(x, y) = ∑ ai x1 y1 + bi x1 y2 + ci x1 y3 + di x2 y1 + ei x2 y2 + fi x2 y3
i=1
2
+ hi x3 y1 + ji x3 y2 + ki x3 y3
for some families of scalars {ai }, {bi }, . . . , {ki } ∈ R and some in-
teger r. By matching up terms of q with terms in this hypothetical
sum-of-squares representation of q, we can learn about their coeffi-
cients.
For example, since the coefficient of x12 y23 in q is 0, whereas the
coefficient of x12 y23 in this sum-of-squares representation is ∑ri=1 c2i ,
we learn that ci = 0 for all 1 ≤ i ≤ r. A similar argument with the
coefficients of x22 y21 and x32 y22 then shows that di = ji = 0 for all
1 ≤ i ≤ r as well. It follows that our hypothetical sum-of-squares
representation of q actually has the somewhat simpler form
r 2
q(x, y) = ∑ ai x1 y1 + bi x1 y2 + ei x2 y2 + fi x2 y3 + hi x3 y1 + ki x3 y3 .
i=1
Comparing the coefficients of x12 y21 , x22 y22 , and x1 x2 y1 y2 in q to this

decomposition of q then tells us that
Specifically, we are
applying the r r r
Cauchy–Schwarz
inequality to the
∑ a2i = 1, ∑ e2i = 1, and ∑ ai ei = −1.
i=1 i=1 i=1
vectors a = (a1 , . . . , ar )
and e = (e1 , . . . , er ) in The equality condition of the Cauchy–Schwarz inequality (Theo-
Rr . It says that
|a · e| = kakkek if and
rem 1.3.8) then tells us that ei = −ai for all 1 ≤ i ≤ r. A similar
only if a and e are argument with the coefficients of x12 y21 , x32 y23 , and x1 x3 y1 y3 shows that
multiples of each ki = −ai for all i, and the coefficients of x22 y22 , x32 y23 , and x2 x3 y2 y3
other. tell us that ki = −ei for all i. In particular, we have shown that
ki = −ai , but also that ki = −ei = −(−ai ) = ai for all 1 ≤ i ≤ r. It

follows that ai = 0 for all 1 ≤ i ≤ r, which contradicts the fact that
∑ri=1 a2i = 1 and shows that no such sum-of-squares decomposition
of q is possible in the first place.
On the other hand, recall from Theorem 3.A.8 that all positive maps Φ :
Mn → Mm are decomposable when (m, n) = (2, 2), (m, n) = (2, 3), or (m, n) =
(3, 2). It follows immediately that all biquadratic forms q : Rm × Rn → R can
be written as a sum of squares of bilinear forms, under the same restrictions on
m and n. We close this section by noting the following strengthening of this
result: PSD biquadratic forms can be written as a sum of squares of bilinear
forms as long as one of m or n equals 2.
Theorem 3.B.4 Suppose q : Rm × Rn → R is a biquadratic form and min{m, n} = 2. Then

PSD Biquadratic Forms q can be written as a sum of squares of bilinear forms if and only if it is
of Few Variables positive semidefinite.
This fact is not true In particular, this result tells us via the isomorphism between biquadratic
when (m, n) = (2, 4) if forms and bisymmetric maps that if Φ : Mn (R) → Mm (R) is bisymmetric with
Φ is not
min{m, n} = 2 then Φ being positive is equivalent to it being decomposable.
bisymmetric—see
Exercise 3.C.17. We do not prove this theorem, however, as it is quite technical—the interested
reader is directed to [Cal73].
3.B.1 Write each of the following homogeneous polyno- ∗(c) Every positive semidefinite polynomial in HP 62 (R)
mials as a linear combination of powers of linear forms, in can be written as a sum of squares of polynomials.
the sense of Theorem 3.B.1. (d) Every positive semidefinite polynomial in HP 63 (R)
can be written as a sum of squares of polynomials.
∗(a) 3x2 + 3y2 − 2xy ∗(e) Every positive semidefinite polynomial in HP 28 (R)
(b) 2x2 + 2y2 − 3z2 − 4xy + 6xz + 6yz can be written as a sum of squares of polynomials.
∗(c) 2x3 − 9x2 y + 3xy2 − y3
(d) 7x3 + 3x2 y + 15xy2
∗(e) x 2 y + y2 z + z 2 x 3.B.4 Compute the dimension of Pnp , the vector space of
(f) 6xyz − x3 − y3 − z3 (non-homogeneous) polynomials of degree at most p in n
∗(g) 2x4 − 8x3 y − 12x2 y2 − 32xy3 − 10y4 variables.
(h) x2 y2
∗∗3.B.5 Show that the polynomial f (x, y) = x2 y2 cannot
3.B.2 Write each of the following vectors from the sym- be written as a sum of 4-th powers of linear forms.
metric subspace Snp as a linear combination of vectors of
the form v⊗p . ∗∗3.B.6 We claimed in Remark 3.B.1 that the symmetric
∗(a) (2, 3, 3, 5) ∈ S22 tensor rank of a vector w ∈ Sn2 equals its usual tensor rank:
(b) (1, 3, −1, 3, −3, 3, −1, 3, 1) ∈ S32 rank(w) = rankS (w).
∗(c) (1, −2, −2, 0, −2, 0, 0, −1) ∈ S23 a) Prove this claim if the ground field is R.
(d) (2, 0, 0, 4, 0, 4, 4, 6, 0, 4, 4, 6, 4, 6, 6, 8) ∈ S24 [Hint: Use the real spectral decomposition.]
b) Prove this claim if the ground field is C.
3.B.3 Determine which of the following statements are [Hint: Use the Takagi factorization of Exer-
true and which are false. cise 2.3.26.]
∗(a) If g is a dehomogenization of a homogeneous poly-
nomial f then the degree of g equals that of f .
(b) If f ∈ HP np (R) is non-zero and positive semidefinite
then p must be even.
∗∗ 3.B.7 In this exercise, we show that the polynomial ∗∗3.B.10 Solve Example 3.B.6(a) “directly”. That is, show
f ∈ HP 44 defined by that the biquadratic form q : R3 × R3 → R defined by
f (w, x, y, z) = w4 + x2 y2 + y2 z2 + z2 x2 − 4wxyz q(x, y) = y21 (x12 + x32 ) + y22 (x22 + x12 ) + y23 (x32 + x22 )
is positive semidefinite, but cannot be written as a sum of − 2x1 x2 y1 y2 − 2x2 x3 y2 y3 − 2x3 x1 y3 y1
squares of polynomials (much like we did for a polynomial
in HP 63 in Example 3.B.4). is positive semidefinite without appealing to Theorem 3.B.3
or the Choi map from Theorem 3.A.7.
a) Show that f is positive semidefinite.
b) Show that f cannot be written as a sum of squares of [Hint: This is hard. One approach is to fix 5 of the 6 variables
polynomials. and use the quadratic formula on the other one.]
∗∗3.B.8 Write the polynomial from Remark 3.B.2 as a ∗∗ 3.B.11 Show that the function defined in Equa-
sum of squares of rational functions. tion (3.B.2) really is an inner product.
[Hint: Multiply and divide the “obvious” PSD decomposi-
tion of f by another copy of x2 + y2 + z2 .] 3.B.12 Let v ∈ (C2 )⊗3 be the vector that we showed has
tensor rank 3 in Example 3.3.4. Show that it also has sym-
metric tensor rank 3.
∗∗3.B.9 Show that a function q : Rm × Rn → R has the
form described by Definition 3.B.2 if and only if there is a [Hint: The decompositions that we provided of the vector
quadrilinear form f : Rm × Rn × Rm × Rn → R such that from Equation (3.3.3) might help.]
q(x, y) = f (x, y, x, y) for all x ∈ Rm , y ∈ Rn .
3.C Extra Topic: Semidefinite Programming
One of the most useful tools in advanced linear algebra is an optimization

method called semidefinite programming, which is a generalization of lin-
ear programming. Much like linear programming deals with maximizing or
minimizing a linear function over a set of inequality constraints, semidefinite
programming allows us to maximize or minimize a linear function over a set of
positive semidefinite constraints.
Before getting into the details of what semidefinite programs look like or
what we can do with them, we briefly introduce one piece of new notation
that we use extensively throughout this section. Given Hermitian matrices
A, B ∈ MH n , we say that A B (or equivalently, B A) if A − B is positive
We can similarly semidefinite (in which case we could also write A − B O). For example, if
write A B, B ≺ A, or
A − B O if A − B is
1 1−i 0 1
positive definite. A= and B=
1+i 3 1 2
then A B, since
1 −i
A−B = ,
i 1
which is positive semidefinite (its eigenvalues are 2 and 0).
We prove these This ordering of Hermitian matrices is called the Loewner partial order,
basic properties of
and it shares many of the same properties as the usual ordering on R (in fact,
the Loewner partial
order in for 1 × 1 Hermitian matrices it is the usual ordering on R). For example, if
Exercise 3.C.1. A, B,C ∈ MH n then:
Reflexive: it is the case that A A,
Antisymmetric: if A B and B A then A = B, and
Transitive: if A B and B C then A C.
In fact, these three properties are exactly the defining properties of a partial
order—a function on a set (not necessarily a set of matrices) that behaves like
we would expect something that we call an “ordering” to.
Two matrices However, there is one important property that the ordering on R has that is
A, B ∈ MH n for which missing from the Loewner partial order: If a, b ∈ R then it is necessarily the
A 6 B and B 6 A are
called
case that either a ≥ b or b ≥ a (or both), but the analogous statement about the
non-comparable, Loewner partial order on MH n does not hold. For example, if
and their existence is
why this is called a 1 0 0 0
“partial” order (as A= and B =
opposed to a “total” 0 0 0 1
order like the one on
R). then each of A − B and B − A have −1 as an eigenvalue, so A 6 B and B 6 A.
3.C.1 The Form of a Semidefinite Program

The idea behind semidefinite programming is to construct a matrix–valued
version of linear programming. We thus start by recalling that a linear program
in an optimization program that can be put into the following form, where
Refer to any number A ∈ Mm,n (R), b ∈ Rm , and c ∈ Rn are fixed, x ∈ Rn is a vector of variables that
of other books, like is being optimized over, and inequalities between vectors are meant entrywise:
[Chv83] or [Joh20],
for a refresher on
linear programming. maximize: c·x
subject to: Ax ≤ b (3.C.1)
x≥0
Recall (from Figure 2.6, for example) that we can think of Hermitian
matrices and positive semidefinite matrices as the “matrix versions” of real
Since we are numbers and non-negative real numbers, and the standard inner product on
working with
Hermitian matrices MH n is the Frobenius inner product hC, Xi = tr(CX) (just like the standard
(i.e., C, X ∈ MHn ), we inner product on Rn is the dot product hc, xi = c · x). Since we now similarly
do not need the think of the Loewner partial order as the “matrix version” of the ordering of
conjugate transpose
real numbers, the following definition hopefully seems like a somewhat natural
in the Frobenius inner
product tr(C∗ X). generalization of linear programs—it just involves replacing every operation
that is specific to R or Rn with its “natural” matrix-valued generalization.
Definition 3.C.1 Suppose Φ : MH H H

n → Mm is a linear transformation, and B ∈ Mm and C ∈
H
Mn are Hermitian matrices. The semidefinite program (SDP) associated
Semidefinite Program
(Primal Standard Form) with Φ, B, and C is the following optimization problem over the matrix
variable X ∈ MH n:
Recall from maximize: tr(CX)

Section 3.A that a subject to: Φ(X) B (3.C.2)
linear transformation
X O
Φ : MH H
n → Mm is
called Hermiticity-
preserving. Furthermore, this is called the primal standard form of the semidefinite
program.
Before jumping into specific examples of semidefinite programs, it is worth

emphasizing that they really do generalize linear programs. In particular, we can
write the linear program (3.C.1) in the form of the semidefinite program (3.C.2)
3.C Extra Topic: Semidefinite Programming 395
by defining B = diag(b), C = diag(c), and Φ : MH H

n → Mm by
Semidefinite  
programs over real n
symmetric matrices  ∑ a1, j x j, j 0 ··· 0 
in MSn are fine  j=1 
 
too—for example,  n 
 
we could just add  0 ∑ a2, j x j, j ··· 0 
the linear constraint  
X = X T to a
Φ(X) =  j=1 .
 
complex-valued SDP  .. .. .. .. 
to make it

 . . . . 

real-valued.  n 
 
0 0 ··· ∑ am, j x j, j
j=1
A routine calculation then shows that the semidefinite program (3.C.2) is

equivalent to the original linear program (3.C.1)—we have just stretched each
vector in the original linear program out along the diagonal of a matrix and used
the fact that a diagonal matrix is PSD if and only if its entries are non-negative.
Basic Manipulations into Primal Standard Form
Just as was the case with linear programs, the primal standard form of the
semidefinite program (3.C.2) is not quite as restrictive as it appears at first
glance. For example, we can allow for multiple constraints simply by making
use of block matrices and matrix-valued linear maps that act on block matrices,
as we now demonstrate.
Example 3.C.1 Suppose C ∈ MH n . Write the following optimization problem as a semidef-

Turning Multiple inite program in primal standard form:
Semidefinite
Constraints Into One maximize: tr(CX)
subject to: tr(X) = 3
X I
X O
In fact, if n ≥ 3 then Solution:

this semidefinite
We first split the constraint tr(X) = 3 into the pair of “≤” constraints
program computes
the sum of the 3 tr(X) ≤ 3 and −tr(X) ≤ −3, just like we would if we were trying to write
largest eigenvalues a linear program in primal standard form. We can now rewrite the three
of C—see inequality constraints tr(X) ≤ 3, −tr(X) ≤ −3, and X I as the single
Exercise 3.C.12.
matrix constraint Φ(X) B (and thus put the SDP into primal standard
form) if we define
   
X 0 0 I 0 0
   
Φ(X) = 0T tr(X) 0  and B = 0T 3 0  .
0T 0 −tr(X) 0T 0 −3
More generally, multiple positive semidefinite constraints can be merged

into a single positive semidefinite constraint simply by placing matrices along
the diagonal blocks of a larger matrix—this works because a block diagonal
matrix is positive semidefinite if and only if each of its diagonal blocks is
positive semidefinite if (see Exercise 2.2.13).
We make use of much of the same terminology when discussing semidefi-
nite programs as we do for linear programs. The objective function of an SDP
is the function being maximized or minimized (i.e., tr(CX) if it is written in
standard form) and its optimal value is the maximal or minimal value that the
One wrinkle that objective function can attain subject to the constraints (i.e., it is the “solution” of
occurs for SDPs that
did not occur for the semidefinite program). A feasible point is a matrix X ∈ MH n that satisfies
linear programs is all of the constraints of the SDP (i.e., Φ(X) B and X O), and the feasible
that the maximum or region is the set consisting of all feasible points.
minimum in a
semidefinite In addition to turning multiple constraints and matrices into a single block-
program might not diagonal matrix and constraint, as in the previous example, we can also use
be attained (i.e., by techniques similar to those that are used for linear programs to transform a
“maximum” we
really mean
wide variety of optimization problems into the standard form of semidefinite
“supremum” and by programs. For example:
“minimum” we really • We can turn a minimization problem into a maximization problem by
mean
“infimum”)—see multiplying the objective function by −1 and then multiplying the result-
Example 3.C.6. ing optimal value by −1, as illustrated in Figure 3.4.
• We can turn an “=” constraint into a pair of “” and “” constraints
All of these (since the Loewner partial order is antisymmetric).
modifications are
completely • We can turn a “” constraint into a “” constraint by multiplying it by
analogous to how −1 and flipping the direction of the inequality (see Exercise 3.C.1(e)).
we can manipulate
inequalities and • We can turn an unconstrained (i.e., not necessarily PSD) matrix variable
equalities involving X into a pair of PSD matrix variables by setting X = X + − X − where
real numbers. X + , X − O (see Exercise 2.2.16).
−tr(CX) tr(CX)
-5 -4 -3 -2 -1 0 1 2 3 4 5
Figure 3.4: Minimizing tr(CX) is essentially the same as maximizing −tr(CX); the final
answers just differ by a minus sign.
Example 3.C.2 Suppose C, D ∈ MH H H

n and Φ, Ψ : Mn → Mn are linear. Write the following
Writing a Semidefinite semidefinite program (in the variables X,Y ∈ MHn ) in primal standard
Program in Primal form:
Standard Form
minimize: tr(CX) + tr(DY )
subject to: X + Ψ(Y ) = I
Φ(X) + Y O
X O
Solution:
Since Y is unconstrained, we replace it by the pair of PSD variables
Y + and Y − via Y = Y + −Y − and Y + ,Y − O. Making this change puts
the SDP into the form
Here we used the
fact that minimize: tr(CX) + tr(DY + ) − tr(DY − )
Ψ(Y + −Y − ) = subject to: X + Ψ(Y + ) − Ψ(Y − ) = I
Ψ(Y + ) − Ψ(Y − ), since Φ(X) + Y+ − Y− O
Ψ is linear. +
X, Y , Y− O
Next, we change the minimization into a maximization by multiplying

the objective function by −1 and also placing a minus sign in front of the
entire SDP. We also change the equality constraint X +Ψ(Y + )−Ψ(Y − ) =

I into the pair of inequality constraints X + Ψ(Y + ) − Ψ(Y − ) I and
X + Ψ(Y + ) − Ψ(Y − ) I, and then convert both of the “” constraints
into “” constraints by multiplying them through by −1. After making
these changes, the SDP has the form
Converting an SDP − maximize: −tr(CX) − tr(DY + ) + tr(DY − )
into primal standard subject to: X + Ψ(Y + ) − Ψ(Y − ) I
form perhaps makes
it look uglier, but is − X − Ψ(Y + ) + Ψ(Y − ) −I
useful for theoretical − Φ(X) − Y+ + Y− O
reasons. X, +
Y , Y− O
At this point, the SDP is essentially in primal standard form—all that

remains is to collect the various pieces of it into block diagonal matrices.
In particular, the version of this SDP in primal standard form optimizes
We use asterisks (∗)
over a positive semidefinite matrix variable Xe ∈ MH 3n , which we partition
to denote entries as a block matrix as follows:
 
whose values we do X ∗ ∗
not care about (they  
might be non-zero, Xe =  ∗ Y + ∗  .
but they are so ∗ ∗ Y−
unimportant that
they do not deserve e Ce ∈ MH and Φ e : MH → MH as follows:
names).
We also define B, 3n 3n 3n
   
I · · −C · ·
We use dots (·) to e   e 
denote entries equal
B = · −I · , C = · −D ·  , and
to 0. · · O · · D
 
X + Ψ(Y + −Y − ) · ·

e Xe = 
Φ  · −X − Ψ(Y + −Y − ) ·


· · +
−Φ(X) −Y +Y −
After making this final substitution, the original linear program can be
written in primal standard form as follows:

− maximize: tr CeXe

subject to: e Xe Be
Φ
Xe O
Some Less Obvious Conversions of SDPs

None of the transformations of optimization problems into SDPs in primal
standard form that we have seen so far are particularly surprising—they pretty
much just amount to techniques that are carried over from linear program-
ming, together with some facts about block diagonal matrices and a lot of
bookkeeping. However, one of the most remarkable things about semidefinite
programming is that it can be used to compute quantities that at first glance do
not even seem linear.
For example, recall the operator norm of a matrix A ∈ Mm,n (F) (where
F = R or F = C), which is defined by
We investigated the
operator norm back
in Section 2.3.3. kAk = maxn kAvk : kvk = 1 .
v∈F
As written, this quantity does not appear to be amenable to semidefinite pro-

gramming, since the quantity kAvk that is being maximized is not linear in v.
However, recall from Exercise 2.3.15 that if x ∈ R is a scalar then kAk ≤ x if
and only if " #
xIm A
O.
A∗ xIn
This leads immediately to the following semidefinite program for computing
the operator norm of A:
minimize: "x #
subject to: xIm A
O (3.C.3)
A∗ xIn
x≥0
To truly convince ourselves that this is a semidefinite program, we could

write it in the primal standard form of Definition 3.C.1 by defining B ∈ MH
m+n ,
The set MH 1 consists C ∈ MH 1 , and Φ : M H → MH
1 m+n by
of the 1 × 1 Hermitian
matrices. That is, it " # " #
equals R, the set of O A −xIm O
B= ∗ , C = −1, and Φ(x) = .
real numbers. A O O −xIn
However, writing the SDP explicitly in standard form like this is perhaps not
terribly useful—once an optimization problem has been written in a form
involving a linear objective function and only linear entrywise and positive
semidefinite constraints, the fact that it is an SDP (i.e., can be converted into
primal standard form) is usually clear.
While computing the operator norm of a matrix via semidefinite program-
ming is not really a wise choice (it is much quicker to just compute it via the
fact that kAk equals the largest singular value of A), this same technique lets us
use the operator norm in the objective function or in the constraints of SDPs.
For example, if A ∈ Mn then the optimization problem
In words, this SDP
finds the closest (in
the sense of the minimize: kY − Ak
operator norm) PSD subject to: y j, j = a j, j for all 1≤ j≤n (3.C.4)
matrix to A that has Y O
the same diagonal
entries as A.
is a semidefinite program, since it can be written in the form
minimize: "x #
subject to: xIm Y −A
O
Y − A∗ xIn
y j, j = a j, j for all 1≤ j≤n
x, Y 0,
which in turn could be written in the primal standard form of Definition 3.C.1
if desired.
However, we have to be slightly more careful here than with our earlier
manipulations of SDPs—the fact that the optimization problem (3.C.4) is a
semidefinite program relies crucially on the fact that we are minimizing the
norm in the objective function. The analogous maximization problem is not
a semidefinite program, since it would involve a maximization over Y and a

minimization over x—it would look something like
In the outer 
maximization here, 
 minimize: "x #

 subject to:
we maximize over xIm Y −A
the variable Y , and maximize: O
in the inner

 Y − A∗ xIn


minimization, we x≥0
minimize over the subject to: y j, j = a j, j for all 1≤ j≤n
variable x (for each
particular choice of Y O.
fixed Y ).
Optimization problems like this, which combine both maximizations and mini-
mizations, typically cannot be represented as semidefinite programs. Instead,
in an SDP we must maximize or minimize over all variables.
Remark 3.C.1 One way to think about the fact that we can use semidefinite programming
Semidefinite Programs to minimize the operator norm, but not maximize it, is via the fact that it is
and Convexity convex:

(1 − t)A + tB ≤ (1 − t)kAk + tkBk for all A, B ∈ Mm,n , 0 ≤ t ≤ 1.
For an introduction Generally speaking, convex functions are easy to minimize since their
to convex functions,
graphs “open up”. It follows that any local minimum that they have is nec-
see Appendix A.5.2.
We can think of essarily a global minimum, so we can minimize them simply by following
minimizing a convex any path on the graph that leads down, as illustrated below:
function just like
rolling a marble y
z
down the side of a
bowl—the marble
z = f (x, y)
never does anything
clever, but it finds y = f (x)
the global minimum
every time anyway. y x
x
However, finding the maximum of a convex function might be hard,

since there can be multiple different local maxima (for example, if we stand
at the bottom of the parabola above on the left and can only see nearby,
it is not clear if we should walk left or walk right to reach the highest
point in the plotted domain). For a similar reason, concave functions (i.e.,
Some standard functions f for which − f is convex) are easy to maximize, since their
examples of graphs “open down”.
concave functions
√
include f (x) = x Similarly, a constraint like kXk ≤ 2 in a semidefinite program is OK,
and g(x) = log(x). since the operator norm is convex and the upper bound of 2 just chops off
the irrelevant top portion of its graph (i.e., it does not interfere with con-
vexity, connectedness of the feasible region, or where its global minimum
is located). However, constraints like kXk ≥ 2 involving lower bounds
on convex functions are not allowed in semidefinite programs, since the
resulting feasible region might be extremely difficult to search over—it
could have holes and many different local minima.
There are similar semidefinite programs for computing, maximizing, mini-

mizing, and bounding many other linear algebraic quantities of interest, such
as the Frobenius norm (well, its square anyway—see Exercise 3.C.9), the trace
norm (see Exercise 2.3.17 and the upcoming Example 3.C.5), or the maxi-
mum or minimum eigenvalue (see Exercise 3.C.3). However, we must also be
slightly careful about where we place these quantities in a semidefinite program.
Every norm is convex, Norms are all convex, so they can be placed in a “minimize” objective function
but not every norm
or on the smaller half of a “≤” constraint, while the minimum eigenvalue is
can be computed
via semidefinite concave and thus can be placed in a “maximize” objective function or on the
programming. larger half of a “≥” constraint. The possible placements of these quantities are
summarized in Table 3.4.
Function Convexity Objective func. Constraint type

We use λmax (X) and
λmin (X) to refer to kXk convex min. “≤”
the maximal and kXk2F convex min. “≤”
minimal eigenvalues kXktr convex min. “≤”
of X, respectively (in λmax (X) convex min. “≤”
order to use these
functions, X must be λmin (X) concave max. “≥”
Hermitian so that its
eigenvalues are Table 3.4: A summary of the convexity/concavity of some functions of a matrix
real). variable X, as well as what type of objective function they can be placed in and
what type of constraint they can be placed on the left-hand-side of.
For example, the two optimization problems (where A ∈ MH

n is fixed and
X ∈ MH n is the matrix variable)

minimize: λmax (X) maximize: λmin (A − X)

subject to: kA − XkF ≤ 2 subject to: λmax (X) + kXktr ≤ 1

X O X O
are both semidefinite programs, whereas neither of the following two optimiza-
tion problems are:
The left problem is
not an SDP due to
minimize: tr(X) maximize: λmax (A − X)
the norm equality
constraint and the subject to: kA − Xktr = 1 subject to: X A

right problem is not X O X O.
an SDP due to
maximizing λmax . We also note that Table 3.4 is not remotely exhaustive. Many other linear
algebraic quantities, particularly those involving eigenvalues, singular val-
ues, and/or norms, can be incorporated into semidefinite programs—see Exer-
cises 3.C.10 and 3.C.13.
3.C.2 Geometric Interpretation and Solving

Semidefinite programs have a natural geometric interpretation that is com-
pletely analogous to that of linear programs. Recall that the feasible region of a
linear program is a convex polyhedron—a convex shape with flat sides—and
optimizing the objective function can be thought of as moving a line (or plane,
or hyperplane, ...) in one direction as far as possible while still intersecting the
feasible region. Furthermore, the optimal value of a linear program is always
attained at a corner of the feasible region, as indicated in Figure 3.5.
For semidefinite programs, all that changes is that the feasible region might
no longer be a polyhedron. Instead of having flat sides, the edges of the feasible
region might “bow out” (though it must still be convex)—we call these shapes
y y−x ≤ 1 y y−x ≤ 1
x + 2y = 11
3 3
(x, y) = (1, 2) x + 2y = 9
2 2
x + 2y = 7
1 1
x + 2y = 5
x x
-1 1 2 3 4 -1 1 2 3 4
x + 2y = 3
-1 -1
z = −3
x+y ≤ 3 x + 2y = −3 x + 2y =x +
−1y ≤ 3x + 2y = 1
Figure 3.5: The feasible region of the linear program that asks us to maximize x + 2y
subject to the constraints x + y ≤ 3, y − x ≤ 1, and x, y ≥ 0. Its optimal value is 5, which
is attained at (x, y) = (1, 2).
spectrahedra, and their boundaries are defined by polynomial equations (in

much the same way that the sides of polyhedra are defined by linear equations).
Since the objective function tr(CX) of a semidefinite program is linear, we
can still think of optimizing it as moving a line (or plane, or hyperplane, ...)
in one direction as far as possible while still intersecting the feasible region.
However, because of its bowed out sides, the optimal value might no longer be
In fact, the feasible attained at a corner of the feasible region. For example, consider the following
region of an SDP
semidefinite program in the variables x, y ∈ R:
might not even have
any corners—it
could be a filled maximize: x

circle or sphere, for subject to: x+1 y x y−1
example. , O (3.C.5)
y 1 y−1 3−x
y ≥ 0.
The feasible region of this SDP is displayed in Figure 3.6, and its optimal value
is x = 3, which is attained on one of the curved boundaries of the feasible
region (not at one of its corners).
y y
Sylvester’s criterion 3 x(3 − x) ≥ (y − 1)2 3
(Theorem 2.2.6) can
turn any positive 2
semidefinite 2
constraint into a (x, y) = (3, 1)
1 1
family of polynomial
constraints in the x
matrix’s entries. For
x
-1 1 2 3 4 -1 1 2 3 4
example, -1 -1
Theorem 2.2.7 says
x + 1 ≥ y2
that x = −1 x=1 x=3
" #
x+1 y
O
y 1
if and only if
x + 1 ≥ y2 . Figure 3.6: The feasible region of the semidefinite program (3.C.5). Its optimal value
is 3, which is attained at (x, y) = (3, 1). Notably, this point is not a corner of the
feasible region.
This geometric wrinkle that semidefinite programs have over linear pro-
grams has important implications when it comes to solving them. Recall that
If the coefficients in
a linear program are the standard method of solving a linear program is to use the simplex method,
rational then so is its which works by jumping from corner to adjacent corner of the feasible region
optimal value. This so as to increase the objective function by as much as possible at each step.
statement is not true
This method relies crucially on the fact that the optimal value of the linear
of semidefinite
programs. program occurs at a corner of the feasible region, so it does not generalize
straightforwardly to semidefinite programs.
While the simplex method for linear programs always terminates in a finite
number of steps and can produce an exact description of its optimal value, no
The CVX package such algorithm for semidefinite programs is known. There are efficient methods
[CVX12] for MATLAB for numerically approximating the optimal value of an SDP to any desired
and the CVXPY accuracy, but these algorithms are not really practical to run by hand—they
package [DCAB18]
for Python can be
are instead implemented by various computer software packages, and we do
used to solve not explore these methods here. There are entire books devoted to semidefinite
semidefinite programming (and convex optimization in general), so the interested reader
programs, for is directed to [BV09] for a more thorough treatment. We are interested in
example (both
packages are free).
semidefinite programming primarily for its duality theory, which often lets us
find the optimal solution of an SDP analytically.
3.C.3 Duality
Just as was the case for linear programs, semidefinite programs have a robust
duality theory. However, since we do not have simple methods for explicitly
solving semidefinite programs by hand like we do for linear programs, we will
find that duality plays an even more important role in this setting.
Definition 3.C.2 Suppose B ∈ MH H H H

m , C ∈ Mn , Φ : Mn → Mm is linear, and X ∈ Mn and
H
H
Y ∈ Mm are matrix variables. The dual of a semidefinite program
Dual of a
Semidefinite Program
maximize: tr(CX)
The dual of the dual subject to: Φ(X) B
of an SDP is itself. X O
is the semidefinite program
minimize: tr(BY )
Four things “flip” subject to: Φ∗ (Y ) C
when constructing a Y O
dual problem:
“maximize”
becomes “minimize”, The original semidefinite program in Definition 3.C.2 is called the pri-
Φ turns into Φ∗ , the mal problem, and the two of them together are called a primal/dual pair.
“” constraint Although constructing the dual of a semidefinite program is a rather routine
becomes “”, and B
and C switch spots. and mechanical affair, keep in mind that the above definition only applies once
the SDP is written in primal standard form (fortunately, we already know how
to convert any SDP into that form). Also, even though we can construct the
adjoint Φ∗ of any linear map Φ : MH H
n → Mm via Corollary 3.A.5, it is often
quicker and easier to just “eyeball” a formula for the adjoint and then check
that it is correct.
Example 3.C.3 Construct the dual of the semidefinite program (3.C.3) that computes the
Constructing the operator norm of a matrix A ∈ Mm,n .
Dual of an SDP
Solution:
Our first step is to write this SDP in primal standard form, which we
recall can be done by defining
" # " #
O A −xIm O
B= ∗ , C = −1, and Φ(x) = .
A O O −xIn
The adjoint of Φ is a linear map from MH H

m+n to M1 , and it must satisfy
Again, we use
asterisks (∗) to
∗Y ∗ Y ∗
denote entries that tr xΦ = tr Φ(x)
are irrelevant for our ∗ Z ∗ Z
" # !
purposes.
−xIm O Y ∗
= tr = −xtr(Y ) − xtr(Z)
O −xIn ∗ Z
for all Y ∈ MH H
m and Z ∈ Mn . We can see from inspection that one map
Recall from ∗
(and thus the map) Φ that works is given by the formula
Theorem 1.4.8 that
adjoints are unique.
Y ∗
Φ∗ = −tr(Y ) − tr(Z).
∗ Z
It follows that the dual of the original semidefinite program has the
The minus sign in following form:
front of this
minimization comes " #" #!
from the fact that O A Y X
we had to convert − minimize: tr
the original SDP from A∗ O X∗ Z
" #
a minimization to a subject to: Y X
maximization in O
order to put it into X∗ Z
primal standard form − tr(Y ) − tr(Z) ≥ −1
(this is also why
C = −1 instead of
C = 1). After simplifying things somewhat, this dual problem can be written in
the somewhat prettier (but equivalent) form

maximize: Re tr(AX ∗ ) #
"
subject to: Y −X
∗ O
−X Z
tr(Y ) + tr(Z) ≤ 2
Just as is the case with linear programs, the dual of a semidefinite program
is remarkable for the fact that it can provide us with upper bounds on the
optimal value of the primal problem (and the primal problem similarly provides
lower bounds to the optimal value of the dual problem):
Theorem 3.C.1 If X ∈ MHn is a feasible point of a semidefinite program in primal standard

Weak Duality form, and Y ∈ MH m is a feasible point of its dual problem, then tr(CX) ≤
tr(BY ).
Proof. Since Y and B − Φ(X) are both positive semidefinite, we know from
Exercise 2.2.19 that
Recall that

tr Φ(X)Y = 0 ≤ tr (B − Φ(X))Y = tr(BY ) − tr Φ(X)Y = tr(BY ) − tr XΦ∗ (Y ) .

tr XΦ∗ (Y ) simply
because we are It follows that tr(BY ) ≥ tr XΦ∗ (Y ) . Then using the fact that X and Φ∗ (Y ) −C
working in the are both positive semidefinite, a similar argument shows that
Frobenius inner
product and Φ∗ is 0 ≤ tr X(Φ∗ (Y ) −C) = tr XΦ∗ (Y )) − tr(XC),
(by definition) the
adjoint of Φ in this so tr XΦ∗ (Y )) ≥ tr(XC). Stringing these two inequalities together shows that
inner product.
tr(BY ) ≥ tr XΦ∗ (Y ) ≥ tr(XC) = tr(CX),
as desired.
Weak duality not only provides us with a way of establishing upper bounds
on the optimal value of our semidefinite program, but it often also lets us easily
determine when we have found its optimal value. In particular, if we can find
feasible points of the primal and dual problems that give the same value when
plugged into their respective objective functions, they must be optimal, since
they cannot possibly be increased or decreased past each other (see Figure 3.7).
primal (maximize) dual (minimize)

tr(CX) tr(BY )
-2 -1 0 1 2 3 4 5 6 7 8
Figure 3.7: Weak duality says that the objective function of the primal (maxi-
mization) problem cannot be increased past the objective function of the dual
(minimization) problem.
The following example illustrates how we can use this feature of weak
duality to solve semidefinite programs, at least in the case when they are simple
enough that we can spot what we think the solution should be. That is, weak
duality can help us verify that a conjectured optimal solution really is optimal.
Example 3.C.4 Use weak duality to solve the following SDP in the variable X ∈ MH
3:
Solving an SDP via
Weak Duality
  
0 1 1
maximize: tr 1 0 1 X 
1 1 0
subject to: OX I
Solution:
The objective function of this semidefinite program just adds up the
Actually, the off-diagonal entries of X, and the constraint O X I says that every
objective function
eigenvalue of X is between 0 and 1 (see Exercise 3.C.3). Roughly speaking,
adds up the real
parts of the we thus want to find a PSD matrix X with small diagonal entries and large
off-diagonal entries off-diagonal entries. One matrix that seems to work fairly well is
of X.
 
1 1 1
1
X= 1 1 1 ,
3
1 1 1
which has eigenvalues 1, 0, and 0 (and is thus feasible), and produces a

value of 2 in the objective function.
To show that this choice of X is optimal, we construct the dual of this
The map Φ in this SDP, which has the form
SDP is the identity, so
Φ∗ is the identity as
well. minimize: tr(Y )
 
subject to: 0 1 1
Y 1 0 1
1 1 0
Y O
Since the matrix on the right-hand-side of the constraint in this dual SDP
has eigenvalues 2, −1, and −1, we can find a positive semidefinite matrix
Y satisfying that constraint simply by increasing the negative eigenvalues
to 0 (while keeping the corresponding eigenvectors the same). That is, if
In other words, we we have the spectral decomposition
choose Y to be the
positive semidefinite    
part of 0 1 1 2 0 0
  1 0 1 = U 0 −1 0 U ∗
0 1 1

1 0 1 ,
 1 1 0 0 0 −1
1 1 0
then we choose  
in the sense of 2 0 0
Exercise 2.2.16.
Explicitly,

Y =U 0 0 0 U ∗ ,
  0 0 0
1 1 1
2 
Y = 1 1 1 . which has tr(Y ) = 2.
3
1 1 1 Since we have now found feasible points of both the primal and dual
problems that attain the same objective value of 2, we know that this must
in fact be the optimal value of both problems.
As suggested by the previous example, weak duality is useful for the fact
that it can often be used to solve a semidefinite program without actually
performing any optimization at all, as long as the solution is simple enough
that we can eyeball feasible points of each of the primal and dual problems that
attain the same value. We now illustrate how this procedure can be used to give
us new characterizations of linear algebraic objects.
Example 3.C.5 Suppose A ∈ Mm,n . Use weak duality to show that the following semidefi-
Using Weak Duality to nite program in the variables X ∈ MH H
m and Y ∈ Mn computes kAktr (the
Understand an SDP trace norm of A, which was introduced in Exercise 2.3.17):
for the Trace Norm
minimize: (tr(X) +#tr(Y ))/2
"
subject to: X A
O
A∗ Y
X, Y O
Solution:
To start, we append rows or columns consisting entirely of zeros to A
so as to make it square. Doing so does not affect kAktr or any substantial
details of this SDP, but makes its analysis a bit cleaner.

To show that the optimal value of this SDP is bounded above by kAktr ,
we just need to find a feasible point that attains this quantity in the objective
function. To this end, let A = UΣV ∗ be a singular value decomposition of
A. We then define X = UΣU ∗ and Y = V ΣV ∗ . Then
tr(X) = kUΣU ∗ ktr = kΣktr = kAktr , and

∗
tr(Y ) = kV ΣV ktr = kΣktr = kAktr ,
so (tr(X) + tr(Y ))/2 = kAktr and thus X and Y produce the desired value
in the objective function. Furthermore, X and Y are feasible since they are
positive semidefinite and so is
If A (and thus Σ) is " # " # " √ # " √ #∗
not square then X A UΣU ∗ UΣV ∗ U Σ U Σ
√
either U Σ or V Σ
√ ∗ = = √ ∗ √ ∗ ,
A∗ Y V Σ∗U ∗ V ΣV ∗
V Σ V Σ
needs to be
padded with some √
zero columns for this where Σ is the entrywise square root of Σ.
decomposition to
make sense. To prove the opposite inequality (i.e., to show that this SDP is bounded
below by kAktr ), we first need to construct its dual. To this end, we first
explicitly list the components of its primal standard form (3.C.2):
" # " # " #
Here, C is negative O A −1 Im O X ∗ −X O
B= ∗ , C= , and Φ = .
because the given A O 2 O In ∗ Y O −Y
SDP is a minimization
problem, so we have
to multiply the It is straightforward to show that Φ∗ = Φ, so the dual SDP (after simplify-
objective function ing a bit) has the form
by −1 to turn it into a
maximization
problem (i.e., to put maximize: "Re tr(AZ)
#
it into primal subject to: X Z
standard form). O
Z∗ Y
X, Y I
Next, we find a feasible point of this SDP that attains the desired value
of kAktr in the objective function. If A has the singular value decomposition
Recall that the fact A = UΣV ∗ then the matrix
that Φ = Φ∗ means
that it is called " # " #
self-adjoint. X Z I VU ∗
=
Z∗ Y UV ∗ I
is a feasible point of the dual SDP, since X = I I, Y = I I, and the

fact that VU ∗ is unitary tells us that kVU ∗ k = 1 (by Exercise 2.3.5), so the
given block matrix is positive semidefinite (by Exercise 2.3.15). Then

Re tr(AZ) = Re tr(UΣV ∗VU ∗ ) = Re tr(Σ) = tr(Σ) = kAktr ,
which shows that kAktr is a lower bound on the value of this SDP. Since
we have now proved both bounds, we are done.
Remark 3.C.2 The previous examples of solving SDPs via duality might seem somewhat
Numerics Combined “cooked up” at first, especially since in Example 3.C.5 we were told what
with Duality Make a the optimal value is, and we just had to verify it. In practice, we are
Powerful Combination typically not told the optimal value of the SDP that we are working with
ahead of time. However, this is actually not a huge restriction, since we
can use computer software to numerically solve the SDP and then use that
solution to help us eyeball the analytic (non-numerical) solution.
To illustrate what we mean by this, consider the following semidefinite
program that maximizes over symmetric matrices X ∈ MS2 :
maximize: x1,1 − x2,2

subject to: x1,1 − x1,2 = 1/2
tr(X) ≤ 1
X O
It might be difficult to see what the optimal value of this SDP is, but
we can get a helpful nudge by using computer software to solve it and
find that the optimal value is approximately 0.7071, which is attained at a
matrix " #
0.8536 0.3536
X≈ .
0.3536 0.1464
√
The optimal value looks like it is probably 1/ 2 = 0.7071 . . ., and to
prove it we just need to find feasible matrices that attain this value in the
objective functions of the primal and dual problems.
Another way to The matrix X that works in the primal problem must have x1,1 + x2,2 ≤
guess the entries of
1 (and
√ the numerics above suggest that equality holds)√ and x1,1 − x2,2 =
X would be to plug
the decimal 1/ 2.√Solving this 2 × 2 linear system gives x1,1 = (2 + 2)/4 and x2,2 =
approximations of its (2
√ − 2)/4, and the constraint x1,1 − x1,2 = 1/2 then tells us that x1,2 =
entries into a tool like
the Inverse Symbolic 2/4. We thus guess that the maximum value is obtained at the matrix
Calculator [BBP95]. " √ √ #
1 2+ 2 2
X= √ √ ,
4 2 2− 2
which indeed√ satisfies all of the constraints and produces an objective

value of 1/ 2, so this is a lower bound on the desired optimal value.
√
To see that we cannot do any better (i.e., to show that 1/ 2 is also an
You are encouraged upper bound), we construct the dual problem, which has the form
to work through the
details of
constructing this minimize: y"1 + y2 /2 #
dual program subject to: y1 + y2 −y2 /2 1 0
yourself.
−y2 /2 y1 0 −1
y1 ≥ 0
Similar to before, we can solve this SDP numerically to find that its optimal
value is also approximately 0.7071, and is attained when
y1 ≈ 0.2071 and y2 ≈ 1.000.

We can√turn this into an exact answer by guessing √ y2 = 1, which forces

y1 = ( 2 − 1)/2 (since we want y1 + y2 /2 = 1/ 2). It is then a routine
calculation to verify that this (y1 , y2 ) pair satisfies all√of the constraints in
the dual problem and gives an objective value of 1/ 2, so this is indeed
the optimal value of the SDP.
In the previous two examples, we saw that not only did the dual problem
serve as an upper bound on the primal problem, but rather we were able to
find particular feasible points of each problem that resulted in their objective
functions taking on the same value, thus proving optimality. The following
theorem shows that this phenomenon occurs with a great deal of generality—
there are simple-to-check conditions that guarantee that there must be feasible
points in each of the primal and dual problems that attain the optimal value in
their objective functions.
Recall that X O Before stating what these conditions are, we need one more piece of ter-
means that X is
positive definite (i.e., minology: we say that an SDP in primal standard form (3.C.2) is feasible if
PSD and invertible) there exists a matrix X ∈ MH n satisfying all of its constraints (i.e., X O and
and Φ(X) ≺ B means Φ(X) B), and we say that it is strictly feasible if X can be chosen to make
that B − Φ(X) is
both of those inequalities strict (i.e., X O and Φ(X) ≺ B). Feasibility and
positive definite.
strict feasibility for problems written in the dual form of Definition 3.C.2 are
defined analogously. Geometrically, strict feasibility of an SDP means that its
feasible region has an interior—it is not a degenerate lower-dimensional shape
that consists only of boundaries and edges.
Theorem 3.C.2 Suppose that both problems in a primal/dual pair of SDPs are feasible, and
Strong Duality at least one of them is strictly feasible. Then the optimal values of those
SDPs coincide. Furthermore,
The conditions in this a) if the primal problem is strictly feasible then the optimal value is
theorem are
attained in the dual problem, and
sometimes called
the Slater conditions b) if the dual problem is strictly feasible then the optimal value is
for strong duality. attained in the primal problem.
There are other
(somewhat more
technical) Since the proof of this result relies on some facts about convex sets that
conditions that we have not explicitly introduced in the main body of this text, we leave it
guarantee that a
to Appendix B.3. However, it is worth presenting some examples to illustrate
primal/dual pair
share their optimal what the various parts of the theorem mean and why they are important. To
value as well. start, it is worthwhile to demonstrate what we mean when we say that strict
feasibility implies that the optimal value is “attained” by the other problem in a
primal/dual pair.
Example 3.C.6 Show that no feasible point of the following SDP attains its optimal value:
An SDP That
Does Not Attain minimize: x

Its Optimal Value subject to: x 1
O
1 y
x, y ≥ 0.
Furthermore, explain why this phenomenon does not contradict strong

duality (Theorem 3.C.2).
Solution:
Be careful: even if Recall from Theorem 2.2.7 that the matrix in this SDP is positive
the strong duality
semidefinite if and only if xy ≥ 1. In particular, this means that (x, y) =
conditions of
Theorem 3.C.2 hold (x, 1/x) is a feasible point of this SDP for all x > 0, so certainly the optimal
(i.e., the primal/dual value of this SDP cannot be bigger than 0. However, no feasible point has
SDPs are strictly x = 0, since we would then have xy = 0 6≥ 1. It follows that the optimal
feasible) and the
optimal value is
value of this SDP is 0, but no feasible point attains that value—they just
attained, that does get arbitrarily close to it.
not mean that it is The fact that the optimal value of this SDP is not attained does not
attained at a strictly
feasible point. contradict Theorem 3.C.2 since the dual of this SDP must not be strictly
feasible. To verify this claim, we compute the dual SDP to have the
following form (after simplifying somewhat):
maximize: Re(z)

subject to: x −z 2 0
O
−z y 0 0
Recall from The constraint in the above SDP forces y = 0, so there is no positive
Theorem 2.2.4 that
definite matrix that satisfies it (i.e., this dual SDP is not strictly feasible).
positive definite
matrices have strictly
positive diagonal There are also conditions other than those of Theorem 3.C.2 that can be
entries. used to guarantee that the optimal value of an SDP is attained or its dual
has the same optimal value. For example, the Extreme Value Theorem from
real analysis says that every continuous function on a closed and bounded
set necessarily attains it maximum and minimum values. Since the objective
function of every SDP is linear (and thus continuous), and the feasible set of
every SDP is closed, we obtain the following criterion:
! If the feasible region of an SDP is non-empty and bounded, its

optimal value is attained.
On the other hand, notice that the feasible region of the SDP from Exam-
ple 3.C.6 is unbounded, since we can make x and y as large as we like in it (see
Figure 3.8).
y xy ≥ 1 x=0 x=2 x=4
3 3
2 2
1 1
x x
-1 1 2 3 4 -1 1 2 3 4
-1 -1
Figure 3.8: The feasible region of the semidefinite program from Example 3.C.6.
The fact that the optimal value of this SDP is not attained corresponds to the fact
that its feasible region gets arbitrarily close to the line x = 0 but does not contain
any point on it.
In the previous example, even though the dual SDP was not strictly feasible,
the primal SDP was, so the optimal values of both problems were still forced
to equal each other by Theorem 3.C.2 (despite the optimal value not actually
being attained in the primal problem). We now present an example that shows
that it is also possible for neither SDP to be strictly feasible, and thus for their
optimal values to differ.
Example 3.C.7 Show that the following primal/dual SDPs, which optimize over the vari-
A Primal/Dual SDP ables X ∈ MS3 in the primal and y, z ∈ R in the dual, have different optimal
Pair with Unequal values:
Optimal Values
Primal Dual

maximize: −x2,2 minimize: y
 
subject to: 2x1,3 + x2,2 = 1 subject to: 0 0 y

x3,3 = 0 0 y+1 0 O
Verify on your own
X O y 0 z
that these problems
are indeed duals of
each other. Furthermore, explain why this phenomenon does not contradict strong
duality (Theorem 3.C.2).
Solution:
In the primal problem, the fact that x3,3 = 0 and X O forces x1,3 = 0
as well. The first constraint then says that x2,2 = 1, so the optimal value of
the primal problem is −1.
In the dual problem, the fact that the top-left entry of the 3 × 3 PSD
Recall from matrix equals 0 forces y = 0, so the optimal value of this dual problem
Exercise 2.2.11 that if
equals 0.
a diagonal entry of
a PSD matrix equals Since these two optimal values are not equal to each other, we know
0 then so does that from Theorem 3.C.2 that neither problem is strictly feasible. Indeed, the
entire row and
column.
primal problem is not strictly feasible since the constraint x3,3 = 0 ensures
that the entries in the final row and column of X all equal 0 (so X cannot
be invertible), and the dual problem is similarly not strictly feasible since
the 3 × 3 PSD matrix must have every entry in its first row and column
equal to 0.
In spite of semidefinite programs like those presented in the previous two

examples, most real-world SDPs (i.e., ones that are not extremely “cooked up”)
Recall that strong are strictly feasible and we can thus make use of strong duality. That is, for
duality holds (i.e.,
most SDPs it is the case that Figure 3.7 is somewhat misleading—it is not just
the optimal values
are attained and the case that the primal and dual programs bound each other, but rather they
equal in a bound each other “tightly” in the sense that their optimal values coincide (see
primal/dual pair) for Figure 3.9).
every linear
program.
primal (maximize) dual (minimize)
tr(CX) tr(BY )
-2 -1 0 1 2 3 4 5 6 7 8
Figure 3.9: Strong duality says that, for many semidefinite programs, the objective
function of the primal (maximization) problem can be increased to the exact
same value that the objective function of the dual (minimization) problem can be
decreased to.
Furthermore, showing that strict feasibility holds for an SDP is usually

quite straightforward. For example, to see that the SDP from Example 3.C.5 is
strictly feasible, we just need to observe that we can choose X and Y to each be
suitably large multiples of the identity matrix.
Example 3.C.8 Show that the following semidefinite program in the variables Y ∈ MHm,
Using Strong Duality Z ∈ MHn , and X ∈ M m,n (C) computes the operator norm of the matrix
to Solve an SDP A ∈ Mm,n (C):

maximize: Re tr(AX ∗ ) #
"
subject to: Y −X
∗ O
−X Z
tr(Y ) + tr(Z) ≤ 2
Solution:
Recall from Example 3.C.3 that this is the dual of the SDP (3.C.3) that
computes kAk. It follows that the optimal value of this SDP is certainly
bounded above by kAk, so there are two ways we could proceed:
• we could find a feasible point of this SDP that attains the conjectured
optimal value kAk, and then note that the true optimal value must
We find a feasible be kAk by weak duality, or
point attaining this
optimal value in • we could show that the primal SDP (3.C.3) is strictly feasible, so
Exercise 3.C.8. this dual SDP must attain its optimal value, which is the same as the
optimal value kAk of the primal SDP by strong duality.
We opt for the latter method—we show that the primal SDP (3.C.3) is
strictly feasible. To this end, we just note that if x is really, really large (in
When proving strong particular, larger than kAk) then x > 0 and
duality, it is often
useful to just let the " #
variables be very big. xIm A
O,
Try not to get hung A∗ xIn
up on how big is big
enough to make
everything positive so the primal SDP (3.C.3) is strictly feasible and strong duality holds.
definite, as long as it We do not need to, but we can also show that the dual SDP that we
is clear that there is
a big enough started with is strictly feasible by choosing
choice that works.
Y = Im /m, Z = In /n, and X = O.
Remark 3.C.3 Just like linear programs, semidefinite programs can be infeasible (i.e.,
Unbounded and have an empty feasible region) or unbounded (i.e., have an objective
Feasibility SDPs function that can be made arbitrarily large). If a maximization problem is
The corresponding unbounded then we say that its optimal value is ∞, and if it is infeasible
table for linear then we say that its optimal value is −∞ (and of course these two quantities
programs is identical, are swapped for minimization problems).
except with a blank
in the Weak duality immediately tells us that if an SDP is unbounded then
“infeasible/solvable” its dual must be infeasible, which leads to the following possible infeasi-
cells. There exists an ble/unbounded/solvable (i.e., finite optimal value) pairings that primal and
infeasible SDP with
solvable dual (see dual problems can share:
Exercise 3.C.4), but
no such pair of LPs
exists.
Primal problem
Infeasible Solvable Unbounded
Infeasible X X X
Dual
Solvable X X ·
Unbounded X · ·
Allowing optimal values of ±∞ like this is particularly useful when

considering feasibility SDPs—semidefinite programs in which we are
only interested in whether or not a feasible point exists. For example,
suppose we wanted to know whether or not there was a way to fill in the
missing entries in the matrix
These entries can  
indeed be filled in to 3 ∗ ∗ −2 ∗
make this matrix ∗ 3 ∗ −2 ∗
PSD—see  
 
Exercise 3.C.7. ∗ ∗ 3 ∗ ∗
 
−2 −2 ∗ 3 −2
∗ ∗ ∗ −2 3
so as to make it positive semidefinite. This isn’t really an optimization

problem per se, but we can nonetheless write it as the following SDP:
maximize: 0
subject to: x j, j = 3 for all 1 ≤ j ≤ 5
xi, j = −2 for {i, j} = {1, 4}, {2, 4}, {4, 5}
X O
If such a PSD matrix exists, this SDP has optimal value 0, otherwise it is
infeasible and thus has optimal value −∞.
∗∗3.C.1 We now prove some of the basic properties of the

(a) Show that A cI if and only if λmax (A) ≤ c.
Loewner partial order. Suppose A, B,C ∈ MH
n. (b) Use part (a) to construct a semidefinite program that
(a) Show that A A. computes λmax (A).
(b) Show that if A B and B A then A = B. (c) Construct a semidefinite program that computes
(c) Show that if A B and B C then A C. λmin (A).
(d) Show that if A B then A +C B +C.
(e) Show that A B if and only if −A −B.
∗∗ 3.C.4 In this problem, we show that there are pri-
(f) Show that A B implies PAP∗ PBP∗ for all
mal/dual pairs of semidefinite programs in which one prob-
P ∈ Mm,n (C).
lem is infeasible and the other is solvable. Consider the
following SDP involving the variables x, y ∈ R:
3.C.2 Determine which of the following statements are
true and which are false. maximize: 0" # " #
∗(a) Every linear program is a semidefinite program. subject to: x+y 0 0 1
(b) Every semidefinite program is a linear program.
0 x 1 0
∗(c) If A and B are Hermitian matrices for which A B x+y ≤ 0
then tr(A) ≥ tr(B). x≥0
(d) If an SDP is unbounded then its dual must be infeasi-
ble. (a) Show that this SDP is infeasible.
∗(e) If an SDP is infeasible then its dual must be un- (b) Construct the dual of this semidefinite program and
bounded. show that it is feasible and has an optimal value of 0.
∗∗3.C.3 Suppose A ∈ MH n and c ∈ R. Let λmax (A) and

λmin (A) denote the maximal and minimal eigenvalue, re-
spectively, of A.
∗∗3.C.5 Suppose A, B ∈ MH [Hint: Solve Exercise 3.C.9 first, which computes σ12 + · · · +
n are positive semidefinite
and A B. σr2 , and maybe make use of Exercise 3.C.5.]
(a) Provide an example to show that it is not necessarily [Side note: A similar method can be used to construct
the case that A2 B2 . semidefinite programs that compute σ1p +· · ·+σrp whenever
(b) Show that, in spite of part (a), it is the case that p is an integer power of 2.]
1/p
tr(A2 ) ≥ tr(B2 ). [Side note: The quantity σ1p + · · · + σrp is sometimes
[Hint: Factor a difference of squares.] called the Schatten p-norm of A.]
3.C.6 Suppose A, B ∈ MH n are positive semidefinite and 3.C.11 Let A ∈ MH n be positive semidefinite. Show
A B. that the following semidefinite
√ √ √ program in the variable
(a) Show that if A is positive definite then A B. X ∈ Mn (C) computes tr A (i.e., the sum of the square
√ √ −1 roots of the eigenvalues of A):
[Hint:√ Use the fact that B A and
A−1/4 BA−1/4 are similar to each other, where
p√ −1 maximize: "Re tr(X)#
A−1/4 = A .] subject to: I X
(b) √Use the√techniques from Section 2.D.3 to show that O
A B even when A is just positive semidefinite. X∗ A
You may use the fact that the principal square root
function is continuous on the set of positive semidef- [Hint: Use duality or Exercises 2.2.18 and 3.C.6.]
inite matrices.
∗∗3.C.12 Let C ∈ MH n and let k be a positive integer such
∗∗3.C.7 Use computer software to solve the SDP from that 1 ≤ k ≤ n. Consider the following semidefinite program
Remark 3.C.3 and thus fill in the missing entries in the in the variable X ∈ MH n:
matrix  
3 ∗ ∗ −2 ∗ maximize: tr(CX)
∗ 3 ∗ −2 ∗ subject to: tr(X) = k
 
  X I
∗ ∗ 3 ∗ ∗
  X O
−2 −2 ∗ 3 −2
∗ ∗ ∗ −2 3 (a) Construct the dual of this semidefinite program.
(b) Show that the optimal value of this semidefinite pro-
so as to make it positive semidefinite.
gram is the sum of the k largest eigenvalues of C.
[Hint: Try to “eyeball” optimal solutions for the pri-
∗∗3.C.8 Find a feasible point of the semidefinite program mal and dual problems.]
from Example 3.C.8 that attains its optimal value kAk.
3.C.13 Let A ∈ Mm,n (C). Construct a semidefinite pro-
∗∗ 3.C.9 Let A ∈ Mm,n (C). Show that the following gram for computing the sum of the k largest singular values
semidefinite program in the variable X ∈ MH
n computes of A.
kAk2F :
[Side note: This sum is called the Ky Fan k-norm of A.]
minimize: "tr(X) # [Hint: The results of Exercises 2.3.14 and 3.C.12 may be
subject to: Im A helpful.]
O
A∗ X
X O ∗∗ 3.C.14 Recall that a linear map Φ : Mn → Mm is
called decomposable if there exist completely positive lin-
[Hint: Either use duality and mimic Example 3.C.5, or use ear maps Ψ1 , Ψ2 : Mn → Mm such that Φ = Ψ1 + T ◦ Ψ2 ,
the Schur complement from Section 2.B.1.] where T is the transpose map.
Construct a semidefinite program that determines whether
3.C.10 Let A ∈ Mm,n (C). Show that the following or not a given matrix-valued linear map Φ is decomposable.
semidefinite program in the variables X,Y ∈ MH n computes
σ14 + · · · + σr4 , where σ1 , . . ., σr are the non-zero singular ∗∗3.C.15 Use computer software and the semidefinite pro-
values of A: gram from Exercise 3.C.14 to show that the map Φ : M2 →
M4 from Exercise 3.A.21 is decomposable.
minimize: "tr(Y ) # " #
subject to: Im A I X
, n O 3.C.16 Use computer software and the semidefinite pro-
A∗ X X Y
gram from Exercise 3.C.14 to show that the Choi map Φ
X, Y O from Theorem 3.A.7 is not decomposable.
[Side note: We already demonstrated this claim “directly”
in the proof of that theorem.]
∗∗3.C.17 Use computer software and the semidefinite pro- Construct a semidefinite program that determines whether
gram from Exercise 3.C.14 to show that the map Φ : M2 → or not q can be written as a sum of squares of bilinear forms.
M4 from Exercise 3.A.22 is not decomposable. [Hint: Instead of checking q(x, y) = xT Φ(yyT )x for all x
[Side note: This map Φ is positive, so it serves as an example and y, it suffices to choose finitely many vectors {xi } and
to show that Theorem 3.A.8 does not hold when mn = 8.] {y j } so that the sets {xi xTi } and {y j yTj } span the set of
symmetric matrices.]
∗∗3.C.18 Recall from Section 3.B.3 that a biquadratic

form is a function q : Rm × Rn → R that can be writ-
ten in the form q(x, y) = xT Φ(yyT )x for some linear map
Φ : Mn (R) → Mm (R).
A. Mathematical Preliminaries
In this appendix, we present some of the miscellaneous bits of mathematical

knowledge that are not topics of advanced linear algebra themselves, but are
nevertheless useful and might be missing from the reader’s toolbox. We also
review some basic linear algebra from an introductory course that the reader
may have forgotten.
A.1 Review of Introductory Linear Algebra
Here we review some of the basics of linear algebra that we expect the reader to
be familiar with throughout the main text. We present some of the key results of
introductory linear algebra, but we do not present any proofs or much context.
For a more thorough presentation of these results and concepts, the reader is
directed to an introductory linear algebra textbook like [Joh20].
A.1.1 Systems of Linear Equations

One of the first objects typically explored in introductory linear algebra is a
system of linear equations (or a linear system for short), which is a collection
of 1 or more linear equations (i.e., equations in which variables can only be
added to each other and/or multiplied by scalars) in the same variables, like
y + 3z = 3
2x + y − z = 1 (A.1.1)
x + y + z = 2.
Linear systems can have zero, one, or infinitely many solutions, and one
particularly useful method of finding these solutions (if they exist) starts by
placing the coefficients of the linear system into a rectangular array called a
The rows of a matrix matrix. For example, the matrix associated with the linear system (A.1.1) is
represent the
 
equations in the 0 1 3 3
linear system, and its  2 1 −1 1  .
columns represent
(A.1.2)
the variables as well 1 1 1 2
as the coefficients
on the right-hand We then use a method called Gaussian elimination or row reduction,
side. which works by using one of the three following elementary row operations
to simplify this matrix as much as possible:
Multiplication. Multiplying row j by a non-zero scalar c ∈ R, which we denote by cR j .

https://doi.org/10.1007/978-3-030-52815-7
416 Appendix A. Mathematical Preliminaries
Swap. Swapping rows i and j, which we denote by Ri ↔ R j .

Addition. Replacing row i by (row i) + c(row j), which we denote by Ri + cR j .
In particular, we can use these three elementary row operations to put any
matrix into reduced row echelon form (RREF), which means that it has the
following three properties:
Any matrix that has • all rows consisting entirely of zeros are below the non-zero rows,
the first two of these
three properties is • in each non-zero row, the first non-zero entry (called the leading entry)
said to be in is to the left of any leading entries below it, and
(not-necessarily-
reduced) row • each leading entry equals 1 and is the only non-zero entry in its column.
echelon form. For example, we can put the matrix (A.1.2) into reduced row echelon form
via the following sequence of elementary row operations:
Every matrix can be
     
0 1 3 3 1 1 1 2 1 1 1 2
converted into one,  2 1 −1 1  R1 ↔R3  2 1
and only one, −−−−→ −1 1  −
R2 −2R1  0
−−−→ −1 −3 −3 
reduced row 1 1 1 2 0 1 3 3 0 1 3 3
echelon form.    
However, there may
1 1 1 2 R1 −R2 1 0 −2 −1
−R2  0 1 3 3  R3 −R2  0 1 3 3 .
be many different −−→
sequences of row 0 1 3 3 −−−−−→ 0 0 0 0
operations that get
there. One of the useful features of reduced row echelon form is that the solutions
of the corresponding linear system can be read off from it directly. For example,
if we interpret the reduced row echelon form above as a linear system, the
bottom row simply says 0x + 0y + 0z = 0 (so we ignore it), the second row says
that y + 3z = 3, and the top row says that x − 2z = −1. If we just move the “z”
term in each of these equations over to the other side, we see that every solution
of this linear system has x = 2z − 1 and y = 3 − 3z, where z is arbitrary (we
thus call z a free variable and x and y leading variables).
A.1.2 Matrices as Linear Transformations

One of the central features of linear algebra is that there is a one-to-one corre-
spondence between matrices and linear transformations. That is, every m × n
In fact, vectors and matrix A ∈ Mm,n can be thought of as a function that sends x ∈ Rn to the
matrices do not
vector Ax ∈ Rm . Conversely, every linear transformation T : Rn → Rm (i.e.,
even need to have
real entries. Their function with the property that T (x + cy) = T (x) + cT (y) for all x, y ∈ Rn and
entries can come c ∈ R) can be represented by a matrix—there is a unique matrix A ∈ Mm,n
from any “field” (see with the property that Ax = T (x) for all x ∈ Rn . We thus think of matrices and
the upcoming
Appendix A.4).
linear transformations as the “same thing”.
Linear transformations are special for the fact that they are determined
completely by how they act on the standard basis vectors e1 , e2 , . . . , en , which
A linear are the vectors with all entries equal to 0, except for a single entry equal to 1 in
transformation is also the location indicated by the subscript (e.g., in R3 there are three standard basis
completely
determined by how
vectors: e1 = (1, 0, 0), e2 = (0, 1, 0), and e3 = (0, 0, 1)). In particular, Ae1 , Ae2 ,
it acts on any other . . ., Aen are exactly the n columns of A, and those n vectors form the sides of
basis of Rn . the parallelogram/ parallelepiped/hyperparallelepiped that the unit square/cube/
hypercube is mapped to by A (see Figure A.1).
In particular, linear transformations act “uniformly” in the sense that they
send a unit square/cube/hypercube grid to a parallelogram/parallelepiped/
hyperparallelepiped grid without distorting any particular region of space more
than other regions of space.
A.1 Review of Introductory Linear Algebra 417
y y
Squares are
mapped to Av
parallelograms in R2 , v
cubes are mapped
to parallelepipeds in A Ae2
e2 −−→
R3 , and so on.
Ae1
x x
e1
Figure A.1: A matrix A ∈ M2 acts as a linear transformation on R2 that transforms a

square grid with sides e1 and e2 into a grid made up of parallelograms with sides Ae1
and Ae2 (i.e., the columns of A). Importantly, A preserves which cell of the grid each
vector is in (in this case, v is in the 2nd square to the right, 3rd up, and Av is similarly
in the 2nd parallelogram in the direction of Ae1 and 3rd in the direction of Ae2 .
A.1.3 The Inverse of a Matrix
The inverse of a square matrix A ∈ Mn is a matrix A−1 ∈ Mn for which

In the n = 1 case, the AA−1 = A−1 A = I (the identity matrix). The inverse of a matrix is unique when
inverse of a 1 × 1
it exists, but not all matrices have inverses (even in the n = 1 case, the scalar 0
matrix (i.e., scalar) a
is just 1/a. does not have an inverse). The inverse of a matrix can be computed by using
Gaussian elimination to row-reduce the block matrix [ A | I ] into its reduced
row echelon form [ I | A−1 ]. Furthermore, if the reduced row echelon form of
[ A | I ] has anything other than I in the left block, then A is not invertible.
We show in A one-sided inverse of a matrix is automatically two-sided (that is, if either
Remark 1.2.2 that
of the equations AB = I or BA = I holds then the other necessarily holds as
one-sided inverses
are not necessarily well—we can deduce that B = A−1 based on just one of the two defining
two-sided in the equations). We can get some intuition for why this fact is true by thinking
infinite-dimensional geometrically—if AB = I then, as linear transformations, A simply undoes
case.
whatever B does to Rn , and it perhaps seems believable that B similarly undoes
whatever A does (see Figure A.2).
y y y
Av
A−1 Av = v
v A A−1
−−→ −−−→
x x x
Figure A.2: As a linear transformation, A−1 undoes what A does to vectors in Rn .

That is, A−1 Av = v for all v ∈ Rn (and AA−1 v = v too).
Row reducing a matrix is equivalent to multiplication on the left by an

invertible matrix in the sense that B ∈ Mm,n can be obtained from A ∈ Mm,n
via a sequence of elementary row operations if and only if there is an invertible
matrix P ∈ Mm such that B = PA. The fact that every matrix can be row-
reduced to a (unique) matrix in reduced row echelon form is thus equivalent to
the following fact:
! For every A ∈ Mm,n , there exists an invertible P ∈ Mm such

that A = PR, where R ∈ Mm,n is the RREF of A.
It is also useful to be familiar with the numerous different ways in which

invertible matrices can be characterized:
Theorem A.1.1 Suppose A ∈ Mn . The following are equivalent:

The Invertible a) A is invertible.
Matrix Theorem b) AT is invertible.
c) The reduced row echelon form of A is I.
d) The linear system Ax = b has a solution for all b ∈ Rn .
e) The linear system Ax = b has a unique solution for all b ∈ Rn .
f) The linear system Ax = 0 has a unique solution (x = 0).
g) The columns of A are linearly independent.
h) The columns of A span Rn .
i) The columns of A form a basis of Rn .
The final four j) The rows of A are linearly independent.
characterizations
k) The rows of A span Rn .
here concern rank,
determinants, and l) The rows of A form a basis of Rn .
eigenvalues, which m) rank(A) = n (i.e., range(A) = Rn ).
are the subjects of n) nullity(A) = 0 (i.e., null(A) = {0}).
the next three
subsections.
o) det(A) 6= 0.
p) All eigenvalues of A are non-zero.
A.1.4 Range, Rank, Null Space, and Nullity

The range of a matrix A ∈ Mm,n is the subspace of Rm consisting of all possible
output vectors of A (when we think of it as a linear transformation), and its
null space is the subspace of Rn consisting of all vectors that are sent to the
zero vector:
range(A) = {Av : v ∈ Rn } null(A) = {v ∈ Rn : Av = 0}.

def def
and
The rank and nullity of A are the dimensions of its range and null space,
respectively. The following theorem summarizes many of the important proper-
ties of the range, null space, rank, and nullity that are typically encountered in
introductory linear algebra courses.
Theorem A.1.2 Suppose A ∈ Mm,n has columns v1 , v2 , . . . , vn . Then:

Properties of Range, a) range(A) = span(v1 , v2 , . . . , vn ).
Rank, Null Space b) range(AA∗ ) = range(A).
and Nullity c) null(A∗ A) = null(A).
d) rank(AA∗ ) = rank(A∗ A) = rank(A∗ ) = rank(A).
Property (e) is e) rank(A) + nullity(A) = n.
sometimes called
f) If B ∈ Mn,p then range(AB) ⊆ range(A).
the “rank-nullity
theorem”. g) If B ∈ Mm,n then rank(A + B) ≤ rank(A) + rank(B).
h) If B ∈ Mn,p then rank(AB) ≤ min{rank(A), rank(B)}.
i) If B ∈ Mn,p has rank(B) = n then rank(AB) = rank(A).
j) rank(A) equals that number of non-zero rows in any row echelon
form of A.
We saw in Theorem A.1.1 that square matrices with maximal rank are
exactly the ones that are invertible. Intuitively, we can think of the rank of
a matrix as a rough measure of “how close to invertible” it is, so matrices
with small rank are in some sense the “least invertible” matrices out there. Of
particular interest are matrices with rank 1, which are exactly the ones that can
be written in the form vwT , where v and w are non-zero column vectors. More
generally, a rank-r matrix can be written as a sum of r matrices of this form,
but not fewer:
Theorem A.1.3 Suppose A ∈ Mm,n . Then the smallest integer r for which there exist sets
Rank-One Sum of vectors {v j }rj=1 ⊂ Rm and {w j }rj=1 ⊂ Rn with
Decomposition
r
This theorem works

A= ∑ v j wTj
j=1
just fine if we replace
R by an arbitrary
is exactly r = rank(A). Furthermore, the sets {v j }rj=1 and {w j }rj=1 can be
field F (see
Appendix A.4). chosen to be linearly independent.
A.1.5 Determinants and Permutations

The determinant of a matrix A ∈ Mn , which we denote by det(A), is the area
Actually, the (or volume, or hypervolume, depending on the dimension n) of the output of
determinant of a
the unit square/cube/hypercube after it is acted upon by A (see Figure A.3). In
matrix can also be
negative—this other words, it measures how much A expands space when acting as a linear
happens if A flips the transformation—it is the ratio
orientation of space
(i.e., reflects Rn volume of output region
.
through some volume of input region
(n − 1)-dimensional
hyperplane).
y y
e2
A
−−→ Ae2
x Ae1 x
e1
Figure A.3: A 2 × 2 matrix A stretches the unit square (with sides e1 and e2 ) into a
parallelogram with sides Ae1 and Ae2 (the columns of A). The determinant of A is
the area of this parallelogram.
This geometric interpretation of the determinant leads immediately to the

fact that it is multiplicative—stretching space by the product of two matrices
(i.e., the composition of two linear transformations) has the same effect as
stretching by the first one and then stretching by the second one:
Similarly, det(I) = 1
since the identity
! If A, B ∈ Mn then det(AB) = det(A) det(B).
matrix does not
stretch or squish
space at all. There are several methods for actually computing the determinant of a
matrix, but for our purposes the most useful method requires some brief
background knowledge of permutations. A permutation is a function σ :
{1, 2, . . . , n} → {1, 2, . . . , n} with the property that σ (i) 6= σ ( j) whenever i 6= j
(i.e., σ can be thought of as shuffling around the integers 1, 2, . . . , n). A trans-

position is a permutation with two distinct values i, j such that σ (i) = j and
σ ( j) = i, and σ (k) = k for all k 6= i, j (i.e., σ swaps two integers and leaves the
Sn is often called the rest alone). We denote the set of all permutations acting on {1, 2, . . . , n} by Sn .
symmetric group.
Every permutation can be written as a composition of transpositions. We
define the sign of a permutation σ to be the quantity
sgn(σ ) = (−1)k , if σ can be written as a composition of k transpositions.

def
Importantly, even though every permutation can be written as a composition of

transpositions in numerous different ways, the number of transpositions used
always has the same parity (i.e., it is either always even or always odd), so the
sign of a permutation is well-defined.
With these details out of the way, we can now present a formula for com-
puting the determinant. Note that we state this formula as a theorem, but many
sources actually use it as the definition of the determinant:
Theorem A.1.4 Suppose A ∈ Mn . Then det(A) = ∑ sgn(σ )aσ (1),1 aσ (2),2 · · · aσ (n),n .
Determinants via σ ∈Sn
Permutations
This theorem is typically not used for actual calculations in practice, as
much faster methods of computing the determinant are known. However, this
formula is useful for the fact that we can use it to quickly derive several
useful properties of the determinant. For example, swapping two columns of a
matrix has the effect of swapping the sign of each permutation in the sum in
Theorem A.1.4, which leads to the following fact:
We can replace
! If B ∈ Mn is obtained from A ∈ Mn by swapping two of its
columns with rows in
columns then det(B) = − det(A).
both of these facts,
since det(AT ) = det(A)
for all A ∈ Mn . Similarly, the determinant is multilinear in the columns of the matrix it acts
on: it acts linearly on each column individually, as long as all other columns are
fixed. That is, much like we can “split up” linear transformations over vector
addition and scalar multiplication, we can similarly split up the determinant
These properties of over vector addition and scalar multiplication in a single column of a matrix:
the determinant
can be derived
from its geometric ! For all matrices A = [ a1 | · · · | an ] ∈ Mn , all v, w ∈ Rn , and
interpretation as well, all scalars c ∈ R, it is the case that
but it’s somewhat
messy to do so. det [ a1 | · · · | v + cw | · · · | an ]

= det [ a1 | · · · | v | · · · | an ] + c · det [ a1 | · · · | w | · · · | an ] .
Finally, if A ∈ Mn is upper triangular then the only permutation σ for which

aσ (1),1 aσ (2),2 · · · aσ (n),n 6= 0 is the identity permutation (i.e., the permutation for
which σ ( j) = j for all 1 ≤ j ≤ n), since every other permutation results in at
least one factor in the product aσ (1),1 aσ (2),2 · · · aσ (n),n coming from the strictly
lower triangular portion of A. This observation gives us the following simple
formula for the determinant of an upper triangular matrix:
This same result holds
for lower triangular ! If A ∈ Mn is upper triangular then det(A) = a1,1 · a2,2 · · · an,n .
matrices as well.
A similar argument shows the slightly more general fact that if A is block
k
upper triangular with diagonal blocks A1 , A2 , . . ., Ak then det(A) = ∏ det(A j ).
j=1
A.1.6 Eigenvalues and Eigenvectors

If we allowed v = 0 If A ∈ Mn , v 6= 0 is a vector, λ is a scalar, and Av = λ v, then we say that v is an
as an eigenvector
eigenvector of A with corresponding eigenvalue λ . Geometrically, this means
then every scalar λ
would be an that A stretches v by a factor of λ , but does not rotate it at all (see Figure A.4).
eigenvalue The set of all eigenvectors corresponding to a particular eigenvalue (together
corresponding to it. with the zero vector) is always a subspace of Rn , which we call the eigenspace
corresponding to that eigenvalue. The dimension of an eigenspace is called the
geometric multiplicity of the corresponding eigenvalue.
y y
Av2 = 3v2
This matrix does " #
change the 1 2
A=
direction of any v1 = (−1, 1) v2 = (1, 1) 2 1
vector that is not on −−−−−−−−→
one of the two lines
displayed here. x x
Av1 = −v1
Figure A.4: Matrices do not change the line on which any of their eigenvectors lie,
but rather just scale them by the corresponding eigenvalue. The matrix displayed
here has eigenvectors v1 = (−1, 1) and v2 = (1, 1) with corresponding eigenvalues
−1 and 3, respectively.
The standard method of computing a matrix’s eigenvalues is to construct

its characteristic polynomial pA (λ ) = det(A − λ I). This function pA really
is a polynomial in λ by virtue of the permutation formula for the determinant
(i.e., Theorem A.1.4). Furthermore, the degree of pA is n (the size of A), so
it has at most n real roots and exactly n complex roots counting multiplicity
(see the upcoming Theorem A.3.1). These roots are the eigenvalues of A, and
the eigenvectors v that they correspond to can be found by solving the linear
system (A − λ I)v = 0 for each eigenvalue λ .
Example A.1.1 Compute all" of the

# eigenvalues and corresponding eigenvectors of the
Computing the matrix A = 1 2 .
Eigenvalues and 5 4
Eigenvectors of a
Matrix Solution:
To find the eigenvalues of A, we first compute the characteristic poly-
nomial pA (λ ) = det(A − λ I):
For a 2 × 2 matrix, the " #!
determinant is simply 1−λ 2
" #! pA (λ ) = det(A − λ I) = det
a b 5 4−λ
det = ad −bc.
c d = (1 − λ )(4 − λ ) − 10 = λ 2 − 5λ − 6.
Setting this determinant equal to 0 then gives
λ 2 − 5λ − 6 = 0 ⇐⇒ (λ + 1)(λ − 6) = 0
⇐⇒ λ = −1 or λ = 6,
so the eigenvalues of A are λ = −1 and λ = 6. To find the eigenvectors

corresponding to these eigenvalues, we solve the linear systems (A + I)v =
0 and (A − 6I)v = 0, respectively:
λ = −1: In this case, we want to solve the linear system (A−λ I)v =
(A + I)v = 0, which we can write explicitly as follows:
2v1 + 2v2 = 0
.
5v1 + 5v2 = 0
To solve this linear system, we use Gaussian elimination as

usual:

2 2 0 R2 − 5 R1 2 2 0
2 .
5 5 0 −−−−−→ 0 0 0
It follows that v2 is a free variable and v1 is a leading variable with

v1 = −v2 . The eigenvectors corresponding to the eigenvalue λ = −1
are thus the non-zero vectors of the form v = (−v2 , v2 ) = v2 (−1, 1).
λ = 6: Similarly, we now want to solve the linear system (A −
λ I)v = (A − 6I)v = 0, which we can do as follows:

−5 2 0 R2 +R1 −5 2 0
.
5 −2 0 −−−−→ 0 0 0
By multiplying (2/5, 1) It follows that v2 is a free variable and v1 is a leading variable

by 5, we could also
with v1 = 2v2 /5. The eigenvectors corresponding to the eigenvalue
say that the
eigenvectors here λ = 6 are thus the non-zero vectors of the form v = (2v2 /5, v2 ) =
are the multiples of v2 (2/5, 1).
(2, 5), which is a
slightly cleaner
answer. The multiplicity of an eigenvalue λ as a root of the characteristic polynomial
is called its algebraic multiplicity, and the sum of all algebraic multiplicities
of eigenvalues of an n × n matrix is no greater than n (and it exactly equals
n if we consider complex eigenvalues). The following remarkable (and non-
obvious) fact guarantees that the sum of geometric multiplicities is similarly
no larger than n:
! The geometric multiplicity of an eigenvalue is never larger than

its algebraic multiplicity.
A.1.7 Diagonalization
One of the primary reasons that eigenvalues and eigenvectors are of interest
is that they let us diagonalize a matrix. That is, they give us a way of de-
composing a matrix A ∈ Mn into the form A = PDP−1 , where P is invertible
and D is diagonal. If the entries of P and D can be chosen to be real, we say
that A is diagonalizable over R. However, some real matrices A can only be
diagonalized if we allow P and D to have complex entries (see the upcoming
discussion of complex numbers in Appendix A.3). In that case, we say that A
is diagonalizable over C (but not over R).
Theorem A.1.5 Suppose A ∈ Mn . The following are equivalent:

Characterization of a) A is diagonalizable over R (or C).
Diagonalizability
b) There exists a basis of Rn (or Cn ) consisting of eigenvectors of A.
c) The sum of geometric multiplicities of the real (or complex) eigen-
values of A is n.
Furthermore, in any diagonalization A = PDP−1 , the eigenvalues of A
are the diagonal entries of D and the corresponding eigenvectors are the
columns of P in the same order.
To get a feeling for why diagonalizations are useful, notice that computing
a large power of a matrix directly is quite cumbersome, as matrix multiplication
itself is an onerous process, and repeating it numerous times only makes it
worse. However, once we have diagonalized a matrix we can compute an
arbitrary power of it via just two matrix multiplications, since
P−1 P = I P−1 P = I ···

z }| { z }| { z }| {
Ak = PD P−1 P D P−1 P D P−1 · · · PDP−1 = PDk P−1 ,
| {z }
ktimes
and Dk is trivial to compute (for diagonal matrices, matrix multiplication is the

same as entrywise multiplication).
" # 314
Example A.1.2 Diagonalize the matrix A = 1 2 and then compute A .
Diagonalizing 5 4
a Matrix
Solution:
We showed in Example A.1.1 that this matrix has eigenvalues λ1 = −1
and λ2 = 6 corresponding to the eigenvectors v1 = (−1, 1) and v2 =
We could have also (2, 5), respectively. Following the suggestion of Theorem A.1.5, we stick
chosen v2 = (2/5, 1),
these eigenvalues along the diagonal of a diagonal matrix D, and the
but our choice here
is prettier. Which corresponding eigenvectors as columns into a matrix P in the same order:
multiple of each " # " # " #
eigenvector we λ1 0 −1 0 −1 2
choose does not D= = and P = v1 | v2 = .
matter. 0 λ2 0 6 1 5
It is straightforward to check that P is invertible, so Theorem A.1.5 tells

us that A is diagonalized by this D and P.
The inverse of a 2 × 2 To compute A314 , we first compute P−1 to be
matrix is simply
" #−1 " #
a b −1 1 −5 2
= P = .
c d 7 1 1
" #
1 d −b
det(A) −c a
. We can then compute powers of A via powers of the diagonal matrix D in
this diagonalization:
" #" #" #
314 314 −1 1 −1 2 (−1)314 0 −5 2
Since 314 is even, A = PD P =
7 1 5 0 6314 1 1
(−1)314 = 1.
" #
1 5 + 2 · 6314 −2 + 2 · 6314
= .
7 −5 + 5 · 6314 2 + 5 · 6314
We close this section with a reminder of a useful connection between

diagonalizability of a matrix and the multiplicities of its eigenvalues:
! If a matrix is diagonalizable then, for each of its eigenvalues,

the algebraic and geometric multiplicities coincide.
A.2 Polynomials and Beyond
A (single-variable) polynomial is a function f : R → R of the form

f (x) = a p x p + a p−1 x p−1 + · · · + a2 x2 + a1 x + a0 ,
It’s also OK to where a0 , a1 , a2 , . . . , a p−1 , a p ∈ R are constants (called the coefficients of f ).
consider polynomials
The highest power of x appearing in the polynomial is called its degree (so, for
f : C → C with
coefficients in C, or example, the polynomial f above has degree p).
even polynomials More generally, a multivariate polynomial is a function f : Rn → R that
f : F → F, where F is
any field (see
can be written as a linear combination of products of powers of the n input
Appendix A.4). variables. That is, there exist scalars {a j1 ,..., jn }, only finitely many of which are
non-zero, such that
f (x1 , . . . , xn ) = ∑ a j1 ,..., jn x1j1 · · · xnjn for all x1 , . . . , xn ∈ R.
j1 ,..., jn
The degree of a multivariate polynomial is the largest sum j1 + · · · + jn of

exponents in any of its terms. For example, the polynomials
f (x, y) = 3x2 y − 2y2 + xy + x − 3 and g(x, y, z) = 4x3 yz4 + 2xyz2 − 4y
have degrees equal to 2 + 1 = 3 and 3 + 1 + 4 = 8, respectively.
A.2.1 Monomials, Binomials and Multinomials

Two of the simplest types of polynomials are monomials, which consist of just
a single term (like 3x2 or −4x2 y3 z), and binomials, which consist of two terms
In other words, a added together (like 3x2 + 4x or 2xy2 − 7x3 y). If we take a power of a binomial
binomial is a sum of
like (x + y) p then it is often useful to know what the coefficient of each term in
two monomials.
the resulting polynomial is once everything is expanded out. For example, it is
straightforward to check that
(x + y)2 = x2 + 2xy + y2 and
(x + y) = x + 3x y + 3xy + y3 .
3 3 2 2
A.2 Polynomials and Beyond 425
More generally, we can notice that when we compute

(x + y) p = (x + y)(x + y) · · · (x + y)
| {z }
p times
by repeatedly applying distributivity of multiplication over addition, we get 2 p

terms in total—we can imagine constructing these terms by choosing either
x or y in each copy of (x + y), so we can make a choice between 2 objects p
times.
However, when we expand (x + y) p in this way, many of the resulting
terms are duplicates of each other due to commutativity of multiplication. For
example, x2 y = xyx = yx2 . To count the number of times that each term is
repeated, we notice that for each integer 0 ≤ k ≤ p, the number of times that
x p−k yk is repeated equals the number of ways that we can choose y from k of
the p factors of the form (x + y) (and thus x from the other p − k factors). This
quantity is denoted by kp and is given by the formula

p! is called “p p p!
= , where p! = p · (p − 1) · · · 3 · 2 · 1.
factorial”. k k! (p − k)!

This quantity kp is called a binomial coefficient (due to this connection
with binomials) and is read as “p choose k”. The following theorem states our
observations about powers of binomials and binomial coefficients a bit more
formally.
Theorem A.2.1 Suppose p ≥ 0 is an integer and x, y ∈ R. Then

Binomial Theorem p
p p p−k k
(x + y) = ∑ x y.
k=0 k
Example A.2.1 Expand the polynomial (x + 2y)4 .

Using the
Binomial Theorem Solution:
We start by computing the binomial coefficients that we will need:
Binomial coefficients
are symmetric: 4 4 4 4 4
p p = 1, = 4, = 6, = 4, = 1.
k = p−k . 0 1 2 3 4
Now that we have these quantities, we can just plug them into the binomial
theorem (but we are careful to replace y in the theorem with 2y):
As a strange edge
4
case, we note that 4
0! = 1. (x + 2y)4 = ∑ k x4−k (2y)k
k=0
= x4 (2y)0 + 4x3 (2y)1 + 6x2 (2y)2 + 4x1 (2y)3 + x0 (2y)4
= x4 + 8x3 y + 24x2 y2 + 32xy3 + 16y4 .
More generally, we can use a similar technique to come up with an explicit

formula for the coefficients of powers of any polynomials—not just binomials.
In particular, when expanding out an expression like
(x1 + x2 + · · · + xn ) p = (x1 + x2 + · · · + xn ) · · · (x1 + x2 + · · · + xn ),
| {z }
p times
we find that the number of times that a particular term x1k1 x2k2 · · · xnkn occurs
Each of the p factors equals the number of ways that we can choose x1 from the p factors a total of
must have some x j
k1 times, x2 from the p factors a total of k2 times, and so on. This quantity is
chosen from it, so
k1 + k2 + · · · + kn = p. called a multinomial coefficient, and it is given by

p p!
= .
k1 , k2 , . . . , kn k1! k2 ! · · · kn !
These observations lead to the following generalization of the binomial

theorem to multinomials (i.e., polynomials in general).
Theorem A.2.2 Suppose p ≥ 0 and n ≥ 1 are integers and x1 , x2 , . . . , xn ∈ R. Then

Multinomial Theorem
p
p
(x1 + x2 + · · · + xn ) = ∑ x1k1 x2k2 · · · xnkn .
k +···+kn =p k ,
1 2k , . . . , kn
1
Example A.2.2 Expand the polynomial (x + 2y + 3z)3 .

Using the
Solution:
Multinomial Theorem
We start by computing the multinomial coefficients that we will need:

3 3 3 3
= 6, = = = 1, and
1, 1, 1 3, 0, 0 0, 3, 0 0, 0, 3

3 3 3
= =
0, 1, 2 1, 0, 2 1, 2, 0

3 3 3
= = = = 3.
0, 2, 1 2, 0, 1 2, 1, 0
Now that we have these quantities, we can just plug them into the multino-
Expanding out mial theorem (but we replace y in the theorem with 2y and z with 3z):
(x1 + · · · + xn ) p results

in n+p−1 terms in
p
3 3
general (this fact (x + 2y + 3z) = ∑ xk (2y)` (3z)m
can be proved using k+`+m=3 k, `, m
the method of
Remark 3.1.2). In this = 6x1 (2y)1 (3z)1
example we have + x3 (2y)0 (3z)0 + x0 (2y)3 (3z)0 + x0 (2y)0 (3z)3
n = p = 3, so we get
5
3 = 10 terms. + 3x0 (2y)1 (3z)2 + 3x1 (2y)0 (3z)2 + 3x1 (2y)2 (3z)0
+ 3x0 (2y)2 (3z)1 + 3x2 (2y)0 (3z)1 + 3x2 (2y)1 (3z)0
= 36xyz + x3 + 8y3 + 27z3
+ 54yz2 + 27xz2 + 12xy2 + 36y2 z + 9x2 z + 6x2 y.
A.2.2 Taylor Polynomials and Taylor Series

Since we know so much about polynomials, it is often easier to approximate a
function via a polynomial and then analyze that polynomial than it is to directly
analyze the function that we are actually interested in. The degree-p polynomial
Tp : R → R that best approximates a function f : R → R near some value x = a
A.2 Polynomials and Beyond 427
We sometimes say is called its degree-p Taylor polynomial, and it has the form
that these Taylor
polynomials are p
centered at a. f (n) (a)
Tp (x) = ∑ (x − a)n
n=0 n!
f 00 (a) f (p) (a)
= f (a) + f 0 (a)(x − a) + (x − a)2 + · · · + (x − a) p ,
Be somewhat 2 p!
careful with the n = 0
term: (x − a)0 = 1 for
as long as f has p derivatives (i.e., f (p) exists).
all x.
For example, the lowest-degree polynomials that best approximate f (x) =
sin(x) near x = 0 are
T1 (x) = T2 (x) = x,
x3
T3 (x) = T4 (x) = x − ,
3!
x3 x5
T5 (x) = T6 (x) = x − + , and
3! 5!
x3 x5 x7
T7 (x) = T8 (x) = x − + − ,
3! 5! 7!
which are graphed in Figure A.5. In particular, notice that T1 is simply the
tangent line at the point (0, 0), and T3 , T5 , and T7 provide better and better
approximations of f (x) = sin(x).
y y = T1 (x) y = T5 (x)
In general, the
graph of T0 is a
horizontal line going
through (a, f (a)) and y = sin(x)
the graph of T1 is the x
-6 -5 -4 -3 -2 -1 1 2 3 4 5 6
tangent line going
through (a, f (a)).
y = T3 (x) y = T7 (x)
Figure A.5: The graphs of the first few Taylor polynomials of f (x) = sin(x). As the
degree of the Taylor polynomial increases, the approximation gets better.
To get a rough feeling for why Taylor polynomials provide the best approxi-
mation of a function near a point, notice that the first p derivatives of Tp (x) and
of f (x) agree with each other at x = a, so the local behavior of these functions
is very similar. The following theorem pins this idea down more precisely and
says that the difference between f (x) and Tp (x) behaves like (x − a) p+1 , which
is smaller than any degree-p polynomial when x is sufficiently close to a:
Theorem A.2.3 Suppose f : R → R is a (p + 1)-times differentiable function. For all x and

Taylor’s Theorem a, there is a number c between x and a such that
f (p+1) (c)
f (x) − Tp (x) = (x − a) p+1 .
(p + 1)!
A Taylor series is what we get if we consider the limit of the Taylor

polynomials as p goes to infinity. The above theorem tells us that, as long as
f (p+1) (x) does not get too large near x = a, this limit converges to f (x) near
x = a, so we have
∞
f (n) (a)
f (x) = lim Tp (x) = ∑ (x − a)n
p→∞
n=0 n!
f 00 (a) f 000 (a)
= f (a) + f 0 (a)(x − a) + (x − a)2 + (x − a)3 + · · ·
2 3!
For example, ex , cos(x), and sin(x) have the following Taylor series representa-
tions, which converge for all x ∈ R:
∞ n
x x2 x3 x4 x5
ex = ∑ = 1+x+ + + + +··· ,
n=0 n! 2 3! 4! 5!
∞
(−1)n x2n x2 x4 x6 x8
cos(x) = ∑ = 1 − + − + −··· , and
n=0 (2n)! 2 4! 6! 8!
∞
(−1)n x2n+1 x3 x5 x7 x9
sin(x) = ∑ = x− + − + −···
n=0 (2n + 1)! 3! 5! 7! 9!
A.3 Complex Numbers
Many tasks in advanced linear algebra are based on the concept of eigenvalues
and eigenvectors, and thus require us to be able to find roots of polynomials
(after all, the eigenvalues of a matrix are exactly the roots of its characteris-
tic polynomial). Because many polynomials do not have real roots, much of
linear algebra works out much more cleanly if we instead work with “com-
plex” numbers—a more general type of number with the property that every
polynomial has complex roots (see the upcoming Theorem A.3.1).
To construct the complex numbers, we start by letting i be an object with
the property that i2 = −1. It is clear that i cannot be a real number, but we
nonetheless think of it like a number anyway, as we will see that we can
manipulate it much like we manipulate real numbers. We call any real scalar
The term multiple of i like 2i or −(7/3)i an imaginary number, and they obey the
“imaginary” number
same laws of arithmetic that we might expect them to (e.g., 2i + 3i = 5i and
is absolutely awful.
These numbers are (3i)2 = 32 i2 = −9).
no more We then let C, the set of complex numbers, be the set
make-believe than
def
real numbers C = a + bi : a, b ∈ R
are—they are both
purely mathematical in which addition and multiplication work exactly as they do for real numbers,
constructions and
they are both useful.
as long as we keep in mind that i2 = −1. We call a the real part of a + bi, and
it is sometimes convenient to denote it by Re(a + bi) = a. We similarly call b
its imaginary part and denote it by Im(a + bi) = b.
Remark A.3.1 It might seem extremely strange at first that we can just define a new
Yes, We Can number i and start doing arithmetic with it. However, this is perfectly fine,
Do That and we do this type of thing all the time—one of the beautiful things about
mathematics is that we can define whatever we like. However, for that
definition to actually be useful, it should mesh well with other definitions
and objects that we use.
Complex numbers are useful because they let us do certain things that
A.3 Complex Numbers 429
we cannot do in the real numbers (such as find roots of all polynomials),

and furthermore they do not break any of the usual laws of arithmetic.
That is, we still have all of the properties like ab = ba, a + b = b + a, and
a(b + c) = ab + ac for all a, b, c ∈ C that we would expect “numbers” to
have. That is, C is a “field” (just like R)—see Appendix A.4.
By way of contrast, suppose that we tried to do something similar to
add a new number that lets us divide by zero. If we let ε be a number with
that ε × 0 = 1 (i.e., we are thinking of ε as 1/0 much like we
the property √
think of i as −1), then we have
1 = ε × 0 = ε × (0 + 0) = (ε × 0) + (ε × 0) = 1 + 1 = 2.
We thus cannot work with such a number without breaking at least one of
the usual laws of arithmetic.
As mentioned earlier, some polynomials like p(x) = x2 − 2x + 2 do not

For example, have real roots. However, one of the most remarkable theorems concerning
p(x) = x2 − 2x + 2 has polynomials says that every polynomial, as long as it is not a constant function,
roots 1 ± i.
has complex roots.
Theorem A.3.1 Every non-constant polynomial has at least one complex root.
Fundamental Theorem
of Algebra Equivalently, the fundamental theorem of algebra tells us that every poly-
nomial can be factored as a product of linear terms, as long as we allow the
roots/factors to be complex numbers. That is, every degree-n polynomial p can
be written in the form
p(x) = a(x − r1 )(x − r2 ) · · · (x − rn ),
where a is the coefficient of xn in p and r1 , r2 , . . ., rn are the (potentially

complex, and not necessarily distinct) roots of p.
The proof of this theorem is outside the scope of this book, so we direct the
interested reader to a book like [FR97] for various proofs and a discussion of
the types of techniques needed to establish it.
A.3.1 Basic Arithmetic and Geometry

For the most part, arithmetic involving complex numbers works simply how we
might expect it to. For example, to add two complex numbers together we just
add up their real and imaginary parts: (a + bi) + (c + di) = (a + c) + (b + d)i,
which is hopefully not surprising (it is completely analogous to how we can
group and add real numbers and vectors). For example,
(3 + 7i) + (2 − 4i) = (3 + 2) + (7 − 4)i = 5 + 3i.
Similarly, to multiply two complex numbers together we just distribute

parentheses like we do when we multiply real numbers together, and we make
Here we are making use of the fact that i2 = −1:
use of the fact that
multiplication
distributes over (a + bi)(c + di) = ac + bci + adi + bdi2 = (ac − bd) + (ad + bc)i.
addition.
For example,
(3 + 7i)(4 + 2i) = (12 − 14) + (6 + 28)i = −2 + 34i.
Much like we think of R as a line, we can think of C as a plane, called the

complex plane. The set of real numbers takes the place of the x-axis (which we
call the real axis) and the set of imaginary numbers takes the place of the y-axis
(which we called the imaginary axis), so that the number a + bi has coordinates
(a, b) on that plane, as in Figure A.6.
Im
4 + 3i
−2 + 2i
a + bi
b
Re
a
−1 − i 3−i
Figure A.6: The complex plane is a representation of the set C of complex numbers.
We can thus think of C much like we think of R2 (i.e., we think of the

complex number a + bi ∈ C as the vector (a, b) ∈ R2 ), but with a multiplica-
tion operation that we do not have on R2 . With this in mind, we define the
magnitude of a complex number a + bi to be the quantity
The magnitude of a
real number is simply def
p
its absolute value:
|a + bi| = a2 + b2 ,
√
|a| = a2 .
which is simply the length of the associated vector (a, b) (i.e., it is the distance
between a + bi and the origin in the complex plane).
A.3.2 The Complex Conjugate

One particularly important operation on complex numbers that does not have
any natural analog on the set of real numbers is complex conjugation, which
negates the imaginary part of a complex number and leaves its real part alone.
We denote this operation by putting a horizontal bar over the complex number
it is being applied to so that, for example,
3 + 4i = 3 − 4i, 5 − 2i = 5 + 2i, 3i = −3i, and 7 = 7.
Geometrically, complex conjugation corresponds to reflecting a number in the

complex plane through the real axis, as in Figure A.7.
Notice that applying Algebraically, complex conjugation is useful since many other common
the complex
operations involving complex numbers can be expressed in terms of it. For
conjugate twice
simply undoes it: example:
a + bi = a − bi = a + bi. • The magnitude of a complex number z = a + bi can be written in terms
After all, reflecting a of the product of z with its complex conjugate: |z|2 = zz, since
number or vector
twice returns it to
where it started.
zz = (a + bi)(a + bi) = (a + bi)(a − bi)
= a2 + b2 = |a + bi|2 = |z|2 .
Im
b a + bi
Re
a
a + bi = a − bi
Figure A.7: Complex conjugation reflects a complex number through the real axis.
• The previous point tells us that we can multiply any complex number
by another one (its complex conjugate) to get a real number. We can
make use of this trick to come up with a method of dividing by complex
numbers:
In the first step here,
we just cleverly a + bi a + bi c − di
multiply by 1 so as to =
c + di c + di c − di
make the
denominator real. (ac + bd) + (bc − ad)i ac + bd bc − ad
= = + i.
c2 + d 2 c2 + d 2 c2 + d 2
• The real and imaginary parts of a complex number z = a + bi can be

computed via
z+z z−z
Re(z) = and Im(z) = ,
2 2i
since
z + z (a + bi) + (a − bi) 2a
= = = a = Re(z) and
2 2 2
z − z (a + bi) − (a − bi) 2bi
= = = b = Im(z).
2i 2i 2i
A.3.3 Euler’s Formula and Polar Form

Since we can think of complex numbers as points in the complex plane, we
can specify them via their length and direction rather than via their real and
imaginary parts. In particular, we can write every complex number in the form
z = |z|u, where |z| is the magnitude of z and u is a number on the unit circle in
the complex plane.
By recalling that every point on the unit circle in R2 has coordinates of
The notation the form (cos(θ ), sin(θ )) for some θ ∈ [0, 2π), we see that every point on the
θ ∈ [0, 2π) means
unit circle in the complex plane can be written in the form cos(θ ) + i sin(θ ), as
that θ is between 0
(inclusive) and 2π illustrated in Figure A.8.
(non-inclusive). It follows that we can write every complex number in the form z =
|z|(cos(θ ) + i sin(θ )). However, we can simplify this expression somewhat
Im
It is worth noting that cos(θ ) + i sin(θ )

| cos(θ ) + i sin(θ )|2 =
cos2 (θ ) + sin2 (θ ) = 1, sin(θ )
so these numbers θ
really are on the unit Re
circle.
cos(θ )
Figure A.8: Every number on the unit circle in the complex plane can be written in
the form cos(θ ) + i sin(θ ) for some θ ∈ [0, 2π).
by using the following Taylor series that we saw in Appendix A.2.2:
x2 x3 x4 x5
ex = 1 + x + + + + +··· ,
2 3! 4! 5!
x2 x4
cos(x) = 1 − + − · · · , and
2! 4!
x3 x5
sin(x) = x − + ··· .
3! 5!
Just like these Taylor In particular, if we plug x = iθ into the Taylor series for ex , then we see that
series converge for
all x ∈ R, they also
θ2 θ3 θ4 θ5
converge for all x ∈ C. eiθ = 1 + iθ − −i + +i −···
2 3! 4! 5!

θ2 θ4 θ3 θ5
= 1− + −··· +i θ − + +··· ,
2 4! 3! 5!
which equals cos(θ ) + i sin(θ ). We have thus proved the remarkable fact, called
In other words, Euler’s formula, that
Re(eiθ ) = cos(θ ) and
Im(eiθ ) = sin(θ ). eiθ = cos(θ ) + i sin(θ ) for all θ ∈ [0, 2π).
By making use of Euler’s formula, we see that we can write every complex
number z ∈ C in the form z = reiθ , where r is the magnitude of z (i.e., r = |z|)
and θ is the angle that z makes with the positive real axis (see Figure A.9). This
is called the polar form of z, and we can convert back and forth between the
polar form z = reiθ and its Cartesian form z = a + bi via the formulas
In the formula for θ ,
p
sign(b) = ±1, a = r cos(θ ) r= a2 + b2

depending on a
whether b is positive b = r sin(θ ) θ = sign(b) arccos √ .
or negative. If b < 0 a2 + b2
then we get
−π < θ < 0, which There is no simple way to “directly” add two complex numbers that are
we can put in the in polar form, but multiplication is quite straightforward: (r1 eiθ1 )(r2 eiθ2 ) =
interval [0, 2π) by (r1 r2 )ei(θ1 +θ2 ) . We can thus think of complex numbers as stretched rotations—
adding 2π to it.
multiplying by reiθ stretches numbers in the complex plane by a factor of r and
rotates them counter-clockwise by an angle of θ .
Because the polar form of complex numbers works so well with multipli-
cation, we can use it to easily compute powers and roots. Indeed, repeatedly
Im
a + bi = reiθ
b
eiθ
θ
Re
a
Figure A.9: Every complex number can be written in Cartesian form a + bi and also
in polar form reiθ .
You should try to
convince yourself multiplying a complex number in polar form by itself gives (reiθ )n = rn einθ for
that raising any of all positive integers n ≥ 1. We thus see that every non-zero complex number
these numbers to has at least n distinct n-th roots (and in fact, exactly n distinct n-th roots). In
the n-th power results
in reiθ . Use the fact particular, the n roots of z = reiθ are
that e2πi = e0i = 1.
r1/n eiθ /n , r1/n ei(θ +2π)/n , r1/n ei(θ +4π)/n , ..., r1/n ei(θ +2(n−1)π)/n .
Example A.3.1 Compute all 3 cube roots of the complex number z = i.

Computing
Solution:
Complex Roots
We start by writing z in its polar form z = eπi/2 , from which we see
that its 3 cube roots are
eπi/6 , e5πi/6 , and e9πi/6 .
We can convert these three cube roots into Cartesian coordinates if we

want to, though this is perhaps not necessary:
√ √
eπi/6 = ( 3 + i)/2, e5πi/6 = (− 3 + i)/2, and e9πi/6 = −i.
√
Geometrically, the cube root eπi/6 = ( 3 + i)/2 of i lies on the unit
circle and has an angle one-third as large as that of i, and the other two
cube roots are evenly spaced around the unit circle:
Im
eπ i/2 = i
√ √
e5πi/6 = (− 3 + i)/2 eπi/6 = ( 3 + i)/2
π /6
Re
-2 2
e9πi/6 = −i
This definition of
principal roots
requires us to make Among the n distinct n-th roots of a complex number z = reiθ , we call
sure that θ ∈ [0, 2π).
r1/n eiθ /n its principal n-th root, which we denote by z1/n . The principal root
of a complex number is the one with the smallest angle so that, for example, if
z is a positive real number (i.e., θ = 0) then its principal roots are positive real
numbers as well. Similarly, the principal square root of a complex number is the
one in the upper half of the complex plane (for example, the principal square
root of −1 is i, not −i), and we showed in √ Example A.3.1 that the principal
cube root of z = eπi/2 = i is z1/3 = eπi/6 = ( 3 + i)/2.
Remark A.3.2 Complex numbers can actually be represented by real matrices. If we

Complex Numbers define the 2 × 2 matrix
as 2 × 2 Matrices 0 −1
J= ,
1 0
then it is straightforward to verify that J 2 = −I (recall that we think of the
identity matrix I as the “matrix version” of the number 1). It follows that
addition and multiplication of the matrices of the form
" #
a −b
aI + bJ =
b a
work exactly the same as they do for complex numbers. We can thus think
In the language of of the complex number a + bi as “the same thing” as the 2 × 2 matrix
Section 1.3.1, the set
of 2 × 2 matrices of aI + bJ.
this special form is This representation of complex numbers perhaps makes it clearer why
“isomorphic” to C.
However, they are
they act like rotations when multiplying by them: we can rewrite aI + bJ
not just isomorphic as " # " #
as vector spaces, a −b p cos(θ ) − sin(θ )
but even as fields aI + bJ = = a2 + b2 ,
(i.e., the field
b a sin(θ ) cos(θ )
operations of √
Definition A.4.1 are where we recognize a2 + b2 as the magnitude of the complex number
preserved). a + bi, and the matrix on the right as the one that rotates R2√counter-
2 2
√ of θ (here, θ is chosen so that cos(θ ) = a/ a + b
clockwise by an angle
2
and sin(θ ) = b/ a + b ).2
A.4 Fields
Introductory linear algebra is usually carried out via vectors and matrices made
up of real numbers. However, most of its results and methods carry over just
fine if we instead make use of complex numbers, for example (in fact, much of
linear algebra works out better when using complex numbers rather than real
numbers).
This raises the question of what sets of “numbers” we can make use of in
linear algebra. In fact, it even raises the question of what a “number” is in the
first place. We saw in Remark A.3.2 that we can think of complex numbers
as certain special 2 × 2 matrices, so it seems natural to wonder why we think
The upcoming of them as “numbers” even though we do not think of general matrices in the
definition generalizes
R in much the same same way.
way that This appendix answers these questions. We say that a field is a set in which
Definition 1.1.1
generalizes Rn .
addition and multiplication behave in the same way that they do in the set of
A.4 Fields 435
real or complex numbers, so it is reasonable to think of it as being made up of

“numbers” or “scalars”.
Definition A.4.1 Let F be a set with two operations called addition and multiplication.
Field We write the addition of a ∈ F and b ∈ F as a + b, and the multiplication
of a and b as ab.
If the following conditions hold for all a, b, c ∈ F then F is called a field:
a) a+b ∈ F (closure under addition)
b) a+b = b+a (commutativity of addition)
c) (a + b) + c = a + (b + c) (associativity of addition)
d) There is a zero element 0 ∈ F such that 0 + a = a.
e) There is an additive inverse −a ∈ F such that a + (−a) = 0.
Notice that the first f) ab ∈ F (closure under multiplication)
five properties
g) ab = ba (commutativity of multiplication)
concern addition,
the next five h) (ab)c = a(bc) (associativity of multiplication)
properties concern i) There is a unit element 1 ∈ F such that 1a = a.
multiplication, and j) If a 6= 0, there is a multiplicative inverse 1/a ∈ F such that
the final property
combines them.
a(1/a) = 1.
k) a(b + c) = (ab) + (bc) (distributivity)
The sets R and C of real and complex numbers are of course fields (we
defined fields specifically so as to mimic R and C, after all). Similarly, it is
straightforward to show that the set Q of rational numbers—real numbers of
the form p/q where p and q 6= 0 are integers—is a field. After all, if a = p1 /q1
and b = p2 /q2 are rational then so are a + b = (p1 q2 + p2 q1 )/(q1 q2 ) and
This is analogous to ab = (p1 p2 )/(q1 q2 ), and all other field properties follow immediately from the
how we only need
to prove closure to corresponding facts about real numbers.
show that a Another less obvious example of a field is the set Z2 of numbers with arith-
subspace is a vector
space
metic modulo 2. That is, Z2 is simply the set {0, 1}, but with the understanding
(Theorem 1.1.2)—all that addition and multiplication of these numbers work as follows:
other properties are
inherited from the 0+0 = 0 0+1 = 1 1+0 = 1 1 + 1 = 0, and
parent structure.
0×0 = 0 0×1 = 0 1×0 = 0 1 × 1 = 1.
In other words, these operations work just as they normally do, with the
exception that 1 + 1 = 0 instead of 1 + 1 = 2 (since there is no “2” in Z2 ).
Phrased differently, we can think of this as a simplified form of arithmetic that
only keeps track of whether or not a number is even or odd (with 0 for even
and 1 for odd). All field properties are straightforward to show for Z2 , as long
as we understand that −1 = 1.
More generally, if p is a prime number then the set Z p = {0, 1, 2, . . . , p − 1}
with arithmetic modulo p is a field. For example, addition and multiplication in
Z5 = {0, 1, 2, 3, 4} work as follows:
+ 0 1 2 3 4 × 0 1 2 3 4
0 0 1 2 3 4 0 0 0 0 0 0
1 1 2 3 4 0 1 0 1 2 3 4
2 2 3 4 0 1 2 0 2 4 1 3
3 3 4 0 1 2 3 0 3 1 4 2
4 4 0 1 2 3 4 0 4 3 2 1
Again, all of the field properties of Z p are straightforward to show, with the
exception of property (j)—the existence of multiplicative inverses. The standard
Property (j) is why we way to prove this property is to use a theorem called Bézout’s identity, which
need p to be prime. says that, for all 0 ≤ a < p, we can find integers x and y such that ax + py = 1.
For example, Z4 is We can rearrange this equation to get ax = 1 − py, so that we can choose
not a field since 2
does not have a 1/a = x. For example, in Z5 we have 1/1 = 1, 1/2 = 3, 1/3 = 2, and 1/4 = 4
multiplicative inverse. (with 1/2 = 3, for example, corresponding to the fact that the integer equation
2x + 5y = 1 can be solved when x = 3).
A.5 Convexity
In linear algebra, the sets of vectors that arise most frequently are subspaces
(which are closed under linear combinations) and the functions that arise most
frequently are linear transformations (which preserve linear combinations). In
this appendix, we briefly discuss the concept of convexity, which is a slight
weakening of these linearity properties that can be thought of as requiring the
set or graph of the function to “not bend inward” rather than “be flat”.
A.5.1 Convex Sets

By a “real vector Roughly speaking, a convex subset S of a real vector space V is one for which
space” we mean a
every line segment between two points in S is also in S. The following definition
vector space over R.
pins down this idea algebraically:
Definition A.5.1 Suppose V is a real vector space. A subset S ⊆ V is called convex if

Convex Set
(1 − t)v + tw ∈ S whenever v, w ∈ S and 0 ≤ t ≤ 1.
If we required To convince ourselves that this definition captures the geometric idea
cv + dw ∈ S for all
involving line segments between two points, we observe that if t = 0 then
c, d ∈ R then S would
be a subspace. The (1 − t)v + tw = v, if t = 1 then (1 − t)v + tw = w, and as t increases from 0
restriction that the to 1 the quantity (1 − t)v + tw travels along the line segment from v to w (see
coefficients are Figure A.10).
non-negative and
add up to 1 makes
convex sets more
v
general. v
t =0 1 w
3 2 w
3
t =1
Figure A.10: A set is convex (a) if all line segments between two points in that set
are contained within it, and it is non-convex (b) otherwise.
Another way of thinking about convex sets is as ones with no holes or sides
A.5 Convexity 437
that bend inward—their sides can be flat or bend outward only. Some standard
examples of convex sets include
• Any interval in R.
• Any subspace of a real vector space.
Recall that even
though the entries of • Any quadrant in R2 or orthant in Rn .
• The set of positive (semi)definite matrices in MH
Hermitian matrices
can be complex, n (see Section 2.2).
MHn is a real vector The following theorem pins down a very intuitive, yet extremely powerful,
space (see idea—if two convex sets are disjoint (i.e., do not contain any vectors in com-
Example 1.1.5).
mon) then we must be able to fit a line/plane/hyperplane (i.e., a flat surface that
cuts V into two halves) between them.
The way that we formalize this idea is based on linear forms (see Sec-
tion 1.3), which are linear transformations from V to R. We also make use of
open sets, which are sets of vectors that do not include any of their boundary
points (standard examples include any subinterval of R that does not include
its endpoints, and the set of positive definite matrices in MHn ).
Theorem A.5.1 Suppose V is a finite-dimensional real vector space and S, T ⊆ V are

Separating disjoint convex subsets of V, and T is open. Then there is a linear form
Hyperplane f : V → R and a scalar c ∈ R such that
Theorem
f (y) > c ≥ f (x) for all x ∈ S and y ∈ T.
In particular, the line/plane/hyperplane that separates the convex sets S

If c = 0 then this and T in this theorem is {v ∈ V : f (v) = c}. This separating hyperplane may
separating
intersect the boundary of S and/or T , but because T is open we know that it
hyperplane is a
subspace of V (it is does not intersect T itself, which guarantees the strict inequality f (w) > c (see
null( f )). Otherwise, it Figure A.11).
is a “shifted”
subspace (which are
typically called
affine spaces). T
T
f (v) = c + 1
S f (v) = c
f (v) = c
f (v) = c − 1 S
Figure A.11: The separating hyperplane theorem (Theorem A.5.1) tells us (a) that
we can squeeze a hyperplane between any two disjoint convex sets. This is (b) not
necessarily possible if one of the sets is non-convex.
While the separating hyperplane theorem itself perhaps looks a bit abstract,
it can be made more concrete by choosing a particular vector space V and
thinking about how we can represent linear forms acting on V. For example,
if V = Rn then Theorem 1.3.3 tells us that for every linear form f : Rn → R
In fact, v is exactly
the unique (up to there exists a vector v ∈ Rn such that f (x) = v · x for all x ∈ Rn . It follows that,
scaling) vector that in this setting, if S and T are convex and T is open then there exists a vector
is orthogonal to the v ∈ Rn and a scalar c ∈ R such that
separating
hyperplane.
v·y > c ≥ v·x for all x ∈ S and y ∈ T.
Similarly, we know from Exercise 1.3.7 that for every linear form f : Mn →
R, there exists a matrix A ∈ Mn such that f (X) = tr(AX) for all X ∈ Mn . The
separating hyperplane theorem then tells us (under the usual hypotheses on S
and T ) that
tr(AY ) > c ≥ tr(AX) for all X ∈ S and Y ∈ T.
A.5.2 Convex Functions

Convex functions generalize linear forms in much the same way that convex
sets generalize subspaces—they weaken the linearity (i.e., “flat”) condition in
favor of a convexity (i.e., “bend out”) condition.
Definition A.5.2 Suppose V is a real vector space and S ⊆ V is a convex set. We say that a
Convex Function function f : S → R is convex if

f (1 − t)v + tw ≤ (1 − t) f (v) + t f (w) whenever v, w ∈ S
and 0 ≤ t ≤ 1.
In the case when V = S = R, the above definition can be interpreted very

naturally in terms of the function’s graph: f is convex if and only if every line
segment between two points on its graph lies above the graph itself. Equiva-
lently, f is convex if and only if the region above its graph is a convex set (see
Figure A.12).
y y = f (x)
y y = f (x)
4
The region above a
4
function’s graph is
called its epigraph. 3
(1 − t) f (x1 ) + t f (x2 ) 3
¢
≥ f (1 − t)x1 + tx2 2
1
1
x
1 2 3 4 x
(1 − t)x1 + tx2 1 2 3 4
Figure A.12: A function is convex if and only if (a) every line segment between two
points on its graph lies above its graph, if and only if (b) the region above its graph
is convex.
Some standard examples of convex functions include:

• The function f (x) = ex on R.
• Given any real p ≥ 1, the function f (x) = x p on [0, ∞).
• Any linear form on a real vector space.
• Any norm (see Section 1.D) on a real vector space, since the triangle
inequality and absolute homogeneity tell us that
k(1 − t)v + twk ≤ k(1 − t)vk + ktwk = (1 − t)kvk + tkwk.
A.5 Convexity 439
Some introductory If f is a convex function then − f is called a concave function. Geometrically,

calculus courses
a concave function is one whose graph bends the opposite way of a convex
refer to convex and
concave functions function—a line segment between two points on its graph is always below
as concave up and the graph, and the region below its graph is a convex set (see Figure A.13).
concave down, Examples of concave functions include
respectively.
• Given any real b > 1, the base-b logarithm f (x) = logb (x) on (0, ∞).
p
•
√ f (x) = x on [0, ∞). For example,
Given any real 0 < p ≤ 1, the function
if p = 1/2 then we see that f (x) = x is concave.
• Any linear form on a real vector space. In fact, linear forms are exactly
the functions that are both convex and concave.
Figure A.13: A function is concave if and only if every line segment between two
points on its graph lies below its graph, if and only if the region below its graph is
convex.
The following theorem provides what is probably the simplest method of
checking whether or not a function of one real variable is convex, concave, or
neither.
Theorem A.5.2 Suppose f : (a, b) → R is twice differentiable and f 00 is continuous. Then:

Convexity of Twice- a) f is convex if and only if f 00 (x) ≥ 0 for all x ∈ (a, b), and
Differentiable Functions b) f is concave if and only if f 00 (x) ≤ 0 for all x ∈ (a, b).
For example, we can see that the function f (x) = x4 is convex on R since
f 00 (x)= 12x2 ≥ 0, whereas the natural logarithm g(x) = ln(x) is concave on
Similarly, a function is (0, ∞) since g00 (x) = −1/x2 ≤ 0. In fact, the second derivative of g(x) = ln(x)
strictly convex if is strictly negative, which tells us that any line between two points on its graph
f (1 − t)v + tw <
(1 − t) f (v) + t f (w) is strictly below its graph—it only touches the graph at the endpoints of the
whenever v 6= w and line segment. We call such a function strictly concave.
0 < t < 1. Notice the
strict inequalities.
One of the most useful applications of the fact that the logarithm is (strictly)
concave is the following inequality that relates the arithmetic and geometric
means of a set of n real numbers:
Theorem A.5.3 If x1 , x2 , . . ., xn ≥ 0 are real numbers then

Arithmetic Mean–
√ x1 + x2 + · · · + xn
Geometric Mean n
x1 x2 · · · xn ≤ .
(AM–GM) Inequality n
Furthermore, equality holds if and only if x1 = x2 = · · · = xn .
When n = 2, the AM–GM inequality just says that if x, y ≥ 0 then

√
xy ≤ (x + y)/2, which can be proved
√ “directly” by multiplying out the paren-
√
thesized term in the inequality ( x − y)2 ≥ 0. To see why it holds when
This argument only n ≥ 3, we use concavity of the natural logarithm:
works if each x j is n n
strictly positive, but x1 + x2 + · · · + xn 1 1/n √
that’s okay because ln
n
≥
n ∑ ln(x j ) = ∑ ln(x j ) = ln n
x1 x2 · · · xn .
the AM–GM j=1 j=1
inequality is trivial if
x j = 0 for some Exponentiating both sides (i.e., raising e to the power of both sides of this
1 ≤ j ≤ n. inequality) gives exactly the AM–GM inequality. The equality condition of the
AM–GM inequality follows from the fact that the logarithm is strictly convex.
B. Additional Proofs
In this appendix, we prove some of the technical results that we made use of
throughout the main body of the textbook, but whose proofs are messy enough
(or unenlightening enough, or simply not “linear algebra-y” enough) that they
are hidden away here.
B.1 Equivalence of Norms
We mentioned near the start of Section 1.D that all norms on a finite-dimensional
vector space are equivalent to each other. That is, the statement of Theo-
rem 1.D.1 told us that, given any two norms k · ka and k · kb on V, there exist
real scalars c,C > 0 such that
ckvka ≤ kvkb ≤ Ckvka for all v ∈ V.
We now prove this theorem.
Proof of Theorem 1.D.1. We first note that Exercise 1.D.28 tells us that it
suffices to prove this theorem in Fn (where F = R or F = C). Furthermore,
transitivity of equivalence of norms (Exercise 1.D.27) tells us that it suffices
just to show that every norm on Fn is equivalent to the usual vector length
The 2-norm is (2-norm) k · k, so the remainder of the proof is devoted to showing why this is
q the case. In particular, we will show that any norm k · ka is equivalent to the
kvk = |v1 |2 + · · · + |vn |2 .
usual 2-norm k · k.
To see that there is a constant C > 0 such that kvka ≤ Ckvk for all v ∈ Fn ,
we notice that
Recall that e1 , . . ., en
denote the standard kvka = kv1 e1 + · · · + vn en ka (v in standard basis)
basis vectors in Fn .
≤ |v1 |ke1 ka + · · · + |vn |ken ka (triangle ineq. for k · ka )
q q
≤ |v1 |2 + · · · + |vn |2 ke1 k2a + · · · + ken k2a (Cauchy−Schwarz ineq.)
q
= kvk ke1 k2a + · · · + ken k2a . (formula for kvk)
This might look like a We can thus choose

weird choice for C, q
but all that matters is C= ke1 k2a + · · · + ken k2a .
that it does not
depend on the
choice of v.
442 Appendix B. Additional Proofs
To see that there is similarly a constant c > 0 such that ckvk ≤ kvka for
all v ∈ V, we need to put together three key observations. First, observe that it
suffices to show that c ≤ kvka whenever v is such that kvk = 1, since otherwise
we can just scale v so that this is the case and both sides of the desired inequality
scale appropriately.
Second, we notice that the function k · ka is continuous on Fn , since the
Some details about inequality kvka ≤ Ckvk tells us that if a limk→∞ vk = v then
limits and continuity
can be found in
Section 2.D.2.
lim kvk − vk = 0
k→∞
=⇒ lim kvk − vka = 0 ( lim kvk − vka ≤ C lim kvk − vk = 0)
k→∞ k→∞ k→∞
=⇒ lim kvk ka = kvka . (reverse triangle ineq.: Exercise 1.D.6)
k→∞
Third, we define

Sn = v ∈ Fn : kvk = 1 ⊂ Fn ,
which we can think of as the unit circle (or sphere, or hypersphere, depending
on n), as illustrated in Figure B.14, and we notice that this set is closed and
A closed and bounded.
bounded subset of
Fn is typically called
compact. y
z
S2 S3
x
1 y
Figure B.14: The set Sn ⊂ Fn of unit vectors can be thought of as the unit circle, or
sphere, or hypersphere, depending on the dimension.
Continuous functions To put all of this together, we recall that the Extreme Value Theorem from
also attain their analysis says that every continuous function on a closed and bounded subset of
maximum value on Fn attains its minimum value on that set (i.e., it does not just approach some
a closed and
minimum value). It follows that the norm k · ka attains a minimum value (which
bounded set, but we
only need the we call c) on Sn , and since each vector in Sn is non-zero we know that c > 0.
minimum value. We thus conclude that c ≤ kvka whenever v ∈ Sn , which is exactly what we
wanted to show.
B.2 Details of the Jordan Decomposition
We presented the Jordan decomposition of a matrix in Section 2.4 and presented

most of the results needed to demonstrate how to actually construct it, but we
glossed over some of the details that guaranteed that our method of construction
always works. We now fill in those details.
B.2 Details of the Jordan Decomposition 443
The first result that we present here demonstrates the claim that we made
in the middle of the proof of Theorem 2.4.5.
Theorem B.2.1 If two matrices A ∈ Mm (C) and C ∈ Mn (C) do not have any eigenvalues
Similarity of Block in common then, for all B ∈ Mm,n (C), the following block matrices are
Triangular Matrices similar: " # " #
A B A O
and .
O C O C
The m and n in the Proof. Notice that if X ∈ Mm,n (C) and

statement of this
theorem are not the
I X I −X
same as in P= then P−1 = ,
Theorem 2.4.5. In the O I O I
proof of that
theorem, they were and
called d1 and d2 ,
respectively (which " # " # " #
−1 A O I −X A O I X A AX − XC
are just any positive P P= = .
integers). O C O I O C O I O C
It thus suffices to show that there exists a matrix X ∈ Mm,n (C) such that
We provide another AX − XC = B. Since B ∈ Mm,n (C) is arbitrary, this is equivalent to showing
proof that such an X
that the linear transformation T : Mm,n (C) → Mm,n (C) defined by T (X) =
exists in Exercise 3.1.8.
AX − XC is invertible, which we do by showing that T (X) = O implies X = O.
To this end, suppose that T (X) = O, so AX = XC. Simply using associativ-
ity of matrix multiplication then shows that this implies
A2 X = A(AX) = A(XC) = (AX)C = (XC)C = XC2 , so

A3 X = A(A2 X) = A(XC2 ) = (AX)C2 = (XC)C2 = XC3 ,
and so on so that Ak X = XCk for all k ≥ 1. Taking linear combinations of

these equations then shows that p(A)X = X p(C) for any polynomial, so if
we choose p = pA to be the characteristic polynomial of A then we see that
pA (A)X = X pA (C). Since the Cayley–Hamilton theorem (Theorem 2.1.3) tells
us that pA (A) = O, it follows that X pA (C) = O. Our next goal is to show that
pA (C) is invertible, which will complete the proof since we can then multiply
the equation X pA (C) = O on the right by pA (C)−1 to get X = O.
Well, if we denote the eigenvalues of A (listed according to algebraic
multiplicity) by λ1 , λ2 , . . ., λm , then
pA (C) = (C − λ1 I)(C − λ2 I) · · · (C − λm I).
Notice that, for each 1 ≤ j ≤ m, the factor C − λ j I is invertible since λ j is not

an eigenvalue of C. It follows that their product pA (C) is invertible as well, as
desired.
The next result that we present fills in a gap in the proof of Theorem 2.4.5
(the Jordan decomposition) itself. In our method of constructing this decompo-
sition, we needed a certain set that we built to be linearly independent. We now
verify that it is.
Theorem B.2.2 The set Bk from the proof of Theorem 2.4.1 is linearly independent.
Linear Independence
of Jordan Chains
Proof. Throughout this proof, we use the same notation as in the proof of
Theorem 2.4.1, but before proving anything specific about Bk , we prove the
following claim:
Claim: If S1 , S2 are subspaces of Cn and x ∈ Cn is a vector for which x ∈ /
span(S1 ∪ S2 ), then span({x} ∪ S1 ) ∩ span(S2 ) = span(S1 ) ∩ span(S2 ).
Note that
span(S1 ∪ S2 ) = S1 + S2 To see why this holds, notice that if {y1 , . . . , ys } and {z1 , . . . , zt } are
is the sum of S1 and bases of S1 and S2 , respectively, then we can write every member of
S2 , which was
span({x} ∪ S1 ) ∩ span(S2 ) in the form
introduced in
Exercise 1.1.19. s t
bx + ∑ ci yi = ∑ di zi .
i=1 i=1
Moving the ci yi terms over to the right-hand side shows that bx ∈

/ span(S1 ∪ S2 ), this implies b = 0, so this mem-
span(S1 ∪ S2 ). Since x ∈
ber of span({x} ∪ S1 ) ∩ span(S2 ) is in fact in span(S1 ) ∩ span(S2 ).
With that out of the way, we now inductively prove that Bk is linearly
independent, assuming that Bk+1 is linearly independent (with the base case
of Bµ = {} trivially being linearly independent). Recall that Ck is constructed
specifically to be linearly independent. For convenience, we start by giving
names to the members of Ck :
Bk+1 = {v1 , v2 , . . . , v p } and Ck \Bk+1 = {w1 , w2 , . . . , wq }.
Now suppose that some linear combination of the members of Bk equals

zero:
p q k−1
∑ ci vi + ∑ ∑ di, j (A − λ I) j wi = 0. (B.2.1)
i=1 i=1 j=0
Our goal is to show that this implies ci = di, j = 0 for all i and j. To this
end, notice that if we multiply the linear combination (B.2.1) on the left by
(A − λ I)k−1 then we get
!
p q k−1
0 = (A − λ I)k−1 ∑ ci vi + ∑ ∑ di, j (A − λ I) j+k−1 wi
i=1 i=1 j=0
! (B.2.2)
p q
k−1
= (A − λ I) ∑ ci vi + ∑ di,0 wi ,
i=1 i=1
where the second equality comes from the fact that (A − λ I) j+k−1 wi= 0
whenever j ≥ 1, since Ck \Bk+1 ⊆ null (A − λ I)k ⊆ null (A − λ I) j+k−1 .
In other words, we have shown that ∑i=1p
ci vi + ∑qi=1 di,0 wi ∈ null (A −

λ I)k−1 , which we claim implies di,0 = 0 for all 1 ≤ i ≤ q. To see why this
is the case, we repeatedly use the Claim from the start of this proof with x
ranging over the members of Ck \Bk+1 (i.e., the vectors that we added one
at a time in Step 2 ofthe proof of Theorem 2.4.1), S1 = span(Bk+1 ), and
S2 = null (A − λ I)k−1 to see that

span(Ck ) ∩ null (A − λ I)k−1 = span(Bk+1 ) ∩ null (A − λ I)k−1
(B.2.3)
⊆ span(Bk+1 ).
p
It follows that ∑i=1 ci vi + ∑qi=1 di,0 wi ∈ span(Bk+1 ). Since Ck is linearly inde-
pendent, this then tells us that di,0 = 0 for all i.
B.2 Details of the Jordan Decomposition 445
We now repeat the above argument, but instead of multiplying the linear
combination (B.2.1) on the left by (A − λ I)k−1 , we multiply it on the left by
smaller powers of A−λ I. For example, if we multiply on the left by (A−λ I)k−2
then we get
! !
p q q
0 = (A − λ I)k−2 ∑ ci vi + ∑ di,0 wi + (A − λ I)k−1 ∑ di,1 wi
i=1 i=1 i=1
! !
p q
= (A − λ I)k−2 ∑ ci vi + (A − λ I)k−1 ∑ di,1 wi ,
i=1 i=1
where the first equality comes from the fact that (A − λ I) j+k−2 wi = 0 whenever
j ≥ 2, and the second equality following from the fact that we already showed
that di,0 = 0 for all 1 ≤ i ≤ q.
In other words, p
we have shown that ∑i=1 ci vik−2 + (A − λ I) ∑qi=1 di,1 wi ∈
null (A − λ I) k−2 . If we notice that null (A − λ I) ⊆ null (A − λ I)k−1
p q
then it follows that ∑i=1 ci vi + ∑i=1 di,1 wi ∈ null (A − λ I)k−1 , so repeating
the same argument from earlier shows that di,1 = 0 for all 1 ≤ i ≤ q.
Repeating this argument shows that di, j = 0 for all i and j, so the linear
p
combination (B.2.1) simply says that ∑i=1 ci vi = 0. Since Bk+1 is linearly
independent, this immediately implies ci = 0 for all 1 ≤ i ≤ p, which finally
shows that Bk is linearly independent as well.
The final proof of this subsection shows that the way we defined analytic
functions of matrices (Definition 2.4.4) is actually well-defined. That is, we
now show that this definition leads to the formula of Theorem 2.4.6 regardless
of which value a we choose to center the Taylor series at, whereas we originally
just proved that theorem in the case when a = λ . Note that in the statement
of this theorem and its proof, we let Nn ∈ Mk (C) denote the matrix with
ones on its n-th superdiagonal and zeros elsewhere, just as in the proof of
Theorem 2.4.6.
Theorem B.2.3 Suppose Jk (λ ) ∈ Mk (C) is a Jordan block and f : C → C is analytic on

Functions of some open set D containing λ . Then for all a ∈ D we have
Jordan Blocks
∞
f (n) (a) n k−1 f ( j) (λ )
∑ Jk (λ ) − aI = ∑ Nj.
n=0 n! j=0 j!
The left-hand side of Proof. Since Jk (λ ) = λ I + N1 , we can rewrite the sum on the left as
the concluding line
of this theorem
is just ∞
f (n) (a) n ∞
f (n) (a) n
f Jk (λ ) , and the ∑ λ I + N1 − aI = ∑ (λ − a)I + N1 .
right-hand side is just n=0 n! n=0 n!
the large matrix
formula given by Using the binomial theorem (Theorem A.2.1) then shows that this sum equals
Theorem 2.4.6. !
∞
f (n) (a) k n n− j
∑ n! ∑ j (λ − a) N j .
n=0 j=0

Swapping the order of these sums and using the fact that nj = n!/((n − j)! j!)
then puts it into the form
We can swap the !
order of summation
k
1 ∞
f (n) (a)
∑ ∑ (λ − a)n− j N j . (B.2.4)
j=0 j! n=0 (n − j)!
here because the
sum over n con-
verges
for each j.
f (a) (n)
Since f is analytic we know that f (λ ) = ∑∞ n
n=0 n! (λ − a) . Further-
( j) ( j)
more, f must also be analytic, and replacing f by f in this Taylor series
(n)
f (a)
shows that f ( j) (λ ) = ∑∞n=0 (n− j)! (λ − a)
n− j . Substituting this expression into
Equation (B.2.4) shows that

Keep in mind that !
Nk = O, so we can k
1 ∞
f (n) (a) n− j
k
f ( j) (λ )
discard the last term ∑ j! ∑ (n − j)! (λ − a) Nj = ∑
j!
Nj,
in this sum. j=0 n=0 j=0
B.3 Strong Duality for Semidefinite Programs
When discussing the duality properties of semidefinite programs in Section 3.C.3,

one of the main results that we presented gave some conditions under which
strong duality holds (this was Theorem 3.C.2). We now prove this theorem
by making use of the properties of convex sets that were outlined in Ap-
pendix A.5.1.
Throughout this section, we use the same notation and terminology as in
Section 3.C.3. In particular, A B means that A−B is positive semidefinite, A
B means that A − B is positive definite, and a primal/dual pair of semidefinite
programs is a pair of optimization problems of the following form, where
B ∈ MH H H H
m , C ∈ Mn , and Φ : Mn → Mm is linear:
Primal Dual
X ∈ MH
n and Y ∈ Mm
H
maximize: tr(CX) minimize: tr(BY )
are matrix variables.
subject to: Φ(X) B subject to: Φ∗ (Y ) C
X O Y O
We start by proving a helper theorem that does most of the heavy lifting
for strong duality. We note that we use the closely-related Exercises 2.2.20
and 2.2.21 in this proof, which tell us that A O if and only if tr(AB) ≥ 0
whenever B O, as well as some other closely-related variants of this fact
involving positive definite matrices.
Theorem B.3.1 Suppose B ∈ MH H H

m and Φ : Mn → Mm is linear. The following are equiv-
Farkas Lemma alent:
a) There does not exist O X ∈ MH
n such that Φ(X) ≺ B.
b) There exists a non-zero O Y ∈ MH
m such that Φ (Y ) O and
∗
tr(BY ) ≤ 0.
Proof. To see that (b) implies (a), suppose that Y is as described in (b). Then if
a matrix X as described in (a) did exist, we would have
In particular, the first
and second 0 < tr (B − Φ(X))Y (since Y 6= O,Y O, B − Φ(X) O)
inequalities here use ∗

Exercise 2.2.20. = tr(BY ) − tr XΦ (Y ) (linearity of trace, definition of adjoint)
≤ 0, (since X O, Φ∗ (Y ) O, tr(BY ) ≤ 0)
which is impossible, so such an X does not exist.

B.3 Strong Duality for Semidefinite Programs 447
For the opposite implication, we notice that if (a) holds then the set
S = {B − Φ(X) : X ∈ MH
n is positive semidefinite}
is disjoint from the set T of positive definite matrices in MHm . Since T is

open and both of these sets are convex, the separating hyperplane theorem
(Theorem A.5.1) tells us that there is a linear form f : MH
m → R and a scalar
c ∈ R such that
f (Z) > c ≥ f B − Φ(X)
for all O X ∈ MH H
n and all O ≺ Z ∈ Mm . We know from Exercise 1.3.8 that
there exists a matrix Y ∈ Mm such that f (A) = tr(AY ) for all A ∈ MH
H
m , so the
above string of inequalities can be written in the form

tr(ZY ) > c ≥ tr (B − Φ(X))Y (separating hyperplane)
∗

= tr(BY ) − tr XΦ (Y ) (linearity of trace, definition of adjoint)
for all O X ∈ MH n and all O ≺ Z ∈ Mm .

H
It follows that Y 6= O (otherwise we would have 0 > c ≥ 0, which is impos-

sible). We also know from Exercise 2.2.21(b) that the inequality tr(ZY ) > c for
all Z O implies Y O, and furthermore
that wecan choose c = 0. It follows
that 0 = c ≥ tr(BY ) − tr XΦ∗ (Y ) , so tr XΦ∗ (Y ) ≥ tr(BY ) for all X O. Us-
ing Exercise 2.2.21(a) then shows that Φ∗ (Y ) O and that tr(BY ) ≤ 0, which
completes the proof.
With the Farkas Lemma under our belt, we can now prove the main result
of this section, which we originally stated as Theorem 3.C.2, but we re-state
here for ease of reference.
Theorem B.3.2 Suppose that both problems in a primal/dual pair of SDPs are feasible, and
Strong Duality at least one of them is strictly feasible. Then the optimal values of those
for Semidefinite SDPs coincide. Furthermore,
Programs a) if the primal problem is strictly feasible then the optimal value is
attained in the dual problem, and
b) if the dual problem is strictly feasible then the optimal value is
attained in the primal problem.
Proof. We just prove part (a) of the theorem, since part (b) then follows from
swapping the roles of the primal and dual problems.
Let α be the optimal value of the primal problem (which is not necessarily
e : MH
attained). If we define Φ H
n → Mm+1 and B ∈ Mm+1 by
H
" #
e −tr(CX) 0 e −α 0
Φ(X) = and B = ,
0 Φ(X) 0 B
then there does not exist O X ∈ MH e e

n such that Φ(X) ≺ B (since otherwise we
would have Φ(X) ≺ B and tr(CX) > α, which would mean that the objective
function of the primal problem can be made larger than its optimal value α).
Applying Theorem B.3.1 to Be and Φ e then tells us that there exists a non-zero
e H e
O Y ∈ Mm+1 such that Φ (Y ) O and tr(BeYe ) ≤ 0. If we write
∗ e
We use asterisks (∗)
to denote entries of
e y ∗ e ∗ (Ye ) = Φ∗ (Y ) − yC.
Ye that are irrelevant Y= then Φ
∗ Y
to us.
The inequality tr(BeYe ) ≤ 0 then tells us that tr(BY ) ≤ yα, and the constraint
e ∗ (Ye ) O tells us that Φ∗ (Y ) yC. If y 6= 0 then it follows that Y /y is
Φ
Keep in mind that
Ye O implies Y O
a feasible point of the dual problem that produces a value in the objective
and y ≥ 0. function no larger than α (and thus necessarily equal to α, by weak duality),
which is exactly what we wanted to find. All that remains is to show that y 6= 0,
so that we know that Y /y exists.
To pin down this final detail, we notice that if y = 0 then Φ∗ (Y ) = Φe ∗ (Ye )
If y = 0, we know that O and tr(BY ) = tr(BeYe ) ≤ 0. Applying Theorem B.3.1 (to B and Φ this time)
Y 6= O since Ye 6= O. shows that the primal problem is not strictly feasible, which is a contradiction
that completes the proof.
C. Selected Exercise Solutions
Section 1.1: Vector Spaces and Subspaces

1.1.1 (a) Not a subspace. For example, if c = −1 and 1.1.3 (a) Not a basis, since it does not span all of M2 .
v = (1, 1) then cv = (−1, −1) is not in the set. For example,
(c) Not a subspace. For example, I and −I are both " # " # " # " #!
invertible, but I + (−I) = O, which is not invert- 1 0 1 0 1 1 1 0
/ span
∈ , , ,
ible. 0 0 0 1 0 1 1 1
(e) Not a subspace. For example, f (x) = x is in the since any linear combination of the three matri-
set, but 2 f is not since 2 f (2) = 4 6= 2. ces on the right has both diagonal entries equal
(g) Is a subspace. If f and g are even functions to each other.
and c ∈ R then f + g and c f are even functions (c) Is a basis. To see this, we note that
too, since ( f + g)(x) = f (x) + g(x) = f (−x) + " # " #
g(−x) = ( f + g)(−x) and (c f )(x) = c f (x) = 1 0 1 1
c1 + c2
c f (−x) = (c f )(−x). 0 1 0 1
(i) Is a subspace. If f and g are functions in this " # " #
set, then ( f + g)00 (x) − 2( f + g)(x) = ( f 00 (x) − 1 0 1 1
+ c3 + c4
2 f (x)) + (g00 (x) − 2g(x)) = 0 + 0 = 0. Similarly, 1 1 1 0
if c ∈ R then (c f )00 (x) − 2(c f )(x) = c( f 00 (x) −
c + c2 + c3 + c4 c2 + c4
2 f (x)) = c0 = 0. = 1 .
c3 + c4 c1 + c2 + c3
(k) Is a subspace. If A and B both have trace
0 then tr(cA) = ctr(A) = 0 and tr(A + B) = To make this equal to a given matrix
" #
tr(A) + tr(B) = 0 as well. a b
,
1.1.2 (a) Linearly independent, because every set con- c d
taining exactly one non-zero vector is linearly we solve the linear system to get the unique
independent (if there is only one vector v then solution c1 = 2a − b − c − d, c2 = −a + b + d,
the only way for c1 v = 0 to hold is if c1 = 0). c3 = −a + c + d, c4 = a − d. This shows that
(c) Linearly dependent, since sin2 (x)+ cos2 (x)=1. this set of matrices spans all of M2 . Fur-
(e) Linearly dependent, since the angle-sum thermore, since this same linear system has
trigonometric identities tell us that a unique solution when a = b = c = d = 0, we
see that this set is linearly independent, so it is
sin(x + 1) = cos(1) sin(x) + sin(1) cos(x).
a basis.
We have thus written sin(x + 1) as a linear com- (e) Not a basis, since the set is linearly dependent.
bination of sin(x) and cos(x) (since cos(1) and For example,
sin(1) are just scalars).
x2 − x = (x2 + 1) − (x + 1).
(g) Linearly independent. To see this, consider the
equation (g) Is a basis. We can use the same technique
x x 2 x as in Example 1.1.15 to see that this set is
c1 e + c2 xe + c3 x e = 0. linearly independent. Specifically, we want
This equation has to be true no matter what to know if there exist (not all zero) scalars
value of x we plug in. Plugging in x = 0 tells c0 , c1 , c2 , . . . , cn such that
us that c1 = 0. Plugging in x = 1 (and us- c0 +c1 (x−1)+c2 (x−1)2 +· · ·+cn (x−1)n = 0.
ing c1 = 0) tells us that c2 e + c3 e = 0, so
c2 + c3 = 0. Plugging in x = 2 (and using Plugging in x = 1 shows that c0 = 0, and then
c1 = 0) tells us that 2c2 e2 + 4c3 e2 = 0, so taking derivatives and plugging in x = 1 shows
c2 + 2c3 = 0. This is a system of two linear that c1 = c2 = · · · = cn = 0 as well, so the set
equations in two unknowns, and it is straight- is linearly independent. The fact that this set
forward to check that its only solution is spans P follows either from the fact that we
c2 = c3 = 0. Since we already showed that can build a Taylor series for polynomials cen-
c1 = 0 too, the functions are linearly indepen- tered at x = 1, or from the fact that in a linear
dent. combination of the form
c0 + c1 (x − 1) + c2 (x − 1)2 + · · · + cn (x − 1)n
450 Appendix C. Selected Exercise Solutions
we can first choose cn to give us whatever coef- 1.1.14 (a) We check the two closure properties from
ficient of xn we want, then choose cn−1 to give Theorem 1.1.2: if f , g ∈ P O then ( f +
us whatever coefficient of xn−1 we want, and g)(−x) = f (−x) + g(−x) = − f (x) − g(x) =
so on. −( f + g)(x), so f + g ∈ P O too, and if c ∈ R
1.1.4 (b) True. Since V must contain a zero vector 0, we then (c f )(−x) = c f (−x) = −c f (x), so c f ∈
can just define W = {0}. P O too.
(d) False. For example, let V = R2 and let B = (b) We first notice that {x, x3 , x5 , . . .} ⊂ P O . This
{(1, 0)},C = {(2, 0)}. set is linearly independent since it is a subset
(f) False. For example, if V = P and B = of the linearly independent set {1, x, x2 , x3 , . . .}
{1, x, x2 , . . .} then span(B) = P, but there is from Example 1.1.16. To see that it spans P O ,
no finite subset of B whose span is all of P. we notice that if
(h) False. The set {0} is linearly dependent (but f (x) = a0 + a1 x + a2 x2 + a3 x3 + · · · ∈ P O
every other single-vector set is indeed linearly
independent). then f (x) − f (−x) = 2 f (x), so
2 f (x) = f (x) − f (−x)
1.1.5 Property (i) of vector spaces tells us that (−c)v =
= (a0 + a1 x + a2 x2 + a3 x3 + · · · )
(−1)(cv), and Theorem 1.1.1(b) tells us that
(−1)(cv) = −(cv). − (a0 − a1 x + a2 x2 − a3 x3 + · · · )
= 2(a1 x + a3 x3 + a5 x5 + · · · ).
1.1.7 All 10 of the defining properties from Definition 1.1.1
are completely straightforward and follow imme- It follows that f (x) = a1 x +a3 x3 +a5 x5 +· · · ∈
diately from the corresponding properties of F. For span(x, x3 , x5 , . . .), so {x, x3 , x5 , . . .} is indeed
example v + w = w + v (property (b)) for all v, w ∈ FN a basis of P O .
because v j + w j = w j + v j for all j ∈ N.
1.1.17 (a) We have to show that the following two proper-
1.1.8 If A, B ∈ MSn and c ∈ F then (A + B)T = AT + BT = ties hold for S1 ∩ S2 : (a) if v, w ∈ S1 ∩ S2 then
A + B, so A + B ∈ MSn and (cA)T = cAT = cA, so v + w ∈ S1 ∩ S2 as well, and (b) if v ∈ S1 ∩ S2
cA ∈ MSn . and c ∈ F then cv ∈ S1 ∩ S2 .
For property (a), we note that v+w ∈ S1 (since
1.1.10 (a) C is a subspace of F because if f , g are con- S1 is a subspace) and v + w ∈ S2 (since S2 is
tinuous and c ∈ R then f + g and c f are also a subspace). It follows that v + w ∈ S1 ∩ S2 as
continuous (both of these facts are typically well.
proved in calculus courses). Property (b) is similar: we note that cv ∈ S1
(b) D is a subspace of F because if f , g are differ- (since S1 is a subspace) and cv ∈ S2 (since S2
entiable and c ∈ R then f + g and c f are also is a subspace). It follows that cv ∈ S1 ∩ S2 as
differentiable. In particular, ( f + g)0 = f 0 + g0 well.
and (c f )0 = c f 0 (both of these facts are typi- (b) If V = R2 and S1 = {(x, 0) : x ∈ R}, S2 =
cally proved in calculus courses). {(0, y) : y ∈ R} then S1 ∪ S2 is the set of points
with at least one of their coordinates equal to 0.
1.1.11 If S is closed under linear combinations then cv ∈ S It is not a subspace since (1, 0), (0, 1) ∈ S1 ∪S2
and v + w ∈ S whenever c ∈ F and v, w ∈ S simply but (1, 0) + (0, 1) = (1, 1) 6∈ S1 ∪ S2 .
because cv and v + w are both linear combinations of
members of S, so S is a subspace of V. 1.1.19 (a) S1 + S2 consists of all vectors of the form
Conversely, if S is a subspace of V and v1 , . . . , vk ∈ S (x, 0, 0) + (0, y, 0) = (x, y, 0), which is the xy-
then c j v j ∈ S for each 1 ≤ j ≤ k. Then repeat- plane.
edly using closure under vector addition gives (b) We claim that MS2 + MsS 2 is all of M2 . To
c1 v1 + c2 v2 ∈ S, so (c1 v1 + c2 v2 ) + c3 v3 ∈ S, and so verify this claim, we need to show that we can
on to c1 v1 + c2 v2 + · · · + ck vk ∈ S. write every 2 × 2 matrix as a sum of a symmet-
ric and a skew-symmetric matrix. This can be
1.1.13 The fact that {E1,1 , E1,2 , . . . , Em,n } spans Mm,n is done as# follows:
" " # " #
clear: every matrix A ∈ Mm,n can be written in the a b
=
1 2a b+c
+
1 0 b−c
form c d 2 b+c 2d 2 c−b 0
m n
A= ∑ [Note: We discuss a generalization of this fact
∑ ai, j Ei, j . in Example 1.B.3 and Remark 1.B.1.]
i=1 j=1
Linear independence of {E1,1 , E1,2 , . . . , Em,n } follows (c) We check that the two properties of Theo-
from the fact that if rem 1.1.2 are satisfied by S1 + S2 . For prop-
m n erty (a), notice that if v, w ∈ S1 + S2 then there
∑ ∑ ci, j Ei, j = O exist v1 , w1 ∈ S1 and v2 , w2 ∈ S2 such that
i=1 j=1 v = v1 + v2 and w = w1 + w2 . Then
then   v + w = (v1 + v2 ) + (w1 + w2 )
c1,1 c1,2 ··· c1,n
 c2,1 c2,2 ··· c2,n  = (v1 + w1 ) + (v2 + w2 ) ∈ S1 + S2 .
 

 .. ..

..  = O, Similarly, if c ∈ F then
..
 . . . .  cv = c(v1 + v2 ) = cv1 + cv2 ∈ S1 + S2 ,
cm,1 cm,2 ··· cm,n
since cv1 ∈ S1 and cv2 ∈ C2 .
so ci, j = 0 for all i, j.
C.1 Vector Spaces 451
1.1.20 If every vector in V can be written as a linear combina- (c) Notice that the k-th derivative of xk is the scalar
tion of the members of B, then B spans V by definition, function k!, and every higher-order derivative
so we only need to show linear independence of B. To of xk is 0. It follows that W (x) is an upper
this end, suppose that v ∈ V can be written as a linear triangular matrix with 1, 1!, 2!, 3!, . . . on its di-
combination of the members of B in exactly one way: agonal, so det(W (x)) = 1 · 1! · 2! · · · n! 6= 0.
(d) We start by showing that det(W (x)) = 0 for all
v = c1 v1 + c2 v2 + · · · + ck vk
x. First, we note that f20 (x) = 2|x| for all x. To
for some v1 , v2 , . . . , vk ∈ B. If B were linearly depen- see this, split it into three cases: If x > 0 then
dent, then there must be a non-zero linear combination f2 (x) = x2 , which has derivative 2x = 2|x|. If
of the form x < 0 then f2 (x) = −x2 , which has derivative
−2x = 2|x|. If x = 0 then we have to use the
0 = d1 w1 + d2 w2 + · · · + dm wm limit definition of a derivative
for some w1 , w2 , . . . , wm ∈ B. By adding these two f2 (h) − f2 (0)
linear combinations, we see that f20 (0) = lim
h→0 h
v = c1 v1 + · · · + ck vk + d1 w1 + · · · + dm wm . h|h| − 0
= lim = lim |h| = 0,
h→0 h h→0
Since not all of the d j ’s are zero, this is a different
linear combination that gives v, which contradicts which also equals 2|x| (since x = 0). With that
uniqueness. We thus conclude that B must in fact be out of the way, we can compute
linearly independent. " #!
x2 x|x|
det(W (x)) = det
1.1.21 (a) Since f1 , f2 , . . . , fn are linearly dependent, it 2x 2|x|
follows that there exist scalars c1 , c2 , . . . , cn
such that = 2x2 |x| − 2x2 |x| = 0
c1 f1 (x) + c2 f2 (x) + · · · + cn fn (x) = 0 for all x ∈ R, as desired.

Now we will show that f1 and f2 are linearly
for all x ∈ R. Taking the derivative of this equa- independent. To see this, suppose there exist
tion gives constants c1 , c2 such that
c1 f10 (x) + c2 f20 (x) + · · · + cn fn0 (x) = 0, c1 x2 + c2 x|x| = 0
taking the derivative again gives
for all x. Plugging in x = 1 gives the equation
c1 f100 (x) + c2 f200 (x) + · · · + cn fn00 (x) = 0, c1 + c2 = 0, while plugging in x = −1 gives
the equation c1 − c2 = 0. It is straightforward
and so on. We thus see that to solve this system of two equations to get
     
f1 (x) fn (x) c1 = c2 = 0, which implies that x2 and x|x| are
 0  0
 f1 (x)    linearly independent.
   fn0 (x)   0 
 00   
 f1 (x)   f 00 (x)   0

1.1.22 The Wronskian of {ea1 x , ea2 x , . . . , ean x } (see Exer-
c1   + · · · + c 
n
n  =  
  
 ..   ..   ..  cise 1.1.21) is
   .   
 .    .  a x 
e1 ea2 x ... ean x
f
(n−1)
(x) f
(n−1)
(x) 0  a e a1 x
1 n
 1 a2 e 2a x ... an e n 
a x 

 
for all x ∈ R as well. This is equivalent to say-  a2 ea1 x a22 ea2 x ... a2n ean x 
ing that the columns of W (x) are linearly de- det 

 1  =

 .. .. .. .. 
pendent for all x, which is equivalent to W (x) 
. . . .

 
not being invertible for all x, which is equiva-
lent to det(W (x)) = 0 for all x. an−1
1 e
a1 x an−1
2 e
a2 x ... an−1
n e
an x
(b) It is straightforward to compute W (x):  

1 1 ... 1
   a a2 ... an  
x ln(x) sin(x)  1 
 2 

W (x) = 1 1/x

cos(x)  .  a1 a22 ... a2n 
ea1 x ea2 x · · · ean x det 

  ,

0 −1/x2 − sin(x)  .. .. .. .. 
 . . . . 
 
Then
an−1
1 an−1
2 ... n−1
an
det(W (x)) = (− sin(x)) + 0 − sin(x)/x2
with the equality coming from using multilinearity of
+ cos(x)/x + ln(x) sin(x) − 0 the determinant to pull ea1 x out of the first column of
= sin(x)(ln(x) − 1 − 1/x2 ) + cos(x)/x. the matrix, ea2 x out of the second column, and so on.
Since the matrix on the right is the transpose of a Van-
To prove linear independence, we just need to dermonde matrix, we know that it is invertible and thus
find a particular value of x such that the above has non-zero determinant. Since ea1 x ea2 x · · · ean x 6= 0
expression is non-zero. Almost any choice of as well, we conclude that the Wronskian is non-zero,
x works: one possibility is x = π, which gives so this set of functions is linearly independent.
det(W (x)) = −1/π.
Section 1.2: Coordinates and Linear Transformations

 
1.2.1 (a) [v]B = (1, 5)/2 (c) [v]B = (1, 4, 2, 3) 1 1 1 1
 
0 1 0 1
(d) PC←B =  
1.2.2 (b) One basis is {Ei,i : 1 ≤ i ≤ n} ∪ {Ei, j + E j,i : 0 0 1 1
1 ≤ i < j ≤ n}. This set contains n + n(n − 1 1 1 0
1)/2 = n(n + 1)/2 matrices, so that is the di-
1.2.6 (b) Since
mension of V.
(d) One basis is {Ei, j − E j,i : 1 ≤ i < j ≤ n}. This T (ax2 + bx + c) = a(3x + 1)2 + b(3x + 1) + c
set contains n(n − 1)/2 matrices, so that is the = 9ax2 + (6a + 3b)x + (a + b + c),
dimension of V.
(f) One basis is {E j, j : 1 ≤ j ≤ n} ∪ {E j,k + Ek, j : we conclude that
 
1 ≤ k < j ≤ n} ∪ {iE j,k − iEk, j : 1 ≤ j < k ≤ 9 0 0
n}. This set contains n + n(n − 1)/2 + n(n −  
[T ]D←B = 6 3 0 .
1)/2 = n2 matrices, so that is the dimension of
1 1 1
V.
(h) One basis is {(x − 3), (x − 3)2 , (x − 3)3 }, so (d) We return to this problem much later, in Theo-
dim(V) = 3. rem 3.1.7. For now, we can solve it directly to
(j) To find a basis, we plug an arbitrary polyno- get
 
mial f (x) = a0 + a1 x + a2 x2 + a3 x3 into the 0 −2 2 0
equation that defines the vector space. If we do  
−3 −3 0 2
this, we get [T ]D←B =  .
3 0 3 −2
(a0 + a1 x + a2 x2 + a3 x3 ) 0 3 −3 0
− (a1 x + 2a2 x2 + 3a3 x3 ) = 0, 1.2.8 We start by showing that this set is linearly indepen-
dent. To see why this is the case, consider the matrix
so matching up powers of x shows that a0 = 0,
equation
−a2 = 0, and −2a3 = 0. It follows that the n n
only polynomials in P 3 satisfying the indicated ∑ ∑ ci, j Si, j = O.
equation are those of the form a1 x, so a basis i=1 j=i
is {x} and V is 1-dimensional. The top row of the matrix on the left equals
(l) V is 3-dimensional, since a basis is (∑nj=1 c1, j , c1,2 , c1,3 , . . . , c1,n ), so we conclude that
{ex , e2x , e3x } itself. It spans V by definition, c1,2 = c1,3 = · · · = c1,n = 0 and thus c1,1 = 0 as well.
so we just need to show that it is a linearly A similar argument with the second row, third row,
independent set. This can be done by plug- and so on then shows that ci, j = 0 for all i and j, so B
ging in 3 particular x values in the equation is linearly independent.
c1 ex + c2 e2x + c3 e3x and then showing that the We could use a similar argument to show that B spans
only solution is c1 = c2 = c3 = 0, but we just MSn , but instead we just recall from Exercise 1.2.2
note that it follows immediately from Exer- that dim(MSn ) = n(n + 1)/2, which is exactly the
cise 1.1.22. number of matrices in B. It follows from the upcom-
ing Exercise 1.2.27(a) that B is a basis of MSn .
1.2.3 (b) Is a linear transformation, since T (c f (x)) =
c f (2x − 1) = cT ( f (x)) and T ( f (x) + g(x)) = 1.2.10 (b) Let B = {ex sin(2x), ex cos(2x)} be a basis of
f (2x − 1) + g(2x − 1) = T ( f (x)) + T (g(x)). the vector space V = span(B) and let D : V →
(d) Is a linear transformation, since RB (A +C) = V be the derivative map. We start by computing
(A + C)B = AB + CB = RB (A) + RB (C) and [D]B :
RB (cA) = (cA)B = c(AB) = cRB (A). D(ex sin(2x)) = ex sin(2x) + 2ex cos(2x),
(f) Not a linear transformation. For example,
T (iA) = (iA)∗ = −iA 6= iA = iT (A). D(ex cos(2x)) = ex cos(2x) − 2ex sin(2x).
It follows that
" #
1.2.4 (a) True. In fact, this is part of the definition of a
1 −2
vector space being finite-dimensional. [D]B = .
(c) False. P 3 is 4-dimensional (in general, P p is 2 1
(p + 1)-dimensional). The inverse of this matrix is
(e) False. Its only basis is {}, which has 0 vectors, " #
1 1 2
so dim(V) = 0. [D−1 ]B = .
(g) False. It is a basis of P (the vector space of 5 −2 1
polynomials). Since the coordinate vector of ex sin(2x) is
(i) True. In fact, it is its own inverse since (1,R0), we then know that the coordinate vector
(AT )T = A for all A ∈ Mm,n . of ex sin(2x) dx is
  " #" # " #
−1 5 −2 1 1 2 1 1 1
1  = .
1.2.5 (b) PC←B =  4 −2 −1 5 −2 1 0 5 −2
9 R
2 −1 4 It follows that ex sin(2x) dx = ex sin(2x)/5 −
2ex cos(2x)/5 +C.
1.2.12 We prove this in the upcoming Definition 3.1.3, and (b) Suppose v∈V. Then Exercise 1.2.22 tells us that
the text immediately following it.
c1 v1 + c2 v2 + · · · + cm vm = v
1.2.13 If AT = λ A then a j,i = λ ai, j for all i, j, so ai, j = if and only if
λ (λ ai, j ) = λ 2 ai, j for all i, j. If A 6= O then this im-
plies λ = ±1. The fact that the eigenspaces are MSn c1 [v1 ]B + c2 [v2 ]B + · · · + cm [vm ]B
and MsS n follows directly from the defining properties = [c1 v1 + c2 v2 + · · · + cm vm ]B = [v]B .
AT = A and AT = −A of these spaces.
In particular, we can find c1 , c2 , . . . , cm to
1.2.15 (a) range(T ) = P 2 , null(T ) = {0}. solve the first equation if and only if we
(b) The only eigenvalue of T is λ = 1, and the can find c1 , c2 , . . . , cm to solve the second
corresponding eigenspace is P 0 (the constant equation, so v ∈ span(v1 , v2 , . . . , vm ) if and
functions). only if [v]B ∈ span [v1 ]B , [v2 ]B , . . . , [vm ]B and
(c) One square root is the transformation S : P 2 → thus span(v1 , v2 , . . . , vm ) = V if and only if
P 2 given by S( f (x)) = f (x + 1/2). span [v1 ]B , [v2 ]B , . . . , [vm ]B = Fn .
(c) This follows immediately from combining
1.2.18 [D]B can be (complex) diagonalized as [D]B = PDP−1 parts (a) and (b).
via " # " #
1 1 i 0 1.2.24 (a) If v, w ∈ range(T ) and c ∈ F then there ex-
P= , D= . ist x, y ∈ V such that T (x) = v and T (y) =
−i i 0 −i
w. Then T (x + cy) = v + cw, so v + cw ∈
It follows that range(T ) too, so range(T ) is a subspace of w.
1/2
[D]B = PD1/2 P−1 (b) If v, w ∈ null(T ) and c ∈ F then T (v + cw) =
" # T (v) + cT (w) = 0 + c0 = 0, so v + cw ∈
1 1+i 0 null(T ) too, so null(T ) is a subspace of V.
=√ P P−1
2 0 1−i
" # 1.2.25 (a) Wejustnotethatw ∈ range(T )ifandonlyifthere
1 1 −1
=√ , exists v ∈ V such that T (v) = w, if and only if
2 1 1 there exists [v]B such that [T ]D←B [v]B = [w]D ,
which is the same square root we found in Exam- if and only if [w]D ∈ range([T ]D←B ).
ple 1.2.19. (b) Similar to part (a), v ∈ null(T ) if and only if
T (v) = 0, if and only if [T ]D←B [v]B = [0]B = 0,
1.2.21 It then follows from Exercise 1.2.20(c) that if and only if [v]B ∈ null([T ]D←B ).
if {v1 , v2 , . . . , vn } is a basis of V then (c) Using methods like in part (a), we can
{T (v1 ), T (v2 ), . . . , T (vn )} is a basis of W, so show that if w1 , . . . , wn is a basis of
dim(W ) = n = dim(V). range(T ) then [w1 ]D , . . . , [wn ]D is a ba-
sis of range([T ]D←B ), so rank(T ) =
1.2.22 (a) Write B={v1 , v2 , . . ., vn } and give names to the dim(range(T )) = dim(range([T ]D←B )) =
entries of [v]B and [w]B : [v]B = (c1 , c2 , . . . , cn ) rank([T ]D←B ).
and [w]B = (d1 , d2 , . . . , dn ), so that (d) Using methods like in part (b), we can
show that if v1 , . . . , vn is a basis of
v = c1 v1 + c2 v2 + · · · + cn vn null(T ) then [v1 ]B , . . . , [vn ]B is a basis of
w = d1 v1 + d2 v2 + · · · + dn vn . null([T ]D←B ), so nullity(T ) = dim(null(T )) =
By adding these equations we see that dim(null([T ]D←B )) = nullity([T ]D←B ).
v + w = (c1 + d1 )v1 + · · · + (cn + dn )vn , 1.2.27 (a) The “only if” implication is trivial since a ba-
which means that [v+w]B =(c1 +d1 , c2 +d2 , . . . , sis of V must, by definition, be linearly in-
cn +dn ), which is the same as [v]B +[w]B . dependent. Conversely, we note from Exer-
(b) Using the same notation as in part (a), we have cise 1.2.26(a) that, since B is linearly indepen-
cv = (cc1 )v1 + (cc2 )v2 + · · · + (ccn )vn , dent, we can add 0 or more vectors to B to
create a basis of V. However, B already has
which means that [cv]B = (cc1 , cc2 , . . . , cck ), dim(V) vectors, so the only possibility is that
which is the same as c[v]B . B becomes a basis when we add 0 vectors to
(c) It is clear that if v = w then [v]B = [w]B . For it—i.e., B itself is already a basis of V.
the converse, we note that parts (a) and (b) (b) Again, the “only if” implication is trivial since
tell us that if [v]B = [w]B then [v − w]B = 0, so a basis of V must span V. Conversely, Exer-
v − w = 0v1 + · · · + 0vn = 0, so v = w. cise 1.2.26(b) tells us that we can remove 0
or more vectors from B to create a basis of V.
1.2.23 (a) By Exercise 1.2.22, we know that However, we know that all bases of V contain
c1 v1 + c2 v2 + · · · + cm vm = 0 dim(V) vectors, and B already contains exactly
if and only if this many vectors, so the only possibility is that
B becomes a basis when we remove 0 vectors
c1 [v1 ]B + c2 [v2 ]B + · · · + cm [vm ]B from it—i.e., B itself is already a basis of V.
= [c1 v1 + c2 v2 + · · · + cm vm ]B = [0]B = 0.
In particular, this means that {v1 , v2 , . . . , vm } is
linearly independent if and only if c1 = c2 =
· · · = cm = 0 is the unique
solution to these equa-
tions, if and only if [v1 ]B , [v2 ]B , . . . , [vm ]B ⊂
n
F is linearly independent.
1.2.28 (a) All 10 vector space properties from Defini- 1.2.32 We check the two properties of Definition 1.2.4:
tion 1.1.1 are straightforward, so we do not (a) R (v1 , v2 , . . .) + (w1 , w2 , . . .) = R(v1 + w1 , v2 +
show them all here. Property (a) just says that w2 , . . .) = (0, v1 + w1 , v2 + w2 , . . .) = (0, v1 , v2 , . . .) +
the sum of two linear transformations is again (0, w1 , w2 , . . .) = R(v1 , v2 , . . .) + R(w1 , w2 , . . .),
a linear transformation, for example. and (b) R c(v1 , v2 , . . .) = R(cv1 , cv2 , . . .) =
(b) dim(L(V, W)) = mn, which can be seen by (0, cv1 , cv2 , . . .) = c(0, v1 , v2 , . . .) = cR(v1 , v2 , . . .).
noting that if {v1 , . . . , vn } and {w1 , . . . , wm }
are bases of V and W, respectively, then the 1.2.34 It is clear that dim(P 2 (Z2 )) ≤ 3 since {1, x, x2 } spans
mn linear transformations defined by P 2 (Z2 ) (just like it did in the real case). However, it
( is not linearly independent since the two polynomials
wi if j = k, f (x) = x and f (x) = x2 are the same on Z2 (i.e., they
Ti, j (vk ) =
0 otherwise provide the same output for all inputs). The set {1, x}
is indeed a basis of P 2 (Z2 ), so its dimension is 2.
form a basis of L(V, W). In fact, the standard
matrices of these linear transformations make 1.2.35 Suppose P ∈ Mm,n is any matrix such that P[v]B =
up the standard basis of Mm,n : [Ti, j ]D←B = Ei, j [T (v)]D for all v ∈ V. For every 1 ≤ j ≤ n, if v = v j
for all i, j. then we see that [v]B = e j (the j-th standard basis
vector in Rn ), so P[v]B = Pe j is the j-th column
1.2.30 [T ]B = In if and only if the j-th column of [T ]B equals of P. On the other hand, it is also the case that
e j for each j. If we write B = {v1 , v2 , . . . , vn } then P[v]B = [T (v)]D = [T (v j )]D . It follows that the j-
we see that [T ]B = In if and only if [T (v j )]B = e j for th column of P is [T (v j )]D for each 1 ≤ j ≤ n, so
all j, which is equivalent to T (v j ) = v j for all j (i.e., P = [T ]D←B , which shows uniqueness of [T ]D←B .
T = IV ).
1.2.36 Suppose that a set C has m < n vectors, which we call
1.2.31 (a) If B is a basis of W then by Exercise 1.2.26 v1 , v2 , . . . , vm . To see that C does not span V, we want
that we can extend B to a basis C ⊇ B of V. to show that there exists x ∈ V such that the equation
However, since dim(V) = dim(W) we know c1 v1 + · · · + cm vm = x does not have a solution. This
that B and C have the same number of vectors, equation is equivalent to
so B = C, so V = W.
(b) Let V = c00 from Example 1.1.10 and let W c1 [v1 ]B + · · · + cm [vm ]B = [x]B ,
be the subspace of V with the property that
which is a system of n linear equations in m variables.
the first entry of every member of w equals 0.
Since m < n, this is a “tall and skinny” linear system,
Then dim(V) = dim(W) = ∞ but V 6= W.
so applying Gaussian elimination to the augmented
matrix form [ A | [x]B ] of this linear system results in
a row echelon form [ R | [y]B ] where R has at least
one zero row at the bottom. If x is chosen so that the
bottom entry of [y]B is non-zero then this linear system
has no solution, as desired.
Section 1.3: Isomorphisms and Linear Forms

1.3.1 (a) Isomorphic, since they are finite-dimensional (d) Not an inner product, since it is not even linear
with the same dimension (6). in its second argument. For example, hI, Ii =
(c) Isomorphic, since they both have dimension 9. tr(I + I) = 2n, but hI, 2Ii = 3n 6= 2hI, Ii.
(e) Not isomorphic, since V is finite-dimensional (f) Is an inner product. All three properties can
but W is infinite-dimensional. be proved in a manner analogous to Exam-
ple 1.3.17.
1.3.2 (a) Not a linear form. For example, f (e1 ) =
f (−e1 ) = 1, but f (e1 − e1 ) = T (0) = 0 6= 2 = 1.3.4 (a) True. This property of isomorphisms was stated
f (e1 ) + f (−e1 ). explicitly in the text, and is proved in Exer-
(c) Not a linear form, since it does not output cise 1.3.6.
scalars. It is a linear transformation, however. (c) False. This statement is only true if V and W
(e) Is a linear form, since g( f1 + f2 ) = ( f1 + are finite-dimensional.
f2 )0 (3) = f10 (3) + f20 (3) = g( f1 ) + g( f2 ) by (e) False. It is conjugate linear, not linear.
linearity of the derivative, and similarly (g) False. For example, if f (x) = x2 then
g(c f ) = cg( f ). E1 ( f ) = 12 = 1 and E2 ( f ) = 22 = 4, but
E3 ( f ) = 32 = 9, so E1 ( f ) + E2 ( f ) 6= E3 ( f ).
1.3.3 (a) Not an inner product, since it is not conju-
gate symmetric. For example, he1 , e2 i = 1 but 1.3.6 (a) This follows immediately from things that we
he2 , e1 i = 0. learned in Section 1.2.4. Linearity of T −1 is
baked right into the definition of its existence,
and it is invertible since T is its inverse.
(b) S ◦ T is linear (even when S and T are any (b) It is still true that alternating implies skew-
not necessarily invertible linear transforma- symmetric, but the converse fails because
tions) since (S ◦ T )(v + cw) = S(T (v + cw)) = f (v, v) = − f (v, v) no longer implies f (v, v) =
S(T (v) + cT (w)) = S(T (v)) + cS(T (w)) = 0 (since 1 + 1 = 0 implies −1 = 1 in this field).
(S ◦ T )(v) + c(S ◦ T )(w). Furthermore, S ◦ T is
invertible since T −1 ◦ S−1 is its inverse. 1.3.18 (a) We could prove this directly by mimicking the
proof of Theorem 1.3.5, replacing the trans-
1.3.7 If we use the standard basis of Mm,n (F), then we poses by conjugate transposes. Instead, we
know from Theorem 1.3.3 that there are scalars prove it by denoting the vectors in the basis
{ai, j } such that f (X) = ∑i, j ai, j xi, j . If we let A be B by B = {v1 , . . . , vm } and defining a new bi-
the matrix with (i, j)-entry equal to ai, j , we get linear function g : V × W → C by
tr(AX) = ∑i, j ai, j xi, j as well.
g(v j , w) = f (v j , w) for all v j ∈ B, w ∈ W
1.3.8 Just repeat the argument of Exercise 1.3.7 with the and extending to all v ∈ V via linearity. That
bases of MSn (F) and MH n that we presented in the is, if v = c1 v1 + · · · + cm vm (i.e., [v]B =
solution to Exercise 1.2.2. (v1 , . . . , vm )) then
1.3.9 Just recall that inner products are linear in their second g(v, w) = g(c1 v1 + · · · + cm vm , w)
entry, so hv, 0i = hv, 0vi = 0hv, vi = 0. = c1 g(v1 , w) + · · · + cm g(vm , wm )
1.3.10 Recall that = c1 f (v1 , w) + · · · + cm f (vm , wm )
s = f (c1 v1 + · · · + cm vm , w).
m n
kAkF = ∑ ∑ |ai, j |2 Then Theorem 1.3.5 tells us that we can write
i=1 j=1
g in the form
which is clearly unchanged if we reorder the entries g(v, w) = [v]TB A[w]C .
of A (i.e., kAkF = kAT kF ) and is also unchanged if we
take the complex conjugate of some or all entries (so When combined with the definition of g above,
kA∗ kF = kAT kF ). we see that
f (v, w) = g(c1 v1 + · · · + cm vm , w)
1.3.12 The result follows from expanding out the norm
in terms of the inner product: kv + wk2 = hv + = ([v]B )T A[w]C = [v]∗B A[w]C ,
w, v + wi = hv, vi + hv, wi + hw, vi + hw, wi = as desired.
kvk2 + 0 + 0 + kwk2 . (b) If A = A∗ (i.e., A is Hermitian) then
1.3.13 We just compute f (v, w) = [v]∗B A[w]B = [v]∗B A∗ [w]B
kv + wk2 + kv − wk2 = ([w]∗B A[v]B )∗ = f (w, v),
= hv + w, v + wi + hv − w, v − wi with the final equality following from the fact
that [w]∗B A[v]B = f (w, v) is a scalar and thus
= hv, vi + hv, wi + hw, vi + hw, wi
equal to its own transpose.
+ hv, vi − hv, wi − hw, vi + hw, wi In the other direction, if f (v, w) = f (w, v) for
= 2hv, vi + 2hw, wi all v and w then in particular this holds if
[v]B = ei and [w]B = e j , so
= 2kvk2 + 2kwk2 .
ai, j = e∗i Ae j = [v]∗B A[w]B = f (v, w)
1.3.14 (a) Expanding the norm in terms of the inner prod-
uct gives = f (w, v) = ([w]∗B A[v]B )∗ = (e∗j Aei )∗ = a j,i .
Since this equality holds for all i and j, we
kv + wk2 − kv − wk2
conclude that A is Hermitian.
= hv + w, v + wi − hv − w, v − wi (c) Part (b) showed the equivalence of A being
= hv, vi + 2hv, wi + hw, wi Hermitian and f being conjugate symmetric,
so we just need to show that positive definite-
− hv, vi + 2hv, wi − hw, wi,
ness of f (i.e., f (v, v) ≥ 0 with equality if
= 4hv, wi, and only if f = 0) is equivalent to v∗ Av ≥ 0
from which the desired equality follows. with equality if and only if v = 0. This equiva-
(b) This follows via the same method as in part (a). lence follows immediately from recalling that
All that changes is that the algebra is uglier, f (v, v) = [v]∗B A[v]B .
and we have to be careful to not forget the com-
plex conjugate in the property hw, vi = hv, wi. 1.3.21 gx (y) = 1 + xy + x2 y2 + · · · + x p y p .
1.3.16 (a) If f is alternating then for all v, w ∈ V we 1.3.22 We note that dim((P p )∗ ) = dim(P p ) = p + 1, which
have 0 = f (v + w, v + w) = f (v, v) + f (v, w) + is the size of the proposed basis, so we just need to
f (w, v) + f (w, w) = f (v, w) + f (w, v), so show that it is linearly independent. To this end, we
f (v, w) = − f (w, v), which means that f is note that if
skew-symmetric. Conversely, if f is skew- d0 Ec0 + d1 Ec1 + · · · + d p Ec p = 0
symmetric then choosing v = w tells us that
f (v, v) = − f (v, v), so f (v, v) = 0 for all v ∈ V, then
so f is alternating. d0 f (c0 ) + d1 f (c1 ) + · · · + d p f (c p ) = 0
for all f ∈ P p . However, if we choose f to be the non- 1.3.25 We just check the three defining proper-
zero polynomial with roots at each of c1 , . . ., c p (but ties of inner products, each of which fol-
not c0 , since f has degree p and thus at most p roots) lows from the corresponding properties of
then this tells us that d0 f (c0 ) = 0, so d0 = 0. A similar the inner product on W. (a) hv1 , v2 iV =
argument with polynomials having roots at all except hT (v1 ), T (v2 )iW = hT (v2 ), T (v1 )iW = hv2 , v1 iV .
for one of the c j ’s shows that d0 = d1 = · · · = d p = 0, (b) hv1 , v2 + cv3 iV = hT (v1 ), T (v2 + cv3 )iW =
which gives linear independence. hT (v1 ), T (v2 ) + cT (v3 )iW = hT (v1 ), T (v2 )iW +
chT (v1 , T (v3 )iW = hv1 , v2 iV + chv1 , v3 iV . (c)
1.3.24 Suppose that hv1 , v1 iV = hT (v1 ), T (v1 )iW ≥ 0, with equality if
and only if T (v1 ) = 0, which happens if and only
c1 T (v1 ) · · · + cn T (vn ) = 0.
if v1 = 0. As a side note, the equality condition of
By linearity of T , this implies T (c1 v1 + · · · + cn vn ) = property (c) is the only place where we used the fact
0, so φc1 v1 +···+cn vn = 0, which implies that T is an isomorphism (i.e., invertible).
φc1 v1 +···+cn vn ( f ) = f (c1 v1 + · · · + cn vn ) 1.3.27 If jk = j` then the condition a j1 ,..., jk ,..., j` ,..., j p =
= c1 f (v1 ) + · · · + cn f (vn ) = 0 −a j1 ,..., j` ,..., jk ,..., j p tells us that a j1 ,..., jk ,..., j` ,..., j p = 0
whenever two or more of the subscripts are equal
for all f ∈ V ∗ . Since B is linearly independent, we to each other. If all of the subscripts are dis-
can choose f so that f (v1 ) = 1 and f (v2 ) = · · · = tinct from each other (i.e., there is a permutation
f (vn ) = 0, which implies c1 = 0. A similar argu- σ : {1, 2, . . . , p} → {1, 2, . . . , p} such that σ (k) = jk
ment (involving different choices of f ) shows that for all 1 ≤ k ≤ p) then we see that aσ (1),...,σ (p) =
c2 = · · · = cn = 0 as well, so C is linearly independent. sgn(σ )a1,2,...,p , where sgn(σ ) is the sign of σ (see
Appendix A.1.5). It follows that A is completely de-
termined by the value of a1,2,...,p , so it is unique up to
scaling.
Section 1.4: Orthogonality and Adjoints

1.4.1 (a) Not an orthonormal basis, since (0, 1, 1) does Then
not have length 1.
B2 = A2 − hB1 , A2 iB1
(c) Not an orthonormal basis, since B contains 3 " # " # " #
vectors living in a 2-dimensional subspace, and 1 0 1 1 1 1 1 −1
thus it is necessarily linearly dependent. Alter- = − = .
0 1 2 1 1 2 −1 1
natively, we could check that the vectors are
not mutually orthogonal. Since B2 already has Frobenius norm 1, we do
(e) Is an orthonormal basis, since all three vectors not need to normalize it. Similarly,
have length 1 and they are mutually orthogo-
B3 = A3 − hB1 , A3 iB1 − hB2 , A3 iB2
nal: " # " # " #
0 2 1 1 1 1 1 −1
(1, 1, 1, 1) · (0, 1, −1, 0) = 0 = − − (−2)
1 −1 2 1 1 2 −1 1
(1, 1, 1, 1) · (1, 0, 0, −1) = 0 " #
(0, 1, −1, 0) · (1, 0, 0, −1) = 0. 1 1 1
= .
2 −1 −1
(We ignored some scalars since they do not
affect whether or not the dot product equals 0.) Again, since B3 already has Frobenius norm
1, we do not need to normalize it. We thus
1.4.2 (b) The standard basis {(1, 0), (0, 1)} works (the conclude that {B1 , B2 , B3 } is an orthonormal
Gram–Schmidt process could also be used, but basis of span({A1 , A2 , A3 }).
the resulting basis will be hideous).
(d) If we start with the basis {sin(x), cos(x)} and 1.4.3 (a) T ∗ (w) = (w1 + w2 , w2 + w3 ).
perform the Gram–Schmidt process, we get the (c) T ∗ (w) = (2w1 − w2 , 2w1 − w2 ). Note that this
basis { √1π sin(x), √1π cos(x)}. (In other words, can be found by constructing an orthonormal
we can check that sin and cos are already or- basis of R2 with respect to this weird inner
thogonal in this inner product, and we just need product, but it is likely easier to just construct
to normalize them to have length 1.) T ∗ directly via its definition.
(f) We refer to the three matrices as A1 , A2 , and " #
A3 , respectively, and then we divide A1 by its 1 0
1.4.4 (a)
Frobenius norm to get B1 : 0 0
" #  
1 1 1 1 2 1 −1
B1 = √ A1 = . 1 
2 1 1 (c) 1 2 1
12 + 12 + 12 + 12 3
−1 1 2
1.4.5 (a) True. This follows from Theorem 1.4.8. 1.4.19 We mimic the proof of Theorem 1.4.9. To see that (a)
(c) False. If A = AT and B = BT then (AB)T = implies (c), suppose B∗ B = C∗C. Then for all v ∈ Fn
BT AT = BA, which in general does not equal we have
AB. For an explicit counter-example, you can p √
kBvk = (Bv) · (Bv) = v∗ B∗ Bv
choose √ p
" # " # = v∗C∗Cv = (Cv) · (Cv) = kCvk.
1 0 0 1
A= , B= .
0 −1 1 0 For the implication (c) =⇒ (b), note that if kBvk2 =
kCvk2 for all v ∈ Fn then (Bv) · (Bv) = (Cv) · (Cv).
(e) False. For example, U = V = I are unitary, but If x, y ∈ Fn then v = x + y)
this tells us (by choosing
U +V = 2I is not. that B(x + y) · B(x + y) = C(x + y) · C(x + y) .
(g) True. On any inner product space V, IV satis- Expanding this dot product on both the left and right
2 = I and I ∗ = I .
fies IV V V V then gives

1.4.8 If U and V are unitary then (UV )∗ (UV ) = (Bx) · (Bx) + 2Re (Bx) · (By) + (By) · (By)

V ∗U ∗UV = V ∗V = I, so UV is also unitary. = (Cx) · (Cx) + 2Re (Cx) · (Cy) + (Cy) · (Cy).
1.4.10 If λ is an eigenvalue of U corresponding to an eigen- By then using the facts that (Bx) · (Bx) = (Cx) · (Cx)
vector v then Uv = λ v. Unitary matrices preserve and (By) · (By) = (Cy) · (Cy), we can simplify the
length, so kvk = kUvk = kλ vk = |λ |kvk. Dividing above equation to the form
both sides of this equation by kvk (which is OK since
Re (Bx) · (By) = Re (Cx) · (Cy) .
eigenvectors are non-zero) gives 1 = |λ |.
If F = R then this implies (Bx) · (By) = (Cx) · (Cy) for
1.4.11 Since U ∗U = I, we have 1 = det(I) = det(U ∗U) = all x, y∈Fn , as desired. If instead F=C then we can
det(U ∗ ) det(U) = det(U) det(U) = | det(U)|2 . Taking repeat the above argument with v = x + iy to see that
the square root of both sides of this equation gives
Im (Bx) · (By) = Im (Cx) · (Cy) ,
| det(U)| = 1.
so in this case we have (Bx) · (By) = (Cx) · (Cy) for
1.4.12 We know from Exercise 1.4.10 that the eigenvalues all x, y ∈ Fn too, establishing (b).
(diagonal entries) of U must all have magnitude 1. Finally, to see that (b) =⇒ (a), note that if we rear-
Since the columns of U each have norm 1, it follows range (Bv) · (Bw) = (Cv) · (Cw) slightly, we get
that the off-diagonal entries of U must each be 0.
(B∗ B −C∗C)v · w = 0 for all v, w ∈ Fn .
1.4.15 Direct computation shows that the (i, j)-entry of F ∗ F If we choose w=(B∗ B − C∗C)v then this implies
∗
equals 1n ∑n−1
k=0 ω
( j−i)k . If i = j (i.e., j − i = 0) then (B B − C∗C)v 2 =0 for all v ∈ Fn , so (B∗ B −
ω ( j−i)k = 1, so the (i, i)-entry of F ∗ F equals 1. If C∗C)v = 0 for all v∈Fn . This in turn implies
i 6= j then ω ( j−i)k 6= 1 is an n-th root of unity, so we B∗ B − C∗C=O, so B∗ B = C∗C, which completes the
claim that this sum equals 0 (and thus F ∗ F = I, so F proof.
is unitary).
To see why ∑n−1 k=0 ω
( j−i)k = 0 when i 6= j, we use a 1.4.20 Since B is mutually orthogonal and thus linearly in-
standard formula for summing a geometric series: dependent, Exercise 1.2.26(a) tells us that there is
a (not necessarily orthonormal) basis D of V such
n−1
1 − ω ( j−i)n that B ⊆ D. Applying the Gram–Schmidt process
∑ ω ( j−i)k = 1 − ω j−i
= 0,
to this bases D results in an orthonormal basis C
k=0
of V that also contains B (since the vectors from B
with the final equality following from the fact that are already mutually orthogonal and normalized, the
ω ( j−i)n = 1. Gram–Schmidt process does not affect them).
1.4.17 (a) For the “if” direction we note that (Ax) · y = 1.4.22 Pick orthonormal bases B and C of V and W,
(Ax)T y = xT AT y and x·(AT y) = xT AT y, which respectively, so that rank(T ) = rank([T ]C←B ) =
are the same. For the “only if” direction, we can rank([T ]C←B
∗ ) = rank([T ∗ ]B←C ) = rank(T ∗ ). In the
either let x and y range over all standard basis second equality, we used the fact that rank(A∗ ) =
vectors to see that the (i, j)-entry of A equals rank(A) for all matrices A, which is typically proved
the ( j, i)-entry of B, or use the “if” direction in introduced linear algebra courses.
together with uniqueness of the adjoint (Theo-
rem 1.4.8). 1.4.23 If R and S are two adjoints of T , then
(b) Almost identical to part (a), but just recall that
the complex dot product has a complex con- hv, R(w)i = hT (v), wi = hv, S(w)i for all v, w.
jugate in it, so (Ax) · y = (Ax)∗ y = x∗ A∗ y and
Rearranging slightly gives
x·(A∗ y) = x∗ A∗ y, which are equal to each other.
hv, (R − S)(w)i for all v ∈ V, w ∈ W.
1.4.18 This follows directly from the fact that if
Exercise 1.4.27 then shows that R − S = O, so R = S,
B {v1 , . . . , vn } then [v j ]E = vj for all 1 ≤ j ≤ n,
as desired.
so PE←B = v1 | v2 | · · · | vn , which is unitary if
and only if its columns (i.e., the members of B) form
1.4.24 (a) If v, w ∈ Fn , E is the standard basis of Fn , and B
an orthonormal basis of Fn .
is any basis of Fn , then
PB←E v = [v]B and PB←E w = [w]B .
By plugging this fact into Theorem 1.4.3, we see 1.4.29 (a) If P = P∗ then
that h·, ·i is an inner product if and only if it has
the form hP(v), v − P(v)i = hP2 (v), v − P(v)i
= hP(v), P∗ (v − P(v))i
hv, wi = [v]B · [w]B
= hP(v), P(v − P(v))i
= (PB←E v) · (PB←E w)
= hP(v), P(v) − P(v)i = 0,
= v∗ (PB←E
∗
PB←E )w.
so P is orthogonal.
Recalling that change-of-basis matrices are in- (b) If P is orthogonal (i.e., hP(v), v − P(v)i = 0
vertible and every invertible matrix is a change- for all v ∈ V) then h(P − P∗ ◦ P)(v), vi = 0
of-basis matrix completes the proof. for all v ∈ V, so Exercise 1.4.28 tells us that
(b) The matrix " # P − P∗ ◦ P = O, so P = P∗ ◦ P. Since P∗ ◦ P is
1 2 self-adjoint, it follows that P is as well.
P=
0 1
1.4.30 (a) If the columns of A are linearly independent
works.
then rank(A) = n, and we know in general
(c) If P is not invertible then it may be the case
that rank(A∗ A) = rank(A), so rank(A∗ A) = n
that hv, vi = 0 even if v 6= 0, which violates
as well. Since A∗ A is an n × n matrix, this tells
the third defining property of inner products.
us that it is invertible.
In particular, it will be the case that hv, vi = 0
(b) We first show that P is an orthogonal projec-
whenever v ∈ null(P).
tion:

1.4.25 If B = {v1 , . . . , vn } then set Q = v1 | · · · | vn ∈ Mn , P2 = A(A∗ A)−1 (A∗ A)(A∗ A)−1 A∗
which is invertible since B is linearly independent.
Then set P = Q−1 , so that the function = A(A∗ A)−1 A∗ = P,
hv, wi = v∗ (P∗ P)w and

∗
(which is an inner product by Exercise 1.4.24) satisfies P∗ = A(A∗ A)−1 A∗ = A(A∗ A)−1 A∗ = P.
( By uniqueness of orthogonal projections, we
1 if i = j
hvi , v j i = v∗i (P∗ P)v j = e∗i e j = thus just need to show that range(P) =
0 if i 6= j, range(A). The key fact that lets us prove this is
the general fact that range(AB) ⊆ range(A) for
with the second-to-last equality following from the
all matrices A and B.
fact that Qe j = v j , so Pv j = Q−1 v j = e j for all j.
The fact that range(A) ⊆ range(P) follows
from noting that PA = A(A∗ A)−1 (A∗ A) = A,
1.4.27 The “if” direction is trivial. For the “only if” di-
so range(A) = range(PA) ⊆ range(P). Con-
rection, choose w = T (v) to see that kT (v)k2 =
versely, P = A (A∗ A)−1 A∗ ) immediately
hT (v), T (v)i = 0 for all v ∈ V. It follows that T (v) = 0
implies range(P) ⊆ range(A) (by choosing
for all v ∈ V, so T = O.
B = (A∗ A)−1 A∗ in the fact we mentioned ear-
lier), so we are done.
1.4.28 (a) The “if” direction is trivial. For the “only
if” direction, choose v = x + y to see that
1.4.31 If v ∈ range(P) then Pv = v (so v is an eigenvector
hT (x) + T (y), x + yi = 0 for all x, y ∈ V, which
with eigenvalue 1), and if v ∈ null(P) then Pv = 0 (so v
can be expanded and simplified to hT (x), yi +
is an eigenvector with eigenvalue 0). It follows that the
hT (y), xi = 0, which can in turn be rearranged
geometric multiplicity of the eigenvalue 1 is at least
to hT (x), yi + hT ∗ (x), yi = 0 for all x, y ∈ V.
rank(P) = dim(range(P)), and the multiplicity of the
If we perform the same calculation with v =
eigenvalue 0 is at least nullity(P) = dim(null(P)).
x + iy instead, then we similarly learn that
Since rank(P) + nullity(P) = n and the sum of mul-
ihT (x), yi − ihT ∗ (x), yi = 0, which can be mul-
tiplicities of eigenvalues of P cannot exceed n, we
tiplied by −i to get hT (x), yi − hT ∗ (x), yi = 0 conclude that actually the multiplicities of the eigen-
for all x, y ∈ V. values equal rank(P) and nullity(P), respectively. It
Adding the equations hT (x), yi+hT ∗ (x), yi = 0 then follows immediately from Theorem A.1.5 that P
and hT (x), yi − hT ∗ (x), yi = 0 together reveals is diagonalizable and has the claimed form.
that hT (x), yi = 0 for all x, y ∈ V, so Exer-
cise 1.4.27 tells us that T = O. 1.4.32 (a) Applying the Gram–Schmidt process to the stan-
(b) For the “if” direction, notice that if T ∗ = −T dard basis {1, x, x2 } of P 2 [−c, c] produces the
then hT (v), vi = hv, T ∗ (v)i = hv, −T (v)i = orthonormal basis
−hv, T (v)i = −hT (v), vi, so hT (v), vi = 0 for n √ √ o
all v. C = √12c , √ 33 x, √ 55 (3x2 − c2 ) .
2c 8c
For the “only if” direction, choose x = v + y
as in part (a) to see that h(T + T ∗ )(x), yi =
hT (x), yi + hT ∗ (x), yi = 0 for all x, y ∈ V. Ex-
ercise 1.4.27 then tells us that T + T ∗ = O, so
T ∗ = −T .
Constructing Pc (ex ) via this basis (as we did (b) Standard techniques from calculus like
in Example 1.4.18) gives the following poly- L’Hôpital’s rule show that as c → 0, the co-
nomial (we use sinh(c) = (ec − e−c )/2 and efficients of x2 , x, and 1 in Pc (ex ) above go to
cosh(c) = (ec + e−c )/2 to simplify things a bit): 1/2, 1, and 1, respectively, so
15 lim Pc (ex ) = 21 x2 + x + 1,
Pc (ex ) = (c2 + 3) sinh(c) − 3c cosh(c) x2 c→0+
2c5
3 which we recognize as the degree-2 Taylor
− 3 sinh(c) − c cosh(c) x
c polynomial of ex at x = 0. This makes intuitive
3 sense since Pc (ex ) is the best approximation of
− 3 (c2 + 5) sinh(c) − 5c cosh(c) . ex on the interval [−c, c], whereas the Taylor
2c
polynomial is its best approximation at x = 0.
Section 1.5: Summary and Review

1.5.1 (a) False. This is not even true if B and C are bases (b) Since dim(S1 + S2 ) ≤ dim(V), rearrang-
of V (vector spaces have many bases). ing part (a) shows that dim(S1 ∩ S2 ) =
(c) True. This follows from Theorem 1.2.9. dim(S1 ) + dim(S2 ) − dim(S1 + S2 ) ≥
(e) False. This is true if V is finite-dimensional, dim(S1 ) + dim(S2 ) − dim(V).
but we showed in Exercise 1.2.33 that the
“right shift” transformation on CN has no eigen- 1.5.4 One basis is {x` yk : `, k ≥ 0, ` + k ≤ p}. To count the
values or eigenvectors. members of this basis, we note that for each choice of
` there are p+1−` possible choices of k, so there are
1.5.2 (a) If {v1 , . . . , vk } is a basis of S1 ∩ S2 then we (p+1)+ p+(p−1)+· · ·+2+1 = (p+1)(p+2)/2
can extend it to bases {v1 , . . . , vk , w1 , . . . , wm } vectors in this basis, so dim(P2p ) = (p+1)(p+2)/2.
and {v1 , . . . , vk , x1 , . . . , xn } of S1 and
S2 , respectively, via Exercise 1.2.26. It 1.5.6 We can solve this directly from the definition of
is then straightforward to check that T ∗ : hT (X),Y i = hAXB∗ ,Y i = tr (AXB∗ )∗Y =
{v1 , . . . , vk , w1 , . . . , wm , x1 , . . . , xn } is a basis tr(BX ∗ A∗Y ) = tr(X ∗ A∗Y B) = hX, A∗Y Bi =
of S1 + S2 , so dim(S1 + S2 ) = k + m + n, hX, T ∗ (Y )i for all X ∈ Mn and Y ∈ Mm . It fol-
and dim(S1 ) + dim(S2 ) − dim(S1 ∩ S2 ) = lows that T ∗ (Y ) = A∗Y B.
(k + m) + (k + n) − k = k + m + n too.
Section 1.A: Extra Topic: More About the Trace

1.A.1 (b) False. For example, if A = B = I ∈ M2 then then
tr(AB) = 2 but tr(A)tr(B) = 4. " # " #
(d) True. A and AT have the same diagonal entries. 0 1 0 0
AB = and BA = ,
0 0 0 0
1.A.2 We just repeatedly use properties of the trace:
which are not similar since they do not have
tr(AB) = −tr(AT BT ) = −tr((BA)T ) = −tr(BA) =
the same rank (rank(A) = 1 and rank(B) = 0).
−tr(AB), so tr(AB) = 0. Note that this argument
only works if F is a field in which 1 + 1 6= 0.
1.A.6 (a) If W ∈ W then W = ∑i ci (Ai Bi − Bi Ai ), which
has tr(W ) = ∑i ci (tr(Ai Bi ) − tr(Bi Ai )) = 0 by
1.A.4 Recall from Exercise 1.4.31 that we can write
" # commutativity of the trace, so W ∈ Z, which
I O −1 implies W ⊆ Z. The fact that W is a subspace
P=Q r Q ,
O O follows from the fact that every span is a sub-
space, and the fact that Z is a subspace follows
where r = rank(P). Since the trace is similarity- from the fact that it is the null space of a linear
invariant, it follows that tr(P) = tr(Ir ) = r = rank(P). transformation (the trace).
(b) We claim that dim(Z) = n2 − 1. One way
1.A.5 (a) If A is invertible then choosing P = A−1 shows to see this is to notice that rank(tr) = 1
that (since its output is just 1-dimensional), so the
P(AB)P−1 = A−1 ABA = BA, rank-nullity theorem tells us that nullity(tr) =
so AB and BA are similar. A similar argument dim(null(tr)) = dim(Z) = n2 − 1.
works if B is invertible.
(b) If
" # " #
1 0 0 1
A= and B =
0 0 0 0
(c) If we let A = E j, j+1 and B = E j+1, j (with 1 ≤ It then follows from Theorem
1.A.4 that
j < n), then we get AB − BA = E j, j+1 E j+1, j − 0 (0) = det(A)tr A−1 B .
fA,B
E j+1, j E j, j+1 = E j, j − E j+1, j+1 . Similarly, if (b) The definition of the derivative says that
A = Ei,k and B = Ek, j (with i 6= j) then we get
AB − BA = Ei,k Ek, j − Ek, j Ei,k = Ei, j . It follows d det A(t + h) − det A(t)
det A(t) = lim .
that W contains each of the n2 − 1 matrices in dt h→0 h
the following set: Using Taylor’s theorem (Theorem A.2.3) tells
{E j, j − E j+1, j+1 } ∪ {Ei, j : i 6= j}. us that
Furthermore, it is straightforward to show that dA
A(t + h) = A(t) + h + hP(h),
this set is linearly independent, so dim(W) ≥ dt
n2 − 1. Since W ⊆ Z and dim(Z) = n2 − 1,
where P(h) is a matrix whose entries depend on
it follows that dim(W) = n2 − 1 as well, so
h in a way so that lim P(h) = O. Substituting
W = Z. h→0
this expression for A(t + h) into the earlier
ex-
1.A.7 (a) We note that f (E j, j ) = 1 for all 1 ≤ j ≤ n since pression for the derivative of det A(t) shows
E j, j is a rank-1 orthogonal projection. To see that
that f (E j,k ) = 0 whenever 1 ≤ j 6= k ≤ n (and d
thus show that f is the trace), we notice that det A(t)
dt
E j, j + E j,k + Ek, j + Ek,k is a rank-2 orthogonal pro-
det A(t) + h dA
dt + hP(h) − det A(t)
jection, so applying f to it gives a value of 2. Lin- = lim
earity of f then says that f (E j,k ) + f (Ek, j ) = 0 h→0 h

whenever j 6= k. A similar argument shows that det A(t) + h dA
dt − det A(t)
E j, j + iE j,k − iEk, j + Ek,k is a rank-2 orthogonal = lim
h→0 h
projection, so linearity of f says that f (E j,k ) − 0
= fA(t),dA/dt (0),
f (Ek, j ) = 0 whenever j 6= k. Putting these two
facts together shows that f (E j,k ) = f (Ek, j ) = 0, where the second-to-last equality follows from
which completes the proof. the fact that adding hP(h) inside the determi-
(b) Consider the linear form f : M2 (R) → R defined nant, which is a polynomial in the entries of
by " #! its argument, just adds lots of terms that have
a b h[P(h)]i, j as a factor for various values of i and
f = a + b − c + d. j. Since lim h[P(h)]i, j /h = 0 for all i and j,
c d h→0
This linear form f coincides with the trace on all these terms make no contribution to the value
symmetric matrices (and thus all orthogonal proof the limit.
jections P ∈ M2 (R)), but not on all A ∈ M2 (R). (c) This follows immediately from choosing
B = dA/dt in part (a) and then using the result
1.A.8 (a) First use the fact that I = AA−1 and multiplica- that we proved in part (b).
tivity of the determinant to write

fA,B (x) = det(A+xB) = det(A) det I +x(A−1 B) .
Section 1.B: Extra Topic: Direct Sum, Orthogonal Complement

1.B.1 (a) The orthogonal complement is the line go- 1.B.2 In all parts of this solution, we refer to the given
ing through the origin that is perpendicular to matrix as A, and the sets that we list are bases of the
(3, 2). It thus has {(−2, 3)} as a basis. indicated subspaces.
(c) Everything is orthogonal to the zero vector, (a) range(A): {(1, 3)}
so the orthogonal complement is all of R3 . null(A∗ ): {(3, −1)}
One possible basis is just the standard basis range(A∗ ): {(1, 2)}
{e1 , e2 , e3 }. null(A): {(−2, 1)}
(e) Be careful: all three of the vectors in this The fact that range(A)⊥ = null(A∗ ) and
set lie on a common plane, so their orthog- null(A)⊥ = range(A∗ ) can be verified by noting
onal complement is 1-dimensional. The or- that the dimensions of each of the subspace pairs
thogonal complement can be found by solving add up to the correct dimension (2 here) and that
the linear system v · (1, 2, 3) = 0, v · (1, 1, 1) = all vectors in one basis are orthogonal to all vec-
0, v · (3, 2, 1) = 0, which has infinitely many tors in the other basis (e.g., (1, 3) · (3, −1) = 0).
solutions of the form v = v3 (1, −2, 1), so (c) range(A): {(1, 4, 7), (2, 5, 8)}
{(1, −2, 1)} is a basis. null(A∗ ): {(1, −2, 1)}
⊥
(g) We showed in Example 1.B.7 that MSn = range(A∗ ): {(1, 0, −1), (0, 1, 2)}
MsS null(A): {(1, −2, 1)}
n . When n = 2, there is (up to scaling) only
one skew-symmetric matrix, so one basis of
this space is 1.B.3 (b) True. This just says that the only vector in V
(" #) that is orthogonal to everything in V is 0.
0 1
.
−1 0
(d) True. We know from Theorem 1.B.7 that the By uniqueness of projections (Exercise 1.B.7),
range of A and the null space of A∗ are orthog- we now just need to show that range(P) =
onal complements of each other. range(A) and null(P) = null(B∗ ). These facts
(f) False. R2 ⊕ R3 has dimension 5, so it can- both follow quickly from the facts that
not have a basis consisting of 6 vectors. range(AC) ⊆ range(A) and null(A) ⊆ null(CA)
Instead, the standard basis of R2 ⊕ R3 is for all matrices A and C.
{(e1 , 0), (e2 , 0), (0, e1 ), (0, e2 ), (0, e3 )}. More specifically, these facts immediately tell
us that range(P) ⊆ range(A) and null(B∗ ) ⊆
1.B.4 We need to show that span(span(v1 ), . . . , span(vk )) = null(P). To see the opposite inclusions, no-
V and tice that PA = A so range(A) ⊆ range(P), and
[ B∗ P = B∗ so null(P) ⊆ null(B∗ ).
span(vi ) ∩ span v j = {0}
j6=i
1.B.10 To see that the orthogonal complement is indeed
for all 1 ≤ i ≤ k. The first property fol- a subspace, we need to check that it is non-empty
lows immediately from the fact that {v1 , . . . , vk } (which it is, since it contains 0) and verify that the
is a basis of V so span(v1 , . . . , vk ) = V, so two properties of Theorem 1.1.2 hold. For prop-
span(span(v1 ), . . . , span(vk )) = V too. erty (a), suppose v1 , v2 ∈ B⊥ . Then for all w ∈ B we
have hv1 + v2 , wi = hv1 , wi + hv2 , wi = 0 + 0 = 0,
To seeSthe other property, suppose w ∈ span(vi ) ∩
span j6=i v j for some 1 ≤ i ≤ k. Then there exist so v1 + v2 ∈ B⊥ . For property (b), suppose v ∈ B⊥
scalars c1 , c2 , . . ., ck such that w = ci vi = ∑ j6=i c j v j , and c ∈ F. Then for all w ∈ B we have hcv, wi =
which (since {v1 , v2 , . . . , vk } is a basis and thus lin- chv, wi = c0 = 0, so cv ∈ B⊥ .
early independent) implies c1 = c2 = · · · = ck = 0, ⊥
so w = 0. 1.B.14 Since B⊥ = span(B) (by Exercise 1.B.13), it
suffices to prove that (S ⊥ )⊥ = S when S is a
1.B.5 If v ∈ S1 ∩ S2 when we can write v = c1 v1 + subspace of V. To see this, just recall from The-
· · · + ck vk and v = d1 w1 + · · · + dm wm for orem 1.B.6 that S ⊕ S ⊥ = V. However, it is also
some v1 , . . . , vk ∈ B, w1 , . . . , wm ∈ C, and scalars the case that (S ⊥ )⊥ ⊕ S ⊥ = V, so S = (S ⊥ )⊥ by
c1 , . . . , ck , d1 , . . . , dm . By subtracting these two equa- Exercise 1.B.12.
tions for v from each other, we see that
1.B.15 We already know from part (a) that range(T )⊥ =
c1 v1 + · · · + ck vk − d1 w1 − · · · − dm wm = 0, null(T ∗ ) for all T . If we replace T by T ∗ throughout
which (since B ∪C is linearly independent) implies that equation, we see that range(T ∗ )⊥ = null(T ).
c1 = · · · = ck = 0 and d1 = · · · = dm = 0, so v = 0. Taking the orthogonal complement of both sides
(and using the fact that (S ⊥ )⊥ = S for all sub-
1.B.6 It suffices to show that span(S1 ∪ S2 ) = V. To spaces S, thanks to Exercise 1.B.14) we then see
this end, recall that Theorem 1.B.2 says that if B range(T ∗ ) = null(T )⊥ , as desired.
and C are bases of S1 and S2 , respectively, then
(since (ii) holds) B ∪C is linearly independent. Since 1.B.16 Since rank(T ) = rank(T ∗ ) (i.e., dim(range(T )) =
|B ∪C| = dim(V), Exercise 1.2.27 implies that B ∪C dim(range(T ∗ )), we just need to show that v = 0 is
is a basis of V, so using Theorem 1.B.2 again tells the only solution to S(v) = 0. To this end, notice
us that V = S1 ⊕ S2 . that if v ∈ range(T ∗ ) is such that S(v) = 0 then v ∈
null(T ) too. However, since range(T ∗ ) = null(T )⊥ ,
1.B.7 Notice that if P is any projection with range(P) = S1 we know that the only such v is v = 0, which is
and null(P) = S2 , then we know from Example 1.B.8 exactly what we wanted to show.
that V = S1 ⊕ S2 . If B and C are bases of S1 and S2 ,
respectively, then Theorem 1.B.2 tells us that B ∪C 1.B.17 Unfortunately, we cannot quite use Theorem 1.B.6,
is a basis of V and thus P is completely determined since that result only applies to finite-dimensional
by how it acts on the members of B and C. However, inner product spaces, but we R 1 can use the same idea.
P being a projection tells us exactly how it behaves First notice that h f , gi = −1 f (x)g(x) dx = 0 for
on those two sets: Pv = v for all v ∈ B and Pw = 0 all f ∈ P E [−1, 1] and g ∈ P O [−1, 1]. This follows
for all w ∈ C, so P is completely determined. immediately from the fact that if f is even and g is
odd then f g is odd and thus has integral (across any
1.B.8 (a) Since range(A) ⊕ null(B∗ ) = Fm , we know that interval that is symmetric through the origin) equal
if B∗ Av = 0 then Av = 0 (since Av ∈ range(A) to 0. It follows that (P E [−1, 1])⊥ ⊇ P O [−1, 1]. To
so the only way Av can be in null(B∗ ) too is see that equality holds, suppose g ∈ (P E [−1, 1])⊥
if Av = 0). Since A has linearly independent and write g = gE + gO , where gE ∈ P E [−1, 1] and
columns, this further implies v = 0, so the lin- gO ∈ P O [−1, 1] (which we know we can do thanks to
ear system B∗ Av = 0 in fact has a unique solu- Example 1.B.2). If h f , gi = 0 for all f ∈ P E [−1, 1]
tion. Since B∗ A is square, it must be invertible. then h f , gE i = h f , gO i = 0, so h f , gE i = 0 (since
(b) We first show that P is a projection: h f , gO i = 0 when f is even and gO is odd), which
implies (by choosing f = gE ) that gE = 0, so
P2 = A(B∗ A)−1 (B∗ A)(B∗ A)−1 B∗ g ∈ P O [−1, 1].
= A(B∗ A)−1 B∗ = P.
1.B.18 (a) Suppose f ∈ S ⊥ so that h f , gi = 0 for all for some {(vi , 0)} ⊆ B0 (i.e., {vi } ⊆ B), {(0, w j )} ⊆
g ∈ S. Notice that the function g(x) = x f (x) C0 (i.e., {w j } ⊆ C), and scalars c1 , . . . , ck , d1 , . . . , dm .
satisfies g(0) = 0, so g ∈ S, so 0 = h f , gi = This implies c1 v1 + · · · + ck vk = 0, which (since B
R1 2 2
0 x( f (x)) dx. Since x( f (x)) is continuous
is linearly independent) implies c1 = · · · = ck = 0.
and non-negative on the interval [0, 1], the only A similar argument via linear independence of C
way this integral can equal 0 is if x( f (x))2 = 0 shows that d1 = · · · = dm = 0, which implies B0 ∪C0
for all x, so (by continuity of f ) f (x) = 0 for is linearly independent as well.
all 0 ≤ x ≤ 1 (i.e., f = 0).
(b) It follows from part (a) that (S ⊥ )⊥ = {0}⊥ = 1.B.21 (a) If (v1 , 0), (v2 , 0) ∈ V 0 (i.e., v1 , v2 ∈ V) and
C[0, 1] 6= S. c is a scalar then (v1 , 0) + c(v2 , 0) = (v1 +
cv2 , 0) ∈ V 0 as well, since v1 + cv2 ∈ V. It fol-
1.B.19 Most of the 10 properties from Definition 1.1.1 lows that V 0 is a subspace of V ⊕ W, and a
follow straightforwardly from the corresponding similar argument works for W 0 .
properties of V and W, so we just note that the zero (b) The function T : V → V 0 defined by T (v) =
vector in V ⊕ W is (0, 0), and −(v, w) = (−v, w) for (v, 0) is clearly an isomorphism.
all v ∈ V and w ∈ W. (c) We need to show that V 0 ∩ W 0 = {(0, 0)} and
span(V 0 , W 0 ) = V ⊕ W. For the first property,
1.B.20 Suppose suppose (v, w) ∈ V 0 ∩ W 0 . Then w = 0 (since
(v, w) ∈ V 0 ) and v = 0 (since (v, w) ∈ W 0 ), so
k m
(v, w) = (0, 0), as desired. For the second prop-
∑ ci (vi , 0) + ∑ d j (0, w j ) = (0, 0) erty, just notice that every (v, w) ∈ V ⊕ W can
i=1 j=1
be written in the form (v, w) = (v, 0) + (0, w),
where (v, 0) ∈ V 0 and (0, w) ∈ W 0 .
Section 1.C: Extra Topic: The QR Decomposition

1.C.1 (a) This matrix has QR decomposition UT , where 1.C.5 Suppose A = U1 T1 = U2 T2 are two QR decompo-
" # " # sitions of A. We then define X = U1∗U2 = T1 T2−1
1 3 4 5 4
U= and T = . and note that since T1 and T2−1 are upper triangular
5 4 −3 0 2 with positive real diagonal entries, so is X (see Exer-
(c) This QR decomposition is not unique, so yours cise 1.C.4). On the other hand, X is both unitary and
may differ: upper triangular, so Exercise 1.4.12 tells us that it
    is diagonal and its diagonal entries have magnitude
2 1 −2 9 4
1    equal to 1. The only positive real number with mag-
U=  1 2 2  and T = 0 1 . nitude 1 is 1 itself, so X = I, which implies U1 = U2
3
−2 2 −1 0 0 and T1 = T2 .
 
2 −1 2 4
1.C.7 (a) If we write A = [ B | C ] has QR decomposition
11 2 −4 2

(e) U =   and A = U[ T | D ] where B and T are m × m, B
5 4 −2 −1 −2 is invertible, and T is upper triangular, then
2 4 2 −1 B = UT . Since B is invertible, we know from
 
5 3 2 1 Exercise 1.C.5 that this is its unique QR de-
  composition, so the U and T in the QR decom-
0 1 1 2
T = . position of A are unique. To see that D is also
0 0 1 2 unique, we just observe that D = U ∗C.
0 0 0 4 (b) We gave one QR decomposition for a par-
ticular 3 × 2 matrix in the solution to Exer-
√
1.C.2 (a) x = (−1, 1)/ 2. (c) x = (−2, −1, 2). cise 1.C.1(c), and another one can be obtained
simply by negating the final column of U:
   
2 1 2 9 4
1.C.3 (a) True. This follows immediately from Theo- 1   
U= 1 2 −2 and T = 0 1 .
rem 1.C.1. 3
(c) False. Almost any matrix’s QR decomposition −2 2 1 0 0
provides a counter-example. See the matrix
from Exercise 1.C.6, for example.
Section 1.D: Extra Topic: Norms and Isometries

1.D.1 (a) Is a norm. This is the 1-norm of Pv, where P (c) We already know that the desired claim holds
is the diagonal matrix with 1 and 3 on its diag- whenever kvk∞ = 1, so for other vectors we
onal. In general, kPvk p is a norm whenever P just note that
is invertible (and it is in this case). lim kcvk p = |c| lim kvk p = |c|kvk∞ = kcvk∞ .
(c) Not a norm. For example, if v = (1, −1) then p→∞ p→∞
1/3 1.D.9 This inequality can be proved “directly”:

kvk = v3 + v21 v2 + v1 v22 + v3
1 2
n

= |1 − 1 + 1 − 1|1/3 = 0. |v · w| = ∑ v j w j
j=1
Since v 6= 0, it follows that k · k is not a norm. n n
(e) Not a norm. For example, if p(x) = x then ≤ ∑ |v j ||w j | ≤ ∑ |v j |kwk∞ = kvk1 kwk∞ ,
kpk = 1 and k2pk = 1 6= 2. j=1 j=1
where the final inequality follows from the fact that
1.D.2 (b) No, not induced by an inner product. For ex- |w j | ≤ kwk∞ for all j (straight from the definition of
ample, that norm).
2ke1 k2 + 2ke2 k2 = 1 + 32/3 , but 1.D.10 (a) For the “if” direction, just notice that kcw + wk p =
2 2
ke1 + e2 k + ke1 − e2 k = 4 2/3
+4 2/3
. k(c+1)wk p = (c+1)kwk p = kcwk p +kwk p . For
the “only if” direction, we refer back to the proof
Since these numbers are not equal, the paral- of Minkowski’s inequality (Theorem 1.D.2). There
lelogram law does not hold, so the norm is not were only two places in that proof where an inequal-
induced by an inner product. ity was introduced: once when using the triangle
inequality on C and once when using convexity of
1.D.3 (a) True. All three of the defining properties of the function f (x) = x p .
norms are straightforward to prove. For exam- From equality in the triangle inequality we see
ple, the triangle inequality for ck · k follows that, for each 1 ≤ j ≤ n, we have (using the nota-
from the triangle inequality for k · k: tion established in that proof) y j = 0 or x j = c j y j

ckv + wk ≤ c kvk + kwk = ckvk + ckwk. for some 0 < c j ∈ R. Furthermore, since f (x) = x p
is strictly convex whenever p > 1, we conclude
(c) False. This is not even true for unitary matrices that the only way the corresponding inequality can
(e.g., I + I is not a unitary matrix/isometry). be equality is if |x j | = |y j | for all 1 ≤ j ≤ n. Since
(e) False. For example, if V and W have different kxk p = kyk p = 1, this implies x = y, so v and w
dimensions then T cannot be invertible. How- are non-negative multiples of x = y and thus of
ever, we show in Exercise 1.D.21 that this state- each other.
ment is true if V = W is finite-dimensional. (b) The “if” direction is straightforward, so we fo-
cus on the “only if” direction. If kv + wk1 =
1.D.6 The triangle inequality tells us that kx + yk ≤ kxk + kvk1 + kwk1 then
kyk for all x, y ∈ V. If we choose x = v − w and
y = w, then this says that ∑ |v j + w j | = ∑ |v j | + |w j | .
j j
k(v − w) + wk ≤ kv − wk + kwk. Since |v j + w j | ≤ |v j | + |w j | for all j, the above
After simplifying and rearranging, this becomes equality implies |v j + w j | = |v j | + |w j | for all j.
kvk − kwk ≤ kv − wk, as desired. This equation holds if and only if v j and w j lie on
the same closed ray starting at the origin in the
1.D.7 First, choose an index k such that kvk∞ = |vk |. Then complex plane (i.e., either w j = 0 or v j = c j w j
p for some 0 < c j ∈ R).
kvk p = p |v1 | p + · · · + |vn | p
p 1.D.12 (a) If v = e1 is the first standard basis vector then
≥ p |vk | p = |vk | = kvk∞ . kvk p = kvkq = 1.
1.D.8 (a) If |v| < 1 then multiplying |v| by itself de- (b) If v = (1, 1, . . . , 1) then kvk p = n1/p and
creases it. By multiplying it by itself more and kvkq = n1/q , so
more, we can make it decrease to as close to 0 1 1
−
as we like. kvk p = n p q kvkq ,
[Note: To make this more rigorous, you can
as desired.
note that for every sufficiently small ε > 0,
you can choose p ≥ log(ε)/ log(|v|), which is 1.D.13 We already proved the triangle inequality for this
≥ 1 when 0 < ε ≤ |v|, so that |v| p ≤ ε.] norm as Theorem 1.D.7, so we just need to show the
(b) If kvk∞ = 1 then we know from Exercise 1.D.7 remaining two defining properties of norms. First,
that kvk p ≥ 1. Furthermore, |v j | ≤ 1 for each Z b 1/p
1 ≤ j ≤ n, so |v j | p ≤ 1 for all p ≥ 1 as well. Then kc f k p = |c f (x)| p dx
p a
kvk p = p |v1 | p + · · · + |vn | p Z b 1/p
√ √ = |c| p | f (x)| p dx
≤ p 1 + · · · + 1 ≤ p n. a
√ Z 1/p
Since lim p→∞ p n = 1, the squeeze theorem for b
limits tells us that lim p→∞ kvk p = 1 too. = |c| | f (x)| p dx = |c|k f k p .
a
Second, k f k p ≥ 0 for all f ∈ C[a, b] because inte- In particular, this means that there is a non-zero vec-
grating a non-negative function gives a non-negative tor (which has non-zero norm) c1 v1 + · · · + cn vn ∈ V
answer. Furthermore, k f k p = 0 implies f is the zero that gets sent to the zero vector (which has norm 0),
function since otherwise f (x) > 0 for some x ∈ [a, b] so T is not an isometry.
and thus f (x) > 0 on some subinterval of [a, b] by
continuity of f , so the integral and thus k f k p would 1.D.22 We prove this theorem by showing the chain of im-
both have to be strictly positive as well. plications (b) =⇒ (a) =⇒ (c) =⇒ (b), and mimic
the proof of Theorem 1.4.9.
1.D.14 We just mimic the proof of Hölder’s inequality To see that (b) implies (a), suppose T ∗ ◦T = IV . Then
for vectors in Cn . Without loss of generality, we for all v ∈ V we have
just need to prove the theorem in the case when
k f k p = kgkq = 1. By Young’s inequality, we know kT (v)k2W = hT (v), T (v)i = hv, (T ∗ ◦ T )(v)i
that = hv, vi = kvk2V ,
| f (x)| p |g(x)|q
| f (x)g(x)| ≤ +
p q so T is an isometry.
for all x ∈ [a, b]. Integrating then gives For (a) =⇒ (c), note that if T is an isometry then
Z b Z b Z b kT (v)k2W = kvk2V for all v ∈ V, so expanding these
| f (x)| p |g(x)|q
| f (x)g(x)| dx ≤ dx + dx quantities in terms of the inner product (like we did
a a p a q above) shows that
q
k f k pp kgkq 1 1
= + = + = 1, hT (v), T (v)i = hv, vi for all v ∈ V.
p q p q
as desired. Well, if x, y ∈ V then this tells us (by choosing
v = x + y) that
1.D.16 First, we compute
hT (x + y), T (x + y)i = hx + y, x + yi.
1 3 1
hv, vi = ∑ ik kv + ik vk2V = kvk2V ,
4 k=0 Expanding the inner product on both sides of the
above equation then gives
which is clearly non-negative with equality if and
only if v = 0. Similarly, hT (x), T (x)i + 2Re hT (x), T (y)i + hT (y), T (y)i

1 3 1 = hx, xi + 2Re hx, yi + hy, yi.
hv, wi = ∑ ik kv + ik wk2V
4 k=0
By then using the fact that hT (x), T (x)i = hx, xi and
1 3 1 hT (y), T (y)i = hy, yi, we can simplify the above
= ∑ k kik v + wk2V equation to the form
4 k=0 i

1 3 k Re hT (x), T (y)i = Re hx, yi .
= ∑ i kw + ik vk2V = hw, vi.
4 k=0 If V is a vector space over R, then this implies
All that remains is to show that hv, w + cxi = hv, wi + hT (x), T (y)i = hx, yi for all x, y ∈ V. If instead V
chv, xi for all v, w, x ∈ V and all c ∈ C. The fact that is a vector space over C then we can repeat the above
hv, w + xi = hv, wi + hv, xi for all v, w, x ∈ V can argument with v = x + iy to see that
be proved in a manner identical to that given in the
Im hT (x), T (y)i = Im hx, yi ,
proof of Theorem 1.D.8, so we just need to show
that hv, cwi = chv, wi for all v, w ∈ V and c ∈ C. As so in this case we have hT (x), T (y)i = hx, yi for all
suggested by the hint, we first notice that x, y ∈ V too, establishing (c).
1 3 1 Finally, to see that (c) =⇒ (b), note that if we rear-
hv, iwi = ∑ ik kv + ik+1 wk2V
4 k=0
range the equation hT (x), T (y)i = hx, yi slightly, we
get
1 3 1
= ∑ ik−1 kv + iwk2V
4 k=0
hx, (T ∗ ◦ T − IV )(y)i = 0 for all x, y ∈ V.
Well, if we choose x = (T ∗ ◦ T − IV )(y) then this
1 3 1 implies
=i ∑ ik kv + iwk2V = ihv, wi.
4 k=0
k(T ∗ ◦ T − IV )(y)k2V = 0 for all y ∈ V.
When we combine this observation with the fact that
this inner product reduces to exactly the one from This implies (T ∗ ◦T −IV )(y) = 0, so (T ∗ ◦T )(y) = y
the proof of Theorem 1.D.8 when v and w are real, for all y ∈ V, which means exactly that T ∗ ◦ T = IV ,
we see that hv, cwi = chv, wi simply by splitting thus completing the proof.
everything into their real and imaginary parts.
1.D.24 For the “if” direction, we note (as was noted in the
1.D.20 If dim(V) > dim(W) and {v1 , . . . , vn } is a basis of proof of Theorem 1.D.10) that any P with the speci-
V, then {T (v1 ), . . . , T (vn )} is necessarily linearly fied form just permutes the entries of v and possibly
dependent (since it contains more than dim(W) vec- multiplies them by a number with absolute value
tors). There thus exist scalars c1 , . . . , cn , not all equal 1, and such an operation leaves kvk1 = ∑ j |v j | un-
to 0, such that changed.
T (c1 v1 + · · · + cn vn ) = c1 T (v1 ) + · · · + cn T (vn )
= 0.
C.2 Matrix Decompositions 465
For the “only if” direction, suppose P = 1.D.26 Notice that if pn (x) = xn then
p1 | p2 | · · · | pn is an isometry of the 1-norm. Z 1
Then Pe j = p j for all j, so 1
kpn k1 = xn dx = and
0 n+1
kp j k1 = kPe j k1 = ke j k1 = 1 for all 1 ≤ j ≤ n.
kpn k∞ = max xn = 1.
Similarly, P(e j + ek ) = p j + pk for all j, k, so 0≤x≤1
kp j + pk k1 = kP(e j + ek )k1 = ke j + ek k1 = 2 In particular, this means that there does not exist a
for all j and k. We know from the triangle inequal- constant C > 0 such that 1 = kpn k∞ ≤ Ckpn k1 =
ity (or equivalently from Exercise 1.D.10(b)) that C/(n + 1) for all n ≥ 1, since we would need
the above equality can only hold if there exist non- C ≥ n + 1 for all n ≥ 1.
negative real constants ci, j,k ≥ 0 such that, for each
i, j, k, it is the case that either pi, j = ci, j,k pi,k or 1.D.27 Since k · ka and k · kb are equivalent, there exist
pi,k = 0. scalars c,C > 0 such that
However, we can repeat this argument with the fact
ckvka ≤ kvkb ≤ Ckvka for all v ∈ V.
that P(e j − ek ) = p j − pk for all j, k to see that
kp j − pk k1 = kP(e j − ek )k1 = ke j − ek k1 = 2 Similarly, equivalence of k · kb and k · kc tells us that
there exist scalars d, D > 0 such that
for all j and k as well. Then by using Exer-
cise 1.D.10(b) again, we see that there exist non- dkvkb ≤ kvkc ≤ Dkvkb for all v ∈ V.
negative real constants di, j,k ≥ 0 such that, for each
i, j, k, it is the case that either pi, j = −di, j,k pi,k or Basic algebraic manipulations of these inequalities
pi,k = 0. show that
Since each ci, j,k and di, j,k is non-negative, it follows cdkvka ≤ kvkc ≤ CDkvka for all v ∈ V,
that if pi,k 6= 0 then pi, j = 0 for all j 6= k. In other
words, each row of P contains at most one non-zero so k · ka and k · kc are equivalent too.
entry (and each row must indeed contain at least
one non-zero entry since P is invertible by Exer- 1.D.28 Both directions of this claim follow just from notic-
cise 1.D.21). ing that all three defining properties of k · kV follow
Every row thus has exactly one non-zero entry. By immediately from the three corresponding properties
using (again) the fact that isometries must be invert- of k · kFn (e.g., if we know that k · kF n is a norm then
ible, it follows that each of the non-zero entries must we can argue that kvk = 0 implies [v]B Fn = 0, so
occur in a distinct column (otherwise there would [v]B = 0, so v = 0).
be a zero column). The fact that each non-zero entry More generally, we recall that the function that sends
has absolute value 1 follows from simply noting that a vector v ∈ V to its coordinate vector [v]B ∈ Fn is
P must preserve the 1-norm of each standard basis an isomorphism. It is straightforward to check that
vector e j . if T : V → W is an isomorphism then the function
defined by kvk = kT (v)kW is a norm on V (com-
1.D.25 Instead of just noting that max |pi, j ± pi,k | = 1 pare with the analogous statement for inner products
1≤i≤n
for all 1 ≤ j, k ≤ n, we need to observe that given in Exercise 1.3.25).
max1≤i≤n |pi, j + zpi,k | = 1 whenever z ∈ C is
such that |z| = 1. The rest of the proof follows with
no extra changes needed.
Section 2.1: The Schur and Spectral Decompositions

2.1.1 (a) Normal. 2.1.3 (a) The eigenvalues of this matrix are 3 and √
(c) Unitary and normal. 4, with corresponding
√ eigenvectors (1, 1)/ 2
(e) Hermitian, skew-Hermitian, and normal. and (3, 2)/ 13, respectively (only one eigen-
(g) Unitary, Hermitian, and normal. value/eigenvector pair is needed). If we choose
to Schur triangularize via λ = 3 then we get
2.1.2 (a) Not normal, since " # " #
" # " # 1 1 1 3 5
U=√ and T = ,
5 1 5 −1 2 1 −1 0 4
A∗ A = 6= = AA∗ .
1 10 −1 10
whereas if we Schur triangularize via λ = 4
(c) Is normal, since then we get
" # " # " #
2 0 1 3 2 4 5
A∗ A = = AA∗ . U=√ and T = .
0 2 13 2 −3 0 3
(e) Is normal (all Hermitian matrices are). These are both valid Schur triangularizations
(g) Not normal. of A (and there are others too).
2.1.4 In all parts of this question, we call the given matrix 2.1.12 If A has Schur triangularization A = UTU ∗ , then
A. cyclic commutativity of the trace shows that kAkF =
(a) A = UDU ∗ , where kT kF . Since the diagonal entries of T are the eigen-
" # " # values of A, we have
5 0 1 1 1 s
D= and U = √ . n
0 1 2 1 −1 kT kF = ∑ |λ j |2 + ∑ |ti, j |2 .
j=1 i< j
(c) A = UDU ∗ , where
" # " # It follows that
1 0 1 1 −1 s
D= and U = √ . n
0 −1 2 i i kAkF = ∑ |λ j |2
j=1
(e) A = UDU ∗ , where
  if and only if T can actually be chosen to be diagonal,
√
1 + 3i 0 0 which we know from the spectral decomposition
1 √ 
happens if and only if A is normal.
D=  
1 − 3i 0 and
2 0
0 0 4 2.1.13 Just apply Exercise 1.4.19 with B = A and C = A∗ . In
 
2 2 −2 particular, the equivalence of conditions (a) and (c)
1  √ √  gives us what we want.
U= √   −1 + 3i −1 − 3i −2 .
2 3 √ √
−1 − 3i −1 + 3i −2 2.1.14 If A∗ ∈ span I, A, A2 , A3 , . . . then A∗ commutes
with A (and thus A is normal) since A commutes
2.1.5 (b) False. For example, the matrices with each of its powers.
" # " # On the other hand, if A is normal then it has a spec-
1 0 0 1 tral decomposition A = UDU ∗ , and A∗ = UDU ∗ .
A= and B =
0 0 −1 0 Then let p be the interpolating polynomial with the
property that p(λ j ) = λ j for all 1 ≤ j ≤ n (some of
are normal, but A + B is not. the eigenvalues of A might be repeated, but that is
(d) False. This was shown in part (b) above. OK because the eigenvalues of A∗ are then repeated
(f) False. By the real spectral decomposition, we as well, so we do not run into a problem with trying
know that such a decomposition of A is pos- to set p(λ j ) to equal two
sible if and only if A is symmetric. There are different values).
Then
p(A) = A∗ , so A∗ ∈ span I, A, A2 , A3 , . . . . In partic-
(real) normal matrices that are not symmetric ular, this tells us
(and they require a complex D and/or U). that if A has k distinct
eigenvalues
then A∗ ∈ span I, A, A2 , A3 , . . . , Ak−1 .
(h) False. For a counter-example, just pick any
(non-diagonal) upper triangular matrix with 2.1.17 In all cases, we write A in a spectral decomposition
real diagonal entries. However, this becomes A = UDU ∗ .
true if you add in the restriction that A is nor- (a) A is Hermitian if and only if A∗ = (UDU ∗ )∗ =
mal and has real eigenvalues. UD∗U ∗ equals A = UDU ∗ . Multiplying on the
left by U ∗ and the right by U shows that this
2.1.7 Recall that pA (λ ) = λ 2 − tr(A)λ + det(A). By the happens if and only if D∗ = D, if and only if
Cayley–Hamilton theorem, it follows that the entries of D (i.e., the eigenvalues of A) are
pA (A) = A2 − tr(A)A + det(A)I = O. all real.
(b) The same as part (a), but noting that D∗ = −D
Multiplying this equation through by A−1 shows that if and only if its entries (i.e., the eigenvalues of
det(A)A−1 = tr(A)I − A, which we can rearrange as A) are all imaginary.
" # (c) A is unitary if and only if I = A∗ A =
−1 1 d −b (UDU ∗ )∗ (UDU ∗ ) = UD∗ DU ∗ . Multiplying
A = .
det(A) −c a on the left by U ∗ and the right by U shows
that A is unitary if and only if D∗ D = I, which
The reader may have seen this formula when learning (since D is diagonal) is equivalent to |d j, j |2 = 1
introductory linear algebra. for all 1 ≤ j ≤ n.
(d) If A is not normal then we can let it be triangu-
2.1.9 (a) The characteristic polynomial of A is pA (λ ) = lar (but not diagonal) with whatever eigenval-
λ 3 −3λ 2 +4, so the Cayley–Hamilton theorem ues (i.e., eigenvalues) we like. Such a matrix
tells us that A3 − 3A2 + 4I = O. Moving I to is not normal (see Exercise 2.1.16) and thus is
one side and then factoring A out of the other not Hermitian, skew-Hermitian, or unitary.
side then gives I = A( 43 A − 14 A2 ). It follows
that A−1 = 34 A − 41 A2 . 2.1.19 (a) Use the spectral decomposition to write A =
(b) From part (a) we know that A3 − 3A2 + 4I = O. UDU ∗ , where D is diagonal with the eigenval-
Multiplying through by A gives A4 − 3A3 + ues of A along its diagonal. If we recall that
4A = O, which can be solved for A to get rank is similarity-invariant then we see that
A = 14 (3A3 − A4 ). Plugging this into the for- rank(A) = rank(D), and the rank of a diagonal
mula A−1 = 34 A − 41 A2 gives us what we want: matrix equals the number of non-zero entries
A−1 = 16 9 3
A − 43 A4 − 14 A2 . that it has (i.e., the rank(D) equals the number
of non-zero eigenvalues of A).
(b) Any non-zero upper triangular matrix with Since U is unitary and D is diagonal, this com-
all diagonal entries (i.e., eigenvalues) equal pletes the inductive step and the proof.
to 0 works. They have non-zero rank, but no
non-zero eigenvalues. 2.1.23 (a) See solution to Exercise 1.4.29(a).
(b) Orthogonality of P tells us that hP(v), v −
2.1.20 (a) We already proved this in the proof of Theo- P(v)i = 0, so hT (v), vi = 0 for all v ∈ V. Ex-
rem 2.1.6. Since A is real and symmetric, we ercise 1.4.28 then tells us that T ∗ = −T .
have (c) If B is an orthonormal basis of V then [T ]TB =
λ kvk2 = λ v∗ v = v∗ Av = v∗ A∗ v −[T ]B , so tr([T ]B ) = tr([T ]TB ) = −tr([T ]B ). It
follows that tr([P]B − [P]TB [P]B ) = tr([T ]B ) = 0,
= (v∗ Av)∗ = (λ v∗ v)∗ = λ kvk2 , so tr([P]B ) = tr([P]TB [P]B ) = k[P]TB [P]B k2F .
which implies λ = λ (since every eigenvector v Since tr([P]B ) equals the sum of the eigen-
is, by definition, non-zero), so λ is real. values of [P]B , all of which are 0 or 1, it
(b) Let λ be a (necessarily real) eigenvalue of A with follows from Exercise 2.1.12 that [P]B is nor-
corresponding eigenvector v ∈ Cn . Then mal and thus has a spectral decomposition
[P]B = UDU ∗ . The fact that [P]TB = [P]∗B = [P]B
Av = Av = λ v = λ v,
and thus P∗ = P follows from again recalling
so v is also an eigenvector corresponding to λ . that the diagonal entries of D are all 0 or 1 (and
Since linear combinations of eigenvectors (corre- thus real).
sponding to the same eigenvalue) are still eigen-
vectors, we conclude that Re(v) = (v + v)/2 is a 2.1.26 If A has distinct eigenvalues, we can just notice that
real eigenvector of A corresponding to the eigen- if v is an eigenvector of A (i.e., Av = λ v for some λ )
value λ . then ABv = BAv = λ Bv, so Bv is also an eigenvector
(c) We proceed by induction on n (the size of A) and of A corresponding to the same eigenvalue. However,
note that the n = 1 base case is trivial since every the eigenvalues of A being distinct means that its
1 × 1 real symmetric matrix is real diagonal. For eigenspaces are 1-dimensional, so Bv must in fact be
the inductive step, let λ be a real eigenvalue of a multiple of v: Bv = µv for some scalar µ, which
A with corresponding real eigenvector v ∈ Rn . means exactly that v is an eigenvector of B as well.
By using the Gram–Schmidt process we can find If the eigenvalues of A are not distinct, we instead
a unitary matrix V ∈ Mn (R) with v as its first proceed by induction on n (the size of the matrices).
column: The base case n = 1 is trivial, and for the inductive
V = v | V2 , step we suppose that the result holds for matrices of
where V2 ∈ Mn,n−1 (R) satisfies V2T v = 0 (since size (n−1)×(n−1) and smaller. Let {v1 , . . . , vk } be
V is unitary, v is orthogonal to every column of an orthonormal basis of any one of the eigenspaces S
V2 ). Then direct computation shows that of A. If we let V1 = v1 | · · · | vk then we can extend

T V1 to a unitary matrix V = V1 | V2 . Furthermore,
V T AV = v | V2 A v | V2
T T Bv j ∈ S for all 1 ≤ j ≤ k, by the same argument
v v Av vT AV2 used in the previous paragraph, and straightforward
= T Av | AV2 = T T
V2 λV2 v V2 AV2 calculation shows that
T
∗
λ 0 V1 AV1 V1∗ AV2
= . V ∗ AV = and
0 V2T AV2 O ∗
V2 AV2
∗
We now apply the inductive hypothesis—since V1 BV1 V1∗ BV2
V2T AV2 is an (n − 1) × (n − 1) symmetric matrix, V ∗ BV = .
O V2∗ BV2
there exists a unitary matrix U2 ∈ Mn−1 (R) and
a diagonal D2 ∈ Mn−1 (R) such that V2T AV2 = Since the columns of V1 form an orthonormal basis
U2 D2U2T . It follows that of an eigenspace of A, we have V1∗ AV1 = λ Ik , where
λ is the corresponding eigenvalue, so V1∗ AV1 and
λ 0T
V T AV = V1∗ BV1 commute. By the inductive hypothesis, V1∗ AV1
0 U2 D2U2T
and V1∗ BV1 share a common eigenvector x ∈ Ck , and
1 0 T λ 0T 1 0T it follows that V1 x is a common eigenvector of each
= .
0 U2 0 D2 0 U2T of A and B.
By multiplying on the left by V and on the right
by V T , we see that A = UDU T , where

1 0T λ 0T
U =V and D = .
0 U2 0 D2
Section 2.2: Positive Semidefiniteness

2.2.1 (a) Positive definite, since its eigenvalues equal 1. (g) Positive definite, since it is strictly diagonally
(c) Not positive semidefinite, since it is not even dominant.
square. 2.2.2 (a) D(1, 2) and D(−1, 3).
(e) Not positive semidefinite, since it is not even
Hermitian.
(c) D(1, 0), D(2, 0), and D(3, 0). Note that these 2.2.11 First note that a j,i = ai, j since PSD matrices are
discs are really just points located at 1, 2, and 3 Hermitian, so it suffices to show that ai, j = 0 for all
in the complex plane, so the eigenvalues of this i. To this end, let v ∈ Fn be a vector with v j = −1
matrix must be 1, 2, and 3 (which we could see and all entries except for vi and v j equal to 0. Then
directly since it is diagonal). v∗ Av = a j, j − 2Re(vi ai, j ). If it were the case that
(e) D(1, 3), D(3, 3), and another (redundant) copy ai, j 6= 0, then we could choose vi to be a sufficiently
of D(1, 3). large multiple of ai, j so that v∗ Av < 0. Since this
" # contradicts positive semidefiniteness of A, we con-
2.2.3 (a) 1 1 −1 clude that ai, j = 0.
√
2 −1 1
  2.2.13 The “only if” direction follows immediately from
(c) 1 0 0 Theorem 2.2.4. For the “if” direction, note that for
 √ 
0 2 0  any vector v (of suitable size that we partition in the
 
√ same way as A) we have
0 0 3   
 √ √ 
(e)
1 + 3 2 1 − 3 A1 · · · O v1
1  h i  
. . 
√  2  .. ..   ... 

∗
 2 4  v Av = v1 · · · vn  .
∗ ∗

2 3 √ √ . . 
1− 3 2 1+ 3 O · · · An vn
2.2.4 (a) This matrix is positive semidefinite so we can
just set U = I and choose P to be itself. = v∗1 A1 v1 + · · · + v∗n An vn .
(d) Since this matrix is invertible, its polar decom- Since A1 , A2 , . . . , An are positive (semi)definite it
position is unique: follows that each term in this sum is non-negative
   
1 −2 2 2 1 0 (or strictly positive, as appropriate), so the sum is as
1    well, so A is positive (semi)definite.
U = −2 1 2 , P = 1 3 1 .
3
2 2 1 0 1 1 2.2.14 This follows immediately from Theorem 2.2.10,
2.2.5 (a) False. The matrix B from Equation (2.2.1) is a which says that A∗ A = B∗ B if and only if there exists
counter-example. a unitary matrix U such that B = UA. If we choose
(c) True. The identity matrix is Hermitian and has B = A∗ then we see that A∗ A = AA∗ (i.e., A is normal)
all eigenvalues equal to 1. if and only if A∗ = UA.
(e) False. For example, let
" # " # 2.2.16 (a) For the “if” direction just note that any real
2 1 1/2 1 linear combination of PSD (self-adjoint) ma-
A= and B = ,
1 1/2 1 2 trices is self-adjoint. For the “only” direction,
which are both positive semidefinite. Then note that if A = UDU ∗ is a spectral decompo-
" # sition of A then we can define P = UD+U ∗
2 4 and N = −UD−U ∗ , where D+ and D− are the
AB = , diagonal matrices containing the strictly pos-
1 2
itive and negative entries of D, respectively.
which is not even Hermitian, so it cannot be Then P and N are each positive semidefinite
positive semidefinite. and P − N = UD+U ∗ + UD−U ∗ = U(D+ +
(g) False. A counter-example to this claim was D− )U ∗ = UDU ∗ = A.
provided in Remark 2.2.2. [Side note: The P and N constructed in this way
(i) False. I has many non-PSD square roots (e.g., are called the positive semidefinite part and
any diagonal matrix with ±1) entries on its negative semidefinite part of A, respectively.]
diagonal. (b) We know from Remark 1.B.1 that we can write
A as a linear combination of the two Hermitian
2.2.6 (a) x ≥ 0, since a diagonal matrix is PSD if and matrices A + A∗ and iA − iA∗ :
only if its diagonal entries are non-negative.
(c) x = 0. The matrix (which we will call A) is 1 1
A= (A + A∗ ) + (iA − iA∗ ).
clearly PSD if x = 0, and if x 6= 0 then we note 2 2i
from Exercise 2.2.11 that if A is PSD and has Applying the result of part (a) to these 2 Hermi-
a diagonal entry equal to 0 then every entry in tian matrices writes A as a linear combination
that row and column must equal 0 too. of 4 positive semidefinite matrices.
More explicitly, we can see that A is not (c) If F = R then every PSD matrix is symmetric,
PSD by letting v = (0, −1, v3 ) and computing and the set of symmetric matrices is a vector
v∗ Av = 1 − 2v3 x, which is negative as long as space. It follows that there is no way to write a
we choose v3 large enough. non-symmetric matrix as a (real) linear combi-
nation of (real) PSD matrices.
2.2.7 If A is PSD then we can write A = UDU ∗ , where U
is unitary and D is diagonal with non-negative real 2.2.19 (a) Since A is PSD, we know that a j, j = e∗j Ae j ≥ 0
diagonal entries. However, since A is also unitary, for all 1 ≤ j ≤ n. Adding up these non-negative
its eigenvalues (i.e., the diagonal entries of D) must diagonal entries shows that tr(A) ≥ 0.
lie on the unit circle in the complex plane. The only
non-negative real number on the unit circle is 1, so
D = I, so A = UDU ∗ = UU ∗ = I.
(b) Write A = D∗ D and B = E ∗ E (which we 2.2.23 To see that (a) =⇒ (b), let v be an eigenvector of
can do since A and B are PSD). Then A with corresponding eigenvalue λ . Then Av = λ v,
cyclic commutativity of the trace shows and multiplying this equation on the left by v∗ shows
that tr(AB) = tr(D∗ DE ∗ E) = tr(ED∗ DE ∗ ) = that v∗ Av = λ v∗ v = λ kvk2 . Since A is positive defi-
tr((DE ∗ )∗ (DE ∗ )). Since (DE ∗ )∗ (DE ∗ ) is nite, we know that v∗ Av > 0, so it follows that λ > 0
PSD, it follows from part (a) that this quan- too.
tity is non-negative. To see that (b) =⇒ (d), we just apply the spectral
(c) One example that works is decomposition theorem (either the complex Theo-
" # " # rem 2.1.4 or the real Theorem 2.1.6, as appropriate)
1 1 1 −2 to A.
A= , B= , √
1 1 −2 4 To see why (d) =⇒ (c), let D be the diagonal
" # matrix that is obtained by taking the (strictly posi-
4 −2 tive) square√root of the diagonal entries of D, and
C= .
−2 1 define B = DU ∗ . Then B is invertible since it is
the
√ product √of two invertible
√ ∗ √matrices, and B∗ B =
It is straightforward to verify that each of ( DU ) ( DU ) = U D DU = UDU ∗ = A.
∗ ∗ ∗ ∗
A, B, and C are positive semidefinite, but Finally, to see that (c) =⇒ (a), we let v ∈ Fn be any
tr(ABC) = −4. non-zero and we note that
2.2.20 (a) The “only if” direction is exactly Exer- v∗ Av = v∗ B∗ Bv = (Bv)∗ (Bv) = kBvk2 > 0,
cise 2.2.19(b). For the “if” direction, note that with the final inequality being strict because B is
if A is not positive semidefinite then it has a invertible so Bv 6= 0 whenever v 6= 0.
strictly negative eigenvalue λ < 0 with a cor-
responding eigenvector v. If we let B = vv∗ 2.2.24 In all parts of this question, we prove the statement
then tr(AB) = tr(Avv∗ ) = v∗ Av = v∗ (λ v) = for positive definiteness. For semidefiniteness, just
λ kvk2 < 0. make the inequalities not strict.
(b) In light of part (a), we just need to show (a) If A and B are self-adjoint then so is A + B, and
that if A is positive semidefinite but not pos- if v∗ Av > 0 and v∗ Bv > 0 for all v∈Fn then
itive definite, then there is a PSD B such v∗ (A + B)v=v∗ Av + v∗ Bv>0 for all v ∈ Fn too.
that tr(AB) = 0. To this end, we just let v (b) If A is self-adjoint then so is cA (recall c is real),
be an eigenvector of A corresponding to the and v∗ (cA)v = c(v∗ Av) > 0 for all v ∈ Fn when-
eigenvalue 0 of A and set B = vv∗ . Then ever c > 0 and A is positive definite.
tr(AB) = tr(Avv∗ ) = v∗ Av = v∗ (0v) = 0. (c) (AT )∗ = (A∗ )T , so AT is self-adjoint whenever
A is, and AT always has the same eigenvalues as
2.2.21 (a) The fact that c ≤ 0 follows simply from choos- A, so if A is positive definite (i.e., has positive
ing B = O. If A were not positive semidefinite eigenvalues) then so is AT .
then it has a strictly negative eigenvalue λ < 0
with a corresponding eigenvector v. If we let 2.2.25 (a) We use characterization (c) of positive semidef-
x ≥ 0 be a real number and B = I + xvv∗ then initeness from Theorem 2.2.1. If we let
{v1 , v2 , . . . , vn } be the columns of B (i.e., B =
tr(AB) = tr A(I + xvv∗ ) = tr(A) + xλ kvk2 , [ v1 | v2 | · · · | vn ]) then we have
which can be made arbitrarily large and neg-  ∗ 
v1
ative (in particular, more negative than c) by  v∗2 
choosing x sufficiently large.  
B∗ B =  .  [ v1 | v2 | · · · | vn ]
(b) To see that c ≤ 0, choose B = εI so that  .. 
tr(AB) = εtr(A), which tends to 0+ as ε → 0+ . v∗n
Positive semidefiniteness of A then follows via  ∗ 
the same argument as in the proof of part (a) v1 v1 v∗1 v2 · · · v∗1 vn
 ∗ 
(notice that the matrix B = I + xvv∗ from that v2 v1 v∗2 v2 · · · v∗2 vn 
 
proof is positive definite). = . .. ..  .
 . .. 
 . . . . 
2.2.22 (a) Recall from Theorem A.1.2 that rank(A∗ A) =
v∗n v1 v∗n v2 ··· v∗n vn
rank(A). Furthermore, if A∗ A has spectral de-
composition A∗ A = UDU ∗ then rank(A∗ A) In particular, it follows that A = B∗ B (i.e., A
equals the number of non-zero diagonal en- is positive semidefinite) if and only if ai, j =
tries of D, which equals the v∗i v j = vi · v j for all 1 ≤ i, j ≤ n.
√ number of non-
zero diagonal entries (b) The set {v1 , v2 , . . . , vn } is linearly indepen-
√ of D, which equals
rank(|A|) = rank( A∗ A). dent if and only if the matrix B from our
p
(b) Just recall that kAkF = tr(A∗ A), so |A| F = proof of part (a) is invertible, if and only if
p p rank(B∗ B) = rank(B) = n, if and only if B∗ B
tr(|A|2 ) = tr(A∗ A) too.
2 is invertible, if and only if B∗ B is positive
(c) We compute |A|v = (|A|v) · (|A|v) = definite.
v∗ |A|2 v = v∗ A∗ Av = (Av) · (Av) = kAvk2 for
all v ∈ Fn . 2.2.28 (a) This follows from multilinearity of the deter-
minant: multiplying one of the columns of a
matrix multiplies its determinant by the same
amount.
(b) Since A∗ A is positive semidefinite, its eigenval- (d) Equality is attained if and only if the columns
ues λ1 , λ2 , . . . , λn are non-negative, so we can of A form a mutually orthogonal set. The
apply the AM–GM inequality to them to get reason for this is that equality is attained
!n in the AM–GM inequality if and only if
1 n
∗
det(A A) = λ1 λ2 · · · λn ≤ ∑ λj λ1 = λ2 = . . . = λn , which happens if and
n j=1 only if A∗ A is the identity matrix, so A must
n n be unitary in part (b) above. However, the
1 1
= tr(A∗ A) = kAk2F argument in part (b) relied on having already
n n
!n scaled A so that its columns have length 1—
1 n after “un-scaling” the columns, we see that any
= ∑ ka j k2 = 1n = 1.
n j=1 matrix with orthogonal columns also attains
equality.
(c) We just recall that the determinant is mul-
tiplicative, so det(A∗ A) = det(A∗ ) det(A) =
| det(A)|2 , so | det(A)|2 ≤ 1, so | det(A)| ≤ 1.
Section 2.3: The Singular Value Decomposition

2.3.1 In all parts of this solution, we refer to the given (g) False. All we can say in general is that A2 =
" its SVD# as A = UΣV
matrix as A and ∗ UDV ∗UDV ∗ . This does not simply any further,
" . #
1 1 1 4 0 since we cannot cancel out the V ∗U in the mid-
(a) U = √ , Σ = , V = dle.
2 −1 1 0 2
" # (i) True. If A has SVD A = UΣV ∗ then AT =
1 −1 1 V ΣT U T is also an SVD, and Σ and ΣT have
√
2 1 1 the same diagonal entries.
√ √  √ 
2 3 −1 6 0
1  √    2.3.5 Use the singular value decomposition to write
(c) U = √   2 0 2 , Σ= 0 2, A = UΣV ∗ . If each singular value is 1 then Σ = I,
6 √ √
− 2 3 1 0 0 so A = UV ∗ is unitary. Conversely, there were a
" # singular value σ unequal to 1 then there would be
1 1 −1
V=√ a corresponding normalized left- and right-singular
2 1 1 vectors u and v, respectively, for which Av = σ u, so
   √ 
1 2 2 6 2 0 0 kAvk = kσ uk = |σ |kuk = σ 6= 1 = kvk, so A is not
1    unitary by Theorem 1.4.9.
(e) U = −2 −1 2 , Σ =  0 3 0,
3
−2 2 −1 0 0 0 2.3.6 For the “only if” direction, notice that A has singular
 
1 0 1 value decomposition
1  √  " #
V=√ 0 2 0 D O ∗
2 A=U V ,
−1 0 1 O O
√ √
2.3.2 (a) 4 (c) 6 (e) 6 2 where D is an r × r diagonal matrix with non-zero
diagonal entries (so in particular, D is invertible).
Then
" # " #
2.3.3 (a) range(A): {(1, 0), (0, 1)}, Ir O D O ∗
A=U Q, where Q = V
null(A∗ ): {}, O O O I
range(A∗ ): {(1, 0), (0, 1)}, and
is invertible. For the “if” direction, just use the fact
null(A): {}. √ √ that rank(Ir ) = r and multiplying one the left or right
(c) range(A): {(1, 1, −1)/√ 3, (1, 0, 1)/ 2},
by an invertible matrix does not change rank.
null(A∗ ): {(−1, 2, 1)/ 6},
range(A∗ ): {(1, 0), (0, 1)}, and
2.3.7 If A = UΣV ∗ is a singular value decom-
null(A): {}.
position then | det(A)| = | det(UΣV ∗ )| =
(e) range(A): {(1, −2, −2)/3, (2, −1, 2)/3},
| det(U) det(Σ) det(V ∗ )| = | det(Σ)| = σ1 σ2 · · · σn ,
null(A∗ ): {(2, 2, −1)/3},√ where the second-to-last equality follows from the
range(A∗ ): {(1, 0, −1)/
√ 2, (0, 1, 0)}, and
fact that if U is unitary then | det(U)| = 1 (see Exer-
null(A): {(1, 0, 1)/ 2}.
cise 1.4.11).
2.3.4 (a) False. This statement is true if A is normal by
2.3.10 Submultiplicativity of the operator norm tells us that
Theorem 2.3.4, but is false in general.
kIk = kAA−1 k ≤ kAkkA−1 k, so dividing through by
(c) True. We can write A = UΣV ∗ , so A∗ A =
kAk gives kA−1 k ≥ 1/kAk. To see that equality does
V Σ∗ ΣV ∗ , whose singular values are the diago-
not always hold, consider
nal entries of Σ∗ Σ, which are the squares of the " #
singular values of A. 1 0
(e) False. For example, if A = I then kA∗ AkF = A= ,
√ 0 2
n, but kAk2F = n.
which has kAk = 2 and kA−1 k = 1.
2.3.12 (a) If A = UΣV ∗ is an SVD (with the diag- 2.3.17 (a) Suppose A has SVD A = UΣV ∗ . If we let B =
onal entries of Σ being σ1 ≥ σ2 ≥ · · · ) UV ∗ then B is unitary and thus has kUV ∗ k = 1
then kABk2F = tr((UΣV ∗ B)∗ (UΣV ∗ B)) = (by Exercise 2.3.5, for example), so the given
tr(B∗V Σ∗ ΣV ∗ B) = tr(Σ∗ ΣV ∗ BB∗V ) ≤ maximization is at least as large as hA, Bi =
σ12 tr(V ∗ BB∗V ) = σ12 tr(BB∗ ) = σ12 kBk2F = tr(V ΣU ∗UV ∗ ) = tr(Σ) = kAktr .
kAk2 kBk2F , where the inequality in the middle To show the opposite inequality, note that
comes from the fact that multiplying V ∗ BB∗V if B is any matrix for which kBk ≤ 1
on the left by Σ∗ Σ multiplies its j-th diagonal then |hA,∗Bi| = |hUΣV , Bi| = |hΣ,U ∗ BV i| =
∗
entry by σ 2j , which is no larger than σ12 . ∑ j σ j [U BV ] j, j ≤ ∑ j σ j |u∗ Bv j | ≤ ∑ j σ j kBk ≤

j
(b) This followsqfrom part (a) and the fact that ∑ j σ j = kAktr , where we referred to the j-th
kAk = σ1 ≤ σ12 + · · · + σmin{m,n}
2 = kAkF . columns of U and V as u j and v j , respectively, and
we used Exercise 2.3.13 to see that |u∗j Bv j | ≤ kBk.
2.3.13 The Cauchy–Schwarz inequality tells us that if kvk = (b) Property (a) follows quickly from the fact that if
1 then |v∗ Aw| ≤ kvkkAwk = kAwk. Furthermore, A = UΣV ∗ is a singular value decomposition then
equality is attained when v is parallel to Aw. It fol- so is cA = U(cΣ)V ∗ . Property (c) follows from
lows that the given maximization over v and w is the fact that singular values are non-negative, and
equal to if all singular values equal 0 then the matrix they

maxn kAwk : kwk = 1 come from is UOV ∗ = O. Finally, for property (b)
w∈F we use part (a) above:
which (by definition) equals kAk.
kA + Bk = max |hA + B,Ci| : kCk ≤ 1
C∈Mm,n
2.3.15 If v ∈ Cm and w ∈ Cn are unit vectors then
" # ≤ max |hA,Ci| : kCk ≤ 1
h i cI C∈Mm,n
m A v
v∗ w∗ + max |hB,Ci| : kCk ≤ 1
A∗ cIn w C∈Mm,n
= c(kvk2 + kwk2 ) + 2Re(v∗ Aw). = kAk + kBk.

This quantity is always non-negative (i.e., the 2.3.19 Just notice that A and UAV have the same singu-
block matrix is positive semidefinite) if and only if lar values since if A = U2 ΣV2∗ is a singular value
Re(v∗ Aw) ≤ 1, which is equivalent to |v∗ Aw| ≤ 1 decomposition then so is UAV = UU2 ΣV2∗V =
(since we can multiply w by some eiθ so that (UU2 )Σ(V ∗V2 )∗ .
Re(v∗ Aw) = |v∗ Aw|), which is equivalent to kAk ≤ 1
by Exercise 2.3.13. 2.3.20 Recall from Exercise 2.1.12 that A is normal if and
only if s
n
2.3.16 All three properties follow quickly from the defini-
tion of kAk and the corresponding properties of the
kAkF = ∑ |λ j |2 .
j=1
norm on Fn . Property (a):
Since s
n
kcAk = maxn kcAvk : kvk ≤ 1
v∈F

kAkF = ∑ σ 2j
j=1
= maxn |c|kAvk : kvk ≤ 1
v∈F by definition, we conclude that if A is not normal

= |c| maxn kAvk : kvk ≤ 1 = |c|kAk. then these two sums do not coincide, so at least one
v∈F of the terms in the sum must not coincide: σ j 6= |λ j
Property (b): for some 1 ≤ j ≤ n.

kA + Bk = maxn k(A + B)vk : kvk ≤ 1 2.3.25 (a) Since P 6= O, there exists some non-zero vector v
v∈F
in its range. Then Pv = v, so kPvk/kvk = 1, so
= maxn kAv + Bvk : kvk ≤ 1 kPk ≥ 1.
v∈F
(b) We know from Theorem 1.4.13 that kPvk ≤ kvk
≤ maxn kAvk + kBvk : kvk ≤ 1
v∈F for all v, so kPk ≤ 1. When combined with part (a),

≤ maxn kAvk : kvk ≤ 1 this means that kPk = 1.
v∈F
(c) We know from Exercise 1.4.31 that every eigen-
+ maxn kAvk : kvk ≤ 1 value of P equals 0 or 1, so it has a Schur triangu-
v∈F
larization P = UTU ∗ , where
= kAk + kBk, " #
A B
where the final inequality comes from the fact that T= ,
O I +C
there is more freedom in two separate maximiza-
tions than there is in a single maximization. For where A and C are strictly upper triangular (i.e., we
property (c), the fact that kAk ≥ 0 follows simply are just saying that the diagonal entries of T are all
from the fact that it involves maximizing a bunch 0 in the top-left block and all 1 in the bottom-right
of non-negative quantities, and kAk = 0 if and only block). If u j is the j-th column of U then ku j k = 1
if A = O since kAk = 0 implies kAvk = 0 for all v, and kPu j k = kUTU ∗ u j k = kUT e j k, which is the
which implies A = O. norm of the j-th column of T . Since kPk = 1, it
follows that the j-th column of T cannot have
norm bigger than 1, which implies B = O and
C = O.
To see that A = O (and thus complete the proof), (c) To see that BR is real we note that symme-
note that P2 = P implies T 2 = T , which in turn try of B ensures that the (i, j)-entry of BR is
implies A2 = A, so Ak = A for all k ≥ 1. However, (bi, j +bi, j ))/2 = Re(bi, j ) (a similar calculation
the diagonal of A consists entirely of zeros, so shows that the (i, j)-entry of BI is Im(bi, j )).
the first diagonal above the main diagonal in A2 Since BR and BI are clearly Hermitian, they
consists of zeros, the diagonal above that one must be symmetric. To see that they commute,
in A3 consists of zeros, and so on. Since these we compute
powers of A all equal A itself, we conclude that
A = O. B∗ B = (BR + iBI )∗ (BR + iBI )
= B2R + B2I + i(BR BI − BI BR ).
2.3.26 (a) One simple example is
" # Since B∗ B, BR , and BI are all real, this implies
0 1+i BR BI − BI BR = O, so BR and BI commute.
A= , (d) Since BR and BI are real symmetric and com-
1+i 1
mute, we know by (the “Side note” underneath
which has of) Exercise 2.1.28 that there exists a unitary
" # matrix W ∈ Mn (R) such that each of W T BRW
0 −2i and W T BIW are diagonal. Since B = BR + iBI ,
A∗ A − AA∗ = ,
2i 0 we conclude that
so A is complex symmetric but not normal. W T BW = W T (BR + iBI )W
(b) Since A∗ A is positive semidefinite, its spectral
= (W T BRW ) + (W T BIW )
decomposition has the form A∗ A = V DV ∗ for
some real diagonal D and unitary V . Then is diagonal too.
∗ T ∗ T ∗ ∗ (e) By part (d), we know that if U = VW then
B B = (V AV ) (V AV ) = V A AV = D,
U T AU = W T (V T AV )W = W T BW
so B is real (and diagonal and entrywise non-
negative, but we do not need those properties). is diagonal. It does not necessarily have non-
Furthermore, B is complex symmetric (for all negative (or even real) entries on its diagonal,
unitary matrices V ) since but this can be fixed by multiplying U on
the right by a suitable diagonal unitary matrix,
BT = (V T AV )T = V T AT V = V T AV = B.
which can be used to adjust the complex phases
of the diagonal matrix as we like.
Section 2.4: The Jordan Decomposition

" # " #
2.4.1 (a) 1 1 (c)2 1 (e) Similar, since they both have
 
0 1 0 2 −1 0 0
   
(e) 1 0 0 (g) 3 0 0  
0 3 1
   
0 2 1 0 3 1 0 0 3
0 0 2 0 0 3
as their Jordan canonical form.
2.4.2 We already computed the matrix J in the Jordan de-

" √ #  
composition A = PJP−1 in Exercise 2.4.1, so here 2.4.4 (a) (e) e −2e −e
we just present a matrix P that completes the decom- 2 0
√ √  
0 3e e
position. Note that this matrix is not unique, so your 2/4 2
answer may differ. 0 −4e −e
" # " #
(a) 2 2 (c) 1 0 2.4.5 (a) False. Any matrix with a Jordan block of size
2 1 −2 −1 2 × 2 or larger (i.e., almost any matrix from
    this section) serves as a counter-example.
(e) 0 1 1 (g) 1 −2 −1
    (c) False. The Jordan canonical forms J1 and J2
2 1 2 1 1 0 are only unique up to re-ordering of their Jor-
1 1 2 2 −1 0 dan blocks (so we can shift the diagonal blocks
of J1 around to get J2 ).
(e) True. If A and B are diagonalizable then their
2.4.3 (a) Not similar, since their traces are not the same Jordan canonical forms are diagonal and so
(8 and 10). Theorem 2.4.3 tells us they are similar if those
(c) Not similar, since their determinants are not 1 × 1 Jordan blocks (i.e., their eigenvalues) co-
the same (0 and −2). incide.
(g) True. Since the function f (x) = ex is analytic 2.4.14 Since sin2 (x) + cos2 (x) = 1 for all x ∈ C, the func-
on all of C, Theorems 2.4.6 and 2.4.7 tell tion f (x) = sin2 (x) + cos2 (x) is certainly analytic (it
us that this sum converges for all A (and fur- is constant). Furthermore, f 0 (x) = 0 for all x ∈ C
thermore we can compute it via the Jordan and more generally f (k) (x) = 0 for all x ∈ C and in-
decomposition). tegers k ≥ 1. It follows from Theorem 2.4.6 that
f Jk (λ ) = I for all Jordan blocks Jk (λ ), and
2.4.8 We can choose C to be the standard basis of R2 Theorem 2.4.7 then tells us that f (A) = I for all
and T (v) = Av. Then it is straightforward to check A ∈ Mn (C).
that [T ]C = A. All that is left to do is find a basis
D such that [T ]D = B. To do so, we find a Jordan 2.4.16 (a) This follows directly from the definition of ma-
decomposition of A and B, which are actually diag- trix multiplication:
onalizations
" in#this case. In
" particular,
# if we define k
1 2 1 1 A2 = ∑ ai,` a`, j ,
P1 = and P2 = then i, j
−1 5 −1 −2 `=1
" # which equals 0 unless ` ≥ i + 1 and j ≥ ` + 1 ≥

−1 0 i + 2. A similar argument via induction shows
P1−1 AP1 = = P2−1 BP2 .
0 6 that Ak i, j = 0 unless j ≥ i + k.
(b) This is just the k = n case of part (a)—An only
Rearranging gives P2 P1−1 AP1 P2−1 = B. In other has n superdiagonals, and since they are all 0
words, P1 P2−1 is the change-of-basis matrix from D we know that An = O.
to C. Since C is the standard basis, the columns of
P1 P2−1 are the basis vectors of D. Now we can just 2.4.17 (a) This follows directly from the definition of ma-
compute: trix multiplication:
" # (" # " #)
2 k
3 1 3 1
P1 P2−1 = , so D = , . N1 i, j = ∑ [N1 ]i,` [N1 ]`, j ,
11 6 11 6 `=1
2.4.11 Since etr(A)

6= 0, this follows immediately from Exer- which equals 1 if ` = i + 1 and j = ` + 1 =
cise 2.4.10. i + 2, and equals 0 otherwise. In other words,
N12 = N2 , and a similar argument via induction
shows that N1n = Nn whenever 1 ≤ n < k. The
fact that N1k = O follows from Exercise 2.4.16.
(b) Simply notice that Nn has k − n of the
standard basis vectors as its columns so
its rank is max{k − n, 0}, so its nullity is
k − max{k − n, 0} = min{k, n}.

2.5.1 (a) All five decompositions apply to this matrix. For the “if” direction, note that we can use Schur
(c) Schur triangularization, singular value decom- triangularization to write A = UTU ∗ , where the left-
position, and Jordan decomposition (it does not most column of U is v and we denote the other
have a linearly independent set of eigenvectors, columns of U by u2 , . . ., un . Then Av = t1,1 v (i.e., the
so we cannot diagonalize or apply the spectral eigenvalue of A corresponding to v is t1,1 ), but A∗ v =
decomposition). (UT ∗U)v = UT ∗ e1 = t1,1 v + t1,2 u2 + · · · + t1,n un .
(e) Diagonalization, Schur triangularization, sin- The only way that this equals t1,1 v is if t1,2 = · · · =
gular value decomposition, and Jordan decom- t1,n = 0, so the only non-zero entry in the first row of
position (it is not normal, so we cannot apply T is its (1, 1)-entry.
the spectral decomposition). It follows that the second column of U must also
(g) All five decompositions apply to this matrix. be an eigenvector of A, and then we can repeat the
same argument as in the previous paragraph to show
2.5.2 (a) True. This is the statement of Corollary 2.1.5. that the only non-zero entry in the second row of
(c) True. This fact was stated as part of Theo- T is its (2, 2)-entry. Repeating in this way shows
rem 2.2.12. that k = n and T is diagonal, so A = UTU ∗ is normal.
2.5.4 For the “only if” direction, use the spectral decompo- 2.5.7 (a) tr(A) is the sum of the diagonal entries of A,
sition to write A = UDU ∗ so A∗ = UD∗U ∗ = UDU ∗ . each of which is 0 or 1, so tr(A) ≤ 1 + 1 + · · · +
It follows that A and A∗ have the same eigenspaces 1 = n.
and the corresponding eigenvalues are just complex (b) The fact that det(A) is an integer follows from
conjugates of each other, as claimed. formulas like Theorem A.1.4 and the fact that
each entry of A is an integer. Since det(A) > 0
(by positive definiteness of A), it follows that
det(A) ≥ 1.
(c) The
p AM–GM √ inequality tells us that (d) Parts (b) and (c) tell us that det(A) = 1, so
n
det(A) = n λ1 · · · λn ≤ (λ1 + · · · + λn )/n = equality holds in the AM–GM inequality in
tr(A)/n. Since tr(A) ≤ n by part (a), we con- part (c). By the equality condition of the AM–
clude that det(A) ≤ 1. GM inequality, we conclude that λ1 = . . . = λn ,
so they all equal 1 and thus A has spectral
decomposition A = UDU ∗ with D = I, so
A = UIU ∗ = UU ∗ = I.
Section 2.A: Extra Topic: Quadratic Forms and Conic Sections

2.A.1 (a) Positive definite. (g) Ellipsoid.
(c) Indefinite.
(e) Positive definite. 2.A.4 (a) False. Quadratic forms arise from bilinear
(g) Positive semidefinite. forms in the sense that if f (v, w) is a bilin-
(i) Positive semidefinite. ear form then q(v) = f (v, v) is a quadratic
form, but they are not the same thing. For exam-
2.A.2 (a) Ellipse. ple, bilinear forms act on two vectors, whereas
(c) Ellipse. quadratic forms act on just one.
(e) Hyperbola. (c) True. We stated this fact near the start of Sec-
(g) Hyperbola. tion 2.A.1.
(e) True. This is the quadratic form associated
2.A.3 (a) Ellipsoid. with the 1 × 1 matrix A = [1].
(c) Hyperboloid of one sheet.
(e) Elliptical cylinder.
Section 2.B: Extra Topic: Schur Complements and Cholesky

2.B.2 (a) The Schur complement is 2.B.7 This follows immediately from Theorem 2.B.1 and
" # the following facts: the upper and lower triangular
1 7 −1 matrices in that theorem are invertible due to having
S= ,
5 −1 3 all diagonal entries equal to 1, rank(XY ) = rank(X)
whenever Y is invertible, and the rank of a block di-
which has det(S) = 4/5 and is positive definite. agonal matrix is the sum of the ranks of its diagonal
It follows that the original matrix has deter- blocks.
minant det(A) det(S) = 5(4/5) = 4 and is also
positive definite (since its top-left block is). 2.B.8 (a) The following matrices work:
" # " #
2.B.3 (a) True. More generally, the Schur complement of I BD−1 I O
the top-left n × n block of an (m + n) × (m + n) U= , L= .
O I D−1C I
matrix is an m × m matrix.
(c) False. Only positive semidefinite matrices have (b) The argument is the exact same as in The-
a Cholesky decomposition. orem 2.B.2: det(U) = det(L) = 1, so det(Q)
equals the determinant of the middle block di-
2.B.4 There are many possibilities—we can choose the agonal matrix, which equals det(D) det(S).
(1, 2)- and (1, 3)-entries of the upper triangular ma- (c) Q is invertible if and only if det(Q) 6= 0, if and
trix to be anything sufficiently small that we want, only if det(S) 6= 0, if and only if S is invertible.
and then adjust the (2, 2)- and (2, 3)-entries accord- The inverse of Q can be computed using the
ingly (as suggested by Remark 2.B.1). For example, same method of Theorem 2.B.3 to be
if 0 ≤ a ≤ 1 is a real number and we define " #" #" #
 I O S−1 O I −BD−1
√ √  .
0 a a −D−1C I O D−1 O I
 √ √ 
T = 0 1−a 1 − a
(d) The proof is almost identical to the one pro-
0 0 1 vided for Theorem 2.B.4.
then  
0 0 0
 
T ∗ T = 0 1 1 .
0 1 2
2.B.11 We need to transform det(AB − λ Im ) into a form that for x and B reveals that kxk2 = x∗ x = 0, so x = 0, so
lets us use Sylvester’s determinant identity (Exer- the only way for this to be a Cholesky decomposition
cise 2.B.9). Well, of A is if B = T , where A2,2 = T ∗ T is a Cholesky de-
det(AB − λ Im ) = (−1)m det(λ Im − AB) composition of A2,2 (which is unique by the inductive
hypothesis).
= (−λ )m det(Im + (−A/λ )B) On the other hand, if a1,1 6= 0 (i.e., Case 1 of the
= (−λ )m det(In + B(−A/λ )) proof of Theorem 2.B.5) then the Schur complement
= ((−λ )m /λ n ) det(λ In − BA) S has unique Cholesky decomposition S = T ∗ T , so
we know that the Cholesky decomposition of A must
= (−λ )m−n det(BA − λ In ), be of the form
which is what we wanted to show. "√ #∗ "√ # " #
a1,1 x∗ a1,1 x∗ a1,1 a∗2,1
A= = ,
T T a2,1 A2,2
√ if A√= [a] then the
2.B.12 In the 1 × 1 case, it is clear that 0 0
Cholesky decomposition A = [ a]∗ [ a] is unique.
For the inductive step, we assume that the Cholesky where x ∈ Fn−1 is some unknown column vector.
decompositions of (n − 1) × (n − 1) matrices are Just performing the matrix multiplication reveals
√
unique. If a1,1 = 0 (i.e., Case 1 of the proof of Theo- that it must be the case that x = a2,1 / a1,1 , so the
rem 2.B.5) then solving Cholesky decomposition of A is unique in this case
" # well.
0 0T ∗
A= = x|B x|B
0 A2,2
Section 2.C: Extra Topic: Applications of the SVD

2.C.1 In all cases, we simply compute x = A† b. 2.C.6 We plug the 3 given data points into the equation
(a) No solution, closest is x = (1, 2)/10. y = c1 sin(x) + c2 cos(x) to get 3 linear equations
(c) Infinitely many solutions, with the smallest in the 2 variables c1 and c2 . This linear system has
being x = (−5, −1, 3). no solution, but we can compute the least squares
solution x = A† b = (1, −1/2), so the curve of best
fit is y = sin(x) − cos(x)/2.
2.C.2 (a) y = 3x − 2 (c) y = (5/2)x − 4
2.C.7 These parts can both be proved directly in a manner
analogous to the proofs of parts (a) and (c) given in
" #   the text. However, a quicker way is to notice that
2.C.3 (a) 2 2 (c) 4 2 0
1  part (d) follows from part (c), since Exercise 1.B.9
2 2  2 1 0 tells us that I − A† A is the orthogonal projection
5
−10 −5 0 onto range(A∗ )⊥ , which equals null(A) by Theo-
rem 1.B.7.
Part (b) follows from part (a) via a similar argument:
2.C.4 (a) True. More generally, the pseudoinverse of a I − AA† is the orthogonal projection onto range(A)⊥ ,
diagonal matrix is obtained by taking the recip- which equals null(A∗ ) by Theorem 1.B.7. The fact
rocals of its non-zero diagonal entries. that it is also the orthogonal projection onto null(A† )
(c) False. Theorem 2.C.1 shows that range(A† ) = follows from swapping the roles of A and A† (and
range(A∗ ), which typically does not equal using the fact that (A† )† = A) in part (d).
range(A).
(e) False. The pseudoinverse finds one such vector, 2.C.9 (a) If A has linearly independent columns then
but there may be many of them. For example, rank(A) = n, so A has n non-zero singular
if the linear system Ax = b has infinitely many values. It follows that if A = UΣV ∗ is a sin-
solutions then there are infinitely many vectors gular value decomposition then (A∗ A)−1 A∗ =
that minimize kAx − bk (i.e., make it equal to (V Σ∗ ΣV ∗ )−1V Σ∗U ∗ = V (Σ∗ Σ)−1V ∗V Σ∗U ∗ =
0). V (Σ∗ Σ)−1 Σ∗U ∗ . We now note that Σ∗ Σ is in-
deed invertible, since it is an n × n diagonal
2.C.5 We plug the 4 given data points into the equation matrix with n non-zero diagonal entries. Fur-
z = ax + by + c to get 4 linear equations in the 3 thermore, (Σ∗ Σ)−1 Σ∗ = Σ† since the inverse of
variables a, b, and c. This linear system has no a diagonal matrix is the diagonal matrix with
solution, but we can compute the least squares so- the reciprocal diagonal entries. It follows that
lution x = A† b = (1, 2, 1), so the plane of best fit is (A∗ A)−1 A∗ = V Σ†U ∗ = A† , as claimed.
z = x + 2y + 1. (b) Almost identical to part (a), but noting instead
that rank(A) = m so ΣΣ∗ is an m × m invertible
diagonal matrix.
Section 2.D: Extra Topic: Continuity and Matrix Analysis

p
2.D.2 (a) True. kAkF = tr(A∗ A) is continuous since it 2.D.5 Just notice that if A is positive semidefinite then
can be written as a composition of three con- Ak = A + 1k I is positive definite for all integers k ≥ 1,
tinuous functions (the square root, the trace, and lim Ak = A.
k→∞
and multiplication by A∗ ).
2.D.7 For any A ∈ Mm,n , A∗ A is positive semidefinite with
2.D.3 This follows from continuity of singular values. If m ≥ rank(A) = rank(A∗ A), so the Cholesky decom-
f (A) = σr+1 is the (r + 1)-th largest singular value position (Theorem 2.B.5) tells us that there exists an
of A then f is continuous and f (Ak ) = 0 for all k. It upper triangular matrix T ∈ Mm,n with non-negative
follows that f limk→∞ Ak = limk→∞ f (Ak ) = 0 too, real diagonal entries such that A∗ A = T ∗ T . Applying

so rank limk→∞ Ak ≤ r. Theorem 2.2.10 then tells us that there exists a uni-
tary matrix U ∈ Mm such that A = UT , as desired.
Section 3.1: The Kronecker Product

 
3.1.1 (a) −1 2 −2 4 (b) We know from Exercise 3.1.7 that the eigen-
 0
 −3 0 −6 
 values of A ⊗ I + I ⊗ BT are the sums of the
 −3 6 0 0  eigenvalues of A and BT (which has the same
0 −9 0 0 eigenvalues as B). It follows that the equation
 
(c) 2 −3 1 AX + XB = C has a unique solution if and only
 4 −6 2  if A ⊗ I + I ⊗ BT is invertible (i.e., has no 0
6 −9 3 eigenvalues), if and only if A and −B do not
have any eigenvalues in common.
" # " #
3.1.2 (a) a 0 (c) a b 3.1.9 (a) If A = U1 T1U1∗ and B = U2 T2U2∗ are Schur tri-
0 b c d angularizations then so is
A ⊗ B = (U1 ⊗U2 )(T1 ⊗ T2 )(U1 ⊗U2 )∗ .
3.1.3 (a) True. In general, if A is m × n and B is p × q It is straightforward to check that the diagonal
then A ⊗ B is mp × nq. entries of T1 ⊗ T2 are exactly the products of
(c) False. If AT = −A and BT = −B then the diagonal entries of T1 and T2 , so the eigen-
(A ⊗ B)T = AT ⊗ BT = (−A) ⊗ (−B) = A ⊗ B. values of A ⊗ B are exactly the products of the
That is, A ⊗ B is symmetric, not skew- eigenvalues of A and B.
symmetric. (b) We know from part (a) that if A has eigenval-
ues λ1 , . . ., λm and B has eigenvalues µ1 ,
3.1.4 Most randomly-chosen matrices serve as counter- . . ., µn then A ⊗ B has eigenvalues λ1 µ1 ,
examples here. For example, if . . ., λm µn . Since det(A) = λ1 · · · λm and
  det(B) = µ1 · · · µn , we then have det(A ⊗ B) =
" # 0 0
1 0 0   (λ1 µ1 ) · · · (λm µn ) = λ1n · · · λmn µ1m · · · µnm =
A= and B = 0 1 det(A)n det(B)m .
0 0 0
0 0
3.1.11 (a) Just apply Theorem 3.1.6 to the rank-one sum
then tr(A ⊗ B) = 1 and tr(B ⊗ A) = 0. This does not decomposition of mat(x) (i.e., Theorem A.1.3).
contradict Theorem 3.1.3 or Corollary 3.1.9 since (b) Instead apply Theorem 3.1.6 to the orthogo-
those results require A and B to be square. nal rank-one sum decomposition (i.e., Theo-
rem 2.3.3).
3.1.6 If A = U1 Σ1V1∗ and B = U2 Σ2V2∗ are singular
value decompositions then A† = V1 Σ†1U1∗ and B† = 3.1.13 We just notice that Z22 contains only 3 non-
V2 Σ†2U2∗ . Then zero vectors: (1, 0), (0, 1), and (1, 1), so ten-
sor powers of these vectors cannot possibly
A† ⊗ B† = (V1 Σ†1U1∗ ) ⊗ (V2 Σ†2U2∗ ) span a space that is larger than 3-dimensional,
= (V1 ⊗V2 )(Σ†1 ⊗ Σ†2 )(U1 ⊗U2 )∗ but S23 is 4-dimensional. Explicitly, the vector
(0, 0, 0, 1, 0, 1, 1, 0) ∈ S23 cannot be written in the
= (V1 ⊗V2 )(Σ1 ⊗ Σ2 )† (U1 ⊗U2 )∗ = (A ⊗ B)† , form c1 (1, 0)⊗3 + c2 (0, 1)⊗3 + c3 (1, 1)⊗3 .
where the second-to-last equality comes from recall-
3.1.14 If σ ∈ S p is any permutation with sgn(σ ) = −1
ing that the pseudoinverse of a diagonal matrix is
then Wσ −1 v = v and Wσ w = −w. It follows that
just obtained by taking the reciprocal of its non-zero
v · w = v · (−Wσ w) = −(Wσ∗ v) · w = −(Wσ −1 v) · w =
entries (and leaving its zero entries alone).
−(v · w), where the second-to-last equality uses the
fact that Wσ is unitary so Wσ∗ = Wσ−1 = Wσ −1 . It
3.1.8 (a) This follows immediately from taking the vec-
follows that v · w = 0.
torization of both sides of the equation AX +
XB = C and using Theorem 3.1.7.
C.1 Tensors and Multilinearity 477
3.1.15 (a) Just notice that PT = ∑σ ∈S p WσT /p! = 3.1.21 (a) If A−1 and B−1 exist then we just use Theo-
∑σ ∈S p Wσ −1 /p! = ∑σ ∈S p Wσ /p! = P, with the rem 3.1.2(a) to see that (A ⊗ B)(A−1 ⊗ B−1 ) =
second equality following from the fact that (AA−1 ) ⊗ (BB−1 ) = I ⊗ I = I, so (A ⊗ B)−1 =
Wσ is unitary so WσT = Wσ−1 = Wσ −1 , and the A−1 ⊗ B−1 . If (A ⊗ B)−1 exists then we can use
third equality coming from the fact that chang- Theorem 3.1.3(a) to see that A and B have no
ing the order in which we sum over S p does zero eigenvalues (otherwise A ⊗ B would too),
not change the sum itself. so A−1 and B−1 exist.
(b) We compute (b) The ( j, i)-block of (A ⊗ B)T is the (i, j)-block
of A ⊗ BT , which equals ai, j BT , which is also
P2 = ∑ Wσ /p! ∑ Wτ /p! the ( j, i)-block of AT ⊗ BT .
σ ∈S p τ∈S p
(c) Just apply part (b) above (part (c) of the the-
= ∑ Wσ ◦τ /(p!)2 orem) to the complex conjugated matrices A
σ ,τ∈S p and B.
= ∑ p!Wσ /(p!)2 = P,
σ ∈S p 3.1.22 (a) If A and B are upper triangular then the (i, j)-
block of A ⊗ B is ai, j B. If i > j then this entire
with the third equality following from the fact block equals O since A is upper triangular. If
that summing Wσ ◦τ over all σ and all τ in S p i = j then this block is upper triangular since
just sums each Wσ a total of |S p | = p! times B is. If A and B are lower triangular then just
(once for each value of τ ∈ S p ). apply this result to AT and BT .
(b) Use part (a) and the fact that diagonal matrices
3.1.16 By looking at the blocks in the equation ∑kj=1 v j ⊗ are both upper lower triangular.
w j = 0, we see that ∑kj=1 [v j ]i w j = 0 for all 1 ≤ i ≤ m. (c) If A and B are normal then (A ⊗ B)∗ (A ⊗
By linear independence, this implies [v j ]i = 0 for B) = (A∗ A) ⊗ (B∗ B) = (AA∗ ) ⊗ (BB∗ ) = (A ⊗
each i and j, so v j = 0 for each j. B)(A ⊗ B)∗ .
(d) If A∗ A = B∗ B = I then (A ⊗ B)∗ (A ⊗ B) =
3.1.17 Since B ⊗ C consists of mn vectors, it suffices via (A∗ A) ⊗ (B∗ B) = I ⊗ I = I.
Exercise 1.2.27(a) to show that it is linearly inde- (e) If AT = A and BT = B then (A ⊗ B)T = AT ⊗
pendent. To this end, write B = {v1 , . . . , vm } and BT = A ⊗ B. Similarly, if A∗ = A and B∗ = B
C = {w1 , . . . , wn } and suppose then (A ⊗ B)∗ = A∗ ⊗ B∗ = A ⊗ B.
m n n
!
m
(f) If the eigenvalues of A and B are non-negative
then so are the eigenvalues of A ⊗ B, by Theo-
∑ ∑ ci, j vi ⊗ w j = ∑ ∑ ci, j vi ⊗ w j = 0.
rem 3.1.3.
i=1 j=1 j=1 i=1
We then know from Exercise 3.1.16 that (since C 3.1.23 In all parts of this exercise, we use the fact that if
is linearly independent) ∑m i=1 ci, j vi = 0 for each A = U1 Σ1V1∗ and B = U2 Σ2V2∗ are singular value
1 ≤ j ≤ n. By linear independence of B, this implies decompositions then so is A ⊗ B = (U1 ⊗U2 )(Σ1 ⊗
ci, j = 0 for each i and j, so B ⊗C is linearly indepen- Σ2 )(V1 ⊗V2 )∗ .
dent too. (a) If σ is a diagonal entry of Σ1 and τ is a diagonal
entry of Σ2 then σ τ is a diagonal entry of Σ1 ⊗ Σ2 .
3.1.18 (a) This follows almost immediately from Defini- (b) rank(A ⊗ B) equals the number of non-zero en-
tion 3.1.3(b), which tells us that the columns tries of Σ1 ⊗ Σ2 , which equals the product of the
of Wm,n are (in some order) {e j ⊗ ei }i, j , which number of non-zero entries of Σ1 and Σ2 (i.e.,
are the standard basis vectors in Fmn . In other rank(A)rank(B)).
words, Wm,n is obtained from the identity ma- (c) If Av ∈ range(A) and Bw ∈ range(B) then (Av) ⊗
trix by permuting its columns in some way. (Bw) = (A ⊗ B)(v ⊗ w) ∈ range(A ⊗ B). Since
(b) Since the columns of Wm,n are standard basis range(A ⊗ B) is a subspace, this shows that “⊇”
vectors, the dot products of its columns with inclusion. For the opposite inclusion, recall from
each other equal 1 (the dot product of a column Theorem 2.3.2 that range(A ⊗ B) is spanned by the
with itself) or 0 (the dot product of a column rank(A ⊗ B) columns of U1 ⊗U2 that correspond
with another column). This means exactly that to non-zero diagonal entries of Σ1 ⊗ Σ2 , which are
m,n = I, so Wm,n is unitary.
∗ W
Wm,n all of the form u1 ⊗ u2 for some u1 range(A) and
(c) This follows immediately from Defini- u2 ∈ range(B).
tion 3.1.3(c) and the fact that Ei,∗ j = E j,i . (d) Similar to part (c) (i.e., part (d) of the Theorem),
but using columns of V1 ⊗V2 instead of U1 ⊗U2 .
3.1.20 (a) The (i, j)-block of A ⊗ B is ai, j B, so the (k, `)- (e) kA ⊗ Bk = kAkkBk since we know from part (a)
block within the (i, j)-block of (A ⊗ B) ⊗C is of this theorem that the largest singular value of
ai, j bk,`C. On the other hand, the (k, `)-block of A ⊗ B is the product of the largest singular values
B ⊗ C is bk,`C, so the (k, `)-block within the
(i, j)-block of A ⊗ (B ⊗C) is ai, j bk,`C. p F = kAkF kBkF , just
of A and B. To see that kA⊗Bk
note that kA ⊗ BkF = tr((A ⊗ B)∗ (A ⊗ B)) =
(b) The (i, j)-block of (A + B) ⊗C is (ai, j + bi, j )C, p p p
tr((A∗ A) ⊗ (B∗ B)) = tr(A∗ A) tr(B∗ B) =
which equals the (i, j)-block of A ⊗C + B ⊗C,
kAkF kBkF .
which is ai, j C + bi, j C.
(c) The (i, j)-block of these matrices equal
(cai, j )B, ai, j (cB), c(ai, j B), respectively, which
are all equal.
3.1.24 For property (a), let P = ∑σ ∈S p sgn(σ )Wσ /p! and 3.1.26 (a) We can scale v and w freely, since if we replace
note that P2 = P = PT via an argument almost identi- v and w by cv and dw (where c, d > 0) then
cal to that of Exercise 3.1.15. It thus suffices to show the inequality becomes
that range(P) = Anp . To this end, notice that for all
cd v · w = (cv) · (dw)
τ ∈ S p we have
≤ kcvk p kdwkq = cdkvk p kwkq ,
1
sgn(τ)Wτ P = sgn(τ)sgn(σ )Wτ Wσ which is equivalent to the original inequal-
p! σ∑
∈S p
ity that we want to prove. We thus choose
1 c = 1/kvk p and d = 1/kwkq .
= sgn(τ ◦ σ )Wτ◦σ = P.
p! σ∑
∈S p
(b) We simply notice that we can rearrange the
condition 1/p + 1/q = 1 to the form 1/(q −
It follows that everything in range(P) is unchanged 1) = p − 1. Then dividing the left inequality
by sgn(τ)Wτ (for all τ ∈ S p ), so range(P) ⊆ Anp . To above by |v j | shows that it is equivalent to
prove the opposite inclusion, we just notice that if |w j | ≤ |v j | p−1 , whereas dividing the right in-
v ∈ Anp then equality by |w j | and then raising it to the power
1/(q − 1) = p − 1 shows that it is equivalent to
1 1
Pv = sgn(σ )Wσ v = v = v, |v j | p−1 ≤ |w j |. It is thus clear that at least one
p! σ∑
∈S p p! σ∑
∈S p of these two inequalities must hold (they sim-
ply point in opposite directions of each other),
so v ∈ range(P) and thus Anp ⊆ range(P), so P is a as claimed.
projection onto Anp as claimed. As a minor technicality, we should note that
To prove property (c), we first notice that the columns the above argument only works if p, q > 1 and
of the projection P from part (a) have the form each v j , w j 6= 0. If p = 1 then |v j w j | ≤ |v j | p
P(e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ) = (since kwkq = 1 so |w j | ≤ 1). If q = 1 then
|v j w j | ≤ |w j |q . If v j = 0 or w j = 0 then both
1
sgn(σ )Wσ (e j1 ⊗ e j2 ⊗ · · · ⊗ e j p ), inequalities hold trivially.
p! σ∑
∈S p (c) Part (b) implies that |v j w j | ≤ |v j | p + |w j |q for
each 1 ≤ j ≤ n, so
where 1 ≤ j1 , j2 , . . . , j p ≤ n. To turn this set of
vectors into a basis of range(P) = Anp , we omit n n
the columns that are equal to each other or equal v · w = ∑ v j w j ≤ ∑ |v j w j |
j=1 j=1
to 0 by only considering the columns for which
n n
1 ≤ j1 < j2 < · · · < j p ≤ n. If F = R or F = C (so
we have an inner product to work with) then these
≤ ∑ |v j | p + ∑ |w j |q
j=1 j=1
remaining vectors are mutually orthogonal and thus
form an orthogonal basis of range(P), and otherwise = kvk pp + kwkqq = 2.
they are linearly independent (and thus form a basis (d) Notice that the inequality above does not de-
of range(P)) since the coordinates of their non-zero pend on the dimension n at all. It follows that
entries in the standard basis form disjoint subsets of if we pick a positive integer k, replacing v and
{1, 2, . . . , n p }. If we multiply these vectors each by w by v⊗k and w⊗k respectively shows that
p! then they form the basis described in the statement
k
of the theorem. 2 ≥ v⊗k · w⊗k = v · w ,
To demonstrate property (b), we simply notice that
the basis from part (c) of the theorem contains as with the final equality coming from Exer-
many vectors as there are sets of p distinct numbers cise 3.1.25. Taking the k-th root
√ of this in-

{ j1 , j2 , . . . , j p } ⊆ {1, 2, . . . , n}, which equals np . equality shows us that v · w ≤ k 2 for all
√
k
integers k ≥ 1. Since lim 2 = 1, it follows
k→∞
that v · w ≤ 1, which completes the proof.
Section 3.2: Multilinear Transformations

 
3.2.1 (a) This is not a multilinear transformation since, 3.2.2 (b) 0 2 1 0 3 0
for example, if w = (1, 0) then the function  1 0 0 1 1 0 
S(v) = T (v, w) = (v1 + v2 , 1) is not a linear −1 0 0 0 0 1
 
transformation (e.g., it does not satisfy S(0) = (c) 1 0 0 0
0).  0 1 0 0 
 
(c) Yes, this is a multilinear transformation (we  0 0 1 0 
mentioned that the Kronecker product is bilin- 0 0 0 1
ear in the main text).
(e) Yes, this is a multilinear transformation. 3.2.3 (a) False. The type of a multilinear transforma-
tion does not say the dimensions of the vector
spaces, but rather how many of them there are.
D has type (2, 0).

(c) False. In the standard block matrix, the rows (b) By definition, k f k = max | f (v, w)| : kvk =

index the output space and the columns in- kwk = 1 . If we represent f via Theorem 1.3.5
dex the different input spaces, so it has size and use the fact that kvk = k[v]B k whenever
mp × mn2 p. B is
an orthonormal basis, we see that k fk =
max [v]TB A[w]C : k[v]B k = k[w]C k = 1 . It
3.2.8 range(C) = R3 . To see why this is the case, recall follows from Exercise 2.3.13 that this quantity
that C(v, w) is always orthogonal to each of v and w. equals kAk.
It follows that if we are given x ∈ R3 and we want to
find v and w so that C(v, w) = x, we can just choose 3.2.13 (a) One such decomposition is C(v, w) =
v and w to span the plane orthogonal to x and then ∑6j=1 f1, j (v) f2, j (w)x j , where
rescale v and w appropriately.
f1,1 (v) = v2 f1,2 (v) = v3
3.2.9 If the standard array has k non-zero entries then f1,3 (v) = v3 f1,4 (v) = v1
so does the standard block matrix A. We can thus
f1,5 (v) = v1 f1,6 (v) = v2
write A as a sum of k terms of the form ei eTj , where
ei ∈ FdW and e j ∈ Fd1 ⊗ · · · ⊗ Fd p are standard f2,1 (w) = w3 f2,2 (w) = w2
basis vectors. Since each standard basis vector in f2,3 (w) = w1 f2,4 (w) = w3
Fd1 ⊗ · · · ⊗ Fd p is an elementary tensor, it follows f2,5 (w) = w2 f2,6 (w) = w1
from Theorem 3.2.4 that rank(T ) ≤ k.
x1 = e1 x2 = −e1
3.2.10 The bound rank(T× ) ≤ mnp comes from the standard x3 = e2 x4 = −e2
formula for matrix multiplication in the exact same x5 = e3 x6 = −e3 .
way as in Example 3.2.15 (equivalently, the standard
array of T× has exactly mnp non-zero entries, so (b) We just perform the computation directly:
Exercise 3.2.9 gives this bound). 5
The bound rank(T× ) ≥ mp comes from noting that ∑ f1, j (v) f2, j (w)x j
j=1
the standard block matrix A of T× (with respect to
= v1 (w2 + w3 )(e1 + e3 ) − (v1 + v3 )w2 e1
the standard basis) is an mp × mn2 p matrix with full
rank mp (it is straightforward to check that there are − v2 (w1 + w3 )(e2 + e3 )
standard block matrices that multiply to Ei, j for each + (v2 + v3 )w1 e2 + (v2 − v1 )w3 (e1 + e2 + e3 )
i, j). It follows that rank(T× ) ≥ rank(A) = mp.
= v1 (w2 + w3 ) − (v1 + v3 )w2 + (v2 − v1 )w3 e1

+ − v2 (w1 + w3 ) + (v2 + v3 )w1 + (v2 − v1 )w3 e2
3.2.11 (a) By definition, rank( f ) is the least r such
that we can write f (v, w) = ∑rj=1 f j (v)g j (w) + v1 (w2 + w3 ) − v2 (w1 + w3 ) + (v2 − v1 )w3 e3
for some linear forms { f j } and {g j }. If = (v2 w3 − v3 w2 )e1 + (v3 w1 − v1 w3 )e2
we represent each of these linear forms as + (v1 w2 − v2 w1 )e3 = C(v, w).
row vectors via f j (v) = xTj [v]B and g j (w) =
yTj [w]C then we can write this sum in the
3.2.16 Recall from Exercise 3.2.13 that if C : R3 × R3 → R3
form f (v, w) = [v]TB ∑rj=1 x j yTj [w]C . Since
is the cross product then rank(C) = 5. An optimal
f (v, w) = [v]TB A[w]C , Theorem A.1.3 tells us rank sum decomposition of the standard block matrix
that r = rank(A) as well. of C thus makes use of sets consisting of 5 vectors in
R3 , which cannot possibly be linearly independent.
A similar argument works with the matrix multiplica-
tion transformation T× : M2 × M2 → M2 , which
has rank 7 and vectors living in a 4-dimensional
space.
Section 3.3: The Tensor Product

3.3.1 (a) Just recall that when p = 2 we have rank(v) = 3.3.2 Equation (3.3.3) itself shows that rank(v) ≤ 3, so we
rank(mat(v)) for all v. If v = e1 ⊗ e1 + e2 ⊗ just need to prove the other inequality. If we mimic
e2 then mat(v) = I, which has rank 2, so Example 3.3.4, we find that the set Sv from Theo-
rank(v) = 2. rem 3.3.6 has the form
(c) This vector clearly has rank at most 3, and it
has rank at least 3 (and thus exactly 3) since it Sv = (a, b, b, −a) | a, b ∈ R .
has e1 ⊗ e1 + e2 ⊗ e5 ⊗ e3 ⊗ e9 as a flattening, It is clear that dim(Sv ) = 2, and we can see that the
which has rank-3 matricization. only elementary tensor in Sv is (0, 0, 0, 0), since the
matricization of (a, b, b, −a) is
" #
a b
,
b −a
which has determinant −a2 − b2 and thus rank 2 We know from Exercise 1.1.22 that the set of func-
whenever (a, b) 6= (0, 0). It follows that Sv does not tions {ex , e2x , . . . , ekx , e(k+1)x } is linearly indepen-
have a basis consisting of elementary tensors, so The- dent. However, this contradicts the above linear
orem 3.3.6 tells us that rank(v) > dim(Sv ) = 2, so system, which says that {ex , e2x , . . . , ekx , e(k+1)x } is
rank(v) = 3. contained within the (k-dimensional) span of the
This argument breaks down if we work over the com- functions f1 , . . . , fk .
plex numbers since the above matrix can be made
rank 1 by choosing a = 1, b = i, for example, and Sv 3.3.10 Let T : V × W → W ⊗ V be the bilinear transfor-
is spanned by the elementary tensors (1, i, i, −1) and mation defined by T (v, w) = w ⊗ v. By the univer-
(1, −i, −i, −1). sal property, there exists a linear transformation
S : V ⊗ W → W ⊗ V with S(v ⊗ w) = w ⊗ v. A
3.3.3 (a) True. This is simply a consequence of similar argument shows that there exists a linear
C being 1-dimensional, so dim(C ⊗ C) = transformation that sends w ⊗ v to v ⊗ w. Since
dim(C) dim(C) = 1. Since all vector spaces of elementary tensors span the entire tensor product
the same (finite) dimension over the same field space, it follows that S is invertible and thus an
are isomorphic, we conclude that C ⊗ C ∼ = C. isomorphism.
(c) True. This follows from the fact that rank(v) =
rank(mat(v)), where mat(v) ∈ Mm,n (F). 3.3.12 Suppose
(e) True. This follows from every real tensor rank
(1) (2) (p)
decomposition being a complex one, since v = lim vk ⊗ vk ⊗ · · · ⊗ vk ,
k→∞
R ⊆ C.
where we absorb scalars among these Kronecker fac-
( j)
3.3.4 If there were two such linear transformations tors so that kvk k = 1 for all k and all j ≥ 2. The
S1 , S2 : V ⊗ W → X then we would have (S1 − length of a vector (i.e., its norm induced by the dot
S2 )(v ⊗ w) = 0 for all v ∈ V and w ∈ W. Since product) is continuous in its entries, so
V ⊗ W is spanned by elementary tensors, it follows
(1) (2) (p) (1)
that (S1 − S2 )(x) = 0 for all x ∈ V ⊗ W, so S1 − S2 kvk = lim vk ⊗ vk ⊗ · · · ⊗ vk = lim vk .
k→∞ k→∞
is the zero linear transformation and S1 = S2 .
( j)
In particular, this implies that the vectors {vk } have
3.3.5 For property (a), we just note that it is straightfor- bounded norm, so the Bolzano–Weierstrass theorem
ward to see that every function of the form f (x) = tells us that there is a sequence k1 , k2 , k3 , . . . with the
Ax2 + Bx +C can be written as a linear combination property that
of elementary tensors—Ax2 , Bx, and C are them- (1)
lim vk
selves elementary tensors. For (b) and (c), we com- `→∞ `
pute exists. We can then find a subsequence

(A + cB) ⊗ f (x) = (A + cB) f (x) k`1 , k`2 , k`3 , . . . of that with the property that
= A f (x) + cB f (x) (2)
lim v
m→∞ k`m
= (A ⊗ f )(x) + c(B ⊗ f )(x)
for all x, so (A + cB) ⊗ f = (A ⊗ f ) + c(B ⊗ f ) for exist too. Continuing in this way gives us a subse-
all A, B ∈ M2 , f ∈ P 2 , and c ∈ F. The fact that quence with the property that
A ⊗ ( f + cg) = (A ⊗ f ) + c(A ⊗ g) is similar. For (d), ( j)
lim v
we just define S(A) = T (A, 1), S(Ax) = T (A, x), and k→∞ k
S(Ax2 ) = T (A, x2 ) for all A ∈ M2 and extend via
exists for each 1 ≤ j ≤ p (note that we have relabeled
linearity.
this subsequence just in terms of a single subscript k
for simplicity). It follows that
3.3.6 Suppose that we could write f (x, y) = exy as a

sum of elementary tensors: exy = f1 (x)g1 (y) + · · · + (1) (2)
v = lim vk ⊗ vk ⊗ · · · ⊗ vk
(p)
=
fk (x)gk (y) for some f1 , g1 , . . . , fk , gk ∈ C. Choosing k→∞

y = 1, 2, . . . , k, k + 1 then gives us the system of equa- (1) (2) (p)
tions lim vk ⊗ lim vk ⊗ · · · ⊗ lim vk ,
k→∞ k→∞ k→∞
ex = f1 (x)g1 (1) + · · · + fk (x)gk (1)
since all of these limits exist (after all, the Kronecker
e2x = f1 (x)g1 (2) + · · · + fk (x)gk (2) product is made up of entrywise products of vector
entries and is thus continuous in those entries). The
..
. decomposition on the right shows that v has tensor
rank 1.
ekx = f1 (x)g1 (k) + · · · + fk (x)gk (k)
e(k+1)x = f1 (x)g1 (k + 1) + · · · + fk (x)gk (k + 1).

3.4.1 (a) True. This was part of Theorem 3.1.4. 3.4.2 We need to perform n(n − 1)/2 column swaps to
(c) True. This follows from Theorem 3.1.3, for obtain Wn,n from I (one for each column ei ⊗ e j
example. with i < j), so det(Wn,n ) = (−1)n(n−1)/2 . Equiva-
(e) False. The Kronecker (and tensor) product is lently, det(Wn,n ) = −1 if n ≡ 2 or 3 (mod 4), and
not commutative. det(Wn,n ) = 1 otherwise.
(g) False. We gave an example at the end of Sec-
tion 3.3.3 with real tensor rank 3 but complex
tensor rank 2.
Section 3.A: Extra Topic: Matrix-Valued Linear Maps

3.A.1 (a) True. In fact, we know from Theorem 3.A.6 3.A.8 If CΦ = ∑i, j Φ(Ei, j ) ⊗ Ei, j then tr2 (CΦ ) =
that it is completely positive. ∑i, j Φ(Ei, j ) ⊗ tr(Ei, j ) = ∑i Φ(Ei,i ) = Φ(I), which
(c) True. If X is positive semidefinite then demonstrates the equivalence of (i) and (ii). The
so are Φ(X) and Ψ(X), so (Φ + Ψ)(X) = equivalence of (i) and (iii) just follows from choos-
Φ(X) + Ψ(X) is as well. ing X = I in (iii).
3.A.2 We mimic the proof of Theorem 3.A.6: we construct 3.A.12 (a) If Φ is transpose-preserving and X is symmet-
CΨ and then let the Ai ’s be the matricizations of its ric then Φ(X)T = Φ(X T ) = Φ(X).
scaled eigenvectors. We recall from Example 3.A.8 (b) For example, consider the linear map Φ :
that CΨ = I −W3,3 , and applying the spectral decom- M2 → M2 defined by
position to this matrix shows that CΨ = ∑3i=1 vi vTi , " #! " #
where a b a b
Φ = .
c d (b + c)/2 d
v1 = (0, 1, 0, −1, 0, 0, 0, 0, 0),
v2 = (0, 0, 1, 0, 0, 0, −1, 0, 0), and If the input is symmetric (i.e., b = c) then so
is the output, but Φ(E1,2 ) 6= Φ(E2,1 )T , so Φ is
v3 = (0, 0, 0, 0, 0, 1, 0, −1, 0).
not transpose-preserving.
It follows that Ψ(X) = ∑3i=1 Ai XATi , where Ai =
mat(vi ) for all i, so 3.A.13 (a) Recall from Exercise 2.2.16 that every self-
    adjoint A ∈ Mn can be written as a linear
0 1 0 0 0 1 combination of two PSD matrices: A = P − N.
   
A1 = −1 0 0 , A2 =  0 0 0 , and Then Φ(A) = Φ(P) − Φ(N), and since P and
0 0 0 −1 0 0 N are PSD, so are Φ(P) and Φ(N), so Φ(A)
  is also self-adjoint (this argument works even
0 0 0 if F = R). If F = C then this means that Φ is
 
A3 = 0 0 1 . Hermiticity-preserving, so it follows from The-
0 −1 0 orem 3.A.2 that Φ is also adjoint-preserving.
(b) The same map from the solution to Exer-
3.A.4 Direct computation shows that (ΦC ⊗I2 )(xx∗ ) equals cise 3.A.12(b) works here. It sends PSD ma-
  trices to PSD matrices (in fact, it acts as the
2 0 −1 0 0 −1
 0 0 0 0 0 0  identity map on all symmetric matrices), but is
  not transpose-preserving.
 −1 0 1 0 0 −1 
 ,
 0 0 0 1 0 0 
  3.A.14 We know from Theorem 3.A.1 that T ◦ Φ = Φ ◦ T
 0 0 0 0 1 0 
−1 0 −1 0 0 1 (i.e., Φ is transpose-preserving) if and only if CΦ =
CΦT . The additional requirement that Φ = Φ ◦ T
√
which is not positive semidefinite (it has 1 − 3 ≈ tells us (in light of Exercise 3.A.10) that CΦ =
−0.7321 as an eigenvalue), so ΦC is not 2-positive. (I ⊗ T )(CΦ ) = Γ(CΦ ).
As a side note, observe this condition is also equiva-
lent to CΦ = CΦ T = Γ(C ), since Γ(C ) = Γ(CT ) =
3.A.5 Recall that kXkF = kvec(X)k for all matrices Φ Φ Φ
X, so kΦ(X)kF = kXkF for all X if and only if Γ(CΦ ).
k[Φ]vec(X)k = kvec(X)k for all vec(X). It follows
from Theorem 1.4.9 that this is equivalent to [Φ] 3.A.15 If Φ is bisymmetric and decomposable then Φ =
being unitary. T ◦ Φ = Φ ◦ T = Ψ1 + T ◦ Ψ2 for some completely
positive Ψ1 , Ψ2 . Then
3.A.7 By Exercise 2.2.20, Φ is positive if and only if
tr(Φ(X)Y ) ≥ 0 for all positive semidefinite X and Y . 2Φ = Φ + (T ◦ Φ)
Using the definition of the adjoint Φ∗ shows that this = (Ψ1 + Ψ2 ) + T ◦ (Ψ1 + Ψ2 ),
is equivalent to tr(XΦ∗ (Y )) ≥ 0 for all PSD X and
Y , which (also via Exercise 2.2.20) is equivalent to
Φ∗ being positive.
so we can choose Ψ = (Ψ1 + Ψ2 )/2. Conversely, if which is positive semidefinite since each Φ(Xk ) is
Φ = Ψ + T ◦ Ψ for some completely positive Ψ then positive definite (recall that the coefficients of charac-
Φ is clearly decomposable, and it is bisymmetric teristic polynomials are continuous, so the roots of a
because characteristic polynomial cannot jump from positive
to negative in a limit).
T ◦ Φ = T ◦ (Ψ + T ◦ Ψ) = T ◦ Ψ + Ψ = Φ,
and 3.A.22 (a) If X is Hermitian (so c = b and a and d are
Φ ◦ T = (Ψ + T ◦ Ψ) ◦ T = Ψ ◦ T + T ◦ Ψ ◦ T real) then the determinants of the top-left 1 × 1,
2 × 2, and 3 × 3 blocks of Φ(X) are
= T ◦ (T ◦ Ψ ◦ T ) + (T ◦ Ψ ◦ T ) = T ◦ Ψ + Ψ
= Φ, 4a − 4Re(b) + 3d, 4a2 − 4|b|2 + 6ad, and
where we used the fact that T ◦ Ψ ◦ T = Ψ thanks to 8a2 d + 12ad 2 − 4a|b|2 + 4Re(b)|b|2 − 11d|b|2 ,
Ψ being real and CP (and thus transpose-preserving). respectively. If√X is positive definite then
|Re(b)| ≤ |b| < ad ≤ (a+d)/2, which easily
3.A.16 Recall that {Ei, j ⊗ Ek,` } is a basis of Mmp,nq , and shows that the 1 × 1 and 2 × 2 determinants are
Φ ⊗ Ψ must satisfy positive. For the 3 × 3 determinant, after√we
(Φ ⊗ Ψ)(Ei, j ⊗ Ek,` ) = Φ(Ei, j ) ⊗ Ψ(Ek,` ) use |b|2 < ad three times and Re(b) > − ad
once, we see that
for all i, j, k, and `. Since linear maps are determined
by how they act on a basis, this shows that Φ ⊗ Ψ, if 8a2 d + 12ad 2 − 4a|b|2 + 4Re(b)|b|2 − 11d|b|2
it exists, is unique. Well-definedness (i.e., existence
> 4a2 d + ad 2 − 4(ad)3/2 .
of Φ ⊗ Ψ) comes from reversing this argument—
if we define Φ ⊗ Ψ by (Φ ⊗ Ψ)(Ei, j ⊗ Ek,` ) = This quantity is non-negative since the inequal-
Φ(Ei, j ) ⊗ Ψ(Ek,` ) then linearity shows that Φ ⊗ Ψ ity 4a2 d + ad 2 ≥ 4(ad)3/2 is equivalent to
also satisfies (Φ ⊗ Ψ)(A ⊗ B) = Φ(A) ⊗ Ψ(B) for all (4a2 d + ad 2 )2 ≥ 16a3 d 3 , which is equivalent
A and B. to a2 d 2 (d − 4a)2 ≥ 0.
(b) After simplifying and factoring as suggested
3.A.18 If X ∈ Mn ⊗ Mk is positive semidefinite then we by this hint, we find
can construct a PSD matrix Xe ∈ Mn ⊗ Mk+1 sim-
ply by padding it with n rows and columns of zeros det(Φ(X)) = 2 det(X)p(X), where
without affecting positive semidefiniteness. Then p(X) = 16a + 30ad + 9d 2
2
e is positive semidefinite
(Φ ⊗ Ik )(X) = (Φ ⊗ Ik+1 )(X)
by (k + 1)-positivity of Φ, and since X was arbitrary − 8aRe(b) − 12Re(b)d − 8|b|2 .
this shows that Φ is k-positive.
If we use the facts that |b|2 < ad and Re(b) <
(a + d)/2 then we see that
3.A.19 We need to show that Φ(X) is positive semidefinite
whenever X is positive semidefinite. To this end, p(X) > 12a2 + 12ad + 3d 2 > 0,
notice that every PSD matrix X can be written as
a limit of positive definite matrices X1 , X2 , X3 , . . .: so det(Φ(X)) > 0 as well.
X = limk→∞ Xk , where Xk = X + I/k. Linearity (and (c) Sylvester’s Criterion tells us that Φ(X) is pos-
thus continuity) of Φ then tells us that itive definite whenever X is positive definite.
Exercise 3.A.19 then says that Φ is positive.
Φ(X) = Φ lim Xk = lim Φ(Xk ),
k→∞ k→∞
Section 3.B: Extra Topic: Homogeneous Polynomials

3.B.1 (a) Since this is a quadratic form, we can represent (e) We again mimic Example 3.B.1, but this time
it via a symmetric matrix we are working in a 10-dimensional space
" # so we use the basis {x3 , y3 , z3 , (x + y)3 , (x −
2 2
3 −1 x y)3 , (y + z)3 , (y − z)3 , (z + x)3 , (z − x)3 , (x + y +
3x + 3y − 2xy = x y
−1 3 y z)3 }. Solving the resulting linear system gives
the decomposition 6(x2 y + y2 z + z2 x) = (x +
and then apply the spectral decomposition to y)3 − (x − y)3 + (y + z)3 − (y − z)3 + (z + x)3 −
this matrix to get the sum-of-squares form (z − x)3 − 2x3 − 2y3 − 2z3 .
(x + y)2 + 2(x − y)2 . (g) This time we use the basis {x4 , y4 , (x +
(c) We mimic Example 3.B.1: we try to write y)4 , (x − y)4 , (x + 2y)4 }. Solving the result-
2x3 − 9x2 y + 3xy2 − y3 = c1 x3 + c2 y3 + c3 (x + ing linear system gives the decomposition
y)3 + c4 (x − y)3 and solve for c1 , c2 , c3 , c4 x4 + (x + y)4 + (x − y)4 − (x + 2y)4 + 4y4 .
to get the decomposition x3 + 2y3 − (x + y)3 +
2(x − y)3 . 3.B.2 (a) One possibility is (2, 3, 3, 5) = (1, 1)⊗2 +
(1, 2)⊗2 .
(c) (1, 0)⊗3 + (0, 1)⊗3 − (1, 1)⊗3 + (1, −1)⊗3 If we multiply and divide the right-hand side by
x2 + y2 + z2 and then multiply out the numerators,
3.B.3 (a) False. The degree of g might be strictly less we get f (x, y, z) as a sum of 18 squares of rational
than that of f . For example, if f (x1 , x2 ) = functions. For example, the first three terms in this
x1 x2 + x22 then dividing through by x22 gives sum of squares are of the form
the dehomogenization g(x) = x + 1.
(c) True. This follows from Theorem 3.B.2. (x2 + y2 + z2 )g(x, y, z)2 xg(x, y, z) 2
=
(e) True. This follows from the real spectral de- (x2 + y2 + z2 )2 x2 + y2 + z2
composition (see Corollary 2.A.2). 2
yg(x, y, z) zg(x, y, z) 2
+ 2 + .
x + y2 + z2 x2 + y2 + z2
3.B.5 We just observe that in any 4th power of a linear form
3.B.9 Recall from Theorem 1.3.6 that we can write every
(ax + by)4 ,
quadrilinear form f in the form
if the coefficient of the x4 and y4 terms equal 0 then
f (x, y, z, w) = ∑ bi, j,k,` xi z j yk w` .
a = b = 0 as well, so no quartic form without x4 or i, j,k,`
y4 terms can be written as a sum of 4th powers of
linear forms. If we plug z = x and w = y into this equation then
we see that
3.B.6 For both parts of this question, we just need to show
f (x, y, x, y) = bi, j,k,` xi x j yk y` .
that rankS (w) ≤ rank(w), since the other inequality ∑
i, j,k,`
is trivial.
(a) w ∈ Sn2 if and only if mat(w) is a symmetric If we define ai, j;k,` = (bi, j,k,` + b j,i,k,` + bi, j,`,k +
matrix. Applying the real spectral decomposi- b j,i,` , k)/4 then we get
tion shows that
r
f (x, y, x, y) = ∑ ai, j;k,` xi x j yk y` = q(x, y),
i≤ j,k≤`
mat(w) = ∑ λi vi vTi ,
i=1 and the converse is similar.
where r = rank(mat(w)) = rank(w). It follows
that w = ∑ri=1 λi vi ⊗ vi , so rankS (w) ≤ r = 3.B.10 We think of q(x, y) as a (quadratic) polynomial in
rank(w). y3 , keeping x1 , x2 , x3 , y1 , and y2 fixed. That is, we
(b) The Takagi factorization (Exercise 2.3.26) says write q(x, y) = ay23 + by3 + c, where a, b, and c are
that we can write mat(w) = UDU T , where complicated expressions that depend on x1 , x2 , x3 , y1 ,
U is unitary and D is diagonal (with the sin- and y2 :
gular values of mat(w) on its diagonal). If
a = x22 + x32 ,
r = rank(mat(w)) = rank(w) and the first r
columns of U are v1 , . . ., vr , then b = −2x1 x3 y1 − 2x2 x3 y2 , and
r c = x12 y21 + x12 y22 + x22 y22 + x32 y21 − 2x1 x2 y1 y2 .
mat(w) = ∑ di,i vi vTi ,
i=1 If a = 0 then x2 = x3 = 0, so q(x, y) = x12 (y21 + y22 ) ≥
0, as desired, so we assume from now on that a > 0.
so w = ∑ri=1 di,i vi
⊗ vi , which tells us that
Our goal is to show that q has at most one real root,
rankS (w) ≤ r = rank(w), as desired.
since that implies it is never negative.
To this end, we note that after performing a laborious
3.B.7 (a) Apply the AM–GM inequality to the 4 quanti-
calculation to expand out the discriminant b2 − 4ac
ties w4 , x2 y2 , y2 z2 , and z2 x2 .
of q, we get the following sum of squares decompo-
(b) If we could write f as a sum of squares of
sition:
quadratic forms, those quadratic forms must
contain no x2 , y2 , or z2 terms (or else f would b2 − 4ac = −4(abx − b2 y)2
have an x4 , y4 , or z4 term, respectively). It then
follows that they also have no wx, wy, or wz − 4(aby − c2 x)2 − 4(acy − bcx)2 .
terms (or else f would have an a w2 x2 , w2 y2 , It follows that b2 − 4ac ≤ 0, so q has at most one
or w2 z2 term, respectively) and thus no way to real root and is thus positive semidefinite.
create the cross term −4wxyz in f , which is a
contradiction. 3.B.11 We check the 3 defining properties of Defini-
tion 1.3.6. Properties (a) and (b) are both immediate,
3.B.8 Recall from Remark 3.B.2 that and property (c) comes from computing
g(x, y, z)2 + g(y, z, x)2 + g(z, x, y)2
f (x, y, z) = |ak1 ,k2 ,...,kn |2
x 2 + y2 + z 2 hf, fi = ∑ p ,
k1 +···+kn =p k1 ,k2 ,...,kn
h(x, y, z)2 + h(y, z, x)2 + h(z, x, y)2
+ .
2(x2 + y2 + z2 ) which is clearly non-negative and equals zero if and
only if ak1 ,k2 ,...,kn = 0 for all k1 , k2 , . . ., kn (i.e., if
and only if f = 0).
Section 3.C: Extra Topic: Semidefinite Programming

3.C.1 (a) This is equivalent to A − A = O being positive 3.C.5 (a) One possibility is
semidefinite, which is clear. " # " #
2 0 1 −1
(b) If A − B and B − A = −(A − B) are both posi- A= and B= .
tive semidefinite then all eigenvalues of A − B 0 3 −1 2
must be ≥ 0 and ≤ 0, so they must all equal Then A and B are both positive semidefinite, and
0. Since A − B is Hermitian, this implies that so is " #
it equals O (by the spectral decomposition, for 1 1
example), so A = B. A−B = .
1 1
(c) If A − B and B − C are both PSD then so is
However, " #
(A − B) + (B −C) = A −C.
(d) If A − B is PSD then so is (A +C) − (B +C) = 2 2 2 3
A −B = ,
A − B. 3 4
(e) This is equivalent to the (obviously true) state- which is not positive semidefinite (it has determi-
ment that A − B is PSD if and only if −B + A nant −1).
is PSD. (c) Notice that both A − B and A + B are positive
(f) If A − B is PSD then so is P(A − B)P∗ = semidefinite (the former by assumption, the lat-
PAP∗ − PBP∗ , by Theorem 2.2.3(d). ter since A and B are positive semidefinite). By
Exercise 2.2.19, it follows that
3.C.2 (a) True. We showed how to represent a linear pro-
tr (A − B)(A + B) ≥ 0.
gram as a semidefinite program at the start of
this section. Expanding out this trace then shows that
(c) True. If A − B is PSD then tr(A − B) ≥ 0, so 0 ≤ tr(A2 ) − tr(BA) + tr(AB) − tr(B2 ).
tr(A) ≥ tr(B). Since tr(AB) = tr(BA), it follows that tr(A2 ) −
(e) False. This is not even true for linear programs. tr(B2 ) ≥ 0, so tr(A2 ) ≥ tr(B2 ), as desired.
3.C.3 (a) Suppose A has spectral decomposition A = 3.C.7 One matrix that works is
UDU ∗ . Then A cI if and only if cI − A O,  
if and only if U(cI − D)U ∗ O, if and only if 3 2 1 −2 1
2 3 1
cI − D O, if and only if c ≥ λmax (A).  −1 −2 
 
(b) There are many possibilities, including the fol- 1 −1 3 1 −1 .
 
lowing (where c ∈ R is the variable): −2 −2 1 3 −2
minimize: c 1 1 −1 −2 3
subject to: cI A 3.C.8 If A has orthogonal rank-one sum decomposi-
tion A = ∑rj=1 σ j u j v∗j then we choose X = u1 v∗1 ,
(c) By an argument like that from part (a), A cI Y = u1 u∗1 , and Z = v1 v∗1 . It is straightforward to
if and only if λmin (A) ≥ c. It follows that the then check that tr(Y ) = tr(Z) = 1, Re(tr(AX ∗ )) =
following SDP has optimal value λmin (A): Re(tr(σ1 (v∗1 v1 )(u∗1 u1 ))) = σ1 = kAk, and
" # " #
maximize: c Y −X u1 u∗1 −u1 v∗1
=
subject to: cI A −X ∗ Z −v1 u∗1 v1 v∗1
∗
3.C.4 (a) The constraints of this SDP force u1 u1
= O.
" # −v1 −v1
−x − y 1
O, 3.C.9 We could use duality and mimic Example 3.C.5, but
1 −x instead we recall from Theorem 2.B.4 that
" #
which is not possible since −x ≤ 0 and PSD I A
matrices cannot have negative diagonal entries O
A∗ X
(furthermore, we showed in Exercise 2.2.11
that if a diagonal entry of a PSD matrix equals if and only if X − A∗ A O. It follows that, for any
0 then so does every entry in that row and col- feasible X, we have tr(X) ≥ tr(A∗ A) = kAk2F , which
umn, which rules out the x = 0 case). shows that the optimal value of the semidefinite pro-
(b) The dual SDP has the following form (where gram is at least as large as kAk2F .
X ∈ MH To show that there exists a feasible X that attains the
3 is a matrix variable):
bound of kAk2F , let A = UΣV ∗ be a singular value
minimize: 2Re(x1,2 ) decomposition of A. Then we define X = V Σ∗ ΣV ∗ ,
subject to: x1,1 + x2,2 + x3,3 ≥ 0 which has tr(X) = tr(V Σ∗ ΣV ∗ ) = tr(Σ∗ Σ) = kAk2F .
x1,1 + x3,3 = 0 Furthermore, X is feasible since it is positive semidef-
X O inite and so is
" # " #
This SDP is feasible with optimal value 0 I A I UΣV ∗
=
since the feasible points are exactly the matri- A∗ X V Σ∗U ∗ V Σ∗ ΣV ∗
ces X with x2,2 ≥ 0 and all other entries equal " #" #∗
to 0. U U
= .
V Σ∗ V Σ∗
3.C.12 (a) The dual of this SDP can be written in the fol- respectively.
lowing form, where Y ∈ MH n and z ∈ R are
variables: 3.C.17 (a) The Choi matrix of Φ is
 
minimize: tr(Y ) + kz 4 −2 −2 2 0 0 0 0
subject to: Y + zI C −2 3 0 0 0 0 0 0
 
Y O  
−2 0 2 0 0 1 0 0
 
(b) If X is the orthogonal projection onto the 2 0 0 0 0 0 0 0
 
span of the eigenspaces corresponding to the 0 0 0 0 0 0 0 −2
 
k largest eigenvalues of C then X is a feasible  
0 0 1 0 0 2 0 −1
point of the primal problem of this SDP and  
0 0 0 0 0 0 4 0
tr(CX) is the sum of those k largest eigenvalues,
0 0 0 0 −2 −1 0 2
as claimed. On the other hand, if C has spectral
decomposition C = ∑nj=1 λ j v j v∗j then we can Using the same dual SDP from the solution
choose z = λk+1 (where λ1 ≥ · · · ≥ λn are the to Exercise 3.C.16, we find that the matrix Z
eigenvalues of C) and Y = ∑kj=1 (λ j − z)v j v∗j . given by
It is straightforward to check that Y O and  
Y + zI C, so Y is feasible. Furthermore, 2 0 0 −28 0 0 0 0
tr(Y ) + kz = ∑kj=1 λ j − kz + kz = ∑kj=1 λ j , as  0
 16 0 0 0 0 0 0 
desired.  
 0 0 49 0 0 −49 0 0 
 
−28 0 0 392 0 0 0 0
3.C.14 If Φ = Ψ1 + T ◦ Ψ2 then CΦ = CΨ1 + Γ(CΨ2 ), so Φ  
 0 0 0 0 392 0 0 28
is decomposable if and only if there exist PSD ma-  
 
trices X and Y such that CΦ = X + Γ(Y ). It follows  0 0 −49 0 0 49 0 0
 
that the optimal value of the following SDP is 0 if  0 0 0 0 0 0 16 0 
Φ is decomposable, and it is −∞ otherwise (this is a 0 0 0 0 28 0 0 2
feasibility SDP—see Remark 3.C.3):
is feasible and has tr(CΦ Z) = −2. Scaling Z
maximize: 0 up by an arbitrary positive scalar shows that
subject to: X + Γ(Y ) = CΦ the optimal value of this SDP is −∞ and thus
X,Y O the primal problem is infeasible and so Φ is
not decomposable.
3.C.15 We can write Φ = Ψ1 + T ◦ Ψ2 , where 2CΨ1 and
2CΨ2 are the (PSD) matrices 3.C.18 Recall from Theorem 3.B.3 that q is a sum of squares
  of bilinear forms if and only if we can choose Φ
3 0 0 0 0 −2 −1 0
0 to be decomposable and bisymmetric. Let {xi xTi }
 1 0 0 0 0 0 −1
  and {y j yTj } be the rank-1 bases of MSm and MSn ,
0 0 2 0 0 0 0 −2 respectively, described by Exercise 1.2.8.
 
0 0 0 2 −2 0 0 0 It follows that the following SDP determines whether
 
0 0 0 −2 2 0 0 0 or not q is a sum of squares, since it determines
 
  whether or not there is a decomposable bisymmetric
−2 0 0 0 0 2 0 0
  map Φ that represents q:
−1 0 0 0 0 0 1 0
0 −1 −2 0 0 0 0 3 maximize: 0
and subject to: xTi Φ(y j yTj )xi = q(xi , y j ) for all i, j
  Φ = T ◦Φ
1 0 0 0 0 0 −1 0
0 Φ = Φ◦T
 3 0 0 −2 0 0 −1

  Φ = Ψ1 + T ◦ Ψ2
0 0 2 0 0 −2 0 0 CΨ1 ,CΨ2 O
 
0 0 0 2 0 0 −2 0
 
0 0 0 2 0 0 0
 −2  [Side note: We can actually choose Ψ1 = Ψ2 to make
 
0 0 −2 0 0 2 0 0 this SDP slightly simpler, thanks to Exercise 3.A.15.]
 
−1 0 0 −2 0 0 3 0
0 −1 0 0 0 0 0 1
Bibliography
[Aga85] S.S. Agaian, Hadamard Matrices and Their Applications (Springer, 1985)
[AM57] A.A. Albert, B. Muckenhoupt, On matrices of trace zeros. Mich. Math. J. 4(1), 1–3 (1957)
[Art27] E. Artin, Über die zerlegung definiter funktionen in quadrate. Abhandlungen aus dem Mathe-
matischen Seminar der Universität Hamburg 5(1), 100–115 (1927)
[BBP95] P. Borwein, J. Borwein, S. Plouffe, The inverse symbolic calculator (1995), http://wayback.
cecm.sfu.ca/projects/ISC/
[BV09] S.P. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, 2009)
[Cal73] A.P. Calderón, A note on biquadratic forms. Linear Algebra Appl. 7(2), 175–177 (1973)
[Cho75] M.-D. Choi, Positive semidefinite biquadratic forms. Linear Algebra Appl. 12(2), 95–100
(1975)
[Chv83] V. Chvatal, Linear Programming (Freeman, W. H, 1983)
[CL92] S. Chang, C.-K. Li, Certain isometries on Rn . Linear Algebra Appl. 165, 251–265 (1992)
[CVX12] CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version
2.0 (August 2012), http://cvxr.com/cvx
[DCAB18] S. Diamond, E. Chu, A. Agrawal, S. Boyd, CVXPY: A Python-embedded modeling
language for convex optimization (2018), http://www.cvxpy.org
[FR97] B. Fine, G. Rosenberger, The Fundamental Theorem of Algebra (Undergraduate Texts in
Mathematics (Springer, New York, 1997)
[Hil88] D. Hilbert, Ueber die darstellung definiter formen als summe von formenquadraten. Mathema-
tische Annalen 32(3), 342–350 (1888)
[HJ12] R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, 2012)
[Hor06] K.J. Horadam, Hadamard Matrices and Their Applications (Princeton University Press, 2006)
[Joh20] N. Johnston, Introduction to Linear and Matrix Algebra (Undergraduate Texts in Mathematics
(Springer, New York, 2020)
[KB09] T.G. Kolda, B.W. Bader, Tensor decompositions and applications. SIAM Rev. 51(3), 455–500
(September 2009)
[Lan12] J.M. Landsberg, Tensors: Geometry and Applications (American Mathematical Society, 2012)
[LP01] C.-K. Li, S. Pierce, Linear preserver problems. Am. Math. Mon. 108, 591–605 (2001)
488 Bibliography
[LS94] C.-K. Li, W. So, Isometries of ` p norm. Am. Math. Mon. 101, 452–453 (1994)
[LT92] C.-K. Li, N.-K. Tsing, Linear preserver problems: a brief introduction and some special
techniques. Linear Algebra Appl. 162, 217–235 (1992)
[Mir60] L. Mirsky, Symmetric gauge functions and unitarily invariant norms. Q. J. Math. 11, 50–59
(1960)
[MO15] M. Miller, R. Olkiewicz, Topology of the cone of positive maps on qubit systems. J. Phys. A
Math. Theor. 48, (2015)
[Shi18] Y. Shitov, A counterexample to Comon’s conjecture. SIAM J. Appl. Algebra Geom. 2(3),
428–443 (2018)
[Stø63] E. Størmer, Positive linear maps of operator algebras. Acta Mathematica 110(1), 233–278
(1963)
[Sto10] A. Stothers, On the Complexity of Matrix Multiplication. Ph.D. thesis, University of Edinburgh
(2010)
[TB97] L.N. Trefethen, D. Bau, Numerical Linear Algebra (SIAM, 1997)

[Wor76] S.L. Woronowicz, Positive maps of low dimensional matrix algebras. Rep. Math. Phys. 10,
165–183 (1976)
[YYT16] Y. Yang, D.H. Leung, W.-S. Tang, All 2-positive linear maps from M3 (C) to M3 (C) are
decomposable. Linear Algebra Appl. 503, 233–247 (2016)
Index
A Cayley transform . . . . . . . . . . . . . . . . . . . . . 187

change-of-basis matrix . . . . . . . . . . . . . . . . . 30
absolute value . . . . . . . . . . . . . . . . . . . . . . . . 208 characteristic polynomial . . . . . . . . . . . . . . 421
adjoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Choi
adjoint-preserving . . . . . . . . . . . . . . . . . . . . 362 map . . . . . . . . . . . . . . . . . . . 373, 413, 415
adjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 360
algebraic multiplicity . . . . . . . . . . . . . . . . . 422 rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
alternating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Cholesky decomposition . . . . . . . . . . . . . . 268
analytic function . . . . . . . . . . . . . . . . . . . . . . 245 circulant matrix . . . . . . . . . . . . . . . . . . . . . . 188
antisymmetric completely positive . . . . . . . . . . . . . . . . . . . 370
array . . . . . . . . . . . . . . . . . . . . . . . . . 70, 79 complex conjugate . . . . . . . . . . . . . . . . . . . . 430
subspace . . . . . . . . . . . . . . . . . . . . . . . . 316 complex number . . . . . . . . . . . . . . . . . . . . . . 428
automorphism . . . . . . . . . . . . . . . . . . . . . . 57, 98 complex permutation matrix . . . . . . . . . . . 164
complex plane . . . . . . . . . . . . . . . . . . . . . . . . 430
complex symmetric . . . . . . . . . . . . . . 185, 226
B
composition . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 concave . . . . . . . . . . . . . . . . . . . . . . . . . 399, 439
bilinear conjugate transpose . . . . . . . . . . . . . . . . . . . . . 6
form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
transformation . . . . . . . . . . . . . . . . . . . 320 convex . . . . . . . . . . . . . . . . . . . . . 399–401, 436
binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 coordinate vector . . . . . . . . . . . . . . . . . . . . . . 25
biquadratic form . . . . . . . . . . . . . 387, 414, 416 CP. see completely positive
bisymmetric map . . . . . . . . . . . . . . . . . . . . . 363 cross product . . . . . . . . . . . . . . . . . . . . . . . . . 320
border rank . . . . . . . . . . . . . . . . . . . . . . . . . . 355 cubic form . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Breuer–Hall map . . . . . . . . . . . . . . . . . . . . . 377
D
C decomposable map . . . . . . . . . . 372, 413, 415
degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Cartesian decomposition . . . . . . . . . . . . . . 126
dense . . . . . . . . . . . . . . . . . . . . . . . . . . . 285, 286

https://doi.org/10.1007/978-3-030-52815-7
490 Index
determinant . . . . . . . . . . . . . . . . . . . . . . . . . . 419 geometric multiplicity . . . . . . . . . . . . 228, 421

diagonalize . . . . . . . . . . . . . . . . . . . . . . 167, 422 Gershgorin disc . . . . . . . . . . . . . . . . . . . . . . . 197
diagonally dominant . . . . . . . . . . . . . . . . . . 200 Gram matrix . . . . . . . . . . . . . . . . . . . . . . . . . 208
dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 ground field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
direct sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
dot product . . . . . . . . . . . . . . . . . . . . . . . . 58, 71 H
double-dual . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
dual Hölder’s inequality . . . . . . . . . . 153, 165, 319
basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Hadamard matrix . . . . . . . . . . . . . . . . . . . . . 300
space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Hadamard’s inequality . . . . . . . . . . . . 209, 334
Hermitian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
E Hermiticity-preserving . . . . . . . . . . . . 362, 394
homogeneous polynomial . . . . . . . . . . 23, 377
eigenspace . . . . . . . . . . . . . . . . . . . . . . . . 47, 421 hyperboloid . . . . . . . . . . . . . . . . . . . . . . . . . . 261
eigenvalue . . . . . . . . . . . . . . . . . . . . . . . . 47, 421
eigenvector . . . . . . . . . . . . . . . . . . . . . . . 47, 421 I
elementary tensor . . . . . . . . . . . . 305, 343, 349
ellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 identity
epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 transformation . . . . . . . . . . . . . . . . . . . . 37
Euler’s formula . . . . . . . . . . . . . . . . . . . . . . . 432 imaginary number . . . . . . . . . . . . . . . . . . . . 428
evaluation map . . . . . . . . . . . . . . . . . . . . . 59, 79 imaginary part . . . . . . . . . . . . . . . . . . . . . . . . 428
external direct sum. see direct sum indefinite. . . . . . . . . . . . . . . . . . . . . . . . . . . . .261
induced norm . . . . . . . . . . . . . . . . . . . . 165, 221
F inner product . . . . . . . . . . . . . . . . . . . . . . . . . . 71
inner product space . . . . . . . . . . . . . . . . . . . . 74
field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 internal direct sum. see direct sum
flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 invertible . . . . . . . . . . . . . . . . . . . . . . . . . 44, 417
Fourier isometry . . . . . . . . . . . . . . . . . . . . . . . . 162, 376
matrix . . . . . . . . . . . . . . . . . . . . . . 113, 188 isomorphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
series . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . 54
free variable . . . . . . . . . . . . . . . . . . . . . . . . . . 416
Frobenius
J
inner product . . . . . . . . . . . . . . . . . . . . . 72
norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Jacobi’s formula . . . . . . . . . . . . . . . . . 120, 293
Fuglede’s theorem . . . . . . . . . . . . . . . . . . . . 255 Jordan
fundamental subspace . . . . . . . . . . . . 133, 212 block . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
canonical form . . . . . . . . . . . . . . . . . . . 227
G chain . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
decomposition . . . . . . . . . . . . . . . . . . . 227
Gaussian elimination . . . . . . . . . . . . . . . . . . 415 Jordan–Chevalley decomposition . . . . . . . 252
generalized eigenspace . . . . . . . . . . . . . . . . 240
generalized eigenvector . . . . . . . . . . . . . . . 240
Index 491
K nilpotent . . . . . . . . . . . . . . . . . . . . . . . . 187, 252

norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 norm induced by inner product . . . . . . . . . . 74
Kronecker normal matrix . . . . . . . . . . . . . . . . . . . . . . . . 176
power . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 normed vector space . . . . . . . . . . . . . . . . . . 158
product . . . . . . . . . . . . . . . . . . . . . . . . . 298 null space . . . . . . . . . . . . . . . . . . . . . . . . 46, 418
sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 nullity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46, 418
Ky Fan norm . . . . . . . . . . . . . . . . . . . . 413, 415
O
L
oblique projection. see projection
leading entry . . . . . . . . . . . . . . . . . . . . . . . . . 416 operator norm . . . . . . . . . . . . . . . 165, 221, 333
linear operator-sum representation . . . . . . . . . . . 363
combination . . . . . . . . . . . . . . . . . . . . . . 12 order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
dependence . . . . . . . . . . . . . . . . . . . . . . . 15 ordered basis . . . . . . . . . . . . . . . . . . . . . . . . . . 25
equation . . . . . . . . . . . . . . . . . . . . . . . . . 415 orthogonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 complement . . . . . . . . . . . . . . . . . . . . . 128
independence . . . . . . . . . . . . . . . . . . . . . 15 matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 100
least squares . . . . . . . . . . . . . . . . 108, 278 projection . . . . . . . . . . . . . . . . . . . . . . . 101
preserver problem . . . . . . . . . . . . . . . . 368 orthonormal basis . . . . . . . . . . . . . . . . . . . . . . 81
system . . . . . . . . . . . . . . . . . . . . . . . . . . 415
transformation . . . . . . . . . . . . . . . . 36, 416 P
Loewner partial order . . . . . . . . . . . . . . . . . 393
p-norm . . . . . . . . . . . . . . . . . . . . . . . . . . 150, 156
paraboloid . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
M
parallelogram law . . . . . . . . . . . . . . . . . 78, 158
matricization . . . . . . . . . . . . . . . . . . . . . . . . . 305 partial trace . . . . . . . . . . . . . . . . . . . . . . . . . . 367
matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 415 partial transpose . . . . . . . . . . . . . . . . . . . . . . 366
matrix-valued map . . . . . . . . . . . . . . . . . . . . 358 Pauli matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
max norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 permanent. . . . . . . . . . . . . . . . . . . . . . . . . . . .341
Minkowski’s inequality . . . . . . . . . . . . . . . . 151 permutation . . . . . . . . . . . . . . . . . . . . . 311, 419
monomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 permutation matrix . . . . . . . . . . . . . . . . . . . 309
multilinear . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 polar decomposition . . . . . . . . . . . . . . . . . . 204
form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 polarization identity . . . . . . . . . . . . . . . 78, 159
transformation . . . . . . . . . . . . . . . . . . . 320 polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
multinomial coefficient . . . . . . . . . . . . . . . . 426 positive definite . . . . . . . . . . . . . . 79, 188, 258
mutually orthogonal . . . . . . . . . . . . . . . . . . . . 80 positive map . . . . . . . . . . . . . . . . . . . . . . . . . 369
positive semidefinite . . . . . . . . . 188, 258, 383
positive semidefinite part . . . . . 208, 405, 407
N primal problem . . . . . . . . . . . . . . . . . . 402, 404
principal minor . . . . . . . . . . . . . . . . . . 196, 373
negative semidefinite . . . . . . . . . . . . . . . . . . 260
principal square root . . . . . . . . . 203, 248, 433
492 Index
projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 similarity-invariant . . . . . . . . . . . . . . . . . . . . 116

PSD. see positive semidefinite singular value . . . . . . . . . . . . . . . . . . . . 209, 470
pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . . 273 singular value decomposition . . . . . . . . . . 209
Pythagorean theorem . . . . . . . . . . . . . . . . . . . 78 skew-symmetric
bilinear form . . . . . . . . . . . . . . . . . . . . . . 78
Q matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
span . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
QR decomposition . . . . . . . . . . . . . . . . . . . . 139 spectrahedron . . . . . . . . . . . . . . . . . . . . 401, 402
quadratic form . . . . . . . . . . . . . . . . . . . . . . . 256 spectral decomposition . . . . . . . . . . . 176, 183
quadrilinear form . . . . . . . . . . . . . . . . . . . . . 387 standard
quartic form . . . . . . . . . . . . . . . . . . . . . . . . . . 378 array . . . . . . . . . . . . . . . . . . . . . . . . 68, 325
basis . . . . . . . . . . . . . . . . . . . . . . . . 19, 416
matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
R
stochastic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
range . . . . . . . . . . . . . . . . . . . . . . . . 46, 335, 418 Strassen’s algorithm . . . . . . . . . . . . . . . . . . 339
rank . . . . . . . . . . . . . . . . . . . . 46, 337, 351, 418 submultiplicativity . . . . . . . . . . . . . . . . . . . . 222
rank-one sum. . . . . . . . . . . . . . . . . . . . . . . . .419 subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
real part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 sum of subspaces . . . . . . . . . . . . . . . . . . . . . . 23
reduced row echelon form . . . . . . . . . 252, 416 SVD. see singular value decomposition
reduction map . . . . . . . . . . . . . . . . . . . . . . . . 359 swap matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 307
reverse triangle inequality . . . . . . . . . . . . . 165 Sylvester equation . . . . . . . . . . . . . . . . . . . . 319
root of a number . . . . . . . . . . . . . . . . . . . . . . 433 Sylvester’s law of inertia . . . . . . . . . . . . . . 255
row echelon form . . . . . . . . . . . . . . . . . . . . . 416 symmetric
row reduction . . . . . . . . . . . . . . . . . . . . . . . . 415 bilinear form . . . . . . . . . . . . . . . . . . . . . . 78
RREF. see reduced row echelon form group . . . . . . . . . . . . . . . . . . . . . . . 312, 420
matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
subspace . . . . . . . . . . . . . . . . . . . . . . . . 312
S
tensor rank . . . . . . . . . . . . . . . . . . . . . . 382
saddle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 symmetric tensor rank . . . . . . . . . . . . . . . . . 382
Schatten norm . . . . . . . . . . . . . . . . . . . 413, 415 symmetry-preserving . . . . . . . . . . . . . 362, 376
Schmidt decomposition . . . . . . . . . . . . . . . 319
Schur complement . . . . . . . . . . . . . . . . . . . . 265 T
Schur triangularization . . . . . . . . . . . . . . . . 169
SDP. see semidefinite program Takagi factorization . . . . . . . . . . 226, 315, 392
self-adjoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 taxicab norm . . . . . . . . . . . . . . . . . . . . . . . . . 147
semidefinite program . . . . . . . . . . . . . . . . . . 394 Taylor series . . . . . . . . . . . . . . . . . . . . . . 14, 426
sequence space . . . . . . . . . . . . . . . . . . . . . . . . . 4 tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
sesquilinear form . . . . . . . . . . . . . . . . . . . 71, 78 tensor product . . . . . . . . . . . . . . . . . . . . . . . . 343
sign of a permutation . . . . . . . . . . . . . . . . . . 420 tensor rank . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
signed permutation matrix . . . . . . . . . . . . . 164 trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36, 116
similar . . . . . . . . . . . . . . . . . . . . . . . . . . 168, 232 trace norm . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
similarity transformation . . . . . . . . . . . . . . 168 trace-preserving . . . . . . . . . . . . . . . . . . . . . . 376
Index 493
transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 W
transpose-preserving . . . . . . . . . . . . . . . . . . 361
type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Wronskian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
U Y
unit ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Young’s inequality . . . . . . . . . . . . . . . . . . . . 154
unit vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
unital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Z
unitary matrix . . . . . . . . . . . . . . . . . . . . . . . . . 96
unitary similarity . . . . . . . . . . . . . . . . . . . . . 168 zero
universal property . . . . . . . . . . . . . . . . . . . . 343 transformation . . . . . . . . . . . . . . . . . . . . 37
vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
V vector space . . . . . . . . . . . . . . . . . . 28, 317
vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 2
vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
vectorization . . . . . . . . . . . . . . . . . . . . . . . . . 305
494 Index
Symbol Index
⊥ orthogonal complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
∼
= isomorphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
, Loewner partial order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
⊕ direct sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121, 135
⊗ Kronecker product, tensor product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298, 343
0 zero vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
O zero matrix or zero transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 37
[A]i, j (i, j)-entry of the matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
[T ]B standard matrix of T with respect to basis B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
k·k the length/norm of a vector or the operator norm of a matrix . . . . . . . . . . . . . . . . . . 74, 221
kAkF Frobenius norm of the matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
AT transpose of the matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A∗ conjugate transpose of the matrix A, adjoint of the linear transformation A . . . . . . . . 6, 92
Anp antisymmetric subspace of (Fn )⊗p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
|B| size (i.e., number of members) of the set B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
C complex numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 428
Cn vectors/tuples with n complex entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
C vector space of continuous functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
C[a, b] continuous functions on the real interval [a, b] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
c00 infinite real sequences with finitely many non-zero entries . . . . . . . . . . . . . . . . . . . . . . . . . 11
D vector space of differentiable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
ej j-th standard basis vector in Fn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Ei, j standard basis matrix with 1 in its (i, j)-entry and 0s elsewhere . . . . . . . . . . . . . . . . . . . . 19
F a field (often R or C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 434
Fn vectors/tuples with n entries from F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
FN vector space of infinite sequences with entries from F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
F vector space of real-valued functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
HP np homogeneous polynomials in n variables with degree p . . . . . . . . . . . . . . . . . . . . . . . 23, 378
I identity matrix or identity transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12, 37
Mm,n m × n matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Mn n × n matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
MsS n n × n skew-symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
MsH n n × n (complex) skew-Hermitian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
MH n n × n (complex) Hermitian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
MSn n × n symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
N natural numbers {1, 2, 3, . . .} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
P polynomials in one variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
P E, P O even and odd polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
Pn polynomials in n variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Pp polynomials in one variable with degree ≤ p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Pnp polynomials in n variables with degree ≤ p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Q rational numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285, 435
R real numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Rn vectors/tuples with n real entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Snp symmetric subspace of (Fn )⊗p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
V∗ dual of the vector space V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Z integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Sanet - ST 3030528146

Uploaded by

Copyright:

Available Formats

Sanet - ST 3030528146

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sanet - ST 3030528146

Uploaded by

Copyright:

Available Formats

Nathaniel Johnston

ISBN 978-3-030-52814-0 ISBN 978-3-030-52815-7 (eBook)

Mathematics Subject Classiﬁcation: 15Axx, 97H60, 00-01

© Springer Nature Switzerland AG 2021

Continuation of Introduction to Linear and Matrix Algebra

Features of this Book

Notes in the Margin

To the Instructor and Independent Reader

Extra Topic Sections

Introduction to Linear and Matrix Algebra

Chapter 1: Vector Spaces

§1.A §1.B §1.C §1.D

Chapter 2: Matrix Decompositions

§2.1 §2.2 §2.3 §2.4

§2.A §2.B §2.C §2.D

Chapter 3: Tensors and Multilinearity

§3.A §3.B §3.C

Sackville, NB, Canada Nathaniel Johnston

The Purpose of this Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Chapter 1: Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.4.3 Unitary Matrices 96

Chapter 2: Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Chapter 3: Tensors and Multilinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

3.1 The Kronecker Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

Appendix A: Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

A.2 Polynomials and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

A.3 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

A.4 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

Appendix B: Additional Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

B.1 Equivalence of Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

Appendix C: Selected Exercise Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

It is my experience that proofs involving matrices

© Springer Nature Switzerland AG 2021 1

1.1 Vector Spaces and Subspaces

Example 1.1.1 Show that Rn is a vector space.

v j , w j , and x j denote v + w = (v1 + w1 , . . . , vn + wn ) = (w1 + v1 , . . . , wn + vn ) = w + v.

d) The zero vector 0 = (0, 0, . . . , 0) ∈ Rn clearly satisfies v + 0 = v.

h) Similarly to property (g), we just notice that for each 1 ≤ j ≤ n, the

We should keep Rn in our mind as the prototypical example of a vector

If we wish to emphasize which field F the entries of the matrices in Mm,n

( f + g)(x) = f (x) + g(x) = g(x) + f (x) = (g + f )(x),

Finally, we say that a matrix A ∈ Mn is symmetric if AT = A and that a

then A is symmetric (and Hermitian), B is Hermitian (but not symmetric), and

Theorem 1.1.1 Suppose V is a vector space and v ∈ V. Then

Proof. To show that 0v = 0, we carefully use the various defining properties

0 = 0v (we just proved this)

It follows that (−1)v = −v, which completes the proof.

Definition 1.1.2 If V is a vector space and S ⊆ V, then S is a subspace of V if S is itself

It turns out that checking whether or not something is a subspace is much

which is also a polynomial of degree at most p, so f + g ∈ P p .

b) Just like before, we compute

which is also a polynomial of degree at most p, so c f ∈ P p .

Slightly more generally, the set P p (F) of polynomials with coefficients

Similar to the previous example, the set MSn of n × n symmetric matrices

However, their sum is

1.1.2 Spans, Linear Combinations, and Independence

It is important that where c1 , c2 , . . . , ck ∈ F.

Example 1.1.11 Suppose f , g, h ∈ P 2 are given by the formulas

Determine whether or not f is a linear combination of g and h.