Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
24 views

Matrix Decomposition and Applications

Uploaded by

zhaoyichao01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Matrix Decomposition and Applications

Uploaded by

zhaoyichao01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 184

Matrix Decomposition and Applications

Matrix Decomposition and Applications

Jun Lu
jun.lu.locky@gmail.com
arXiv:2201.00145v1 [math.NA] 1 Jan 2022

Abstract
In 1954, Alston S. Householder published Principles of Numerical Analysis, one of the first
modern treatments on matrix decomposition that favored a (block) LU decomposition-the
factorization of a matrix into the product of lower and upper triangular matrices. And
now, matrix decomposition has become a core technology in machine learning, largely due
to the development of the back propagation algorithm in fitting a neural network. The sole
aim of this survey is to give a self-contained introduction to concepts and mathematical
tools in numerical linear algebra and matrix analysis in order to seamlessly introduce
matrix decomposition techniques and their applications in subsequent sections. However,
we clearly realize our inability to cover all the useful and interesting results concerning
matrix decomposition and given the paucity of scope to present this discussion, e.g., the
separated analysis of the Euclidean space, Hermitian space, Hilbert space, and things in
the complex domain. We refer the reader to literature in the field of linear algebra for a
more detailed introduction to the related fields.
This survey is primarily a summary of purpose, significance of important matrix decom-
position methods, e.g., LU, QR, and SVD, and the origin and complexity of the methods
which shed light on their modern applications. Most importantly, this article presents
improved procedures for most of the calculations of the decomposition algorithms which
potentially reduce the complexity they induce. Again, this is a decomposition-based con-
text, thus we will introduce the related background when it is needed and necessary. In
many other textbooks on linear algebra, the principal ideas are discussed and the matrix
decomposition methods serve as “byproduct”. However, we focus on the decomposition
methods instead and the principal ideas serve as fundamental tools for them. The mathe-
matical prerequisite is a first course in linear algebra. Other than this modest background,
the development is self-contained, with rigorous proof provided throughout.
Keywords: Existence and computing of matrix decompositions, Complexity, Floating
point operations (flops), Low-rank approximation, Pivot, LU decomposition for nonzero
leading principal minors, Data distillation, CR decomposition, CUR/Skeleton decomposi-
tion, Interpolative decomposition, Biconjugate decomposition, Coordinate transformation,
Hessenberg decomposition, ULV decomposition, URV decomposition, Rank decomposition,
Gram-Schmidt process, Householder reflector, Givens rotation, Rank revealing decompo-
sition, Cholesky decomposition and update/downdate, Eigenvalue problems, Alternating
least squares, Randomized algorithm.

©2022 Jun Lu.


Jun Lu

QR Algorithm Diagonalization

Rank-Revealing Reveal Rank


Semidefinite Spectral
Semidefinite
Symmetric
PD
PSD PSD
e Rank ‘ ‘ Sp Schur
C as eci
l al”
cia
pe Special Case

Ca
S

se
dent Same
nd epen ns C
I m
Colu CR
CPQR or RRQR Skeleton
Interpolative
(CUR)

Pseudoskeleton / Low-Rank Approx.


RREF via Elimination
Positive Definite
LU Cholesky
(X,
Y) I,I)
=( =(
I ,I
) ,Y)
Two-Sided (X
Orthogonal NMF
Phase 1 Phase 2
Biconjugate Nonnegative
Polar
Rank ALS
LQ UTV
Estimation
) (X Same Derivation,
I, A ,Y
( )=
Lo

Row Space Rank Estimation Different Transform.


)= (V -R
w

( X,Y ,U ) an
k
QR SVD

Householder

Householder Hessenberg QR Algorithm


With Bidiagonal Eigenvalue
Householder Compute SVD

Symmetric Generalized
Eigenvector

Tridiagonal
Bidiagonal Jordan
T = B>B

Figure 1: Matrix Decomposition World Map.


2
Matrix Decomposition and Applications

Matrix A
A ∈ Rm×n

Square Rectangular
A ∈ Rn×n m 6= n

Symmetric Asymmetric
Full Rank General
A = A> A 6= A> Basis Space A ∈ Rm×n

QR, SVD
column space A = U ΣV >
PD or PSD General Eigen Reduction A = QR
x> Ax ≥ 0 Generalized Ortho. Similar Polar
LQ, row space A = Q l Sl
Cholesky, PD Eigenvalue Hessenberg
A = LQ Bidiagonal
A = R> R A = XΛX −1 A = QHQ>
A = U BV >
Semidefi. PSD Schur Tridiagonal
A = R> R A= QU Q> Data Distill UTV
A = QT Q>
Low-Rank App. A = UT V
RR Semi., PSD Jordan
CR RRQR
P > AP = R> R A = XJ X −1
Gaussian Elim. A = CR A = QRP >
Element. Trans.
Rank
Eigenvector Basis LU A = DF
Orthogonal Column and Row A = LU
Skeleton
A = LDU
Spectral Two-Sided A = CU −1 R
A = P LU
AP A = U F V >
A = QΛQ> Interpolative
RRLU
P AQ = LU A = CW

Complete ALS
Pivoting LU A ≈ WZ
P AQ = LU NMF
A ≈ WZ

Figure 2: Matrix Decomposition World Map Under Conditions.

3
Jun Lu

Contents

Introduction and Background 8

I Gaussian Elimination 15

1 LU Decomposition 16
1.1 Relation to Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Existence of the LU Decomposition without Permutation . . . . . . . . . . 20
1.3 Existence of the LU Decomposition with Permutation . . . . . . . . . . . . 21
1.4 Bandwidth Preserving in the LU Decomposition without Permutation . . . 23
1.5 Block LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Application: Linear System via the LU Decomposition . . . . . . . . . . . . 24
1.7 Application: Computing the Inverse of Nonsingular Matrices . . . . . . . . 25
1.8 Application: Computing the Determinant via the LU Decomposition . . . . 25
1.9 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.9.1 Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.9.2 Complete Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.9.3 Rook Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.10 Rank-Revealing LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . 28

2 Cholesky Decomposition 28
2.1 Existence of the Cholesky Decomposition via Recursive Calculation . . . . . 29
2.2 Sylvester’s Criterion: Leading Principal Minors of PD Matrices . . . . . . . 32
2.3 Existence of the Cholesky Decomposition via the LU Decomposition without
Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Diagonal Values of the Upper Triangular Matrix . . . . . . . . . . . 35
2.3.2 Block Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . 36
2.4 Existence of the Cholesky Decomposition via Induction . . . . . . . . . . . 37
2.5 Uniqueness of the Cholesky Decomposition . . . . . . . . . . . . . . . . . . 38
2.6 Last Words on Positive Definite Matrices . . . . . . . . . . . . . . . . . . . 38
2.7 Decomposition for Semidefinite Matrices . . . . . . . . . . . . . . . . . . . . 39
2.8 Application: Rank-One Update/Downdate . . . . . . . . . . . . . . . . . . . 40
2.8.1 Rank-One Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.8.2 Rank-One Downdate . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.9 Application: Indefinite Rank Two Update . . . . . . . . . . . . . . . . . . . 43

II Triangularization, Orthogonalization and Gram-Schmidt Process 43

3 QR Decomposition 43
3.1 Project a Vector Onto Another Vector . . . . . . . . . . . . . . . . . . . . . 44
3.2 Project a Vector Onto a Plane . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Existence of the QR Decomposition via the Gram-Schmidt Process . . . . . 45
3.4 Orthogonal vs Orthonormal . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4
Matrix Decomposition and Applications

3.5 Computing the Reduced QR Decomposition via CGS and MGS . . . . . . . 48


3.6 Computing the Full QR Decomposition via the Gram-Schmidt Process . . . 52
3.7 Dependent Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 QR with Column Pivoting: Column-Pivoted QR (CPQR) . . . . . . . . . . 53
3.8.1 A Simple CPQR via CGS . . . . . . . . . . . . . . . . . . . . . . . . 53
3.8.2 A Practical CPQR via CGS . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 QR with Column Pivoting: Revealing Rank One Deficiency . . . . . . . . . 55
3.10 QR with Column Pivoting: Revealing Rank r Deficiency* . . . . . . . . . . 56
3.11 Existence of the QR Decomposition via the Householder Reflector . . . . . 56
3.12 Existence of the QR Decomposition via the Givens Rotation . . . . . . . . . 60
3.13 Uniqueness of the QR Decomposition . . . . . . . . . . . . . . . . . . . . . . 64
3.14 LQ Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.15 Two-Sided Orthogonal Decomposition . . . . . . . . . . . . . . . . . . . . . 65
3.16 Rank-One Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.17 Appending or Deleting a Column . . . . . . . . . . . . . . . . . . . . . . . . 68
3.18 Appending or Deleting a Row . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4 UTV Decomposition: ULV and URV Decomposition 71


4.1 UTV Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Complete Orthogonal Decomposition . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Application: Row Rank equals Column Rank Again via UTV . . . . . . . . 74

III Data Interpretation and Information Distillation 75

5 CR Decomposition 76
5.1 Existence of the CR Decomposition . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Reduced Row Echelon Form (RREF) . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Rank Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Application: Rank and Trace of an Idempotent Matrix . . . . . . . . . . . . 80

6 Skeleton/CUR Decomposition 80
6.1 Existence of the Skeleton Decomposition . . . . . . . . . . . . . . . . . . . . 81

7 Interpolative Decomposition (ID) 83


7.1 Existence of the Column Interpolative Decomposition . . . . . . . . . . . . 85
7.2 Row ID and Two-Sided ID . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

IV Reduction to Hessenberg, Tridiagonal, and Bidiagonal Form 89

8 Hessenberg Decomposition 89
8.1 Similarity Transformation and Orthogonal Similarity Transformation . . . . 90
8.2 Existence of the Hessenberg Decomposition . . . . . . . . . . . . . . . . . . 92
8.3 Properties of the Hessenberg Decomposition . . . . . . . . . . . . . . . . . . 94

5
Jun Lu

9 Tridiagonal Decomposition: Hessenberg in Symmetric Matrices 96


9.1 Properties of the Tridiagonal Decomposition . . . . . . . . . . . . . . . . . . 97

10 Bidiagonal Decomposition 98
10.1 Existence of the Bidiagonal Decomposition: Golub-Kahan Bidiagonalization 99
10.2 Connection to Tridiagonal Decomposition . . . . . . . . . . . . . . . . . . . 105

V Eigenvalue Problem 106

11 Eigenvalue and Jordan Decomposition 106


11.1 Existence of the Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . 106
11.2 Jordan Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

12 Schur Decomposition 110


12.1 Existence of the Schur Decomposition . . . . . . . . . . . . . . . . . . . . . 110
12.2 Other Forms of the Schur Decomposition . . . . . . . . . . . . . . . . . . . 112

13 Spectral Decomposition (Theorem) 113


13.1 Existence of the Spectral Decomposition . . . . . . . . . . . . . . . . . . . . 114
13.2 Uniqueness of Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . 119
13.3 Other Forms, Connecting Eigenvalue Decomposition* . . . . . . . . . . . . 119
13.4 Skew-Symmetric Matrices and its Properties* . . . . . . . . . . . . . . . . . 126
13.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.5.1 Application: Eigenvalue of Projection Matrix . . . . . . . . . . . . . 129
13.5.2 Application: An Alternative Definition on PD and PSD of Matrices 131
13.5.3 Proof for Semidefinite Rank-Revealing Decomposition . . . . . . . . 132
13.5.4 Application: Cholesky Decomposition via the QR Decomposition and
the Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . . 133
13.5.5 Application: Unique Power Decomposition of Positive Definite Matrices133

14 Singular Value Decomposition (SVD) 134


14.1 Existence of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.2 Properties of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
14.2.1 Four Subspaces in SVD . . . . . . . . . . . . . . . . . . . . . . . . . 137
14.2.2 Relationship between Singular Values and Determinant . . . . . . . 139
14.2.3 Orthogonal Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 139
14.2.4 SVD for QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
14.3 Polar Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
14.4 Application: Least Squares via the Full QR Decomposition, UTV, SVD . . 141
14.5 Application: Principal Component Analysis (PCA) via the Spectral Decom-
position and the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
14.6 Application: Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . 147

VI Special Topics 148

6
Matrix Decomposition and Applications

15 Coordinate Transformation in Matrix Decomposition 148


15.1 An Overview of Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 148
15.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
15.3 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.4 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
15.5 Polar Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

16 Alternating Least Squares 152


16.1 Netflix Recommender and Matrix Factorization . . . . . . . . . . . . . . . . 152
16.2 Regularization: Extension to General Matrices . . . . . . . . . . . . . . . . 157
16.3 Missing Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
16.4 Vector Inner Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
16.5 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
16.6 Regularization: A Geometrical Interpretation . . . . . . . . . . . . . . . . . 162
16.7 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
16.8 Bias Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

17 Nonnegative Matrix Factorization (NMF) 167


17.1 NMF via Multiplicative Update . . . . . . . . . . . . . . . . . . . . . . . . . 167
17.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
17.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

18 Biconjugate Decomposition 170


18.1 Existence of the Biconjugate Decomposition . . . . . . . . . . . . . . . . . . 170
18.2 Properties of the Biconjugate Decomposition . . . . . . . . . . . . . . . . . 175
18.3 Connection to Well-Known Decomposition Methods . . . . . . . . . . . . . 175
18.3.1 LDU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
18.3.2 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 177
18.3.3 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
18.3.4 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
18.4 Proof General Term Formula of Wedderburn Sequence . . . . . . . . . . . . 179

19 Acknowledgments 180

7
Jun Lu

Introduction and Background

Matrix decomposition has become a core technology in statistics (Banerjee and Roy, 2014;
Gentle, 1998), optimization (Gill et al., 2021), machine learning (Goodfellow et al., 2016;
Bishop, 2006), and deep learning largely due to the development of back propagation algo-
rithm in fitting a neural network and the low-rank neural networks in efficient deep learning.
The sole aim of this survey is to give a self-contained introduction to concepts and mathe-
matical tools in numerical linear algebra and matrix analysis in order to seamlessly introduce
matrix decomposition techniques and their applications in subsequent sections. However,
we clearly realize our inability to cover all the useful and interesting results concerning
matrix decomposition and given the paucity of scope to present this discussion, e.g., the
separated analysis of the Euclidean space, Hermitian space, and Hilbert space. We refer
the reader to literature in the field of linear algebra for a more detailed introduction to the
related fields. Some excellent examples include (Householder, 2006; Trefethen and Bau III,
1997; Strang, 2009; Stewart, 2000; Gentle, 2007; Higham, 2002; Quarteroni et al., 2010;
Golub and Van Loan, 2013; Beck, 2017; Gallier and Quaintance, 2017; Boyd and Vanden-
berghe, 2018; Strang, 2019; van de Geijn and Myers, 2020; Strang, 2021). Most importantly,
this survey will only cover the compact proofs of the existence of the matrix decomposition
methods. For more details on how to reduce the calculation complexity, rigorous discussion
in various applications and examples, why each matrix decomposition method is important
in practice, and preliminaries on tensor decomposition, one can refer to (Lu, 2021c).
A matrix decomposition is a way of reducing a complex matrix into its constituent
parts which are in simpler forms. The underlying principle of the decompositional approach
to matrix computation is that it is not the business of the matrix algorithmists to solve
particular problems, but it is an approach that can simplify more complex matrix operations
which can be performed on the decomposed parts rather than on the original matrix itself.
At a general level, a matrix decomposition task on matrix A can be cast as

• A = QU : where Q is an orthogonal matrix that contains the same column space as


A and U is a relatively simple and sparse matrix to reconstruct A.
• A = QT Q> : where Q is orthogonal such that A and T are similar matrices that
share the same properties such as same eigenvalues, sparsity. Moreover, working on
T is an easier task compared to that of A.
• A = U T V : where U , V are orthogonal matrices such that the columns of U and the
rows of V constitute an orthonormal basis of the column space and row space of A
respectively.
• A = B C : where B, C are full rank matrices that can reduce the memory
m×n m×r r×n
storage of A. In practice, a low-rank approximation A ≈ D F can be em-
m×n m×k k×n
ployed where k < r is called the numerical rank of the matrix such that the matrix
can be stored much more inexpensively and can be multiplied rapidly with vectors
or other matrices. An approximation of the form A = DF is useful for storing the
matrix A more frugally (we can store D and F using k(m + n) floats, as opposed to
mn numbers for storing A), for efficiently computing a matrix-vector product b = Ax
(via c = F x and b = Dc), for data interpretation, and much more.

8
Matrix Decomposition and Applications

• A matrix decomposition, which though is usually expensive to compute, can be reused


to solve new problems involving the original matrix in different scenarios, e.g., as long
as the factorization of A is obtained, it can be reused to solve the set of linear systems
{b1 = Ax1 , b2 = Ax2 , . . . , bk = Axk }.
• More generally, a matrix decomposition can help to understand the internal meaning
of what happens when multiplied by the matrix such that each constituent has a
geometrical transformation (see Section 15, p. 148).
The matrix decomposition algorithms can fall into many categories. Nonetheless, six
categories hold the center and we sketch it here:
1. Factorizations arise from Gaussian elimination including the LU decomposition and
its positive definite alternative - Cholesky decomposition;
2. Factorizations obtained when orthogonalizing the columns or the rows of a matrix
such that the data can be explained well in an orthonormal basis;
3. Factorizations where the matrices are skeletoned such that a subset of the columns
or the rows can represent the whole data in a small reconstruction error, whilst, the
sparsity and nonnegativity of the matrices are kept as they are;
4. Reduction to Hessenberg, tridiagonal, or bidiagonal form, as a result, the properties
of the matrices can be explored in these reduced matrices such as rank, eigenvalues,
and so on;
5. Factorizations result from the computation of the eigenvalues of matrices;
6. In particular, the rest can be cast as a special kind of decompositions that involve
optimization methods, high-level ideas where the category may not be straightforward
to determine.
The world pictures for decomposition in Figure 1 and 2 connect each decomposition
method by their internal relations and also separate different methods by the criteria or
prerequisites of them. Readers will get more information about the two pictures after
reading the text.

Notation and preliminaries In the rest of this section we will introduce and recap some
basic knowledge about linear algebra. For the rest of the important concepts, we define and
discuss them as per need for clarity. The readers with enough background in matrix analysis
can skip this section. In the text, we simplify matters by considering only matrices that are
real. Without special consideration, the eigenvalues of the discussed matrices are also real.
We also assume throughout that || · || = || · ||2 .
In all cases, scalars will be denoted in a non-bold font possibly with subscripts (e.g., a, α,
αi ). We will use boldface lower case letters possibly with subscripts to denote vectors (e.g.,
µ, x, xn , z) and boldface upper case letters possibly with subscripts to denote matrices
(e.g., A, Lj ). The i-th element of a vector z will be denoted by zi in bold font (or zi in the
non-bold font). The i-th row and j-th column value of matrix A will be denoted by Aij if
block submatrices are involved, or by aij if block submatrices are not involved. Furthermore,
it will be helpful to utilize the Matlab-style notation, the i-th row to j-th row and k-th
column to m-th column submatrix of matrix A will be denoted by Ai:j,k:m . When the index
is not continuous, given ordered subindex sets I and J, A[I, J] denotes the submatrix of A

9
Jun Lu

obtained by extracting the rows and columns of A indexed by I and J, respectively; and
A[:, J] denotes the submatrix of A obtained by extracting the columns of A indexed by J.
And in all cases, vectors are formulated in a column rather than in a row. A row vector
will be denoted by a transpose of a column vector such as a> . A specific column vector
with values is split by the symbol “; ”, e.g., x = [1; 2; 3] is a column vector in R3 . Similarly,
a specific row vector with values is split by the symbol “, ”, e.g., y = [1, 2, 3] is a row vector
with 3 values. Further, a column vector can be denoted by the transpose of a row vector
e.g., y = [1, 2, 3]> is a column vector.
The transpose of a matrix A will be denoted by A> and its inverse will be denoted by
−1
A . We will denote the p × p identity matrix by Ip . A vector or matrix of all zeros will
be denoted by a boldface zero 0 whose size should be clear from context, or we denote 0p
to be the vector of all zeros with p entries.

Definition 0.1: (Eigenvalue)


Given any vector space E and any linear map A : E → E, a scalar λ ∈ K is called an
eigenvalue, or proper value, or characteristic value of A if there is some nonzero vector
u ∈ E such that
Au = λu.

Definition 0.2: (Spectrum and Spectral Radius)


The set of all eigenvalues of A is called the spectrum of A and denoted by Λ(A). The
largest magnitude of the eigenvalues is known as the spectral radius ρ(A):

ρ(A) = max |λ|.


λ∈Λ(A)

Definition 0.3: (Eigenvector)


A vector u ∈ E is called an eigenvector, or proper vector, or characteristic vector of A if
u 6= 0 and if there is some λ ∈ K such that

Au = λu,

where the scalar λ is then an eigenvalue. And we say that u is an eigenvector associated
with λ.

Moreover, the tuple (λ, u) above is said to be an eigenpair. Intuitively, the above
definitions mean that multiplying matrix A by the vector u results in a new vector that
is in the same direction as u, but only scaled by a factor λ. For any eigenvector u, we
can scale it by a scalar s such that su is still an eigenvector of A. That’s why we call the
eigenvector as an eigenvector of A associated with eigenvalue λ. To avoid ambiguity, we
usually assume that the eigenvector is normalized to have length 1 and the first entry is
positive (or negative) since both u and −u are eigenvectors.
In this context, we will highly use the idea about the linear independence of a set of
vectors. Two equivalent definitions are given as follows.

10
Matrix Decomposition and Applications

Definition 0.4: (Linearly Independent)


A set of vectors {a1 , a2 , . . . , am } is called linearly independent if there is no combination
can get x1 a1 + x2 a2 + . . . + xm am = 0 except all xi ’s are zero. An equivalent definition
is that a1 6= 0, and for every k > 1, the vector ak does not belong to the span of
{a1 , a2 , . . . , ak−1 }.

In the study of linear algebra, every vector space has a basis and every vector is a linear
combination of members of the basis. We then define the span and dimension of a subspace
via the basis.

Definition 0.5: (Span)


If every vector v in subspace V can be expressed as a linear combination of {a1 , a2 , . . . ,
am }, then {a1 , a2 , . . . , am } is said to span V.

Definition 0.6: (Subspace)


A nonempty subset V of Rn is called a subspace if xa + ya ∈ V for every a, b ∈ V and
every x, y ∈ R.

Definition 0.7: (Basis and Dimension)


A set of vectors {a1 , a2 , . . . , am } is called a basis of V if they are linearly independent
and span V. Every basis of a given subspace has the same number of vectors, and the
number of vectors in any basis is called the dimension of the subspace V. By convention,
the subspace {0} is said to have dimension zero. Furthermore, every subspace of nonzero
dimension has a basis that is orthogonal, i.e., the basis of a subspace can be chosen
orthogonal.

Definition 0.8: (Column Space (Range))


If A is an m × n real matrix, we define the column space (or range) of A to be the set
spanned by its columns:

C(A) = {y ∈ Rm : ∃x ∈ Rn , y = Ax}.

And the row space of A is the set spanned by its rows, which is equal to the column space
of A> :
C(A> ) = {x ∈ Rn : ∃y ∈ Rm , x = A> y}.

Definition 0.9: (Null Space (Nullspace, Kernel))


If A is an m × n real matrix, we define the null space (or kernel, or nullspace) of A to be
the set:
N (A) = {y ∈ Rn : Ay = 0}.

11
Jun Lu

And the null space of A> is defined as

N (A> ) = {x ∈ Rm : A> x = 0}.

Both the column space of A and the null space of A> are subspaces of Rn . In fact,
every vector in N (A> ) is perpendicular to C(A) and vice versa.1

Definition 0.10: (Rank)


The rank of a matrix A ∈ Rm×n is the dimension of the column space of A. That is,
the rank of A is equal to the maximal number of linearly independent columns of A, and
is also the maximal number of linearly independent rows of A. The matrix A and its
transpose A> have the same rank. We say that A has full rank, if its rank is equal to
min{m, n}. In another word, this is true if and only if either all the columns of A are
linearly independent, or all the rows of A are linearly independent. Specifically, given a
vector u ∈ Rm and a vector v ∈ Rn , then the m × n matrix uv > is of rank 1. In short,
the rank of a matrix is equal to:
• number of linearly independent columns;
• number of linearly independent rows;
• and remarkably, these are always the same (see Corollary 0.13, p. 12).

Definition 0.11: (Orthogonal Complement in General)


The orthogonal complement V ⊥ of a subspace V contains every vector that is perpendic-
ular to V. That is,
V ⊥ = {v|v > u = 0, ∀u ∈ V}.
The two subspaces are disjoint that span the entire space. The dimensions of V and V ⊥
add to the dimension of the whole space. Furthermore, (V ⊥ )⊥ = V.

Definition 0.12: (Orthogonal Complement of Column Space)


If A is an m × n real matrix, the orthogonal complement of C(A), C ⊥ (A) is the subspace
defined as:
C ⊥ (A) = {y ∈ Rm : y > Ax = 0, ∀x ∈ Rn }
= {y ∈ Rm : y > v = 0, ∀v ∈ C(A)}.

Then we have the four fundamental spaces for any matrix A ∈ Rm×n with rank r as shown
in Theorem 0.15.

Theorem 0.13: (Row Rank Equals Column Rank)

1. Every vector in N (A) is also perpendicular to C(A> ) and vice versa.

12
Matrix Decomposition and Applications

The dimension of the column space of a matrix A ∈ Rm×n is equal to the dimension of
its row space, i.e., the row rank and the column rank of a matrix A are equal.

Proof [of Theorem 0.13, A First Way] We first notice that the null space of A is
orthogonal complementary to the row space of A: N (A)⊥C(A> ) (where the row space of
A is exactly the column space of A> ), that is, vectors in the null space of A are orthogonal
to vectors in the row space of A. To see this, suppose A has rows a> > >
1 , a2 , . . . , am and A =
[a1 ; a2 ; . . . ; am ]. For any vector x ∈ N (A), we have Ax = 0, that is, [a1 x; a2 x; . . . ; a>
> > > > >
m x] =
0. And since the row space of A is spanned by a> 1 , a> , . . . , a> . Then x is perpendicular
2 m
to any vectors from C(A> ) which means N (A)⊥C(A> ).
Now suppose, the dimension of row space of A is r. Let r1 , r2 , . . . , rr be a set of
vectors in Rn and form a basis for the row space. Then the r vectors Ar1 , Ar2 , . . . , Arr
are in the column space of A, which are linearly independent. To see this, suppose we
have a linear combination of the r vectors: x1 Ar1 + x2 Ar2 + . . . + xr Arr = 0, that is,
A(x1 r1 + x2 r2 + . . . + xr rr ) = 0 and the vector v = x1 r1 + x2 r2 + . . . + xr rr is in null space
of A. But since r1 , r2 , . . . , rr is a basis for the row space of A, v is thus also in the row space
of A. We have shown that vectors from null space of A is perpendicular to vectors from
row space of A, thus v > v = 0 and x1 = x2 = . . . = xr = 0. Then Ar1 , Ar2 , . . . , Arr are
in the column space of A and they are linearly independent which means the dimension of
the column space of A is larger than r. This result shows that row rank of A ≤ column
rank of A.
If we apply this process again for A> . We will have column rank of A ≤ row rank
of A. This completes the proof.

Further information can be drawn from this proof is that if r1 , r2 , . . . , rr is a set of


vectors in Rn that forms a basis for the row space, then Ar1 , Ar2 , . . . , Arr is a basis for
the column space of A. We formulate this finding into the following lemma.

Lemma 0.14: (Column Basis from Row Basis)


For any matrix A ∈ Rm×n , suppose that {r1 , r2 , . . . , rr } is a set of vectors in Rn which
forms a basis for the row space, then {Ar1 , Ar2 , . . . , Arr } is a basis for the column space
of A.

For any matrix A ∈ Rm×n , it can be easily verified that any vector in the row space of A
is perpendicular to any vector in the null space of A. Suppose xn ∈ N (A), then Axn = 0
such that xn is perpendicular to every row of A which agrees with our claim.
Similarly, we can also show that any vector in the column space of A is perpendicular
to any vector in the null space of A> . Further, the column space of A together with the
null space of A> span the whole Rm which is known as the fundamental theorem of linear
algebra.
The fundamental theorem contains two parts, the dimension of the subspaces and the
orthogonality of the subspaces. The orthogonality can be easily verified as shown above.
Moreover, when the row space has dimension r, the null space has dimension n − r. This
cannot be easily stated and we prove in the following theorem.

13
Jun Lu

dim = r
dim = r
row column
space xr b space
of A Axr  b of A
Ax  b
n
R x  xr  xn Rm

xn Axn  0
nullspace
nullspace of AT
of A
dim = n-r dim = m-r

Figure 3: Two pairs of orthogonal subspaces in Rn and Rm . dim(C(A> ))+dim(N (A)) = n


and dim(N (A> )) + dim(C(A)) = m. The null space component goes to zero as Axn = 0 ∈
Rm . The row space component goes to column space as Axr = A(xr + xn ) = b ∈ C(A).

Theorem 0.15: (The Fundamental Theorem of Linear Algebra)


Orthogonal Complement and Rank-Nullity Theorem: for any matrix A ∈ Rm×n , we
have
• N (A) is orthogonal complement to the row space C(A> ) in Rn : dim(N (A)) +
dim(C(A> )) = n;
• N (A> ) is orthogonal complement to the column space C(A) in Rm : dim(N (A> ))+
dim(C(A)) = m;
• For rank-r matrix A, dim(C(A> )) = dim(C(A)) = r, that is, dim(N (A)) = n−r
and dim(N (A> )) = m − r.

Proof [of Theorem 0.15] Following from the proof of Theorem 0.13. Let r1 , r2 , . . . , rr be a
set of vectors in Rn that form a basis for the row space, then Ar1 , Ar2 , . . . , Arr is a basis
for the column space of A. Let n1 , n2 , . . . , nk ∈ Rn form a basis for the null space of A.
Following again from the proof of Theorem 0.13, N (A)⊥C(A> ), thus, r1 , r2 , . . . , rr are per-
pendicular to n1 , n2 , . . . , nk . Then, {r1 , r2 , . . . , rr , n1 , n2 , . . . , nk } is linearly independent
in Rn .
For any vector x ∈ Rn , Ax is in the column space Pr of A. Then it can be expressed
by a combination of Ar1 ,P Ar2 , . . . , Arr : Ax = i=1 ai Ari which states that A(x −
P r r
a r
i=1 i i ) = 0 and x − a r
i=1 i i is thus in N (A). Since {n1 , n2 , . . . , nk } is a basis

14
Matrix Decomposition and Applications

for the null space of A, x − ri=1 ai ri can be expressed by a combination of n1 , n2 , . . . , nk :


P

x − ri=1 ai ri = kj=1 bj nj , i.e., x = ri=1 ai ri + kj=1 bj nj . That is, any vector x ∈ Rn


P P P P
can be expressed by {r1 , r2 , . . . , rr , n1 , n2 , . . . , nk } and the set forms a basis for Rn . Thus
the dimension sum to n: r + k = n, i.e., dim(N (A)) + dim(C(A> )) = n. Similarly, we can
prove dim(N (A> )) + dim(C(A)) = m.

Figure 3 demonstrates two pairs of such orthogonal subspaces and shows how A takes
x into the column space. The dimensions of the row space of A and the null space of A
add to n. And the dimensions of the column space of A and the null space of A> add to
m. The null space component goes to zero as Axn = 0 ∈ Rm which is the intersection of
column space of A and null space of A> . The row space component goes to column space
as Axr = A(xr + xn ) = b ∈ Rm .

Definition 0.16: (Orthogonal Matrix)


A real square matrix Q is an orthogonal matrix if the inverse of Q equals the transpose
of Q, that is Q−1 = Q> and QQ> = Q> Q = I. In another word, suppose Q =
[q1 , q2 , . . . , qn ] where qi ∈ Rn for all i ∈ {1, 2, . . . , n}, then qi> qj = δ(i, j) with δ(i, j)
being the Kronecker delta function. If Q contains only γ of these columns with γ < n,
then Q> Q = Iγ stills holds with Iγ being the γ×γ identity matrix. But QQ> = I will not
be true. For any vector x, the orthogonal matrix will preserve the length: ||Qx|| = ||x||.

Definition 0.17: (Permutation Matrix)


A permutation matrix P is a square binary matrix that has exactly one entry of 1 in each
row and each column and 0’s elsewhere.
Row Point That is, the permutation matrix P has the rows of the identity I in any
order and the order decides the sequence of the row permutation. Suppose we want to
permute the rows of matrix A, we just multiply on the left by P A.
Column Point Or, equivalently, the permutation matrix P has the columns of the
identity I in any order and the order decides the sequence of the column permutation.
And now, the column permutation of A is to multiply on the right by AP .

The permutation matrix P can be more efficiently represented via a vector J ∈ Zn+ of
indices such that P = I[:, J] where I is the n × n identity matrix and notably, the elements
2
in vector J sum to 1 + 2 + . . . + n = n 2+n .

15
Jun Lu

Part I
Gaussian Elimination
1. LU Decomposition
Perhaps the best known and the first matrix decomposition we should know about is the
LU decomposition. We now illustrate the results in the following theorem and the proof of
the existence of which will be delayed in the next sections.

Theorem 1.1: (LU Decomposition with Permutation)


Every nonsingular n × n square matrix A can be factored as

A = P LU ,

where P is a permutation matrix, L is a unit lower triangular matrix (i.e., lower triangular
matrix with all 1’s on the diagonal), and U is a nonsingular upper triangular matrix.

Note that, in the remainder of this text, we will put the decomposition-related results
in the blue box. And other claims will be in a gray box. This rule will be applied for the
rest of the survey without special mention.

Remark 1.2: (Decomposition Notation)


The above decomposition applies to any nonsingular matrix A. We will see that this de-
composition arises from the elimination steps in which case row operations of subtraction
and exchange of two rows are allowed where the subtractions are recorded in matrix L
and the row exchanges are recorded in matrix P . To make this row exchange explicit, the
common form for the above decomposition is QA = LU where Q = P > that records the
exact row exchanges of the rows of A. Otherwise, the P would record the row exchanges
of LU . In our case, we will make the decomposition to be clear for matrix A rather than
for QA. For this reason, we will put the permutation matrix on the right-hand side of
the equation for the remainder of the text without special mention.

Specifically, in some cases, we will not need the permutation matrix. This decomposition
relies on the leading principal minors. We provide the definition which is important for the
illustration.

Definition 1.3: (Leading Principal Minors)


Let A be an n × n square matrix. A k × k submatrix of A obtained by deleting the last
n − k columns and the last n − k rows from A is called a k-th order leading principal
submatrix of A, that is, the k × k submatrix taken from the top left corner of A.
The determinant of the k × k leading principal submatrix is called a k-th order leading
principal minor of A.

Under mild conditions on the leading principal minors of matrix A, the LU decomposi-
tion will not involve the permutation matrix.

16
Matrix Decomposition and Applications

Theorem 1.4: (LU Decomposition without Permutation)


For any n × n square matrix A, if all the leading principal minors are nonzero, i.e.,
det(A1:k,1:k ) 6= 0, for all k ∈ {1, 2, . . . , n}, then A can be factored as

A = LU ,

where L is a unit lower triangular matrix (i.e., lower triangular matrix with all 1’s on the
diagonal), and U is a nonsingular upper triangular matrix.
Specifically, this decomposition is unique. See Corollary 1.8.

Remark 1.5: (Other Forms of the LU Decomposition without Permutation)


The leading principal minors are nonzero, in another word, means the leading principal
submatrices are nonsingular.
Singular A In the above theorem, we assume A is nonsingular as well. The LU de-
composition also exists for singular matrix A. However, the matrix U will be singular as
well in this case. This can be shown in the following section that, if matrix A is singular,
some pivots will be zero, and the corresponding diagonal values of U will be zero.
Singular leading principal submatrices Even if we assume matrix A is nonsingular,
the leading principal submatrices might be singular. Suppose further that some of the
leading principal minors are zero, the LU decomposition also exists, but if so, it is again
not unique.

We will discuss where this decomposition comes from in the next section. There are
also generalizations of LU decomposition to non-square or singular matrices, such as rank-
revealing LU decomposition. Please refer to (Pan, 2000; Miranian and Gu, 2003; Dopico
et al., 2006) or we will have a short discussion in Section 1.10.

1.1 Relation to Gaussian Elimination


Solving linear system equation Ax = b is the basic problem in linear algebra. Gaussian
elimination transforms a linear system into an upper triangular one by applying simple
elementary row transformations on the left of the linear system in n − 1 stages if A ∈
Rn×n . As a result, it is much easier to solve by a backward substitution. The elementary
transformation is defined rigorously as follows.

Definition 1.6: (Elementary Transformation)


For square matrix A, the following three transformations are referred as elementary
row/column transformations:
1. Interchanging two rows (or columns) of A;
2. Multiplying all elements of a row (or a column) of A by some nonzero number;

17
Jun Lu

3. Adding any row (or column) of A multiplied by a nonzero number to any other row
(or column);

Specifically, the elementary row transformations of A are unit lower triangular to multiply
A on the left, and the elementary column transformations of A are unit upper triangular
to multiply A on the right.
The Gaussian elimination is described by the third type - elementary row transformation
above. Suppose the upper triangular matrix obtained by Gaussian elimination is given by
U = En−1 En−2 . . . E1 A, and in the k-th stage, the k-th column of Ek−1 Ek−2 . . . E1 A is
x ∈ Rn . Gaussian elimination will introduce zeros below the diagonal of x by

Ek = I − zk e>
k,

where ek ∈ Rn is the k-th unit basis vector, and zk ∈ Rn is given by


xi
zk = [0, . . . , 0, zk+1 , . . . , zn ]> , zi = , ∀i ∈ {k + 1, . . . , n}.
xk
We realize that Ek is a unit lower triangular matrix (with 1’s on the diagonal) with only
the k-th column of the lower submatrix being nonzero,
 
1 ... 0 0 ... 0
 .. . . .. .. . . .. 
. . . . . .
 
0 . . . 1 0 . . . 0
Ek =  ,
0 . . . −z k+1 1 . . . 0 

 .. . . .. .. . . .. 
. . . . . .
0 ... −zn 0 ... 1

and multiplying on the left by Ek will introduce zeros below the diagonal:
    
1 ... 0 0 ... 0 x1 x1
 .. . . .. .. . . ..   ..   .. 
. . . . . .  .   . 
    
0 . . . 1 0 . . . 0   xk  xk 
Ek x = 0 . . . −zk+1 1 . . . 0 xk+1  =  0  .
   
    
 .. . . .. .. . . ..   ..   .. 
. . . . . .   .  .
0 . . . −zn 0 . . . 1 xn 0

For example, we write out the Gaussian elimination steps for a 4 × 4 matrix. For simplicity,
we assume there are no row permutations. And in the following matrix,  represents a
value that is not necessarily zero, and boldface indicates the value has just been changed.
A Trivial Gaussian Elimination For a 4 × 4 Matrix:
       
               
    E1  0   2 0
 E     0   , (1.1)
E3 
 
    −→  0  
    −→  −→
  0 0     0 0  
    0    0 0   0 0 0 
A E1 A E2 E1 A E3 E2 E1 A

18
Matrix Decomposition and Applications

where E1 , E2 , E3 are lower triangular matrices. Specifically, as discussed above, Gaus-


sian transformation matrices Ei ’s are unit lower triangular matrices with 1’s on the diag-
onal. This can be explained that for the k-th transformation Ek , working on the matrix
Ek−1 . . . E1 A, the transformation subtracts multiples of the k-th row from rows {k + 1, k +
2, . . . , n} to get zeros below the diagonal in the k-th column of the matrix. And never use
rows {1, 2, . . . , k − 1}.
For the transformation example above, at step 1, we multiply on the left by E1 so that
multiples of the 1-st row are subtracted from rows 2, 3, 4 and the first entries of rows 2, 3, 4
are set to zero. Similar situations for step 2 and step 3. By setting L = E1−1 E2−1 E3−1
and letting the matrix after elimination be U , 2 we get A = LU . Thus we obtain an LU
decomposition for this 4 × 4 matrix A.

Definition 1.7: (Pivot)


First nonzero entry in the row after each elimination step is called a pivot. For example,
the blue crosses in Equation (1.1) are pivots.

But sometimes, it can happen that the value of A11 is zero. No E1 can make the next
elimination step successful. So we need to interchange the first row and the second row via
a permutation matrix P1 . This is known as the pivoting, or permutation.
Gaussian Elimination With a Permutation In the Beginning:
     
0           
    P1  0     E1  0    
    −→     −→  0   
     

        0   
A P1 A E1 P1 A
   
       
 0    −→
E2   E3  0   
−→  0 0    0 0   .
 

0 0   0 0 0 
E2 E1 P1 A E3 E2 E1 P1 A

By setting L = E1−1 E2−1 E3−1 and P = P1−1 , we get A = P LU . Therefore we obtain a full
LU decomposition with permutation for this 4 × 4 matrix A.
In some situations, other permutation matrices P2 , P3 , . . . will appear in between the
lower triangular Ei ’s. An example is shown as follows.
Gaussian Elimination With a Permutation In Between:
       
               
    E1  0 0    P1  0     E2  0   
    −→  0     −→  0 0    −→  0 0   .
       

    0    0    0 0 0 
A E1 A P1 E1 A E2 P1 E1 A
2. The inverses of unit lower triangular matrices are also unit lower triangular matrices. And the products
of unit lower triangular matrices are also unit lower triangular matrices.

19
Jun Lu

In this case, we find U = E2 P1 E1 A. In Section 1.3 or Section 1.9.1, we will show that the
permutations in-between will result in the same form A = P LU where P takes account of
all the permutations.
The above examples can be easily extended to any n × n matrix if we assume there are
no row permutations in the process. And we will have n − 1 such lower triangular trans-
formations. The k-th transformation Ek introduces zeros below the diagonal in the k-th
column of A by subtracting multiples of the k-th row from rows {k + 1, k + 2, . . . , n}. Fi-
nally, by setting L = E1−1 E2−1 . . . En−1
−1
we obtain the LU decomposition A = LU (without
permutation).

1.2 Existence of the LU Decomposition without Permutation


The Gaussian elimination or Gaussian transformation shows the origin of the LU decompo-
sition. We then prove Theorem 1.4 rigorously, i.e., the existence of the LU decomposition
without permutation by induction.
Proof [of Theorem 1.4: LU Decomposition without Permutation] We will prove
by induction that every n × n square matrix A with nonzero leading principal minors has a
decomposition A = LU . The 1 × 1 case is trivial by setting L = 1, U = A, thus, A = LU .
Suppose for any k × k matrix Ak with all the leading principal minors being nonzero has
an LU decomposition without permutation. If we prove any (k + 1) × (k + 1) matrix Ak+1
can also be factored as this LU decomposition without permutation, then we complete the
proof.
For any (k+1)×(k+1) matrix Ak+1 , suppose the k-th order leading principal submatrix
of Ak+1 is Ak with the size of k × k. Then Ak can be factored as Ak = Lk Uk with Lk being
a unit lower triangular matrix and Uk being a nonsingular
 upper triangular matrix from
Ak b
the assumption. Write out Ak+1 as Ak+1 = > . Then it admits the factorization:
c d
    
Ak b Lk 0 Uk y
Ak+1 = > = > = Lk+1 Uk+1 ,
c d x 1 0 z
   
Lk 0 Uk y
where b = Lk y, c>= x> Uk ,
d= x> y
+ z, Lk+1 = > , and Uk+1 = . From
x 1 0 z
the assumption, Lk and Uk are nonsingular. Therefore

y = L−1
k b, x> = c> Uk−1 , z = d − x> y.

If further, we could prove z is nonzero such that Uk+1 is nonsingular, we complete the proof.
Since all the leading principal minors of Ak+1 are nonzero, we have det(Ak+1 ) = 3
det(Ak )· det(d − c> A−1 > −1 > −1
k b) = det(Ak ) · (d − c Ak b) 6= 0, where d − c Ak b is a scalar.
> −1
As det(Ak ) 6= 0 from the assumption, we obtain d − c Ak b 6= 0. Substitute b = Lk y
and c> = x> Uk into the formula, we have d − x> Uk A−1 > −1
k Lk y = d − x Uk (Lk Uk ) Lk y =
>
d − x y 6= 0 which is exactly the form of z 6= 0. Thus we find Lk+1 with all the values
 
A B
3. By the fact that if matrix M has a block formulation: M = , then det(M ) = det(A) det(D −
C D
CA−1 B).

20
Matrix Decomposition and Applications

on the diagonal being 1, and Uk+1 with all the values on the diagonal being nonzero which
means Lk+1 and Uk+1 are nonsingular, 4 from which the result follows.

We further prove that if no permutation involves, the LU decomposition is unique.

Corollary 1.8: (Uniqueness of the LU Decomposition without Permutation)


Suppose the n × n square matrix A has nonzero leaning principal minors. Then, the LU
decomposition is unique.

Proof [of Corollary 1.8] Suppose the LU decomposition is not unique, then we can find two
decompositions such that A = L1 U1 = L2 U2 which implies L−1 −1
2 L1 = U2 U1 . The left
of the equation is a unit lower triangular matrix and the right of the equation is an upper
triangular matrix. This implies both sides of the above equation are diagonal matrices.
Since the inverse of a unit lower triangular matrix is also a unit lower triangular matrix,
and the product of unit lower triangular matrices is also a unit lower triangular matrix,
this results in that L−1
2 L1 = I. The equality implies that both sides are identity such that
L1 = L2 and U1 = U2 and leads to a contradiction.

In the proof of Theorem 1.4, we have shown that the diagonal values of the upper
triangular matrix are all nonzero if the leading principal minors of A are all nonzero. We
then can formulate this decomposition in another form if we divide each row of U by each
diagonal value of U . This is called the LDU decomposition.

Corollary 1.9: (LDU Decomposition)


For any n × n square matrix A, if all the leading principal minors are nonzero, i.e.,
det(A1:k,1:k ) 6= 0, for all k ∈ {1, 2, . . . , n}, then A can be uniquely factored as

A = LDU ,

where L is a unit lower triangular matrix, U is a unit upper triangular matrix, and D
is a diagonal matrix.

The proof is trivial that from the LU decomposition of A = LR, we can find a diagonal
matrix D = diag(R11 , R22 , . . . , Rnn ) such that D −1 R = U is a unit upper triangular
matrix. And the uniqueness comes from the uniqueness of the LU decomposition.

1.3 Existence of the LU Decomposition with Permutation


In Theorem 1.4, we require that matrix A has nonzero leading principal minors. However,
this is not necessarily. Even when the leading principal minors are zero, nonsingular matrices
still have an LU decomposition, but with an additional permutation. The proof is still from
induction.
4. A triangular matrix (upper or lower) is nonsingular if and only if all the entries on its main diagonal are
nonzero.

21
Jun Lu

Proof [of Theorem 1.1: LU Decomposition with Permutation] We note that any
1 × 1 nonsingular matrix has a full LU decomposition A = P LU by simply setting P = 1,
L = 1, U = A. We will show that if every (n − 1) × (n − 1) nonsingular matrix has a full
LU decomposition, then this is also true for every n × n nonsingular matrix. By induction,
we prove that every nonsingular matrix has a full LU decomposition.
We will formulate the proof in the following order. If A is nonsingular, then its row per-
muted matrix B is also nonsingular. And Schur complement of B11 in B is also nonsingular.
Finally, we formulate the decomposition of A by B from this property.
We notice that at least one element in the first column of A must be nonzero otherwise
A will be singular. We can then apply a row permutation that makes the element in entry
(1, 1) to be nonzero. That is, there exists a permutation P1 such that B = P1 A in which
case B11 6= 0. Since A and P1 are both nonsingular and the product of nonsingular matrices
is also nonsingular, then B is also nonsingular.
Schur complement of B is also nonsingular:
Now consider the Schur complement of B11 in B with size (n − 1) × (n − 1)
1
B̄ = B2:n,2:n − B2:n,1 B1,2:n .
B11
Suppose there is an (n − 1)-vector x satisfies

B̄x = 0. (1.2)

Then x and y = − B111 B1,2:n x satisfy


      
x B11 B1,2:n x 0
B = = .
y B2:n,1 B2:n,2:n y 0

Since B is nonsingular, x and y must be zero. Hence, Equation (1.2) holds only if x = 0
which means that the null space of B̄ is of dimension 0 and thus B̄ is nonsingular with
size (n − 1) × (n − 1).

By the induction assumption that any (n − 1) × (n − 1) nonsingular matrix can be


factorized as the full LU decomposition form
B̄ = P2 L2 U2 .
We then factor A as
 
B11 B1,2:n
A= P1>
B2:n,1 B2:n,2:n
  
> 1 0 B11 B1,2:n
= P1
0 P2 P2> B2:n,1 P2> B2:n,2:n
  
> 1 0 B11 B1,2:n
= P1
0 P2 P2> B2:n,1 L2 U2 + P2> B111 B2:n,1 B1,2:n
   
> 1 0 1 0 B11 B1,2:n
= P1 .
0 P2 B111 P2> B2:n,1 L2 0 U2

22
Matrix Decomposition and Applications

Therefore, we find the full LU decomposition of A = P LU by defining


     
1 0 1 0 B11 B1,2:n
P = P1> , L= 1 > , U= ,
0 P2 B11 P2 B2:n,1 L2 0 U2

from which the result follows.

1.4 Bandwidth Preserving in the LU Decomposition without Permutation


For any matrix, the bandwidth of it can be defined as follows.

Definition 1.10: (Matrix Bandwidth)


For any matrix A ∈ Rn×n with entry (i, j) element denoted by Aij . Then A has upper
bandwidth q if Aij = 0 for j > i + q, and lower bandwidth p if Aij = 0 for i > j + p.
An example of a 6×6 matrix with upper bandwidth 2 and lower bandwidth 3 is shown
as follows:  
   0 0 0
    0 0 
 
     0 
      .
 
 
 0     
0 0    

Then, we prove that the bandwidth after the LU decomposition without permutation is
preserved.

Lemma 1.11: (Bandwidth Preserving)


For any matrix A ∈ Rn×n with upper bandwidth q and lower bandwidth p. If A has an
LU decomposition A = LU , then U has upper bandwidth q and L has lower bandwidth
p.

Proof [of Lemma 1.11] The LU decomposition without permutation can be obtained as
follows:
    
A11 A1,2:n 1 0 A11 A1,2:n
A= = 1 = L1 U1 ,
A2:n,1 A2:n,2:n A11 A2:n,1 In−1 0 S

where S = A2:n,2:n − A111 A2:n,1 A1,2:n is the Schur complement of A11 in A. We can name
this decomposition of A as the s-decomposition of A. The first column of L1 and the first
row of U1 have the required structure (bandwidth p and q respectively), and the Schur
complement S of A11 has upper bandwidth q − 1 and lower bandwidth p − 1 respectively.
The result follows by induction on the s-decomposition of S.

23
Jun Lu

1.5 Block LU Decomposition


Another form of the LU decomposition is to factor the matrix into block triangular matrices.

Theorem 1.12: (Block LU Decomposition without Permutation)


For any n × n square matrix A, if the first m leading principal block submatrices are
nonsingular, then A can be factored as
 
U11 U12 . . . U1m

I
 L21 I  .. 
 U22 . 
A = LU =  . ,

 .. . ..  
 .. 
 . Um−1,m 
Lm1 . . . Lm,m−1 I Umm

where Li,j ’s and Uij ’s are some block matrices.


Specifically, this decomposition is unique.

Note that the U in the above theorem is not necessarily upper triangular. An example can
be shown as follows:
    
0 1 1 1 1 0 0 0 0 1 1 1
 −1 2 −1 2   0 1 0 0   −1 2 −1 2 
A= =  .
 2 1 4 2   5 −2 1 0  0 0 −3 1 
1 2 3 3 4 −1 0 1 0 0 −2 1

The trivial non-block LU decomposition fails on A since the entry (1, 1) is zero. However,
the block LU decomposition exists.

1.6 Application: Linear System via the LU Decomposition


Consider the well-determined linear system Ax = b with A of size n × n and nonsingular.
Avoid solving the system by computing the inverse of A, we solving linear equation by the
LU decomposition. Suppose A admits the LU decomposition A = P LU . The solution is
given by the following algorithm.

Algorithm 1 Solving Linear Equations by LU Decomposition


Require: matrix A is nonsingular and square with size n × n, solve Ax = b;
1: LU Decomposition: factor A as A = P LU ; . (2/3)n3 flops
2: Permutation: w = P > b; .0 flops
3: Forward substitution: solve Lv = w; . 1 + 3 + ... + (2n − 1) = n2 flops
4: Backward substitution: solve U x = v; . 1 + 3 + ... + (2n − 1) = n2 flops

The complexity of the decomposition step is (2/3)n3 flops (Lu, 2021c), the backward
and forward substitution steps both cost 1 + 3 + ... + (2n − 1) = n2 flops. Therefore, the
total cost for computing the linear system via the LU factorization is (2/3)n3 + 2n2 flops.
If we keep only the leading term, the Algorithm 1 costs (2/3)n3 flops where the most cost
comes from the LU decomposition.

24
Matrix Decomposition and Applications

Linear system via the block LU decomposition For a block LU decomposition of


A = LU , we need to solve Lv = w and U x = v. But the latter system is not triangular
and requires some extra computations.

1.7 Application: Computing the Inverse of Nonsingular Matrices


By Theorem 1.1, for any nonsingular matrix A ∈ Rn×n , we have a full LU factorization
A = P LU . Then the inverse can be obtained by solving the matrix equation

AX = I,

which contains n linear systems computation: Axi = ei for all i ∈ {1, 2, . . . , n} where xi is
the i-the column of X and ei is the i-th column of I (i.e., the i-th unit vector).

Theorem 1.13: (Inverse of Nonsingular Matrix by Linear System)


Computing the inverse of a nonsingular matrix A ∈ Rn×n by n linear systems needs
∼ (2/3)n3 + n(2n2 ) = (8/3)n3 flops where (2/3)n3 comes from the computation of the
LU decomposition of A.

The proof is trivial by using Algorithm 1. However, the complexity can be reduced by
taking the advantage of the structures of U , L. We find that the inverse of the nonsingular
matrix is A−1 = U −1 L−1 P −1 = U −1 L−1 P T . By taking this advantage, the complexity is
reduced from (8/3)n3 to 2n3 flops.

1.8 Application: Computing the Determinant via the LU Decomposition


We can find the determinant easily by using the LU decomposition. If A = LU , then
det(A) = det(LU ) = det(L) det(U ) = U11 U22 . . . Unn where Uii is the i-th diagonal of U
for i ∈ {1, 2, . . . , n}. 5
Further, for the LU decomposition with permutation A = P LU , det(A) = det(P LU ) =
det(P )U11 U22 . . . Unn . The determinant of a permutation matrix is either 1 or –1 because
after changing rows around (which changes the sign of the determinant 6 ) a permutation
matrix becomes identity matrix I, whose determinant is one.

1.9 Pivoting
1.9.1 Partial Pivoting
In practice, it is desirable to pivot even when it is not necessary. When dealing with a linear
system via the LU decomposition as shown in Algorithm 1, if the diagonal entries of U are
small, it can lead to inaccurate solutions for the linear solution. Thus, it is common to pick
the largest entry to be the pivot. This is known as the partial pivoting. For example,

5. The determinant of a lower triangular matrix (or an upper triangular matrix) is the product of the
diagonal entries.
6. The determinant changes sign when two rows are exchanged (sign reversal).

25
Jun Lu

Partial Pivoting For a 4 × 4 Matrix:


       
               
 0 2    −→
    E1   P1  0 7   E2  0 7  
 −→
 0 5   −→  0 0   , (1.3)
    
     0 5  
    0 7   0 2   0 0 0 
A E1 A P1 E1 A E2 P1 E1 A

in which case, we pick 7 as the pivot after transformation by E1 even when it is not necessary.
This interchange permutation can guarantee that no multiplier is greater than 1 in absolute
value during the Gaussian elimination. More generally, the procedure for computing the
LU decomposition with partial pivoting of A ∈ Rn×n is given in Algorithm 2.

Algorithm 2 LU Decomposition with Partial Pivoting


Require: Matrix A with size n × n;
1: Let U = A;
2: for k = 1 to n − 1 do . i.e., get the k-th column of U
3: Find a row permutation Pk that swaps Ukk with the largest element in |Uk:n,k |;
4: U = Pk U ;
5: Determine the Gaussian transformation Ek to introduce zeros below the diagonal of
the k-th column of U ;
6: U = Ek U ;
7: end for
8: Output U ;

The algorithm requires 2/3(n3 ) flops and (n−1)+(n−2)+. . .+1 ∼ O(n2 ) comparisons
resulting from the pivoting procedure. Upon completion, the upper triangular matrix U is
given by
U = En−1 Pn−1 . . . E2 P2 E1 P1 A.

Computing the final L And we here show that Algorithm 2 computes the LU decom-
position in the following form
A = P LU ,

where P = P1 P2 . . . Pn−1 , U is the upper triangular matrix results directly from the algo-
rithm, L is unit lower triangular with |Lij | ≤ 1 for all 1 ≤ i, j ≤ n. Lk+1:n,k is a permuted
version of Ek ’s multipliers. To see this, we notice that the permutation matrices used in the
algorithm fall into a special kind of permutation matrix since we only interchange two rows
of the matrix. This implies the Pk ’s are symmetric and Pk2 = I for k ∈ {1, 2, . . . , n − 1}.
Suppose
Mk = (Pn−1 . . . Pk+1 )Ek (Pk+1 . . . Pn−1 ).

Then, U can be written as

U = Mn−1 . . . M2 M1 P > A.

26
Matrix Decomposition and Applications

To see what Mk is, we realize that Pk+1 is a permutation with the upper left k × k block
being an identity matrix. And thus we have

Mk = (Pn−1 . . . Pk+1 )(In − zk e>


k )(Pk+1 . . . Pn−1 )
= In − (Pn−1 . . . Pk+1 )(zk e>
k )(Pk+1 . . . Pn−1 )
= In − (Pn−1 . . . Pk+1 zk )(e>
k Pk+1 . . . Pn−1 )
= In − (Pn−1 . . . Pk+1 zk )e>
k. (since e> >
k Pk+1 . . . Pn−1 = ek )

This implies that Mk is unit lower triangular with the k-th column being the permuted
version of Ek . And the final lower triangular L is thus given by

L = M1−1 M2−1 . . . Mn−1


−1
.

1.9.2 Complete Pivoting


In partial pivoting, when introducing zeros below the diagonal of the k-th column of U , the
k-th pivot is determined by scanning the current subcolumn Uk:n,k . In complete pivoting,
the largest absolute entry in the current submatrix Uk:n,k:n is interchanged into the entry
(k, k) of U . Therefore, an additional column permutation Qk is needed in each step. The
final upper triangular matrix U is obtained by

U = En−1 Pn−1 . . . (E2 P2 (E1 P1 AQ1 )Q2 ) . . . Qn−1 .

Similarly, the complete pivoting algorithm is formulated in Algorithm 3.

Algorithm 3 LU Decomposition with Complete Pivoting


Require: Matrix A with size n × n;
1: Let U = A;
2: for k = 1 to n − 1 do . the value k is to get the k-th column of U
3: Find a row permutation matrix Pk , and a column permutation Qk that swaps Ukk
with the largest element in |Uk:n,k:n |, say Uu,v = max |Uk:n,k:n |;
4: U = Pk U Qk ;
5: Determine the Gaussian transformation Ek to introduce zeros below the diagonal of
the k-th column of U ;
6: U = Ek U ;
7: end for
8: Output U ;

The algorithm requires 2/3(n3 ) flops and (n2 + (n − 1)2 + . . . + 12 ) ∼ O(n3 ) comparisons
resulting from the pivoting procedure. Again, let P = P1 P2 . . . Pn−1 , Q = Q1 Q2 . . . Qn−1 ,

Mk = (Pn−1 . . . Pk+1 )Ek (Pk+1 . . . Pn−1 ), for all k ∈ {1, 2, . . . , n − 1}

and
L = M1−1 M2−1 . . . Mn−1
−1
.
We have A = P LU Q> or P > AQ = LU as the final decomposition.

27
Jun Lu

1.9.3 Rook Pivoting


The rook pivoting provides an alternative to the partial and complete pivoting. Instead of
choosing the larges value in |Uk:n,k:n | in the k-th step, it searches for an element of Uk:n,k:n
that is maximal in both its row and column. Apparently, the rook pivoting is not unique
such that we could find many entries that satisfy the criteria. For example, for a submatrix
Uk:n,k:n as follows
 
1 2 3 4
2 3 7 3
Uk:n,k:n =  5 2 1 2 ,

2 1 2 1
where the 7 will be chosen by complete pivoting. And one of 5, 4, 7 will be identified as a
rook pivot.

1.10 Rank-Revealing LU Decomposition


In many applications, a factorization produced by Gaussian elimination with pivoting when
A has rank r will reveal rank in the following form
  
L11 0 U11 U12
P AQ = ,
L>
21 I 0 0

where L11 ∈ Rr×r and U11 ∈ Rr×r are nonsingular, L21 , U21 ∈ Rr×(n−r) , and P , Q are
permutations. Gaussian elimination with rook pivoting or complete pivoting can result in
such decomposition (Hwang et al., 1992; Higham, 2002).

2. Cholesky Decomposition

Theorem 2.1: (Cholesky Decomposition)


Every positive definite matrix A ∈ Rn×n can be factored as

A = R> R,

where R ∈ Rn×n is an upper triangular matrix with positive diagonal elements.


This decomposition is known as the Cholesky decomposition of A. R is known as the
Cholesky factor or Cholesky triangle of A.
Alternatively, A can be factored as A = LL> where L = R> is a lower triangular matrix
with positive diagonals.
Specifically, the Cholesky decomposition is unique (Corollary 2.8, p. 38).

The Cholesky decomposition is named after a French military officer and mathemati-
cian, André-Louis Cholesky (1875-1918), who developed the Cholesky decomposition in his
surveying work. Similar to the LU decomposition for solving linear systems, the Cholesky
decomposition is further used primarily to solve positive definite linear systems. The de-
velopment of the solution is similar to that of the LU decomposition in Section 1.6 (p. 24),
and we shall not repeat the details.

28
Matrix Decomposition and Applications

2.1 Existence of the Cholesky Decomposition via Recursive Calculation

In this section, we will prove the existence of the Cholesky decomposition via recursive
calculation. In Section 13.5.4 (p. 133), we will also prove the existence of the Cholesky
decomposition via the QR decomposition and spectral decomposition. Before showing
the existence of Cholesky decomposition, we need the following definitions and lemmas.

Definition 2.2: (Positive Definite and Positive Semidefinite)


A matrix A ∈ Rn×n is positive definite (PD) if x> Ax > 0 for all nonzero x ∈ Rn . And
a matrix A ∈ Rn×n is positive semidefinite (PSD) if x> Ax ≥ 0 for all x ∈ Rn .

One of the prerequisites for the Cholesky decomposition is the definition of the above
positive definiteness of a matrix. We sketch several properties of this PD matrix as follows:

Positive Definite Matrix Property 1 of 5

We will show the equivalent definition on the positive definiteness of a matrix A is


that A only has positive eigenvalues, or on the positive semidefiniteness of a matrix
A is that A only has nonnegative eigenvalues. The proof is provided in Section 13.5.2
(p. 131) as a consequence of the spectral theorem.

Positive Definite Matrix Property 2 of 5

Lemma 2.3: (Positive Diagonals of Positive Definite Matrices)


The diagonal elements of a positive definite matrix A are all positive. And simi-
larly, the diagonal elements of a positive semidefinite matrix B are all nonnega-
tive.

Proof [of Lemma 2.3] From the definition of positive definite matrices, we have x> Ax > 0
for all nonzero x. In particular, let x = ei where ei is the i-th unit vector with the i-th
entry being equal to 1 and other entries being equal to 0. Then,

e>
i Aei = aii > 0, ∀i ∈ {1, 2, . . . , n},

where aii is the i-th diagonal component. The proof for the second part follows similarly.
This completes the proof.

29
Jun Lu

Positive Definite Matrix Property 3 of 5

Lemma 2.4: (Schur Complement of Positive Definite Matrices)


For any positive definite matrix A ∈ Rn×n , its Schur complement of A11 is given
by Sn−1 = A2:n,2:n − A111 A2:n,1 A>
2:n,1 which is also positive definite.

A word on the notation Note that the subscript n − 1 of Sn−1 means it is of


size (n − 1) × (n − 1) and it is a Schur complement of an n × n positive definite
matrix. We will use this notation in the following sections.

Proof [of Lemma 2.4] For any nonzero vector v ∈ Rn−1 , we can construct a vector x ∈ Rn
by the following equation:
 1 > 
− A11 A2:n,1 v
x= ,
v

which is nonzero. Then

 1 >
A>
  
> 1 > > A11 2:n,1 − A11 A2:n,1 v
x Ax = − v A2:n,1 v
A11 A2:n,1 A2:n,2:n v
  
1 > > 0
= − v A2:n,1 v
A11 Sn−1 v
= v > Sn−1 v.

Since A is positive definite, we have x> Ax = v > Sn−1 v > 0 for all nonzero v. Thus, the
Schur complement Sn−1 is positive definite as well.

The above argument can be extended to PSD matrices as well. If A is PSD, then the
Schur complement Sn−1 is also PSD.

A word on the Schur complement In the proof of Theorem 1.1, we have shown
this Schur complement Sn−1 = A2:n,2:n − A111 A2:n,1 A> 2:n,1 is also nonsingular if A is
nonsngular and A11 6= 0. Similarly, the Schur complement of Ann in A is S̄n−1 =
A1:n−1,1:n−1 − A1nn A1:n−1,n A>1:n−1,n which is also positive definite if A is positive defi-
nite. This property can help prove the fact that the leading principal minors of positive
definite matrices are all positive. See Section 2.2 for more details.

We then prove the existence of Cholesky decomposition using these lemmas.

30
Matrix Decomposition and Applications

Proof [of Theorem 2.1: Existence of Cholesky Decomposition Recursively] For


any positive definite matrix A, we can write out (since A11 is positive by Lemma 2.3)

A>
 
A11 2:n,1
A=
A2:n,1 A2:n,2:n
" √ # "√ #
A11 0 A11 √ 1 A>
A 2:n,1
= √1 11
A
A11 2:n,1
I 0 A2:n,2:n − A111 A2:n,1 A> 2:n,1
" √ #  "√ #
A11 0 1 0 A11 √A1 A> 2:n,1
= √1 11
A
A11 2:n,1
I 0 A2:n,2:n − A1 A2:n,1 A> 2:n,1 0 I
11
 
1 0
= R1> R1 .
0 Sn−1

where "√ #
A11 √ 1 A>
R1 = A11 2:n,1 .
0 I
Since we proved the Schur complement Sn−1 is positive definite in Lemma 2.4, then we can
factor it in the same way as
 
> 1 0
Sn−1 = R̂2 R̂2 .
0 Sn−2

Therefore, we have
 
1 0
A = R1> 
 
1 0  R1
0 R̂2> R̂2 .
0 Sn−2
 
  1 0  
> 1 0    1 0
= R1 1 0 R1
0 R̂2> 0

0 R̂2
0 Sn−2
 
1  0
= R1> R2> 

1 0  R2 R1 .
0
0 Sn−2

The same formula can be recursively applied. This process gradually continues down to the
bottom-right corner giving us the decomposition

A = R1> R2> . . . Rn> Rn . . . R2 R1


= R> R,

where R1 , R2 , . . . , Rn are upper triangular matrices with positive diagonal elements and
R = R1 R2 . . . Rn is also an upper triangular matrix with positive diagonal elements from
which the result follows.
The process in the proof can also be used to calculate the Cholesky decomposition and
compute the complexity of the algorithm.

31
Jun Lu

Lemma 2.5: (R> R is PD)


For any upper triangular matrix R with positive diagonal elements, then A = R> R is
positive definite.

Proof [of Lemma 2.5] If an upper triangular matrix R has positive diagonals, it has full
column rank, and the null space of R is of dimension 0 by the fundamental theorem of
linear algebra (Theorem 0.15, p. 14). As a result, Rx 6= 0 for any nonzero vector x. Thus
x> Ax = ||Rx||2 > 0 for any nonzero vector x.
This corollary above works not only for the upper triangular matrices R, but can be ex-
tended to any R with linearly independent columns.

A word on the two claims Combine Theorem 2.1 and Lemma 2.5, we can claim
that matrix A is positive definite if and only if A can be factored as A = R> R where
R is an upper triangular matrix with positive diagonals.

2.2 Sylvester’s Criterion: Leading Principal Minors of PD Matrices


In Lemma 2.4, we proved for any positive definite matrix A ∈ Rn×n , its Schur complement
of A11 is Sn−1 = A2:n,2:n − A111 A2:n,1 A>
2:n,1 and it is also positive definite. This is also true
0
for its Schur complement of Ann , i.e., Sn−1 = A1:n−1,1:n−1 − A1nn A1:n−1,n A> 1:n−1,n is also
positive definite.
We then claim that all the leading principal minors (Definition 1.3, p. 16) of a positive
definite matrix A ∈ Rn×n are positive. This is also known as the Sylvester’s criterion
(Swamy, 1973; Gilbert, 1991). Recall that these positive leading principal minors imply the
existence of the LU decomposition for positive definite matrix A by Theorem 1.4 (p. 17).
To show the Sylvester’s criterion, we need the following lemma.
Positive Definite Matrix Property 4 of 5

Lemma 2.6: (Quadratic PD)


Let E be any invertible matrix. Then A is positive definite if and only if E > AE
is also positive definite.

Proof [of Lemma 2.6] We will prove by forward implication and reverse implication sepa-
rately as follows.
Forward implication Suppose A is positive definite, then for any nonzero vector x, we
have x> E > AEx = y > Ay > 0, since E is invertible such that Ex is nonzero. 7 This
implies E > AE is PD.
Reverse implication Conversely, suppose E > AE is positive definite, for any nonzero x,
we have x> E > AEx > 0. For any nonzero y, there exists a nonzero x such that y = Ex
7. Since the null space of E is of dimension 0 and the only solution for Ex = 0 is the trivial solution x = 0.

32
Matrix Decomposition and Applications

since E is invertible. This implies A is PD as well.

We then provide the rigorous proof for Sylvester’s criterion.

Positive Definite Matrix Property 5 of 5

Theorem 2.7: (Sylvester’s Criterion)


The real symmetric matrix A ∈ Rn×n is positive definite if and only if all the
leading principal minors of A are positive.

Proof [of Theorem 2.7] We will prove by forward implication and reverse implication sep-
arately as follows.

Forward implication: We will prove by induction for the forward implication. Suppose
A is positive definite. Since all the components on the diagonal of positive definite matrices
are all positive (Lemma 2.3, p. 29). The case for n = 1 is trivial that det(A) > 0 if A is a
scalar.
Suppose all the leading principal minors for k × k matrices are all positive. If we could
prove this is also true for (k + 1) × (k + 1) PD matrices, then we complete the proof.
 
A b
For a (k + 1) × (k + 1) matrix with the block form M = > , where A is a k × k
b d
submatrix. Then its Schur complement of d, Sk = A − d1 bb> is also positive definite and its
determinant is positive from the assumption. Therefore, det(M ) = det(d) det(A − d1 bb> )=
8 d · det(A − 1 bb> ) > 0, which completes the proof.
d

Reverse implication: Conversely, suppose all the leading principal minors of A ∈ Rn×n
are positive, i.e., leading principal submatrices are nonsingular. Suppose further the (i, j)-th
entry of A is denoted by aij , we realize that a11 > 0 by the assumption. Subtract multiples
of the first row of A to zero out the entries in the first column of A below the first diagonal
a11 . That is,
   
a11 a12 . . . a1n a11 a12 . . . a1n
 a21 a22 . . . a2n 
 E1 A  0 a22 . . . a2n 
 
A= . −→ .

. .. . . . .. .
 .. .. ..   .. .. .. 
 
. . 
an1 an2 . . . ann 0 an2 . . . ann

This operation preserves the values of the principal minors of A. The E1 might be myste-
rious to the readers. Actually, the E1 contains two steps E1 = Z12 Z11 . The first step Z11

 
A B
8. By the fact that if matrix M has a block formulation: M = , then det(M ) = det(D) det(A −
C D
BD −1 C).

33
Jun Lu

is to subtract the 2-nd row to the n-th row by a multiple of the first row, that is
    
a11 a12 . . . a1n 1 0 ... 0 a11 a12 . . . a1n
 a21 a22 . . . a2n   − a21 1 . . . 0  a21 a22 . . . a2n 
 
 Z11 A  a11
A= . ..  −→  ..

. .. . . .. . . ..   .. .. . .. .
.. 
 . . . .   . . . .  . . 
an1 an2 . . . ann − aan1
11
0 ... 1 an1 an2 . . . ann
 
a11 a12 ... a1n
 0 (a22 − aa21
11
a12 ) . . . (a2n − aa21
11
a1n ) 
= . ,
 
.. .. .
 .. . . .. 
an1 an1
0 (an2 − a11 a12 ) . . . (ann − a11 a1n )

where we subtract the bottom-right (n − 1) × (n − 1) by some terms additionally. Z12 is to


add back these terms.
 
a11 a12 ... a1n
 0 (a22 − a21 a12 ) . . . (a2n − a21 a1n ) 
a11 a11
Z11 A =  .
 
.. .. ..
 ..

. . . 
an1
0 (an2 − a11 a12 ) . . . (ann − aan1
11
a1n )
  
1 0 . . . 0 a11 a12 ... a1n
a21 a21 a21
1   0 (a22 − a11 a12 ) . . . (a2n − a11 a1n ) 
. . . 0

Z12 (Z11 A)  a11
 
−→  . .. . . ..   .. .. . .
 .. .. .. 
. . .  . . 
an1 an1 an1
a11 0 ... 1 0 (an2 − a11 a12 ) . . . (ann − a11 a1n )
 
a11 a12 ... a1n
 0 a22 ... a2n 
= . ..  = E1 A.
 
.. ..
 .. . . . 
0 an2 . . . ann

Now subtract multiples of the first column of E1 A, from the other columns of E1 A to
zero out the entries in the first row of E1 A to the right of the first column. Since A is
symmetric, we can multiply on the right by E1> to get what we want. We then have
     
a11 a12 . . . a1n a11 a12 . . . a1n a11 0 . . . 0
 a21 a22 . . . a2n 
 E1 A  0 a22 . . . a2n  E1 AE1>  0 a22 . . . a2n 
   
A= . −→ −→ .

. .. . . . .. . . . .. .
 .. .. ..   .. .. ..   .. .. .. 
   
. . . 
an1 an2 . . . ann 0 an2 . . . ann 0 an2 . . . ann

This operation also preserves the values of the principal minors of A. The leading principal
minors of E1 AE1> are exactly the same as those of A.
Continue this process, we will transform A into a diagonal matrix En . . . E1 AE1> . . . En>
whose diagonal values are exactly the same as the diagonals of A and are positive. Let
E = En . . . E1 , which is an invertible matrix. Apparently, EAE > is PD, which implies A
is PD as well from Lemma 2.6.

34
Matrix Decomposition and Applications

2.3 Existence of the Cholesky Decomposition via the LU Decomposition


without Permutation
By Theorem 2.7 on Sylvester’s criterion and the existence of LU decomposition without
permutation in Theorem 1.4 (p. 17), there is a unique LU decomposition for positive definite
matrix A = LU0 where L is a unit lower triangular matrix and U0 is an upper triangular
matrix. Since the signs of the pivots of a symmetric matrix are the same as the signs of the
eigenvalues (Strang, 2009):

number of positive pivots = number of positive eigenvalues.

And A = LU0 has the following form


  
1 0 . . . 0 u11 u12 . . . u1n
 l21 1 . . . 0  0 u22 . . . u2n 
 
A = LU0 =  . .

.. .
.. .  . . . .. .
 .. .. .. 
 
. . . . . 
ln1 ln2 ... 1 0 0 . . . unn
This implies that the diagonals of U0 contain the pivots of A. And all the eigenvalues
of PD matrices are positive (see Lemma 13.32, p. 131, which is a consequence of spectral
decomposition). Thus the diagonals of U0 are positive.
Taking the diagonal of U0 out into a diagonal matrix D, we can rewrite U0 = DU as
shown in the following equation
   
1 0 . . . 0 u11 0 . . . 0 1 u12 /u11 . . . u1n /u11
 l21 1 . . . 0  0 u22 . . . 0  0 1 . . . u2n /u22 
A = LU0 =  .  = LDU ,
   
. .. . . ..   .. .. . . ..   .. .. . . ..
 . . . .  . . . .  . . . . 
ln1 ln2 . . . 1 0 0 . . . unn 0 0 ... 1
where U is a unit upper triangular matrix. By the uniqueness of the LU decomposition
without permutation in Corollary 1.8 (p. 21) and the symmetry of A, it follows that U =
L> , and A = LDL> . Since the diagonals of D are positive, we can set R = D 1/2 L> where
√ √ √
D 1/2 = diag( u11 , u22 , . . . , unn ) such that A = R> R is the Cholesky decomposition of
A, and R is upper triangular with positive diagonals.

2.3.1 Diagonal Values of the Upper Triangular Matrix


 
A11 A12
Suppose A is a PD matrix, take A as a block matrix A = where A11 ∈ Rk×k ,
A21 A22
and its block LU decomposition is given by
    
A11 A12 L11 0 U11 U12
A= = LU0 =
A21 A22 L21 L22 0 U22
 
L11 U11 L11 U12
= .
L21 U11 L21 U12 + L22 U22

Then the leading principal minor (Definition 1.3, p. 16), ∆k = det(A1:k,1:k ) = det(A11 ) is
given by
∆k = det(A11 ) = det(L11 U11 ) = det(L11 ) det(U11 ).

35
Jun Lu

We notice that L11 is a unit lower triangular matrix and U11 is an upper triangular matrix.
By the fact that the determinant of a lower triangular matrix (or an upper triangular
matrix) is the product of the diagonal entries, we obtain

∆k = det(U11 ) = u11 u22 . . . ukk ,

i.e., the k-th leading principal minor of A is the determinant of the k × k submatrix of U0 .
That is also the product of the first k diagonals of D (D is the matrix from A = LDL> ).
Let D = diag(d1 , d2 , . . . , dn ), therefore, we have

∆k = d1 d2 . . . dk = ∆k−1 dk .

This gives us an alternative form of D, i.e., the squared diagonal values of R (R is the
Cholesky factor from A = R> R), and it is given by
 
∆2 ∆n
D = diag ∆1 , ,..., ,
∆1 ∆n−1

where ∆k is the k-th leading principal minor of A, for all k ∈ {1, 2, . . . , n}. That is, the
diagonal values of R are given by
r s !
p ∆2 ∆n
diag ∆1 , ,..., .
∆1 ∆n−1

2.3.2 Block Cholesky Decomposition


Following
 from the last section, suppose A is a PD matrix, take A as a block matrix
Ak A12
A= where Ak ∈ Rk×k , and its block LU decomposition is given by
A21 A22
    
Ak A12 Lk 0 Uk U12
A= = LU0 =
A21 A22 L21 L22 0 U22
 
Lk U k L11 U12
= .
L21 U11 L21 U12 + L22 U22

where the k-th leading principal submatrix Ak of A also has its LU decomposition Ak =
Lk Uk . Then, it is trivial that the Cholesky decomposition of an n × n matrix contains n − 1
other Cholesky decompositions within it: Ak = Rk> Rk , for all k ∈ {1, 2, . . . , n − 1}. This
is particularly true that any leading principal submatrix Ak of the positive definite matrix
A is also positive definite. This can be shown that for positive definite matrix Ak+1  ∈
xk
R(k+1)×(k+1) , and any nonzero vector xk ∈ Rk appended by a zero element xk+1 = .
0
It follows that
x> >
k Ak xk = xk+1 Ak+1 xk+1 > 0,

and Ak is positive definite. If we start from A ∈ Rn×n , we will recursively get that An−1
is PD, An−2 is PD, . . .. And all of them admit a Cholesky decomposition.

36
Matrix Decomposition and Applications

2.4 Existence of the Cholesky Decomposition via Induction


In the last section, we proved the existence of the Cholesky decomposition via the LU
decomposition without permutation. Following from the proof of the LU decomposition
in Section 1.2, we realize that the existence of Cholesky decomposition can be a direct
consequence of induction as well.
Proof [of Theorem 2.1: Existence of Cholesky Decomposition by Induction] We
will prove by induction that every n × n positive definite
√ matrix A has a decomposition
A = R> R. The 1 × 1 case is trivial by setting R = A, thus, A = R2 .
Suppose for any k × k PD matrix Ak has a Cholesky decomposition. If we prove any
(k + 1) × (k + 1) PD matrix Ak+1 can also be factored as this Cholesky decomposition, then
we complete the proof.
For any (k + 1) × (k + 1) PD matrix Ak+1 , Write out Ak+1 as
 
Ak b
Ak+1 = > .
b d
We note that Ak is PD from the last section. By the inductive hypothesis, it admits a
Cholesky decomposition Ak = Rk> Rk . We can construct the upper triangular matrix
 
Rk r
R= ,
0 s
such that it follows that
 >
Rk> r

> Rk Rk
Rk+1 Rk+1 = .
r > Rk r > r + s2
> R
Therefore, if we can prove Rk+1 k+1 = Ak+1 is the Cholesky decomposition of Ak+1 (which
requires the value s to be positive), then we complete the proof. That is, we need to prove
b = Rk> r,
d = r > r + s2 .
Since Rk is nonsingular, we have a unique solution for r and s that
r = Rk−> b,
p q
s = d − r > r = d − b> A−1
k b,

since we assume s is nonnegative. However, we need to further prove that s is not only
nonnegative, but also positive. Since Ak is PD, from Sylvester’s criterion, and the fact
A B
that if matrix M has a block formulation: M = , then det(M ) = det(A) det(D −
C D
CA−1 B). We have
det(Ak+1 ) = det(Ak ) det(d − b> A−1 > −1
k b) = det(Ak )(d − b Ak b) > 0.

Since det(Ak ) > 0, we then obtain that (d − b> A−1


k b) > 0 and this implies s > 0. We
complete the proof.

37
Jun Lu

2.5 Uniqueness of the Cholesky Decomposition

Corollary 2.8: (Uniqueness of Cholesky Decomposition)


The Cholesky decomposition A = R> R for any positive definite matrix A ∈ Rn×n is
unique.

The uniqueness of the Cholesky decomposition can be an immediate consequence of the


uniqueness of the LU decomposition without permutation. Or, an alternative rigorous
proof is provided as follows.
Proof [of Corollary 2.8] Suppose the Cholesky decomposition is not unique, then we can
find two decompositions such that A = R1> R1 = R2> R2 which implies

R1 R2−1 = R1−> R2> .

From the fact that the inverse of an upper triangular matrix is also an upper triangular
matrix, and the product of two upper triangular matrices is also an upper triangular matrix,
9 we realize that the left-side of the above equation is an upper triangular matrix and the

right-side of it is a lower triangular matrix. This implies R1 R2−1 = R1−> R2> is a diagonal
matrix, and R1−> R2> = (R1−> R2> )> = R2 R1−1 . Let Λ = R1 R2−1 = R2 R1−1 be the diagonal
matrix. We notice that the diagonal value of Λ is the product of the corresponding diagonal
values of R1 and R2−1 (or R2 and R1−1 ). That is, for
   
r11 r12 . . . r1n s11 s12 . . . s1n
 0 r22 . . . r2n   0 s22 . . . s2n 
R1 =  . , R = ..  ,
   
.
.. . . . . 2  .. .. . .
 .. .. 

 . . . . 
0 0 . . . rnn 0 0 . . . snn

we have,
 r11   s11 
s11 0 ... 0 r11 0 ... 0
r22 s22
 0 s22 ... 0   0 r22 ... 0 
R1 R2−1 = =  = R2 R1−1 .
   
.. .. .. .. .. .. .. ..
 . . . .   . . . . 
rnn snn
0 0 ... snn 0 0 ... rnn

Since both R1 and R2 have positive diagonals, this implies r11 = s11 , r22 = s22 , . . . , rnn =
snn . And Λ = R1 R2−1 = R2 R1−1 = I. That is, R1 = R2 and this leads to a contradiction.
The Cholesky decomposition is thus unique.

2.6 Last Words on Positive Definite Matrices


In Section 13.5.2 (p. 131), we will prove that a matrix A is PD if and only if A can be
factored as A = P > P where P is nonsingular. And in Section 13.5.5 (p. 133), we will
9. Same for lower triangular matrices: the inverse of a lower triangular matrix is also a lower triangular
matrix, and the product of two lower triangular matrices is also a lower triangular matrix.

38
Matrix Decomposition and Applications

prove that PD matrix A can be uniquely factored as A = B 2 where B is also PD. The two
results are both consequences of the spectral decomposition of PD matrices.
To conclude, for PD matrix A, we can factor it into A = R> R where R is an upper
triangular matrix with positive diagonals as shown in Theorem 2.1 by Cholesky decomposi-
tion, A = P > P where P is nonsingular in Theorem 13.33 (p. 131), and A = B 2 where B
is PD in Theorem 13.34 (p. 133). For clarity, the different factorizations of positive definite
matrix A are summarized in Figure 4.

PD Matrix A
LU/ Spectral
Spectral/ Spectral Decomposition
Recursive Decomposition

R> R B2 P >P

Upper
PD B Nonsingular P
Triangular R

Figure 4: Demonstration of different factorizations on positive definite matrix A.

2.7 Decomposition for Semidefinite Matrices


For positive semidefinite matrices, the Cholesky decomposition also exists with slight mod-
ification.

Theorem 2.9: (Semidefinite Decomposition)


Every positive semidefinite matrix A ∈ Rn×n can be factored as

A = R> R,

where R ∈ Rn×n is an upper triangular matrix with possible zero diagonal elements.

For such decomposition, the diagonal of R may not display the rank of A (Higham, 2009).
More generally, a rank-revealing decomposition for semidefinite decomposition is provided
as follows.

Theorem 2.10: (Semidefinite Rank-Revealing Decomposition)


Every positive semidefinite matrix A ∈ Rn×n with rank r can be factored as
 
> > R11 R12
P AP = R R, with R= ∈ Rn×n ,
0 0

39
Jun Lu

where R11 ∈ Rr×r is an upper triangular matrix with positive diagonal elements, and
R12 ∈ Rr×(n−r) .
The proof for the existence of the above rank-revealing decomposition for semidefinite ma-
trices is delayed in Section 13.5.3 (p. 132) as a consequence of the spectral decomposition
(Theorem 13.1, p. 113) and the column-pivoted QR decomposition (Theorem 3.2, p. 53).

2.8 Application: Rank-One Update/Downdate


Updating linear systems after low-rank modifications of the system matrix is widespread
in machine learning, statistics, and many other fields. However, it is well known that this
update can lead to serious instabilities in the presence of round-off error (Seeger, 2004). If
the system matrix is positive definite, it is almost always possible to use a representation
based on the Cholesky decomposition which is much more numerically stable. We will
shortly provide the proof for this rank one update/downdate via Cholesky decomposition
in this section.

2.8.1 Rank-One Update


A rank-one update A0 of matrix A by vector x is of the form:
A0 = A + vv >
R0> R0 = R> R + vv > .

If we have already calculated the Cholesky factor R of A ∈ Rn×n , then the Cholesky factor
R0 of A0 can be calculated efficiently. Note that A0 differs from A only via the symmetric
rank-one matrix. Hence we can compute R0 from R using the rank-one Cholesky update,
which takes O(n2 ) operations each saving from O(n3 ) if we do know R, the Cholesky
decomposition of A up front, i.e., we want to compute the Cholesky decomposition of A0
via that of A. To see this, suppose there is a set of orthogonal matrices Qn Qn−1 . . . Q1
such that that  >  
v 0
Qn Qn−1 . . . Q1 = .
R R0
Then we find out the expression for the Cholesky factor of A0 by R0 . Specifically, multiply
the left-hand side (l.h.s.,) of above equation by its transpose,
 >
v
> = R> R + vv > .
 
v R Q1 . . . Qn−1 Qn Qn Qn−1 . . . Q1
R
And multiply the right-hand side (r.h.s.,) by its transpose,
 
 0
0> = R0> R0 ,

0 R
R0
which agrees with the l.h.s., equation. Givens rotations are such orthogonal matrices that
can transfer R, v into R0 . We will discuss the intrinsic meaning of Givens rotation shortly
to prove the existence of QR decomposition in Section 3.12 (p. 60). Here, we only introduce
the definition of it and write out the results directly. Feel free to skip this section for a first
reading.

40
Matrix Decomposition and Applications

Definition 2.11: (n-th Order Givens Rotation)


A Givens rotation is represented by a matrix of the following form
 
1
 ..
.

 
 

 1 


 c s 

 1 
Gkl =  ,
 
.. 

 . 


 1 


 −s c 


 1 

..
.
n×n

where the (k, k), (k, l), (l, k), (l, l) entries are c, s, −s, c respectively, and s = cos θ and
c = cos θ for some θ.
Let δ k ∈ Rn be the zero vector except that the k-th entry is 1. Then mathematically,
the Givens rotation defined above can be denoted by

Gkl = I + (c − 1)(δ k δ > > > >


k + δ l δ l ) + s(δ k δ l − δ l δ k ).

where the subscripts k, l indicate the rotation is in plane k and l. Specifically, one
can also define the n-th order Givens rotation where (k, k), (k, l), (l, k), (l, l) entries are
c, −s, s, c respectively. The ideas are the same.

It can be easily verified that the n-th order Givens rotation is an orthogonal matrix and
its determinant is 1. For any vector x = [x1 , x2 , . . . , xn ]> ∈ Rn , we have y = Gkl x, where

yk = c · xk + s · xl ,

yl = −s · xk + c · xl ,

yj = x j , (j 6= k, l)

That is, a Givens rotation applied to x rotates two components of x by some angle θ and
leaves all other components the same.
Now, suppose we have an (n + 1)-th order Givens rotation indexed from 0 to n, and it
is given by
Gk = I + (ck − 1)(δ 0 δ > > > >
0 + δ k δ k ) + sk (δ 0 δ k − δ k δ 0 ),
where ck = cos θk , sk = sin θk for some θk , Gk ∈ R(n+1)×(n+1) , δk ∈ Rn+1 is a zero vector
except that the (k + 1)-th entry is 1.
Taking out the k-th column of the following equation
 >  
v 0
= ,
R R0

41
Jun Lu

whereqwe let the k-th element of v be vk , and the k-th diagonal of R be rkk . We realize
2 6= 0, let c = √ rkk
that vk2 + rkk , sk = − √ 2vk 2 . Then,
k 2 2
vk +rkk vk +rkk

vk → ck vk + sk rkk = 0;
q
rkk → −sk vk + ck rkk = 2 = r0 .
vk2 + rkk kk

That is, Gk will introduce zero value to the k-th element to v and nonzero value to rkk .

This finding above is essential for the rank-one update. And we obtain
 >  
v 0
Gn Gn−1 . . . G1 = .
R R0

For each Givens rotation, it takes 6n flops. And there are n such rotations, which requires
6n2 flops if keeping only the leading term. The complexity to calculate the Cholesky factor
of A0 is thus reduced from 31 n3 to 6n2 flops if we already know the Cholesky factor of A
by the rank-one update. The above algorithm is essential to reduce the complexity of the
posterior calculation in the Bayesian inference for Gaussian mixture model (Lu, 2021a).

2.8.2 Rank-One Downdate


Now suppose we have calculated the Cholesky factor of A, and the A0 is the downdate of
A as follows:
A0 = A − vv >
R0> R0 = R> R − vv > .
The algorithm is similar by proceeding as follows:
   >
0 v
G1 G2 . . . Gn = . (2.1)
R R0

Again, Gk = I + (ck − 1)(δ 0 δ > > > >


0 + δ k δ k ) + sk (δ 0 δ k − δ k δ 0 ), can be constructed as follows:

Taking out the k-th column of the following equation


   >
0 v
= .
R R0
√ 2 −v 2
rkk vk
We realize that rkk 6= 0, let ck = rkk
k
, sk = rkk . Then,

0 → sk rkk = vk ;
q
rkk → ck rkk = r2 − v 2 = r0 .
kk k kk

2 > v 2 to make A0 to be positive definite. Otherwise, c above will not


This requires rkk k k
exist.

42
Matrix Decomposition and Applications

Again, one can check that, multiply the l.h.s., of Equation (2.1) by its transpose, we have
 
0
0 R> Gn . . . G2 G1 G1 G2 . . . Gn = R> R.
 
R

And multiply the r.h.s., by its transpose, we have


 v>
 
0> = vv > + R0> R0 .

v R
R0

This results in R0> R0 = R> R − vv > .

2.9 Application: Indefinite Rank Two Update


Let A = R> R be the Cholesky decomposition of A, (Goldfarb, 1976; Seeger, 2004) give a
stable method for the indefinite rank-two update of the form

A0 = (I + vu> )A(I + uv > ).

Let
z = R−> v, v = R> z,
 

w = Ru, u = R−1 w.
And suppose the LQ decomposition 10 of I + zw> is given by I + zw> = LQ, where L is
lower triangular and Q is orthogonal. Thus, we have

A0 = (I + vu> )A(I + uv > )


= (I + R> zw> R−> )A(I + R−1 wz > R)
= R> (I + zw> )(I + wz > )R
= R> LQQ> L> R
= R> LL> R.

Let R0 = R> L which is lower triangular, we find the Cholesky decomposition of A0 .

Part II
Triangularization, Orthogonalization and
Gram-Schmidt Process
3. QR Decomposition
In many applications, we are interested in the column space of a matrix A = [a1 , a2 , ..., an ] ∈
Rm×n . The successive spaces spanned by the columns a1 , a2 , . . . of A are

C([a1 ]) ⊆ C([a1 , a2 ]) ⊆ C([a1 , a2 , a3 ]) ⊆ . . . ,


10. We will shortly introduce in Theorem 3.11 (p. 65).

43
Jun Lu

where C([. . .]) is the subspace spanned by the vectors included in the brackets. The idea of
QR decomposition is the construction of a sequence of orthonormal vectors q1 , q2 , . . . that
span the same successive subspaces.

     
C([q1 ]) = C([a1 ]) ⊆ C([q1 , q2 ]) = C([a1 , a2 ]) ⊆ C([q1 , q2 , q3 ]) = C([a1 , a2 , a3 ]) ⊆ . . . ,

We provide the result of QR decomposition in the following theorem and we delay the
discussion of its existence in the next sections.

Theorem 3.1: (QR Decomposition)


Every m × n matrix A = [a1 , a2 , ..., an ] (whether linearly independent or dependent
columns) with m ≥ n can be factored as

A = QR,

where
1. Reduced: Q is m×n with orthonormal columns and R is an n×n upper triangular
matrix which is known as the reduced QR decomposition;
2. Full: Q is m × m with orthonormal columns and R is an m × n upper triangular
matrix which is known as the full QR decomposition. If further restrict the
upper triangular matrix to be a square matrix, the full QR decomposition can be
denoted as  
R0
A=Q ,
0
where R0 is an m × m upper triangular matrix.
Specifically, when A has full rank, i.e., has linearly independent columns, R also has
linearly independent columns, and R is nonsingular for the reduced case. This implies
diagonals of R are nonzero. Under this condition, when we further restrict elements on
the diagonal of R are positive, the reduced QR decomposition is unique. The full QR
decomposition is normally not unique since the right-most (m − n) columns in Q can be
in any order.

3.1 Project a Vector Onto Another Vector


Project a vector a to a vector b is to find the vector closest to a on the line of b. The
projection vector a
b is some multiple of b. Let a
b=x bb and a − ab is perpendicular to b as
shown in Figure 5(a). We then get the following result:

Project Vector a Onto Vector b

a> b a> b bb>


a⊥ = a − a bb)> b = 0: x
b is perpendicular to b, so (a − x b= b> b
and a
b= b> b
b = b> b
a.

44
Matrix Decomposition and Applications

a
a a - aˆ
a - aˆ
C ([b1 , b2 ,..., bn ])
b
aˆ  xˆb â
(a) Project onto a line (b) Project onto a space

Figure 5: Project a vector onto a line and a space.

3.2 Project a Vector Onto a Plane


Project a vector a to a space spanned by b1 , b2 , . . . , bn is to find the vector closest to a on the
column space of [b1 , b2 , . . . , bn ]. The projection vector a b is a combination of b1 , b2 , . . . , bn :
a
b = x b1 b1 + x b2 b2 + . . . + x bn bn . This is actually a least squares problem. To find the
projection, we just solve the normal equation B > B x b = B > a where B = [b1 , b2 , . . . , bn ]
and xb = [bx1 , x
b2 , . . . , x
bn ]. We refer the details of this projection view in the least squares to
(Strang, 2009; Trefethen and Bau III, 1997; Yang, 2000; Golub and Van Loan, 2013; Lu,
2021e) as it is not the main interest of this survey. For each vector bi , the projection of a
in the direction of bi can be analogously obtained by

bi b>
i
a
bi = a, ∀i ∈ {1, 2, . . . , n}.
b>
i bi

Pn
Let a
b= i=1 a
bi , this results in

a⊥ = (a − a
b ) ⊥ C(B),

i.e., (a − a
b ) is perpendicular to the column space of B = [b1 , b2 , . . . , bn ] as shown in
Figure 5(b).

3.3 Existence of the QR Decomposition via the Gram-Schmidt Process


For three linearly independent vectors {a1 , a2 , a3 } and the space spanned by the three
linearly independent vectors C([a1 , a2 , a3 ]), i.e., the column space of the matrix [a1 , a2 , a3 ].
We intend to construct three orthogonal vectors {b1 , b2 , b3 } in which case C([b1 , b2 , b3 ]) =
C([a1 , a2 , a3 ]). Then we divide the orthogonal vectors by their length to normalize. This
process produces three mutually orthonormal vectors q1 = ||bb11 || , q2 = ||bb22 || , q2 = ||bb22 || .
For the first vector, we choose b1 = a1 directly. The second vector b2 must be per-
pendicular to the first one. This is actually the vector a2 subtracting its projection along

45
Jun Lu

b1 :

b1 b>
1 b1 b>
1
b2 = a2 − a2 = (I − )a2 (Projection view)
b>
1 b 1 b >b
1 1
b>
1 a2
= a2 − b1 , (Combination view)
b>
1 b1
| {z }
a
b2

1 1 b b>
where the first equation shows b2 is a multiplication of the matrix (I − b> b
) and the vector
1 1
a2 , i.e., project a2 onto the orthogonal complement space of C([b1 ]). The second equality in
the above equation shows a2 is a combination of b1 and b2 . Clearly, the space spanned by
b1 , b2 is the same space spanned by a1 , a2 . The situation is shown in Figure 6(a) in which
we choose the direction of b1 as the x-axis in the Cartesian coordinate system. a b2
is the projection of a2 onto line b1 . It can be clearly shown that the part of a2 perpendicular
to b1 is b2 = a2 − a b 2 from the figure.

For the third vector b3 , it must be perpendicular to both the b1 and b2 which is actually
the vector a3 subtracting its projection along the plane spanned by b1 and b2

b1 b>
1 b2 b>
2 b1 b>
1 b2 b>
2
b3 = a3 − a3 − a3 = (I − − )a3 (Projection view)
b>
1 b 1 b >b
2 2 b >b
1 1 b >b
2 2
b>
1 a3 b>
2 a3 (3.1)
= a3 − >
b 1 − b2 , (Combination view)
b1 b1 b>
2 b2
| {z } | {z }
a
b3 ā3

b1 b> b2 b>
where the first equation shows b3 is a multiplication of the matrix (I − b>
1
− b>
2
) and
1 b1 2 b2
the vector a3 , i.e., project a3 onto the orthogonal complement space of C([b1 , b2 ]). The
second equality in the above equation shows a3 is a combination of b1 , b2 , b3 . We will see
this property is essential in the idea of the QR decomposition. Again, it can be shown
that the space spanned by b1 , b2 , b3 is the same space spanned by a1 , a2 , a3 . The situation
is shown in Figure 6(b), in which we choose the direction of b2 as the y-axis of the
Cartesian coordinate system. a b 3 is the projection of a3 onto line b1 , ā3 is the projection
of a3 onto line b2 . It can be shown that the part of a3 perpendicular to both b1 and b2 is
b3 = a3 − ab 3 − ā3 from the figure.

Finally, we normalize each vector by dividing their length which produces three or-
thonormal vectors q1 = ||bb11 || , q2 = ||bb22 || , q2 = ||bb22 || .

46
Matrix Decomposition and Applications

z z a3
b3

y a3 y
â3
â2 a2 â2 a2
x b2 x b2
b1  a1 b1  a1
(a) Project a2 onto the space perpendicular to b1 . (b) Project a3 onto the space perpendicular to
b1 , b2 .

Figure 6: The Gram-Schmidt process.

This idea can be extended to a set of vectors rather than only three. And we call this
process as Gram-Schmidt process. After this process, matrix A will be triangularized. The
method is named after Jørgen Pedersen Gram and Erhard Schmidt, but it appeared earlier
in the work of Pierre-Simon Laplace in the theory of Lie group decomposition.
As we mentioned previously, the idea of the QR decomposition is the construction of a
sequence of orthonormal vectors q1 , q2 , . . . that span the same successive subspaces.
     
C([q1 ]) = C([a1 ]) ⊆ C([q1 , q2 ]) = C([a1 , a2 ]) ⊆ C([q1 , q2 , q3 ]) = C([a1 , a2 , a3 ]) ⊆ . . . ,

This implies any ak is in the space spanned by C([q1 , q2 , . . . , qk ]). 11 As long as we have
found these orthonormal vectors, to reconstruct ai ’s from the orthonormal matrix Q =
[q1 , q2 , . . . , qn ], an upper triangular matrix R is needed such that A = QR.
The Gram–Schmidt process is not the only algorithm for finding the QR decompo-
sition. Several other QR decomposition algorithms exist such as Householder reflections
and Givens rotations which are more reliable in the presence of round-off errors. These QR
decomposition methods may also change the order in which the columns of A are processed.

3.4 Orthogonal vs Orthonormal


The vectors q1 , q2 , . . . , qn ∈ Rm are mutually orthogonal when their dot products qi> qj
are zero whenever i 6= j. When each vector is divided by its length, the vectors become
orthogonal unit vectors. Then the vectors q1 , q2 , . . . , qn are mutually orthonormal. We put
the orthonormal vectors into a matrix Q.
When m 6= n: the matrix Q is easy to work with because Q> Q = I ∈ Rn×n . Such Q
with m 6= n is sometimes referred to as a semi-orthogonal matrix.
When m = n: the matrix Q is square, Q> Q = I means that Q> = Q−1 , i.e., the
transpose of Q is also the inverse of Q. Then we also have QQ> = I, i.e., Q> is the

11. And also, any qk is in the space spanned by C([a1 , a2 , . . . , ak ]).

47
Jun Lu

two-sided inverse of Q. We call this Q an orthogonal matrix. 12 To see this, we have


 >  
q1 1
q >     1

 2
 ..  q1 q2 . . . qn =  .

..
 .   . 
qn> 1

In other words, qi> qj = δij where δij is the Kronecker delta. The columns of an orthogonal
matrix Q ∈ Rn×n form an orthonormal basis of Rn .
Orthogonal matrices can be viewed as matrices that change the basis of other ma-
trices. Hence they preserve the angle (inner product) between the vectors: u> v =
>
(Qu) (Qv). This invariance of the inner products of angles between the vectors is pre-
served, which also relies on the invariance of their lengths: ||Qu|| = ||u||. In real cases,
multiplied by a orthogonal matrix Q will rotate (if det(Q) = 1) or reflect (if det(Q) = −1)
the original vector space. Many decomposition algorithms will result in two orthogonal
matrices, thus such rotations or reflections will happen twice.

3.5 Computing the Reduced QR Decomposition via CGS and MGS


We write out this form of the reduced QR Decomposition such that A = QR where Q ∈
Rm×n and R ∈ Rn×n :
 
r11 r12 . . . r1n
    r22 . . . r2n 
A = a1 a2 ... an = q1 q2 ... qn  ..  .
 
..
 .
0 . 
rnn
The orthonormal matrix Q can be easily calculated by the Gram-Schmidt process. To see
why we have the upper triangular matrix R, we write out these equations
1
X
a1 = r11 q1 = ri1 q1 ,
i=1
..
.
k
X
ak = r1k q1 + r2k q2 + . . . + rkk qk = rik qk ,
i=1
..
.

which coincides with the second equation of Equation (3.1) and conforms to the form of an
upper triangular matrix R. And if we extend the idea of Equation (3.1) into the k-th term,
we will get
k−1
X k−1
X
> ⊥
ak = (qi ak )qi + ak = (qi> ak )qi + ||a⊥
k || · qk ,
i=1 i=1

12. Note here we use the term orthogonal matrix to mean the matrix Q has orthonormal columns. The term
orthonormal matrix is not used for historical reasons.

48
Matrix Decomposition and Applications

where a⊥
k is such bk in Equation (3.1) that we emphasize the “perpendicular” property here.
This implies we can gradually orthonormalize A to an orthonormal set Q = [q1 , q2 , . . . , qn ]
by
rik = qi> ak , ∀i ∈ {1, 2, . . . , k − 1};






 k−1
X
 a⊥ = a −

rik qi ;
k k
i=1
(3.2)

rkk = ||a⊥

k ||;





qk = a⊥

k /rkk .

Orthogonal Projection
We notice again from Equation (3.2), the first two equality imply that
rik = qi> ak , ∀i ∈ {1, 2, . . . , k − 1}



k−1
X → a⊥ > >
k = ak − Qk−1 Qk−1 ak = (I − Qk−1 Qk−1 )ak ,
a⊥
k = a k − r q
ik i



i=1
(3.3)
where Qk−1 = [q1 , q2 , . . . , qk−1 ]. This implies qk can be obtained by

a⊥
k
(I − Qk−1 Q>
k−1 )ak
qk = ⊥
= .
||ak || ||(I − Qk−1 Q>
k−1 )ak ||

The matrix (I − Qk−1 Q> k−1 ) in above equation is known as an orthogonal projection matrix
that will project ak along the column space of Qk−1 , i.e., project a vector so that the vector
is perpendicular to the column space of Qk−1 . The net result is that the a⊥ k or qk calculated
in this way will be orthogonal to the C(Qk−1 ), i.e., in the null space of Q> >
k−1 : N (Qk−1 ) by
the fundamental theorem of linear algebra (Theorem 0.15, p. 14).
Let P1 = (I − Qk−1 Q> >
k−1 ) and we claimed above P1 = (I − Qk−1 Qk−1 ) is an orthogonal
projection matrix such that P1 v will project the v onto the null space of Qk−1 . And
actually, let P2 = Qk−1 Q> k−1 , then P2 is also an orthogonal projection matrix such that
P2 v will project the v onto the column space of Qk−1 .
But why do the matrix P1 , P2 can magically project a vector onto the corresponding
subspaces? It can be easily shown that the column space of Qk−1 is equal to the column
space of Qk−1 Q> k−1 :
C(Qk−1 ) = C(Qk−1 Q> k−1 ) = C(P2 ).
Therefore, the result of P2 v is a linear combination of the columns of P2 , which is in the
column space of P2 or the column space of Qk−1 . The formal definition of a projection
matrix P is that it is idempotent P 2 = P such that projecting twice is equal to projecting
once. What makes the above P2 = Qk−1 Q> k−1 different is that the projection v
b of any
vector v is perpendicular to v − vb:
v = P2 v) ⊥ (v − v
(b b).
This goes to the original definition we gave above: the orthogonal projection matrix. To
avoid confusion, one may use the term oblique projection matrix in the nonorthogonal case.

49
Jun Lu

When P2 is an orthogonal projection matrix, P1 = I − P2 is also an orthogonal projection


matrix that will project any vector onto the space perpendicular to the C(Qk−1 ), i.e.,
N (Q>
k−1 ). Therefore, we conclude the two orthogonal projections:
(
P1 : project onto N (Q>k−1 );
P2 : project onto C(Qk−1 ).
The further result that is important to notice is when the columns of Qk−1 are mutually
orthonormal, we have the following decomposition:

P1 = I − Qk−1 Q> > > >


k−1 = (I − q1 q1 )(I − q2 q2 ) . . . (I − qk−1 qk−1 ), (3.4)

where Qk−1 = [q1 , q2 , . . . , qk−1 ] and each (I − qi qi> ) is to project a vector into the per-
pendicular space of qi . This finding is important to make a step further to a modified
Gram-Schmidt process (MGS) where we project and subtract on the fly. To avoid confu-
sion, the previous Gram-Schmidt is called the classical Gram-Schmidt process (CGS). The
difference between the CGS and MGS is, in the CGS, we project the same vector onto the
orthogonormal ones and subtract afterwards. However, in the MGS, the projection and
subtraction are done in an interleaved manner. A three-column example A = [a1 , a2 , a3 ]
is shown in Figure 7 where each step is denoted in a different color. We summarize the
difference between the CGS and MGS processes for obtaining qk via the k-th column ak of
A and the orthonormalized vectors {q1 , q2 , . . . , qk−1 }:
(CGS) : obtain qk by normalizing a⊥ >
k = (I − Qk−1 Qk−1 )ak ;
n h  io
(MGS) : obtain qk by normalizing a⊥
k = (I − q k−1 q >
k−1 ) . . . (I − q 2 q >
2 ) (I − q 1 q >
1 )ak ,

where the parentheses of the MGS indicate the order of the computation.
𝑎3 -(𝑞1 𝑞1𝑇 )𝑎3 − (𝑞2 𝑞2𝑇 )𝑎3 (𝐼 − 𝑞2 𝑞2𝑇 )(𝐼 − 𝑞1 𝑞1𝑇 )𝑎3
(𝐼 − 𝑞1 𝑞1𝑇 )𝑎3
(𝑞1 𝑞1𝑇 )𝑎3 𝑞1 𝑞1
𝑎3 𝑎3

𝑞2’ 𝑞2’

𝑞2 𝑞2
(𝑞2 𝑞2𝑇 )𝑎3
(a) CGS, step 1: blue vector; step 2: green vector; (b) MGS, step 1: blue vector; step 2: purple vector.
step 3: purple vector.

Figure 7: CGS vs MGS in 3-dimensional space where q20 is parallel to q2 so that projecting
on q2 is equivalent to projecting on q20 .

What’s the difference? Taking the three-column matrix A = [a1 , a2 , a3 ] as an example.


Suppose we have computed {q1 , q2 } such that span{q1 , q2 } = span{a1 , a2 }. And we want
to proceed to compute the q3 .

50
Matrix Decomposition and Applications

In the CGS, the orthogonalization of column an against column {q1 , q2 } is performed


by projecting the original column a3 of A onto {q1 , q2 } respectively and subtracting at
once:  ⊥

 a3 = a3 − (q1> a3 )q1 − (q2> a3 )q2

= a3 − (q1 q1> )a3 − (q2 q2> )a3


(3.5)

 ⊥
a3

 q3 =
 ,
||a⊥
3 ||
as shown in Figure 7(a).
In the MGS, on the other hand, the components along each {q1 , q2 } are immediately
subtracted out of rest of the column a3 as soon as the {q1 , q2 } are computed. Therefore the
orthogonalization of column a3 against {q1 , q2 } is not performed by projecting the original
column a3 against {q1 , q2 } as it is in the CGS, but rather against a vector obtained by
subtracting from that column a3 of A the components in the direction of q1 , q2 successively.
This is important because the error components of qi in span{q1 , q2 } will be smaller (we
will further discuss in the next paragraphs).
More precisely, in the MGS the orthogonalization of column a3 against q1 is performed
by subtracting the component of q1 from the vector a3 :
(1)
a3 = (I − q1 q1> )a3 = a3 − (q1 q1> )a3 ,
(1)
where a3 is the component of a3 lies in a space perpendicular to q1 . And further step is
performed by
(2) (1) (1) (1)
a3 = (I − q2 q2> )a3 = a3 − (q2 q2> )a3
(3.6)
(1)
= a3 − (q1 q1> )a3 − (q2 q2> )a3
(2) (1)
where a3 is the component of a3 lies in a space perpendicular to q2 and we highlight
(2)
the difference to the CGS in Equation (3.5) by blue text. This net result is that a3 is the
component of a3 lies in the space perpendicular to {q1 , q2 } as shown in Figure 7(b).
Main difference and catastrophic cancellation The key difference is that the a3 can
in general have large components in span{q1 , q2 } in which case one starts with large values
and ends up with small values with large relative errors in them. This is known as the
(1)
problem of catastrophic cancellation. Whereas a3 is in the direction perpendicular to q1
and has only a small “error” component in the direction of q1 . Compare the boxed text in
(1)
Equation (3.5) and (3.6), it is not hard to see (q2 q2> )a3 in Equation (3.6) is more accurate
by the above argument. And thus, because of the much smaller error in this projection
factor, the MGS introduces less orthogonalization error at each subtraction step than that
is in the CGS. In fact, it can be shown that the final Q obtained in the CGS satisfies
||I − QQ> || ≤ O(κ2 (A)),
where κ(A) is a value larger than 1 determined by A. Whereas, in the MGS, the error
satisfies
||I − QQ> || ≤ O(κ(A)).
That is, the Q obtained in the MGS is more orthogonal.

51
Jun Lu

More to go, preliminaries for Householder and Givens methods Although, we


claimed here the MGS usually works better than the CGS in practice. The MGS can
still fall victim to the catastrophic cancellation problem. Suppose in iteration k of the
MGS algorithm, ak is almost in the span of {q1 , q2 , . . . , qk−1 }. This will result in that a⊥ k
has only a small component that is perpendicular to span{q1 , q2 , . . . , qk−1 }, whereas the
“error” component in the span{q1 , q2 , . . . , qk−1 } will be amplified and the net result is Q
will be less orthonormal. In this case, if we can find a successive set of orthogonal matrices
{Q1 , Q2 , . . . , Ql } such that Ql . . . Q2 Q1 A is triangularized, then Q = (Ql . . . Q2 Q1 )> will
be “more” orthogonal than the CGS or the MGS. We will discuss this method in Section 3.11
and 3.12 via Householder reflectors and Givens rotations.

3.6 Computing the Full QR Decomposition via the Gram-Schmidt Process


A full QR decomposition of an m×n matrix with linearly independent columns goes further
by appending additional m − n orthonormal columns to Q so that it becomes an m × m
orthogonal matrix. In addition, rows of zeros are appended to R so that it becomes an
m × n upper triangular matrix. We call the additional columns in Q silent columns and
additional rows in R silent rows. The comparison between the reduced QR decomposition
and the full QR decomposition is shown in Figure 8 where silent columns in Q are denoted
in gray, blank entries are zero and blue entries are elements that are not necessarily zero.

   

Amn Qmn Rnn Amn Qmm Rmn


(a) Reduced QR decomposition (b) Full QR decomposition

Figure 8: Comparison between the reduced and full QR decompositions.

3.7 Dependent Columns


Previously, we assumed matrix A has linearly independent columns. However, this is not
always necessary. Suppose in step k of CGS or MGS, ak is in the plane spanned by
q1 , q2 , . . . , qk−1 which is equivalent to the space spanned by a1 , a2 , . . . , ak−1 , i.e., vectors
a1 , a2 , . . . , ak are dependent. Then rkk will be zero and qk does not exist because of the
zero division. At this moment, we simply pick qk arbitrarily to be any normalized vector
that is orthogonal to C([q1 , q2 , . . . , qk−1 ]) and continue the Gram-Schmidt process. Again,
for matrix A with dependent columns, we have both reduced and full QR decomposition
algorithms.
This idea can be further extended that when qk does not exist, we just skip the current
steps. And add the silent columns in the end. In this sense, QR decomposition for a matrix
with dependent columns is not unique. However, as long as you stick to a systematic process,
QR decomposition for any matrix is unique. This finding can also help to decide whether
a set of vectors are linearly independent or not. Whenever rkk in CGS or MGS is zero, we

52
Matrix Decomposition and Applications

report the vectors a1 , a2 , . . . , ak are dependent and stop the algorithm for “independent
checking”.

3.8 QR with Column Pivoting: Column-Pivoted QR (CPQR)

Suppose A has dependent columns, a column-pivoted QR (CPQR) decomposition can be


found as follows.

Theorem 3.2: (Column-Pivoted QR Decomposition)


Every m × n matrix A = [a1 , a2 , ..., an ] with m ≥ n and rank r can be factored as
 
R11 R12
AP = Q ,
0 0

where R11 ∈ Rr×r is upper triangular, R12 ∈ Rr×(n−r) , Q ∈ Rm×m is an orthogonal ma-
trix, and P is a permutation matrix. This is also known as the full CPQR decomposition.
Similarly, the reduced version is given by
 
AP = Qr R11 R12 ,

where R11 ∈ Rr×r is upper triangular, R12 ∈ Rr×(n−r) , Qr ∈ Rm×r contains orthonormal
columns, and P is a permutation matrix.

3.8.1 A Simple CPQR via CGS

A Simple CPQR via CGS The classical Gram-Schmidt process can compute this
CPQR decomposition. Following from the QR decomposition for dependent columns that
when rkk = 0, the column k of A is dependent on the previous k − 1 columns. Whenever
this happens, we permute this column into the last column and continue the Gram-Schmidt
process. We notice that P is the permutation matrix that interchanges the dependent
columns into the last n − r columns. Suppose the first r columns of AP are [b
a1 , a
b2, . . . , a
b r ],
and the span of them is just the same as the span of Qr (in the reduced version), or the
span of Q:,:r (in the full version)

C([b
a1 , a b r ]) = C(Qr ) = C(Q:,:r ).
b2, . . . , a

And R12 is a matrix that recovers the dependent n − r columns from the column space of
Qr or column space of Q:,:r . The comparison of reduced and full CPQR decomposition is
shown in Figure 9 where silent columns in Q are denoted in grey, blank entries are zero
and blue/orange entries are elements that are not necessarily zero.

53
Jun Lu

r r
  r   r

APmn Qmr Rrn APmn Qmm Rmn


(a) Reduced CPQR decomposition (b) Full CPQR decomposition

Figure 9: Comparison between the reduced and full CPQR decompositions.

3.8.2 A Practical CPQR via CGS


A Practical CPQR via CGS We notice that the simple CPQR algorithm pivot the
first r independent columns into the first r columns of AP . Let A1 be the first r columns
of AP , and A2 be the rest. Then, from the full CPQR, we have
      
R11 R12 R11 R12
[A1 , A2 ] = Q = Q ,Q .
0 0 0 0

It is not easy to see that


   
R12 R12
||A2 || = Q = = kR12 k ,
0 0

where the penultimate equality comes from the orthogonal equivalence under the matrix
norm. Therefore, the norm of R12 is decided by the norm of A2 . When favoring well-
conditioned CPQR, R12 should be small in norm. And a practical CPQR decomposition is
to permute columns of the matrix A firstly such that the columns are ordered decreasingly
in vector norm:
A
e = AP0 = [aj , aj , . . . , ajn ],
1 2

where {j1 , j2 , . . . , jn } is a permuted index set of {1, 2, . . . , n} and

||aj1 || ≥ ||aj2 || ≥ . . . ≥ ||ajn ||.

Then apply the “simple” reduced CPQR decomposition on Ae such that AP


e 1 = Qr [R11 , R12 ].
The “practical” reduced CPQR of A is then recovered as

A P0 P1 = Qr [R11 , R12 ].
| {z }
P

The further optimization on the CPQR algorithm is via the MGS where the extra bonus
is to stop at a point when the factorization works on a rank deficient submatrix and the
CPQR via this MGS can find the numerical rank (Lu, 2021c). This is known as the partial
factorization and we shall not give the details.

54
Matrix Decomposition and Applications

3.9 QR with Column Pivoting: Revealing Rank One Deficiency


We notice that column-pivoted QR is just one method to find the column permutation
where A is rank deficient and we interchange the first linearly independent r columns of
A into the first r columns of the AP . If A is nearly rank-one deficient and we would
like to find a column permutation of A such that the resulting pivotal element rnn of the
QR decomposition is small. This is known as the revealing rank-one deficiency problem.

Theorem 3.3: (Revealing Rank One Deficiency, (Chan, 1987))


If A ∈ Rm×n and v ∈ Rn is a unit 2-norm vector (i.e., ||v|| = 1), then there exists a
permutation P such that the reduced QR decomposition

AP = QR

satisfies rnn ≤ n where  = ||Av|| and rnn is the n-th diagonal of R. Note that
Q ∈ Rm×n and R ∈ Rn×n in the reduced QR decomposition.

Proof [of Theorem 3.3] Suppose P ∈ Rn×n is a permutation matrix such that if w = P > v
where
|wn | = max |vi |, ∀i ∈ {1, 2, . . . , n}.
That is, the last component of w is equal to the max component of v in absolute value.

Then we have |wn | ≥ 1/ n. Suppose the QR decomposition of AP is AP = QR, then
..
" #
. √
 = ||Av|| = ||(Q> AP )(P > v)|| = ||Rw|| = ≥ |rnn wn | ≥ |rnn |/ n,
rnn wn

This completes the proof.


The following discussion is based on the existence of the singular value decomposition
(SVD) which will be introduced in Section Pn14 (p. 134). Feel free to skip at a first reading.
>
Suppose the SVD of A is given by A = i=1 σi ui vi , where σi ’s are singular values with
σ1 ≥ σ2 ≥ . . . ≥ σn , i.e., σn is the smallest singular value, and ui ’s, vi ’s are the left and
right singular vectors respectively. Then, if we let v = vn such that Avn = σn un , 13 we
have
||Av|| = σn .
By constructing a permutation matrix P such that

|P > v|n = max |vi |, ∀i ∈ {1, 2, . . . , n},



we will find a QR decomposition of A = QR with a pivot rnn smaller than nσn . If A is
rank-one deficient, then σn will be close to 0 and rnn is thus bounded to a small value in
magnitude which is close to 0.
13. We will prove that the right singular vector of A is equal to the right singular vector of R if the A has
QR decomposition A = QR in Lemma 14.11 (p. 140). The claim can also be applied to the singular
values. So vn here is also the right singular vector of R.

55
Jun Lu

3.10 QR with Column Pivoting: Revealing Rank r Deficiency*


Following from the last section, suppose now we want to compute the reduced QR decom-
position where A ∈ Rm×n is nearly rank r deficient with r > 1. Our goal now is to find a
permutation P such that  
L M
AP = QR = Q , (3.7)
0 N
where N ∈ Rr×r and ||N || is small in some norm. A recursive algorithm can be applied
to do so. Suppose we have already isolated a small k × k block Nk , based on which, if we
can isolate a small (k + 1) × (k + 1) block Nk+1 , then we can find the permutation matrix
recursively. To repeat, suppose we have the permutation Pk such that the Nk ∈ Rk×k has
a small norm,  
Lk Mk
APk = Qk Rk = Qk .
0 Nk
We want to find a permutation Pk+1 , such that Nk+1 ∈ R(k+1)×(k+1) also has a small norm,
 
Lk+1 Mk+1
APk+1 = Qk+1 Rk+1 = Qk+1 .
0 Nk+1

From the algorithm introduced in the last section, there is an (n − k) × (n − k) permutation


matrix Pek+1 such that Lk ∈ R(n−k)×(n−k) has the QR decomposition Lk Pek+1 = Q e k+1 L
ek
such that the entry (n − k, n − k) of L
e k is small. By constructing
   
Pek+1 0 Q
e k+1 0
Pk+1 = Pk , Qk+1 = Qk ,
0 I 0 I

we have
e > Mk
 
L
ek Q
k+1
APk+1 = Qk+1 .
0 Nk

We know that entry (n − k, n − k) of L e > Mk


e k is small, if we can prove the last row of Q
k+1
is small in norm, then we find the QR decomposition revealing rank k + 1 deficiency (see
(Chan, 1987) for a proof).

3.11 Existence of the QR Decomposition via the Householder Reflector


We first give the formal definition of a Householder reflector and we will take a look at its
properties.

Definition 3.4: (Householder Reflector)


Let u ∈ Rn be a vector of unit length (i.e., ||u|| = 1). Then H = I − 2uu> is said
to be a Householder reflector, a.k.a., a Householder transformation. We call this H the
Householder reflector associated with the unit vector u where the unit vector u is also
known as Householder vector. If a vector x is multiplied by H, then it is reflected in the
hyperplane span{u}⊥ .

56
Matrix Decomposition and Applications

xx Figure 10: Demonstration of the House-


u holder reflector. The Householder reflector
obtained by H = I − 2uu> where ||u|| = 1
u
xv
will reflect vector x along the plane perpen-
dicular to u: x = xv + xu → xv − xu .

- xu
Hx
Plane perpendicular to u

>
6 1, we can define H = I − 2 uu
Note that if ||u|| = u> u
as the Householder reflector.

Then we have the following corollary from this definition.

Corollary 3.5: (Unreflected by Householder)


Any vector v that is perpendicular to u is left unchanged by the Householder transfor-
mation, that is, Hv = v if u> v = 0.

The proof is trivial that (I − 2uu> )v = v − 2uu> v = v.


Suppose u is a unit vector with ||u|| = 1, and a vector v is perpendicular to u. Then
any vector x on the plane can be decomposed into two parts x = xv + xu : the first
one xu is parallel to u and the second one xv is perpendicular to u (i.e., parallel to v).
From Section 3.1 on the projection of a vector onto another one, xu can be computed by
uu> >
xu = u > u x = uu x. We then transform this x by the Householder reflector associated

with u, Hx = (I − 2uu> )(xv + xu ) = xv − uu> x = xv − xu . That is, the space


perpendicular to u acts as a mirror and any vector x is reflected by the Householder
reflector associated with u (i.e., reflected in the hyperplane span{u}⊥ ). The situation is
shown in Figure 10.
If we know two vectors are reflected to each other, the next corollary tells us how to
find the corresponding Householder reflector.

Corollary 3.6: (Finding the Householder Reflector)


Suppose x is reflected to y by a Householder reflector with ||x|| = ||y||, then the House-
holder reflector is obtained by
x−y
H = I − 2uu> , where u = .
||x − y||

Proof [of Corollary 3.6] Write out the equation, we have

(x − y)(x> − y > )
Hx = x − 2uu> x = x − 2 x
(x − y)> (x − y)
= x − (x − y) = y.

57
Jun Lu

Note that the condition ||x|| = ||y|| is required to prove the result.
The Householder reflectors are useful to set a block of components of a given vector to
zero. Particularly, we usually would like to set the vector a ∈ Rn to be zero except the i-th
element. Then the Householder vector can be chosen to be
a − rei
u= , where r = ±||a||
||a − rei ||

which is a reasonable Householder vector since ||a|| = ||rei || = |r|. We carefully notice that
when r = ||a||, a is reflected to ||a||ei via the Householder reflector H = I − 2uu> ;
otherwise when r = −||a||, ||a|| is reflected to −||a||ei via the Householder reflector.

Remark 3.7: (Householder Properties)


If H is a Householder reflector, then it has the following properties:
• HH = I;
• H = H >;
• H > H = HH > = I such that Householder reflector is an orthogonal matrix;
• Hu = −u, if H = I − 2uu> .

We see in the Gram-Schmidt section that QR decomposition is to use a triangular


matrix to orthogonalize a matrix A. The further idea is that, if we have a set of orthogonal
matrices that can make A to be triangular step by step, then we can also recover the QR
decomposition. Specifically, if we have an orthogonal matrix Q1 that can introduce zeros
to the 1-st column of A except the entry (1,1); and an orthogonal matrix Q2 that can
introduce zeros to the 2-nd column except the entries (2,1), (2,2); . . .. Then, we can also
find the QR decomposition. For the way to introduce zeros, we could reflect the columns
of the matrix to a basis vector e1 whose entries are all zero except the first entry.
Let A = [a1 , a2 , . . . , an ] ∈ Rm×n be the column partition of A, and let
a1 − r1 e1
r1 = ||a1 ||, u1 = , and H1 = I − 2u1 u>
1, (3.8)
||a1 − r1 e1 ||

where e1 here is the first basis for Rm , i.e., e1 = [1; 0; 0; . . . ; 0] ∈ Rm . Then


 
r1 R1,2:n
H1 A = [H1 a1 , H1 a2 , . . . , H1 an ] = , (3.9)
0 B2

which reflects a1 to r1 e1 and introduces zeros below the diagonal in the 1-st column. We
observe that the entries below r1 are all zero now under this specific reflection. Notice that
we reflect a1 to ||a1 ||e1 the two of which have same length, rather than reflect a1 to e1
directly. This is for the purpose of numerical stability.
Choice of r1 : moreover, the choice of r1 is not unique. For numerical stability, it
is also desirable to choose r1 = −sign(a11 )||a1 ||, where a11 is the first component of a1 . Or
even, r1 = sign(a11 )||a1 || is also possible as long as ||a1 || is equal to ||r1 e1 ||. However, we
will not cover this topic here.

58
Matrix Decomposition and Applications

We can then apply this process to B2 in Equation (3.9) to make the entries below the
entry (2,2) to be all zeros. Note that, we do not apply this process to the entire H1 A but
rather the submatrix B2 in it because we have already introduced zeros in the first column,
and reflecting again will introduce nonzero values back.
Suppose B2 = [b2 , b3 , . . . , bn ] is the column partition of B2 , and let
 
b2 − r2 e1 f2 = I − 2u2 u> , 1 0
r2 = ||b2 ||, u2 = , H 2 and H2 = f2 ,
||b2 − r2 e1 || 0 H

where now e1 here is the first basis for Rm−1 and H2 is also an orthogonal matrix since H
f2
is an orthogonal matrix. Then it follows that
 
r1 R12 R1,3:n
H2 H1 A = [H2 H1 a1 , H2 H1 a2 , . . . , H2 H1 an ] =  0 r2 R2,3:n  .
0 0 C3
The same process can go on, and we will finally triangularize A = (Hn Hn−1 . . . H1 )−1 R =
QR. And since the Hi ’s are symmetric and orthogonal, we have Q = (Hn Hn−1 . . . H1 )−1 =
H1 H2 . . . Hn .
An example of a 5 × 4 matrix is shown as follows where  represents a value that is not
necessarily zero, and boldface indicates the value has just been changed.
     
           
 1  0     H2  0    
       
    H

 →  0    →  0 0  
   

     0     0 0  
    0    0 0  
A H1 A H2 H1 A
   
       
 0   
 H4  0   
 
H3 
→  0 0    →  0 0  
  
 0 0 0   0 0 0 
0 0 0  0 0 0 0
H3 H2 H1 A H4 H3 H2 H1 A
A closer look at the QR factorization The Householder algorithm is a process that
makes a matrix triangular by a sequence of orthogonal matrix operations. In the Gram-
Schmidt process (both CGS and MGS), we use a triangular matrix to orthogonalize the
matrix. However, in the Householder algorithm, we use orthogonal matrices to triangularize.
The difference between the two approaches is summarized as follows:
• Gram-Schmidt: triangular orthogonalization;
• Householder: orthogonal triangularization.
We further notice that, in the Householder algorithm or the Givens algorithm that we
will shortly see, a set of orthogonal matrices are applied so that the QR decomposition
obtained is a full QR decomposition. Whereas, the direct QR decomposition obtained by
CGS or MGS is a reduced one (although the silent columns or rows can be further added
to find the full one).

59
Jun Lu

3.12 Existence of the QR Decomposition via the Givens Rotation


We have defined the Givens rotation in Definition 2.11 (p. 41) to find the rank-one up-
date/downdate of the Cholesky decomposition. Now consider the following 2 × 2 orthogonal
matrices      
−c s c −s c s
F = , J= , G= ,
s c s c −s c
where s = sin θ and c = cos θ for some θ. The first matrix has det(F ) = −1 and is a
special case of a Householder reflector in dimension 2 such that F = I − 2uu> where u =
hq q i> h q q i>
1+c 1−c or u = 1+c 1−c . The latter two matrices have det(J ) =
2 , 2 − 2 , − 2
det(G) = 1 and effects a rotation instead of a reflection. Such a matrix is called a Givens
rotation.

y  [ y1 , y2 ]
x  [ x1 , x2 ]

x  [ x1 , x2 ]

  y  [ y1 , y2 ]
 
(a) y = Jx, counter-clockwise rotation. (b) y = Gx, clockwise rotation.

Figure 11: Demonstration of two Givens rotations.

Figure 11 demonstrate a rotation of x under J , where y = J x such that


(
y1 = c · x 1 − s · x 2 ,
y2 = s · x1 + c · x2 .

We want to verify the angle between x and y is actually θ (and counter-clockwise rotation)
after the Givens rotation J as shown in Figure 11(a). Firstly, we have
 x1
 cos(α) = px2 + x2 ,

 (
1 2 cos(θ) = c,
x2 and
 sin(α) = p 2

 . sin(θ) = s.
x1 + x22
This implies cos(θ + α) = cos(θ) cos(α) − sin(θ) sin(α). If we can show cos(θ + α) =
cos(θ) cos(α) − sin(θ) sin(α) is equal to √ y21 2 , then we complete the proof.
y1 +y2
√1 −s·x
For the former one, cos(θ + α) = cos(θ) cos(α) − sin(θ) sin(α) = c·x 2
. For the latter
x21 +x22
√1 −s·x
one, it can be verified that y12 + y22 = x21 + x22 , and √ y21 2 = c·x
p p
2
2
2
. This completes
y1 +y2 x1 +x2
the proof. Similarly, we can also show that the angle between y = Gx and x is also θ in
Figure 11(b) and the rotation is clockwise.

60
Matrix Decomposition and Applications

It can be easily verified that the n-th order Givens rotation (Definition 2.11, p. 41) is
an orthogonal matrix and its determinant is 1. For any vector x = [x1 , x2 , . . . , xn ]> ∈ Rn ,
we have y = Gkl x, where

yk = c · xk + s · xl ,

yl = −s · xk + c · xl ,

yj = x j , (j 6= k, l)

That is, a Givens rotation applied to x rotates


qtwo components of x by some angle θ and
leaves all other components the same. When x2k + x2l 6= 0, let c = √ x2k 2 , s = √ x2 l 2 .
xk +xl xk +xl
Then,  q
y
 k

 = x2k + x2l ,

yl = 0,

yj = x j . (j 6= k, l)

This finding above is essential for the QR decomposition via the Givens rotation.

Corollary 3.8: (Basis From Givens Rotations Forwards)


For any vector x ∈ Rn , there exists a set of Givens rotations {G12 , G13 , . . . , G1n } such
that G1n . . . G13 G12 x = ||x||e1 where e1 ∈ Rn is the first unit basis in Rn .

Proof [of Corollary 3.8] From the finding above, we can find a G12 , G13 , G14 such that
q >
G12 x = 2 2
x1 + x2 , 0, x3 , . . . , xn ,

q >
G13 G12 x = x21 + x22 + x23 , 0, 0, x4 , . . . , xn ,

and q >
G14 G13 G12 x = 2 2 2 2
x1 + x2 + x3 + x4 , 0, 0, 0, x5 , . . . , xn .

Continue this process, we will obtain G1n . . . G13 G12 = ||x||e1 .

Remark 3.9: (Basis From Givens Rotations Backwards)


In Corollary 3.8, we find the Givens rotation that introduces zeros from the 2-nd entry to
th n-th entry (i.e., forward). Sometimes we want the reverse order, i.e., introduce zeros
from the n-th entry to the 2-nd entrysuch that G12 G13 . . . G1n x = ||x||e1 where e1 ∈ Rn
is the first unit basis in Rn .
The procedure is similar, we can find a G1n , G1,(n−1) , G1,(n−2) such that
q >
G1n x = 2 2
x1 + xn , x2 , x3 , . . . , xn−1 , 0 ,

61
Jun Lu

q >
G1,(n−1) G1n x = 2 2 2
x1 + xn−1 + xn , x2 , x3 , . . . , xn−2 , 0, 0 ,

and
q >
G1,(n−2) G1,(n−1) G1n x = 2 2 2 2
x1 + xn−2 + xn−1 + xn , x2 , x3 , . . . , xn−3 , 0, 0, 0 .

Continue this process, we will obtain G12 G13 . . . G1n x = ||x||e1 .


An alternative form Alternatively, there are rotations {G12 , G23 , . . . , G(n−1),n } such
that G12 G23 . . . G(n−1),n x = ||x||e1 where e1 ∈ Rn is the first unit basis in Rn with
 q >
G(n−1),n x = x1 , x2 , . . . , xn−2 , x2n−1 + x2n , 0 ,

 q >
2 2 2
G(n−2),(n−1) G(n−1),n x = x1 , x2 , . . . , xn−3 , xn−2 + xn−1 + xn , 0, 0 ,

and
G(n−3),(n−2) G(n−2),(n−1) G(n−1),n x =
 q >
2 2 2 2
x1 , x2 , . . . , xn−4 , xn−3 + xn−2 + xn−1 + xn , 0, 0, 0 .

Continue this process, we will obtain G12 G23 . . . G(n−1),n x = ||x||e1 .

From the Corollary 3.8 above, for the way to introduce zeros, we could rotate the
columns of the matrix to a basis vector e1 whose entries are all zero except the first entry.
Let A = [a1 , a2 , . . . , an ] ∈ Rm×n be the column partition of A, and let

G1 = G1m . . . G13 G12 , (3.10)

where e1 here is the first basis for Rm , i.e., e1 = [1; 0; 0; . . . ; 0] ∈ Rm . Then


 
||a1 || R1,2:n
G1 A = [G1 a1 , G1 a2 , . . . , G1 an ] = , (3.11)
0 B2

which rotates a1 to ||a1 ||e1 and introduces zeros below the diagonal in the 1-st column.
We can then apply this process to B2 in Equation (3.11) to make the entries below the
(2,2)-th entry to be all zeros. Suppose B2 = [b2 , b3 , . . . , bn ], and let

G2 = G2m . . . G24 G23 ,

where G2n , . . . , G24 , G23 can be implied from context. Then


 
||a1 || R12 R1,3:n
G2 G1 A = [G2 G1 a1 , G2 G1 a2 , . . . , G2 G1 an ] =  0 ||b2 || R2,3:n  .
0 0 C3

62
Matrix Decomposition and Applications

The same process can go on, and we will finally triangularize A = (Gn Gn−1 . . . G1 )−1 R =
QR. And since Gi ’s are orthogonal, we have Q = (Gn Gn−1 . . . G1 )−1 = G> > >
1 G2 . . . Gn ,
and

G> > >


1 G2 . . . Gn = (Gn . . . G2 G1 )
>
 > (3.12)
= (Gnm . . . Gn,(n+1) ) . . . (G2m . . . G23 )(G1m . . . G12 ) .

The Givens rotation algorithm works better when A already has a lot of zeros below
the main diagonal. An example of a 5 × 4 matrix is shown as follows where  represents a
value that is not necessarily zero, and boldface indicates the value has just been changed.

Givens rotations in G1 For a 5 × 4 example, we realize that G1 = G15 G14 G13 G12 .
And the process is shown as follows:
     
        
  
    0    0
  
  G12   G13  
 →  →
        0
  
   
        
  
        
  
A G12 A G13 G12 A
   
       
 0   
 G15  0   
 
G14 
→  0    →  0   
  

 0     0   
    0   
G14 G13 G12 A G15 G14 G13 G12 A

Givens rotation as a big picture Take G1 , G2 , G3 , G4 as a single matrix, we have


     
           
    0     0   
  G1   G2
 
→ →
    0     0 0  
   
    0     0 0  
    0    0 0  
A G1 A G2 G1 A
   
       
 0   
 G4  0    
 
G3 
→ 0 0
    →  0 0  
  
 0 0 0   0 0 0 
0 0 0  0 0 0 0
G3 G2 G1 A G4 G3 G2 G1 A

Orders to introduce the zeros With the Givens rotations for the QR decomposition,
it is flexible to choose different orders to introduce the zeros of R. In our case, we introduce
zeros column by column. It is also possible to introduce zeros row by row.

63
Jun Lu

3.13 Uniqueness of the QR Decomposition


The results of the QR decomposition from the Gram-Schmidt process , the Householder
algorithm, and the Givens algorithms are different. Even in the Householder algorithm, we
have different methods to choose the sign of r1 in Equation (3.8). Thus, from this point,
QR decomposition is not unique.
However, if we use just the procedure described in the Gram-Schmidt process, or system-
atically choose the sign in the Householder algorithm, then the decomposition is unique. The
uniqueness of the reduced QR decomposition for full column rank matrix A is assured when
R has positive diagonals by inductive analysis (Lu, 2021c). We here provide another proof
for the uniqueness of the reduced QR decomposition for matrices if the diagonal values of R
are positive which will shed light on the implicit Q theorem in Hessenberg decomposition
(Section 8.3, p. 94) or tridiagonal decomposition (Theorem 9.1, p. 97).

Corollary 3.10: (Uniqueness of the reduced QR Decomposition)


Suppose matrix A is an m × n matrix with full column rank n and m ≥ n. Then, the
reduced QR decomposition is unique if the main diagonal values of R are positive.

Proof [of Corollary 3.10] Suppose the reduced QR decomposition is not unique, we can
complete it into a full QR decomposition, then we can find two such full decompositions so
that A = Q1 R1 = Q2 R2 which implies R1 = Q−1 −1
1 Q2 R2 = V R2 where V = Q1 Q2 is an
orthogonal matrix. Write out the equation, we have
   
r11 r12 . . . r1n   s11 s12 . . . s1n
v 11 v 12 . . . v1m 

 r22 . . . r2n 
  v21 v22 . . . v2m   s22 . . . s2n 

R1 =  .. ..  =  .. ..  = V R ,
0 0
 
. .   .
  . .
.. .. .
..  
  . .  2
 . . 
 rnn   snn 
vm1 vm2 . . . vmm
0 0 ... 0 0 0 ... 0

This implies
r11 = v11 s11 , v21 = v31 = v41 = . . . = vm1 = 0.

Since V contains mutually orthonormal columns and the first column of V is of norm 1.
Thus, v11 = ±1. We notice that rii > 0 and sii > 0 for i ∈ {1, 2, . . . , n} by assumption such
that r11 > 0 and s11 > 0 and v11 can only be positive 1. Since V is an orthogonal matrix,
we also have
v12 = v13 = v14 = . . . = v1m = 0.

Applying this process to the submatrices of R1 , V , R2 , we will find the upper-left sub-
matrix of V is an identity: V [1 : n, 1 : n] = In such that R1 = R2 . This implies
Q1 [:, 1 : n] = Q2 [:, 1 : n] and leads to a contradiction such that the reduced QR decompo-
sition is unique.

64
Matrix Decomposition and Applications

3.14 LQ Decomposition
We previously proved the existence of the QR decomposition via the Gram-Schmidt process
in which case we are interested in the column space of a matrix A = [a1 , a2 , ..., an ] ∈ Rm×n .
However, in many applications (see (Schilders, 2009)), we are also interested in the row space
of a matrix B = [b> > >
1 ; b2 ; ...; bm ] ∈ R
m×n , where b is the i-th row of B. The successive
i
spaces spanned by the rows b1 , b2 , . . . of B are

C([b1 ]) ⊆ C([b1 , b2 ]) ⊆ C([b1 , b2 , b3 ]) ⊆ . . . .

The QR decomposition thus has its sibling which finds the orthogonal row space. By
applying QR decomposition on B > = Q0 R, we recover the LQ decomposition of the matrix
B = LQ where Q = Q> 0 and L = R .
>

Theorem 3.11: (LQ Decomposition)


Every m × n matrix B (whether linearly independent or dependent rows) with n ≥ m
can be factored as
B = LQ,
where
1. Reduced: L is an m×m lower triangular matrix and Q is m×n with orthonormal
rows which is known as the reduced LQ decomposition;
2. Full: L is an m × n lower triangular matrix and Q is n × n with orthonormal rows
which is known as the full LQ decomposition. If further restrict the lower triangular
matrix to be a square matrix, the full LQ decomposition can be denoted as
 
B = L0 0 Q,

where L0 is an m × m square lower triangular matrix.

Row-pivoted LQ (RPLQ) Similar to the column-pivoted QR in Section 3.8, there exists


a row-pivoted LQ decomposition:
 
L11


 Reduced RPLQ: PB = Qr ;


 L21 |{z}


 | {z } r×n
m×r
 
 L11 0

 Full RPLQ: P B = Q ,


 L21 0 |{z}

 | {z } m×n
m×m

where L11 ∈ Rr×r is lower triangular, Qr or Q1:r,: spans the same row space as B, and P
is a permutation matrix that interchange independent rows into the upper-most rows.

3.15 Two-Sided Orthogonal Decomposition

65
Jun Lu

Theorem 3.12: (Two-Sided Orthogonal Decomposition)


When square matrix A ∈ Rn×n with rank r, the full CPQR, RPLQ of A are given by
   
R11 R12 L11 0
AP1 = Q1 , P2 A = Q2
0 0 L21 0

respectively. Then we would find out


 
R11 L11 + R12 L21 0
AP A = Q1 Q2 ,
0 0
| {z }
rank r

where the first r columns of Q1 span the same column space of A, first r rows of Q2 span
the same row space of A, and P is a permutation matrix. We name this decomposition
as two-sided orthogonal decomposition.

This decomposition is very similar to the property of SVD: A = U ΣV > that the first r
columns of U span the column space of A and the first r columns of V span the row space of
A (we shall see in Lemma 14.8, p. 138). Therefore, the two-sided orthogonal decomposition
can be regarded as an inexpensive alternative in this sense.

Lemma 3.13: (Four Orthonormal Basis)


Given the two-sided orthogonal decomposition of matrix A ∈ Rn×n with rank r: AP A =
U F V > , where U = [u1 , u2 , . . . , un ] and V = [v1 , v2 , . . . , vn ] are the column partitions
of U and V . Then, we have the following property:
• {v1 , v2 , . . . , vr } is an orthonormal basis of C(A> );
• {vr+1 , vr+2 , . . . , vn } is an orthonormal basis of N (A);
• {u1 , u2 , . . . , ur } is an orthonormal basis of C(A);
• {ur+1 , ur+2 , . . . , un } is an orthonormal basis of N (A> ).

3.16 Rank-One Changes


We previously discussed the rank-one update/downdate of the Cholesky decomposition in
Section 2.8 (p. 40). The rank-one change A0 of matrix A in the QR decomposition is defined
in a similar form:
A0 = A + uv > ,
↓ ↓
Q R = QR + uv > ,
0 0

where if we set A0 = A − (−u)v > , we recover the downdate form such that the update or
downdate in the QR decomposition are the same. Let w = Q> u, we have

A0 = Q(R + wv > ).

66
Matrix Decomposition and Applications

From the second form in Remark 3.9 on introducing zeros backwards, there exists a set of
Givens rotations G12 G23 . . . G(n−1),n such that

G12 G23 . . . G(n−1),n w = ±||w||e1 ,

where G(k−1),k is the Givens rotation in plane k − 1 and k that introduces zero in the k-th
entry of w. Apply this rotation to R, we have

G12 G23 . . . G(n−1),n R = H0 ,

where the Givens rotations in this reverse order are useful to transform the upper trian-
gular R into a “simple” upper Hessenberg which is close to upper triangular matrices (see
Definition 8.1 that we will introduce in the Hessenberg decomposition). If the rotations
are transforming w into ±||w||e1 from forward order as in Corollary 3.8, we will not have
this upper Hessenberg H0 . To see this, suppose R ∈ R5×5 , an example is shown as follows
where  represents a value that is not necessarily zero, and boldface indicates the value
has just been changed. The backwards rotations result in the upper Hessenberg H0 which
is relatively simple to handle:
     
              
 0      0    
 34  0    
 
 G45
 0 0    G→
 
 0 0    →
Backwards:    0 0   
   
 0 0 0    0 0 0    0 0   
0 0 0 0  0 0 0   0 0 0  
R G45 R G34 G45 R
   
         
 0    
 G12      
 
G23 
→   0      →  0     .
  
 0 0     0 0   
0 0 0   0 0 0  
G23 G34 G45 R G12 G23 G34 G45 R

And the forward rotations result in a full matrix:


     
              
 G12      G23
0           
  
Forwards: 0 0    →  0 0    
 
 →

    
0 0 0    0 0 0   0 0 0  
0 0 0 0  0 0 0 0  0 0 0 0 
R G12 R G23 G12 R
   
         
         
G34   G45  
→        →
      .
 
           
0 0 0 0      
G34 G23 G12 R G45 G34 G23 G12 R

67
Jun Lu

Generally, the backward rotation results in,


G12 G23 . . . G(n−1),n (R + wv > ) = H0 ± ||w||e1 v > = H,
which is also upper Hessenberg. Similar to triangularization via the Givens rotation in
Section 3.12, there exists a set of rotations J12 , J23 , . . . , J(n−1),n such that
J(n−1),n . . . J23 J12 H = R0 ,
is upper triangular. Following from the 5 × 5 example, the triangularization is shown as
follows
     
              
     J12
0      0    
 J23
H0 ± ||w||e1 v > = 
   
0   → →
  0      0 0   
}  0 0 

0
 
  0     0 0   
| {z
H
0 0 0   0 0 0   0 0 0  
H J12 H J23 J12 H
   
         
0      J45  0 0   
 
J34 
→ 0 0     →  0 0    .
 
0 0 0     0 0 0  
0 0 0   0 0 0 0 
J34 J23 J12 H J45 J34 J23 J12 H
And the QR decomposition of A0 is thus given by
A 0 = Q0 R 0 ,
where ( 0
R = (J(n−1),n . . . J23 J12 )(G12 G23 . . . G(n−1),n )(R + wv > );
>
(3.13)
Q0 = Q (J(n−1),n . . . J23 J12 )(G12 G23 . . . G(n−1),n ) .


3.17 Appending or Deleting a Column


Deleting a column Suppose the QR decomposition of A ∈ Rm×n is given by A = QR
where the column partition of A is A = [a1 , a2 , . . . , an ]. Now, if we delete the k-th column
of A such that A0 = [a1 , . . . , ak−1 , ak+1 , . . . , an ] ∈ Rm×(n−1) . We want to find the QR
decomposition of A0 efficiently. Suppose further R has the following form
"R
11 a R # k−1
12
R= 0 rkk b> 1
0 0 R22 m−k .
k−1 1 n−k

Apparently,  
R11 R12
Q> A0 =  0 b>  = H
0 R22

68
Matrix Decomposition and Applications

is upper Hessenberg. A 6 × 5 example is shown as follows where k = 3:


   
        
 0      0   
   
 0 0     0 0  
 0 0 0   −→  0 0   .
   
   
 0 0 0 0   0 0 0 
0 0 0 0 0 0 0 0 0
R=Q A > H = Q> A 0

Again, for columns k to n − 1 of H, there exists a set of rotations Gk,k+1 , Gk+1,k+2 , . . .,


Gn−1,n that could introduce zeros for the elements hk+1,k , hk+2,k+1 , . . ., hn,n−1 of H. The
the triangular matrix R0 is given by

R0 = Gn−1,n . . . Gk+1,k+2 Gk,k+1 Q> A0 .

And the orthogonal matrix

Q0 = (Gn−1,n . . . Gk+1,k+2 Gk,k+1 Q> )> = QG> > >


k,k+1 Gk+1,k+2 . . . Gn−1,n , (3.14)

such that A0 = Q0 R0 .
Appending a column Similarly, suppose A e = [a1 , ak , w, ak+1 , . . . , an ] where we append
w into the (k + 1)-th column of A. We can obtain

Q> A
e = [Q> a1 , . . . , Q> ak , Q> w, Q> ak+1 , . . . , Q> an ] = H.
f

A set of Givens rotations Jm−1,m , Jm−2,m−1 , . . . , Jk+1,k+2 can introduce zeros for the e
hm,k+1 ,
hm−1,k+1 , . . ., e
e hk+2,k+1 elements of H
f such that

e = Jk+1,k+2 . . . Jm−2,m−1 Jm−1,m Q> A.


R e

Suppose Hf is of size 6 × 5 and k = 2, an example is shown as follows where  represents a


value that is not necessarily zero, and boldface indicates the value has just been changed.
       
                   
 0      0      0      0    
       
 0 0    J56  0 0    J45  0 0    J34  0 0    
→ → → 
 0 0 0   .
 
 0 0  0   0 0  0   0 0  0 
       
0 0  0 0 0 0  0 0  0 0 0 0   0 0 0 0 
0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H
f f→e
J56 H h63 = 0 f→e
J45 J56 H h53 = 0 f→e
J34 J45 J56 H h43 = 0

And finally, the orthogonal matrix


e = (Jk+1,k+2 . . . Jm−2,m−1 Jm−1,m Q> )> = QJ >
Q > >
m−1,m Jm−2,m−1 . . . Jk+1,k+2 , (3.15)

such that A
e=Q
e R.
e

69
Jun Lu

Real world application The method introduced above is useful for the efficient variable
selection in the least squares problem via the QR decomposition. At each time we delete
a column of the data matrix A, and apply an F -test to see if the variable is significant or
not. If not, we will delete the variable and favor a simpler model (Lu, 2021e).

3.18 Appending or Deleting a Row


Appending
  a row Suppose the full QR decomposition of A ∈ Rm×n is given by A =
A1
= QR where A1 ∈ Rk×n and A2 ∈ R(m−k)×n . Now, if we add a row such that
A2
 
A1
A0 = w>  ∈ R(m+1)×n . We want to find the full QR decomposition of A0 efficiently.
A2
Construct a permutation matrix
     >
0 1 0 A1 w
P = Ik 0 0  −→ P w>  =  A1  .
0 0 Im−k A2 A2
Then,    >
1 0 0 w
> PA = =H
0 Q R
is upper Hessenberg. Similarly, a set of rotations G12 , G23 , . . . , Gn,n+1 can be applied to
introduce zeros for the elements h21 , h32 , . . ., hn+1,n of H. The triangular matrix R0 is
given by  
0 1 0
R = Gn,n+1 . . . G23 G12 P A0 .
0 Q>
And the orthogonal matrix
   >  
0 1 0 > 1 0
Q = Gn,n+1 . . . G23 G12 P =P G> > >
12 G23 . . . Gn,n+1 ,
0 Q> 0 Q
such that A0 = Q0 R0 .


A1
Deleting a row Suppose A = w>  ∈ Rm×n where A1 ∈ Rk×n , A2 ∈ R(m−k−1)×n
A2
m×m , R ∈ Rm×n . We
with the full QR decomposition given by A = QR where  Q ∈ R
want to compute the full QR decomposition of Ae = A1 efficiently (assume m − 1 ≥ n).
A2
Analogously, we can construct a permutation matrix
 
0 1 0
P = Ik 0 0 
0 0 Im−k−1
such that     >
0 1 0 A1 w
P A = Ik 0 0  w>  =  A1  = P QR = M R,
0 0 Im−k−1 A2 A2

70
Matrix Decomposition and Applications

where M = P Q is an orthogonal matrix. Let m> be the first row of M , and a set of givens
rotations Gm−1,m , Gm−2,m−1 , . . . , G1,2 introducing zeros for elements mm , mm−1 , . . . , m2 of
m respectively such that G1,2 . . . Gm−2,m−1 Gm−1,m m = αe1 where α = ±1. Therefore, we
have  >
v 1
G1,2 . . . Gm−2,m−1 Gm−1,m R = R1 m − 1 ,

which is upper Hessenberg with R1 ∈ R(m−1)×n being upper triangular. And


 
> > > α 0
M Gm−1,m Gm−2,m−1 . . . G1,2 = ,
0 Q1

where Q1 ∈ R(m−1)×(m−1) is an orthogonal matrix. The bottom-left block of the above ma-
trix is a zero vector since α = ±1 and M is orthogonal. To see this, let G = G> >
m−1,m Gm−2,m−1
. . . G> > > > >
1,2 with the first column being g and M = [m ; m2 ; m3 ; . . . , mm ] being the row par-
tition of M . We have
m> g = ±1 → g = ±m,
m>
i m = 0, ∀i ∈ {2, 3, . . . , m}.
This results in
P A = MR
= (M G> >
m−1,m Gm−2,m−1 . . . G1,2 >)(G1,2 . . . Gm−2,m−1 Gm−1,m R)
  >    >
αv >

α 0 v w
= = = e .
0 Q1 R1 Q1 R1 A
 
A1
This implies Q1 R1 is the full QR decomposition of A =
e .
A2

4. UTV Decomposition: ULV and URV Decomposition


4.1 UTV Decomposition
The UTV decomposition goes further by factoring the matrix into two orthogonal matrices
A = U T V , where U , V are orthogonal, whilst T is (upper/lower) triangular.14 The result-
ing T supports rank estimation. The matrix T can be lower triangular which results in the
ULV decomposition, or it can be upper triangular which results in the URV decomposition.
The UTV framework shares a similar form as the singular value decomposition (SVD, see
Section 14, p. 134) and can be regarded as inexpensive alternatives to the SVD.

Theorem 4.1: (Full ULV Decomposition)

14. These decompositions fall into a category known as the double-sided orthogonal decomposition. We will
see, the UTV decomposition, complete orthogonal decomposition, and singular value decomposition are
all in this notion.

71
Jun Lu

Every m × n matrix A with rank r can be factored as


 
L 0
A=U V,
0 0

where U ∈ Rm×m and V ∈ Rn×n are two orthogonal matrices, and L ∈ Rr×r is a lower
triangular matrix.
The existence of the ULV decomposition is from the QR and LQ decomposition.
Proof [of Theorem 4.1] For any rank r matrix A = [a1 , a2 , . . . , an ], we can use a column
permutation matrix P (Definition 0.17, p. 15) such that the linearly independent columns of
A appear in the first r columns of AP . Without loss of generality, we assume b1 , b2 , . . . , br
are the r linearly independent columns of A and
AP = [b1 , b2 , . . . , br , br+1 , . . . , bn ].
Let Z = [b1 , b2 , . . . , br ] ∈ Rm×r . Since any bi is in the column space of Z, we can find a
E ∈ Rr×(n−r) such that
[br+1 , br+2 , . . . , bn ] = ZE.
That is,  
AP = [b1 , b2 , . . . , br , br+1 , . . . , bn ] = Z Ir E ,
where Ir is an r × r identity matrix. Moreover, m×r has full column rank such that
 Z ∈R
R
its full QR decomposition is given by Z = U , where R ∈ Rr×r is an upper triangular
0
matrix with full rank and U is an orthogonal matrix. This implies
   
  R   R RE
AP = Z Ir E = U Ir E = U . (4.1)
0 0 0
 
 this means R RE also
Since R has full rank, has full rank such that its full LQ decom-
position is given by L 0 V0 where L ∈ Rr×r is a lower triangular matrix and V0 is an
orthogonal matrix. Substitute into Equation (4.1), we have
 
L 0
A=U V P −1 .
0 0 0
Let V = V0 P −1 which is a product of two orthogonal matrices, and is also an orthogonal
matrix. This completes the proof.
A second way to see the proof of the ULV decomposition will be discussed in the proof of
Theorem 4.3 shortly via the rank-revealing QR decomposition and trivial QR decomposi-
tion. Now suppose the ULV decomposition of matrix A is
 
L 0
A=U V.
0 0
Let U0 = U:,1:r and V0 = V1:r,: , i.e., U0 contains only the first r columns of U , and V0
contains only the first r rows of V . Then, we still have A = U0 LV0 . This is known as the
reduced ULV decomposition. Similarly, we can also claim the URV decomposition as
follows.

72
Matrix Decomposition and Applications

Theorem 4.2: (Full URV Decomposition)


Every m × n matrix A with rank r can be factored as
 
R 0
A=U V,
0 0

where U ∈ Rm×m and V ∈ Rn×n are two orthogonal matrices, and R ∈ Rr×r is an upper
triangular matrix.

The proof is just similar to that of ULV decomposition and we shall not give the details.
Again, there is a version of reduced URV decomposition and the difference between the
full and reduced URV can be implied from the context. The ULV and URV sometimes
are referred to as the UTV decomposition framework (Fierro and Hansen, 1997; Golub and
Van Loan, 2013).
We will shortly see that the forms of ULV and URV are very close to the singular value
decomposition (SVD). All of the three factor the matrix A into two orthogonal matrices.
Specially, there exists a set of basis for the four subspaces of A in the fundamental theorem
of linear algebra via the ULV and the URV. Taking ULV as an example, the first r columns
of U form an orthonormal basis of C(A), and the last (m − r) columns of U form an
orthonormal basis of N (A> ). The first r rows of V form an orthonormal basis for the row
space C(A> ), and the last (n − r) rows form an orthonormal basis for N (A) (similar to the
two-sided orthogonal decomposition):


 C(A) = span{u1 , u2 , . . . , ur },

 N (A) = span{vr+1 , vr+2 , . . . , vn },



 C(A> ) = span{v1 , v2 , . . . , vr },
N (A> ) = span{ur+1 , ur+2 , . . . , um }.

The SVD goes further that there is a connection between the two pairs of orthonormal
basis, i.e., transforming from column basis into row basis, or left null space basis into right
null space basis. We will get more details in the SVD section.

4.2 Complete Orthogonal Decomposition


What is related to the UTV decomposition is called the complete orthogonal decomposition
which factors into two orthogonal matrices as well.

Theorem 4.3: (Complete Orthogonal Decomposition)


Every m × n matrix A with rank r can be factored as
 
T 0
A=U V,
0 0

73
Jun Lu

where U ∈ Rm×m and V ∈ Rn×n are two orthogonal matrices, and T ∈ Rr×r is an rank-r
matrix.

Proof [of Theorem 4.3] By rank-revealing QR decomposition (Theorem 3.2, p. 53), A can
be factored as  
> R11 R12
Q1 AP = ,
0 0

where R11 ∈ Rr×r is upper triangular, R12 ∈ Rr×(n−r) , Q1 ∈ Rm×m is an orthogonal


matrix, and P is a permutation matrix.

Then, it is not hard to find a decomposition such that


 >  
R11 S
> = Q2 0 ,
R12
(4.2)

where Q2 is an orthogonal matrix,


 >S is an rank-r matrix. The decomposition is rea-
R11 n×r has rank r of which the columns stay in
sonable in the sense the matrix > ∈R
R12
a subspace of Rn . Nevertheless, the columns of Q2 span the whole space of n
 R> ,where
R11
we can assume the first r columns of Q2 span the same space as that of > . The
R12
   >
S R11
matrix is to transform Q2 into > .
0 R12

Then, it follows that  > 


S 0
Q>
1 AP Q2 = .
0 0

Let U = Q1 , V = Q>
2P
> and T = S > , we complete the proof.

We
 >can
 find that when Equation (4.2) is taken to be the reduced QR decomposition of
R11
> , then the complete orthogonal decomposition reduces to the ULV decomposition.
R12

4.3 Application: Row Rank equals Column Rank Again via UTV
As mentioned above, the UTV framework can prove the important theorem in linear algebra
that the row rank and column rank of a matrix are equal. Notice that to apply the UTV
in the proof, a slight modification on the claim of the existence of the UTV decomposition
needs to be taken care of. For example, in Theorem 4.1, the assumption of the matrix A
is to have rank r. Since rank r already admits the fact that row rank equals column rank.
A better claim here to this aim is to say matrix A has column rank r in Theorem 4.1. See
(Lu, 2021b) for a detailed discussion.
Proof [of Theorem 0.13, p. 12, A Second Way] Any m × n matrix A with rank r can
be factored as  
L 0
A = U0 V ,
0 0 0

74
Matrix Decomposition and Applications

where U0 ∈ Rm×m and V0 ∈ Rn×n are  two orthogonal matrices, and L ∈ R


r×r is a lower

L 0
triangular matrix 15 . Let D = , the row rank and column rank of D are apparently
0 0
the same. If we could prove the column rank of A equals the column rank of D, and the
row rank of A equals the row rank of D, then we complete the proof.
Let U = U0> , V = V0> , then D = U AV . Decompose the above idea into two steps, a
moment of reflexion reveals that, if we could first prove the row rank and column rank of
A are equal to those of U A, and then, if we further prove the row rank and column rank
of U A are equal to those of U AV , we could also complete the proof.

Row rank and column rank of A are equal to those of U A Let B = U A, and let
further A = [a1 , a2 , . . . , an ] and B = [b1 , b2 , . . . , bn ] be the column partitions of A and B.
Therefore, [b1 , b2 , . . . , bn ] = [U a1 , U a2 , . . . , U an ]. If x1 a1 + x2 a2 + . . . + xn an = 0, then
we also have

U (x1 a1 + x2 a2 + . . . + xn an ) = x1 b1 + x2 b2 + . . . + xn bn = 0.

Let j1 , j2 , . . . , jr be distinct indices between 1 and n, if the set {aj1 , aj2 , . . . , ajr } is inde-
pendent, the set {bj1 , bj2 , . . . , bjr } must also be linearly independent. This implies

dim(C(B)) ≤ dim(C(A)).

Similarly, by A = U > B, it follows that

dim(C(A)) ≤ dim(C(B)).

This implies
dim(C(B)) = dim(C(A)).

Apply the process onto B > and A> , we have

dim(C(B > )) = dim(C(A> )).

This implies the row rank and column rank of A and B = U A are the same. Similarly, we
can also show that the row rank and column rank of U A and U AV are the same. This
completes the proof.

15. Instead of using the ULV decomposition,


 in some texts, the authors use elementary transformations
Ir 0
E1 , E2 such that A = E1 E2 , to prove the result.
0 0

75
Jun Lu

Part III
Data Interpretation and Information
Distillation
5. CR Decomposition
CR decomposition is proposed in (Strang, 2021; Strang and Moler, 2021). As usual, we
firstly give the result and we will discuss the existence and the origin of this decomposition
in the following sections.

Theorem 5.1: (CR Decomposition)


Any rank-r matrix A ∈ Rm×n can be factored as

A = C R
m×n m×r r×n

where C is the first r linearly independent columns of A, and R is an r × n matrix to


reconstruct the columns of A from columns of C. In particular, R is the row reduced
echelon form (RREF) of A without the zero rows.
The storage for the decomposition is then reduced or potentially increased from mn to
r(m + n).

5.1 Existence of the CR Decomposition


Since matrix A is of rank r, there are some r linearly independent columns in A. We then
choose linearly independent columns from A and put them into C:
Find r linearly Independent Columns From A

1. If column 1 of A is not zero, put it into the column of C;


2. If column 2 of A is not a multiple of column 1, put it into the column of C;
3. If column 3 of A is not a combination of columns 1 and 2, put it into the column
of C;
4. Continue this process until we find r linearly independent columns (or all the
linearly independent columns if we do not know the rank r beforehand).

When we have the r linearly independent columns from A, we can prove the existence
of CR decomposition by the column space view of matrix multiplication.

Column space view of matrix multiplication A multiplication of two matrices D ∈


Rm×k , E ∈ Rk×n is A = DE = D[e1 , e2 , . . . , en ] = [De1 , De2 , . . . , Den ], i.e., each column
of A is a combination of columns from D.
Proof [of Theorem 5.1] As the rank of matrix A is r and C contains r linearly independent
columns from A, the column space of C is equivalent to the column space of A. If we take
any other column ai of A, ai can be represented as a linear combination of the columns of

76
Matrix Decomposition and Applications

C, i.e., there exists a vector ri such that ai = Cri , ∀i ∈ {1, 2, . . . , n}. Put these ri ’s into
the columns of matrix R, we obtain

A = [a1 , a2 , . . . , an ] = [Cr1 , Cr2 , . . . , Crn ] = CR,

from which the result follows.

5.2 Reduced Row Echelon Form (RREF)


In Gaussian elimination Section 1.1, we introduced the elimination matrix (a lower trian-
gular matrix) and permutation matrix to transform A into an upper triangular form. We
rewrite the Gaussian elimination for a 4 × 4 square matrix, where  represents a value that
is not necessarily zero, and boldface indicates the value has just been changed:
Gaussian Elimination for a Square Matrix

       
               

    E 
1 
 −→ 0 0    P
 −→1 0
     0    .
E2 
 −→

     0     0 0     0 0  
    0    0    0 0 0 
A E1 A P1 E1 A E2 P1 E1 A

Furthermore, the Gaussian elimination can also be applied on a rectangular matrix, we


give an example for a 4 × 5 matrix as follows:
Gaussian Elimination for a Rectangular Matrix
     
2  10 9  2  10 9  2  10 9 
     E1 0 0 5 6
   0 0 5 6 ,
E2  
 −→   −→
     0 0 2   0 0 0 3  
     0 0    0 0 0 0 0
A E1 A E2 E1 A

where the blue-colored numbers are pivots as we defined previously and we call the last
matrix above row echelon form. Note that we get the 4-th row as a zero row in this
specific example. Going further, if we subtract each row by a multiple of the next row to
make the entries above the pivots to be zero:
Reduced Row Echelon Form: Get Zero Above Pivots
     
2  10 9  2  0 −3  2  0 0 
0 0 5 6
0 0 5 6  −→
 E3   E4 0 0 5 0  
 −→
0 0 0 3 ,
  
0 0 0 3   0 0 0 3 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E2 E1 A E3 E2 E1 A E4 E3 E2 E1 A

77
Jun Lu

where E3 subtracts 2 times the 2-nd row from the 1-st row, and E4 adds the 3-rd row to
the 1-st row and subtracts 2 times the 3-rd row from the 2-nd row. Finally, we get the full
row reduced echelon form by making the pivots to be 1:
Reduced Row Echelon Form: Make The Pivots To Be 1
   
2  0 0  1  0 0 
0 0 5 0  E5 0 0 1 0  
0 0 0 3  −→ 0 0 0 1  ,
   

0 0 0 0 0 0 0 0 0 0
E4 E3 E2 E1 A E5 E4 E3 E2 E1 A

where E5 makes the pivots to be 1. Note here, the transformation matrix E1 , E2 , . . . , E5


are not necessarily to be lower triangular matrices as they are in LU decomposition. They
can also be permutation matrices or other matrices. We call this final matrix the reduced
row echelon form of A where it has 1’s as pivots and zeros above the pivots.

Lemma 5.2: (Rank and Pivots)


The rank of A is equal to the number of pivots.

Lemma 5.3: (RREF in CR)


The reduced row echelon form of the matrix A without zero rows is the matrix R in the
CR decomposition.

In short, we first compute the reduced row echelon form of matrix A by rref (A), Then
C is obtained by removing from A all the non-pivot columns (which can be determined
by looking for columns in rref (A) which do not contain a pivot). And R is obtained by
eliminating zero rows of rref (A). And this is actually a special case of rank decompo-
sition of matrix A. However, CR decomposition is so special that it involves the reduced
row echelon form so that we introduce it here particularly.
R has a remarkable form whose r columns containing the pivots form an r × r identity
matrix. Note again that we can just remove the zero rows from the row reduced echelon
form to obtain this matrix R. In (Strang, 2021), the authors give a specific notation for
the row reduced echelon form without removing the zero rows as R0 :
   
R Ir F
R0 = rref (A) = = P , 16
0 0 0

where the n × n permutation matrix P puts the columns of r × r identity matrix Ir into the
correct positions, matching the first r linearly independent columns of the original matrix
A.
The CR decomposition reveals a great theorem of linear algebra that the row rank equals
the column rank of any matrix.
16. Permutation matrix P in the right side of a matrix is to permute the column of that matrix.

78
Matrix Decomposition and Applications

Proof [of Theorem 0.13, A Third Way] For CR decomposition of matrix A = CR,
we have R = [Ir , F ]P , where P is an n × n permutation to put the columns of the r × r
identity matrix Ir into the correct positions as shown above. It can be easily verified that
the r rows of R are linearly independent of the submatrix of Ir (since Ir is nonsingular)
such that the row rank of R is r.
Firstly, from the definition of the CR decomposition, the r columns of C are from r
linearly independent columns of A, the column rank of A is r. Further,
• Since A = CR, all rows of A are combinations of the rows of R. That is, the row
rank of A is no larger than the row rank of R;
• From A = CR, we also have (C > C)−1 C > CR = (C > C)−1 C > A, that is R =
(C > C)−1 C > A. C > C is nonsingular since it has full column rank r. Then all rows of
R are also combinations of the rows of A. That is, the row rank of R is no larger than the
row rank of A;
• By “sandwiching”, the row rank of A is equal to the row rank of R which is r.
Therefore, both the row rank and column rank of A are equal to r from which the result
follows.

5.3 Rank Decomposition


We previously mentioned that the CR decomposition is a special case of rank decomposition.
Formally, we prove the existence of the rank decomposition rigorously in the following
theorem.

Theorem 5.4: (Rank Decomposition)


Any rank-r matrix A ∈ Rm×n can be factored as

A = D F ,
m×n m×r r×n

where D ∈ Rm×r has rank r, and F ∈ Rr×n also has rank r, i.e., D, F have full rank r.
The storage for the decomposition is then reduced or potentially increased from mn
to r(m + n).

Proof [of Theorem 5.4] By ULV decomposition in Theorem 4.1 (p. 71), we can decompose
A by  
L 0
A=U V.
0 0
Let U0 = U:,1:r and V0 = V1:r,: , i.e., U0 contains only the first r columns of U , and V0
contains only the first r rows of V . Then, we still have A = U0 LV0 where U0 ∈ Rm×r and
V0 ∈ Rr×n . This is also known as the reduced ULV decomposition. Let {D = U0 L and
F = V0 }, or {D = U0 and F = LV0 }, we find such rank decomposition.
The rank decomposition is not unique. Even by elementary transformations, we have
 
Z 0
A = E1 E2 ,
0 0

79
Jun Lu

where E1 ∈ Rm×m , E2 ∈ Rn×n represent elementary row and column operations, Z ∈ Rr×r .
The transformation is rather general, and there are dozens of these E1 , E2 , Z. Similar
construction on this decomposition as shown in the above proof, we can recover another
rank decomposition.
Analogously, we can find such D, F by SVD, URV, CR, CUR, and many other decom-
positional algorithms. However, we may connect the different rank decompositions by the
following lemma.

Lemma 5.5: (Connection Between Rank Decompositions)


For any two rank decompositions of A = D1 F1 = D2 F2 , there exists a nonsingular matrix
P such that
D1 = D2 P and F1 = P −1 F2 .

Proof [of Lemma 5.5] Since D1 F1 = D2 F2 , we have D1 F1 F1> = D2 F2 F1> . It is trivial


that rank(F1 F1> ) = rank(F1 ) = r such that F1 F1> is a square matrix with full rank and
thus is nonsingular. This implies D1 = D2 F2 F1> (F1 F1> )−1 . Let P = F2 F1> (F1 F1> )−1 , we
have D1 = D2 P and F1 = P −1 F2 .

5.4 Application: Rank and Trace of an Idempotent Matrix


The CR decomposition is quite useful to prove the rank of an idempotent matrix. See also
how it works in the orthogonal projection in (Lu, 2021c,e).

Lemma 5.6: (Rank and Trace of an Idempotent Matrix)


For any n × n idempotent matrix A (i.e., A2 = A), the rank of A equals the trace of A.

Proof [of Lemma 5.6] Any n × n rank-r matrix A has CR decomposition A = CR, where
C ∈ Rn×r and R ∈ Rr×n with C, R having full rank r. Then,
A2 = A,
CRCR = CR,
RCR = R,
RC = Ir ,
where Ir is an r × r identity matrix. Thus

trace(A) = trace(CR) = trace(RC) = trace(Ir ) = r,

which equals the rank of A. The equality above is from the invariant of cyclic permutation
of trace.

6. Skeleton/CUR Decomposition

80
Matrix Decomposition and Applications

Theorem 6.1: (Skeleton Decomposition)


Any rank-r matrix A ∈ Rm×n can be factored as

A = C U −1 R,
m×n m×r r×r r×n

where C is some r linearly independent columns of A, R is some r linearly independent


rows of A and U is the nonsingular submatrix on the intersection.
• The storage for the decomposition is then reduced or potentially increased from mn
floats to r(m + n) + r2 floats.
• Or further, if we only record the position of the indices, it requires mr, nr floats for
storing C, R respectively and extra 2r integers to remember the position of each
column of C in that of A and each row of R in that of A (i.e., construct U from
C, R).

Skeleton decomposition is also known as the CUR decomposition follows from the nota-
tion in the decomposition. The illustration of skeleton decomposition is shown in Figure 12
where the yellow vectors denote the linearly independent columns of A and green vectors
denote the linearly independent rows of A. In case A is square and invertible, we have
skeleton decomposition A = CU −1 R where C = R = U = A such that the decomposition
reduces to A = AA−1 A. Specifically, if I, J index vectors both with size r that contain the
indices of rows and columns selected from A into R and C respectively, U can be denoted
as U = A[I, J].

1
  

Amn Cmr U rr1 Rrn


Figure 12: Demonstration of skeleton decomposition of a matrix.

6.1 Existence of the Skeleton Decomposition


In Corollary 0.13, we proved the row rank and the column rank of a matrix are equal. In
another word, we can also claim that the dimension of the column space and the dimension
of the row space are equal. This property is essential for the existence of the skeleton
decomposition.
We are then ready to prove the existence of the skeleton decomposition. The proof is
rather elementary.

81
Jun Lu

Proof [of Theorem 6.1] The proof relies on the existence of such nonsingular matrix U
which is central to this decomposition method.
Existence of such nonsingular matrix U Since matrix A is rank-r, we can pick r
columns from A so that they are linearly independent. Suppose we put the specific r inde-
pendent columns ai1 , ai2 , . . . , air into the columns of an m×r matrix N = [ai1 , ai2 , . . . , air ] ∈
Rm×r . The dimension of the column space of N is r so that the dimension of the row
space of N is also r by Corollary 0.13. Again, we can pick r linearly independent rows
n> > >
j1 , nj2 , . . . , njr from N and put the specific r rows into rows of an r × r matrix U =
[n> > >
j1 ; nj2 ; . . . ; njr ] ∈ R
r×r . Using Corollary 0.13 again, the dimension of the column space

of U is also r which means there are the r linearly independent columns from U . So U is
such a nonsingular matrix with size r × r.
Main proof As long as we find the nonsingular r × r matrix U inside A, we can find the
existence of the skeleton decomposition as follows.
Suppose U = A[I, J] where I, J are index vectors of size r. Since U is a nonsingular
matrix, the columns of U are linearly independent. Thus the columns of matrix C based on
the columns of U are also linearly independent (i.e., select the r columns of A with the same
entries of the matrix U . Here C is equal to the N we construct above and C = A[:, J]).
As the rank of the matrix A is r, if we take any other column ai of A, ai can be
represented as a linear combination of the columns of C, i.e., there exists a vector x such
that ai = Cx, for all i ∈ {1, 2, . . . , n}. Let r rows of ai corresponding to the row entries of
U be ri ∈ Rr for all i ∈ {1, 2, . . . , n} (i.e., ri contains r entries of ai ). That is, select the r
entries of ai ’s corresponding to the entries of U as follows:
A = [a1 , a2 , . . . , an ] ∈ Rm×n −→ A[I, :] = [r1 , r2 , . . . , rn ] ∈ Rr×n .
Since ai = Cx, U is a submatrix inside C, and ri is a subvector inside ai , we have ri = U x
which is equivalent to x = U −1 ri . Thus for every i, we have ai = CU −1 ri . Combine the
n columns of such ri into R = [r1 , r2 , . . . , rn ], we obtain
A = [a1 , a2 , . . . , an ] = CU −1 R,
from which the result follows.
In short, we first find r linearly independent columns of A into C ∈ Rm×r . From
C, we find an r × r nonsingular submatrix U . The r rows of A corresponding to entries
of U can help to reconstruct the columns of A. Again, the situation is shown in Figure 12.

CR decomposition vs skeleton decomposition We note that CR decomposition and


skeleton decomposition share a similar form. Even for the symbols used A = CR for the
CR decomposition and A = CU −1 R for the skeleton decomposition.
Both in the CR decomposition and the skeleton decomposition, we can select the first
r independent columns to obtain the matrix C (the symbol for both the CR decomposition
and the skeleton decomposition). So C’s in the CR decomposition and the skeleton decom-
position are exactly the same. On the contrary, R in the CR decomposition is the reduced
row echelon form without the zero rows, whereas R in the skeleton decomposition is exactly
some rows from A so that R’s have different meanings in the two decompositional methods.

82
Matrix Decomposition and Applications

A word on the uniqueness of CR decomposition and skeleton decomposition


As mentioned above, both in the CR decomposition and the skeleton decomposition, we
select the first r linearly independent columns to obtain the matrix C. In this sense, the CR
and skeleton decompositions have a unique form. However, if we select the last r linearly
independent columns, we will get a different CR decomposition or skeleton decomposition.
We will not discuss this situation here as it is not the main interest of this text.
To repeat, in the above proof for the existence of the skeleton decomposition, we first
find the r linearly independent columns of A into the matrix C. From C, we find an r × r
nonsingular submatrix U . From the submatrix U , we finally find the final row submatrix
R ∈ Rr×n . A further question can be posed that if matrix A has rank r, matrix C contains
r linearly independent columns, and matrix R contains r linearly independent rows, then
whether the r × r “intersection” of C and R is invertible or not 17 .

Corollary 6.2: (Nonsingular Intersection)


If matrix A ∈ Rm×n has rank r, matrix C contains r linearly independent columns, and
matrix R contains r linearly independent rows, then the r × r “intersection” matrix U of
C and R is invertible.

Proof [of Corollary 6.2] If I, J are the indices of rows and columns selected from A into
R and C respectively, then, R can be denoted as R = A[I, :], C can be represented as
C = A[:, J], and U can be denoted as U = A[I, J].
Since C contains r linearly independent columns of A, any column ai of A can be
represented as ai = Cxi = A[:, J]xi for all i ∈ {1, 2, . . . , n}. This implies the r entries
of ai corresponding to the I indices can be represented by the columns of U such that
ai [I] = U xi ∈ Rr for all i ∈ {1, 2, . . . , n}, i.e.,

ai = Cxi = A[:, J]xi ∈ Rm −→ ai [I] = A[I, J]xi = U xi ∈ Rr .

Since R contains r linearly independent rows of A, the row rank and column rank of R are
equal to r. Combining the facts above, the r columns of R corresponding to indices J (i.e.,
the r columns of U ) are linearly independent.
Again, by applying Corollary 0.13, the dimension of the row space of U is also equal to
r which means there are the r linearly independent rows from U , and U is invertible.

7. Interpolative Decomposition (ID)


Column interpolative decomposition (ID) factors a matrix as the product of two matrices,
one of which contains selected columns from the original matrix, and the other of which
has a subset of columns consisting of the identity matrix and all its values are no greater
than 1 in absolute value. Formally, we have the following theorem describing the details of
the column ID.

17. We thank Gilbert Strang for raising this interesting question.

83
Jun Lu

Theorem 7.1: (Column Interpolative Decomposition)


Any rank-r matrix A ∈ Rm×n can be factored as

A = C W,
m×n m×r r×n

where C ∈ Rm×r is some r linearly independent columns of A, W ∈ Rr×n is the ma-


trix to reconstruct A which contains an r × r identity submatrix (under a mild column
permutation). Specifically entries in W have values no larger than 1 in magnitude:

max |wij | ≤ 1, ∀ i ∈ [1, r], j ∈ [1, n].

The storage for the decomposition is then reduced or potentially increased from mn floats
to mr, (n − r)r floats for storing C, W respectively and extra r integers are required to
remember the position of each column of C in that of A.

 

Amn Cmr Wrn


Figure 13: Demonstration of the column ID of a matrix where the yellow vector denotes
the linearly independent columns of A, white entries denote zero, and purple entries denote
one.

The illustration of the column ID is shown in Figure 13 where the yellow vectors denote
the linearly independent columns of A and the purple vectors in W form an r × r identity
submatrix. The positions of the purple vectors inside W are exactly the same as the
positions of the corresponding yellow vectors inside A. The column ID is very similar to
the CR decomposition (Theorem 5.1, p. 76), both select r linearly independent columns into
the first factor and the second factor contains an r × r identity submatrix. The difference is
in that the CR decomposition will exactly choose the first r linearly independent columns
into the first factor and the identity submatrix appears in the pivots (Definition 1.7, p. 19).
And more importantly, the second factor in the CR decomposition comes from the RREF
(Lemma 5.3, p. 78). Therefore, the column ID can also be utilized in the applications of
the CR decomposition, say proving the fact of rank equals trace in idempotent matrices
(Lemma 5.6, p. 80), and proving the elementary theorem in linear algebra that column
rank equals row rank of a matrix (Corollary 0.13, p. 12). Moreover, the column ID is also a
special case of rank decomposition (Theorem 5.4, p. 79) and is apparently not unique. The
connection between different column IDs is given by Lemma 5.5 (p. 80).

84
Matrix Decomposition and Applications

Notations that will be extensively used in the sequel Following again the Matlab-
style notation, if Js is an index vector with size r that contains the indices of columns
selected from A into C, then C can be denoted as C = A[:, Js ]. The matrix C contains
“skeleton” columns of A, hence the subscript s in Js . From the “skeleton” index vector Js ,
the r × r identity matrix inside W can be recovered by

W [:, Js ] = Ir ∈ Rr×r .

Suppose further we put the remaining indices of A into an index vector Jr where

Js ∩ Jr = ∅ and Js ∪ Jr = {1, 2, . . . , n}.

The remaining n − r columns in W consists of an r × (n − r) expansion matrix since the


matrix contains expansion coefficients to reconstruct the columns of A from C:

E = W [:, Jr ] ∈ Rr×(n−r) ,

where the entries of E are known as the expansion coefficients. Moreover, let P ∈ Rn×n
be a (column) permutation matrix (Definition 0.17, p. 15) defined by P = In [:, (Js , Jr )] so
that
AP = A[:, (Js , Jr )] = [C, A[:, Jr ]] ,
and
W P = W [:, (Js , Jr )] = [Ir , E] leads
−−−−−→to W = [Ir , E] P > . (7.1)

7.1 Existence of the Column Interpolative Decomposition


Cramer’s rule The proof of the existence of the column ID relies on the Cramer’s rule
that we shall shortly discuss here. Consider a system of n linear equations for n unknowns,
represented in matrix multiplication form as follows :

M x = l,

where M ∈ Rn×n is nonsingular and x, l ∈ Rn . Then the theorem states that in this case,
the system has a unique solution, whose individual values for the unknowns are given by:
det(Mi )
xi = , for all i ∈ {1, 2, . . . , n},
det(M )
where Mi is the matrix formed by replacing the i-th column of M with the column vector
l. In full generality, the Cramer’s rule considers the matrix equation

M X = L,

where M ∈ Rn×n is nonsingular and X, L ∈ Rn×m . Let I = [i1 , i2 , . . . , ik ] and J =


[j1 , j2 , . . . , jk ] be two index vectors where 1 ≤ i1 ≤ i2 ≤ . . . ≤ ik ≤ n and 1 ≤ j1 ≤ j2 ≤
. . . ≤ jk ≤ n. Then X[I, J] is a k × k submatrix of X. Let further ML (I, J) be the n × n
matrix formed by replacing the is column of M by js column of L for all s ∈ {1, 2, . . . , k}.
Then
det (ML (I, J))
det(X[I, J]) = .
det(M )

85
Jun Lu

When I, J are of size 1, it follows that


det (ML (i, j))
xij = . (7.2)
det(M )
Now we are ready to prove the existence of the column ID.
Proof [of Theorem 7.1] We have mentioned above the proof relies on the Cramer’s rule.
If we can show the entries of W can be denoted by the Cramer’s rule equality in Equa-
tion (7.2) and the numerator is smaller than the denominator, then we can complete the
proof. However, we notice that the matrix in the denominator of Equation (7.2) is a square
matrix. Here comes the trick.
Step 1: column ID for full row rank matrix For a start, we first consider the full
row rank matrix A (which implies r = m, m ≤ n, and A ∈ Rr×n such that the matrix
C ∈ Rr×r is a square matrix in the column ID A = CW that we want). Determine the
“skeleton” index vector Js by

Js = arg max {| det(A[:, J])| : J is a subset of {1, 2, . . . , n} with size r = m} , (7.3)


J

i.e., Js is the index vector that is determined by maximizing the magnitude of the determi-
nant of A[:, J]. As we have discussed in the last section, there exists a (column) permutation
matrix such that  
AP = A[:, Js ] A[:, Jr ] .
Since C = A[:, Js ] has full column rank r = m, it is then nonsingular. The above equation
can be rewritten as
A = A[:, Js ] A[:, Jr ] P >
 
 
= A[:, Js ] Ir A[:, Js ] A[:, Jr ] P >
−1
,
 −1
 >
= C Ir C A[:, Jr ] P
| {z }
W

C −1 A[:, Jr ]
 >
P = Ir E P > by Equation (7.1).
  
where the matrix W is given by Ir
To prove the claim that the magnitude of W is no larger than 1 is equivalent to proving
that entries in E = C −1 A[:, Jr ] ∈ Rr×(n−r) are no greater than 1 in absolute value.
Define the index vector [j1 , j2 , . . . , jn ] as a permutation of [1, 2, . . . , n] such that
[j1 , j2 , . . . , jn ] = [1, 2, . . . , n]P = [Js , Jr ].18
Thus, it follows from CE = A[:, Jr ] that
[aj1 , aj2 , . . . , ajr ] E = [ajr+1 , ajr+2 , . . . , ajn ],
| {z } | {z }
=C=A[:,Js ] =A[:,Jr ]:=B

where ai is the i-th column of A and let B = A[:, Jr ]. Therefore, by Cramer’s rule in
Equation (7.2), we have
det (CB (k, l))
Ekl = , (7.4)
det (C)
18. Note here [j1 , j2 , . . . , jn ], [1, 2, . . . , n], Js , and Jr are row vectors.

86
Matrix Decomposition and Applications

where Ekl is the entry (k, l) of E and CB (k, l) is the r × r matrix formed by replacing the
k-th column of C by the l-th column of B. For example,
 
det [ajr+1 , aj2 , . . . , ajr ] det [ajr+2 , aj2 , . . . , ajr ]
E11 = , E12 = ,
det ([aj1 , aj2 , . . . , ajr ]) det ([aj1 , aj2 , . . . , ajr ])
 
det [aj1 , ajr+1 , . . . , ajr ] det [aj1 , ajr+2 , . . . , ajr ]
E21 = , E22 = .
det ([aj1 , aj2 , . . . , ajr ]) det ([aj1 , aj2 , . . . , ajr ])

Since Js is chosen to maximize the magnitude of det(C) in Equation (7.3), it follows that

|Ekl | ≤ 1, for all k ∈ {1, 2, . . . , r}, l ∈ {1, 2, . . . , n − r}.

Step 2: apply to general matrices To summarize what we have proved above and to
abuse the notation. For any matrix F ∈ Rr×n with full rank r ≤ n, the column ID exists
that F = C0 W where the values in W are no greater than 1 in absolute value.
Apply the finding to the full general matrix A ∈ Rm×n with rank r ≤ {m, n}, it is trivial
that the matrix A admits a rank decomposition (Theorem 5.4, p. 79):

A = D F ,
m×n m×r r×n

where D, F have full column rank r and full row rank r respectively. For the column ID of
F = C0 W where C0 = F [:, Js ] contains r linearly independent columns of F . We notice
by A = DF such that
A[:, Js ] = DF [:, Js ],

i.e., the columns indexed by Js of (DF ) can be obtained by DF [:, Js ] which in turn are
the columns of A indexed by Js . This makes

A[:, Js ] = DF [:, Js ],
| {z } | {z }
C DC0

And
A = DF = DC0 W = DF [:, Js ] W = CW .
| {z }
C

This completes the proof.

The above proof reveals an intuitive way to compute the optimal column ID of matrix
A. However, any algorithm that is guaranteed to find such an optimally-conditioned fac-
torization must have combinatorial complexity (Martinsson, 2019). Therefore, randomized
algorithms, approximation by column-pivoted QR (Section 3.8, p. 53) and rank-revealing
QR (Section 3.10, p. 56) are applied to find a relatively well-conditioned decomposition for
the column ID where W is small in norm rather than having entries all smaller than 1 in
magnitude. See (Lu, 2021c) for more details.

87
Jun Lu

7.2 Row ID and Two-Sided ID


We term the decomposition above as column ID. This is no coincidence since it has its
siblings:

Theorem 7.2: (The Whole Interpolative Decomposition)


Any rank-r matrix A ∈ Rm×n can be factored as

Column ID: A = C W;
m×n m×r r×n

Row ID: = Z R ;
m×r r×n

Two-Sided ID: = Z U W,
m×r r×r r×n

where
• C = A[:, Js ] ∈ Rm×r is some r linearly independent columns of A, W ∈ Rr×n is
the matrix to reconstruct A which contains an r × r identity submatrix (under a
mild column permutation): W [:, Js ] = Ir ;
• R = A[Is , :] ∈ Rr×n is some r linearly independent rows of R, Z ∈ Rm×r is the
matrix to reconstruct A which contains an r × r identity submatrix (under a mild
row permutation): Z[Is , :] = Ir ;
• Entries in W , Z have values no larger than 1 in magnitude: max |wij | ≤ 1 and
max |zij | ≤ 1;
• U = A[Is , Js ] ∈ Rr×r is the nonsingular submatrix on the intersection of C, R;
• The three matrices C, R, U in the boxed texts share same notation as the skeleton
decomposition (Theorem 6.1, p. 81) where they even have same meanings such that
the three matrices make the skeleton decomposition of A: A = CU −1 R.

The proof of the row ID is just similar to that of the column ID. Suppose the column ID
of A> is given by A> = C0 W0 where C0 contains r linearly independent columns of A>
(i.e., r linearly independent rows of A). Let R = C0 , Z = W0 , the row ID is obtained by
A = ZR.
For the two-sided ID, recall from the skeleton decomposition (Theorem 6.1, p. 81). When
U is the intersection of C, R, it follows that A = CU −1 R. Thus CU −1 = Z by the row
ID. And this implies C = ZU . By column ID, it follows that A = CW = ZU W which
proves the existence of the two-sided ID.
Data storage For the data storage of each ID, we summarize as follows
• Column ID. It requires mr and (n − r)r floats to store C and W respectively , and
r integers to store the indices of the selected columns in A;
• Row ID. It requires nr and (m − r)r floats to store R and Z respectively, and r
integers to store the indices of the selected rows in A;
• Two-Sided ID. It requires (m − r)r, (n − r)r, and r2 floats to store Z, W , and U
respectively. And extra 2r integers are required to store the indices of the selected
rows and columns in A.

88
Matrix Decomposition and Applications

Further reduction on the storage for two-sided ID for sparse matrix A Suppose
the column ID of A = CW where C = A[:, Js ] and a good spanning rows index Is set of
C could be found:
A[Is , :] = C[Is , :]W .
We observe that C[Is , :] = A[Is , Js ] ∈ Rr×r which is nonsingular (since full rank r in the
sense of both row rank and column rank). It follows that

W = (A[Is , Js ])−1 A[Is , :].

Therefore, there is no need to store the matrix W explicitly. We only need to store A[Is , :]
and (A[Is , Js ])−1 . Or when we can compute the inverse of A[Is , Js ] on the fly, it only
requires r integers to store Js and recover A[Is , Js ] from A[Is , :]. The storage of A[Is , :] is
cheap if A is sparse.

Part IV
Reduction to Hessenberg, Tridiagonal,
and Bidiagonal Form
8. Hessenberg Decomposition
We firstly give the rigorous definition of the upper Hessenberg matrix.

Definition 8.1: (Upper Hessenberg Matrix)


An upper Hessenberg matrix is a square matrix where all the entries below the first
diagonal (i.e., the ones below the main diagonal ) (a.k.a., lower subdiagonal ) are zeros.
Similarly, a lower Hessenberg matrix is a square matrix where all the entries above the
first diagonal (i.e., the ones above the main diagonal) are zeros.
The definition of the upper Hessenberg can also be extended to rectangular matrices,
and the form can be implied from the context.
In matrix language, for any matrix H ∈ Rn×n , and the entry (i, j) denoted by hij for
all i, j ∈ {1, 2, . . . , n}. Then H with hij = 0 for all i ≥ j + 2 is known as an Hessenberg
matrix.
Let i denote the smallest positive integer for which hi+1,i = 0 where i ∈ {1, 2, . . . , n −
1}, then H is unreduced if i = n − 1.

Take a 5 × 5 matrix as an example, the lower triangular below the lower sub-diagonal
are zero in the upper Hessenberg matrix:
   
         
         
   
 0     or  0     .
   
 0 0     0 0 0  
0 0 0   0 0 0  
possibly unreduced reduced

89
Jun Lu

Then we have the following Hessenberg decomposition:

Theorem 8.2: (Hessenberg Decomposition)


Every n × n square matrix A can be factored as

A = QHQ> or H = Q> AQ,

where H is an upper Hessenberg matrix, and Q is an orthogonal matrix.

It’s not hard to find that a lower Hessenberg decomposition of A> is given by A> =
QH > Q> if A has the Hessenberg decomposition A = QHQ> . The Hessenberg decom-
position shares a similar form as the QR decomposition in that they both reduce a matrix
into a sparse form where the lower parts of both are zero.

Remark 8.3: (Why Hessenberg Decomposition)


We will see that the zeros introduced into H from A is accomplished by the left orthogonal
matrix Q (same as the QR decomposition) and the right orthogonal matrix Q> here does
not transform the matrix into any better or simple form. Then why do we want the
Hessenberg decomposition rather than just a QR decomposition which has a simpler
structure in that it even has zeros in the lower sub-diagonal? The answer is that the
Hessenberg decomposition is usually used by other algorithms as a phase 1 step to find
a decomposition that factor the matrix into two orthogonal matrices, e.g., SVD, UTV,
and so on. And if we employ an aggressive algorithm that even favors zeros in the lower
sub-diagonal (again, as in the QR decomposition), the right orthogonal transform Q>
will destroy the zeros that can be seen very shortly.
On the other hand, the form A = QHQ> on H is known as the orthogonal similarity
transformation (Definition 8.4, p. 90) on A such that the eigenvalues, rank and trace of
A and H are the same (Lemma 8.5, p. 91). Then if we want to study the properties of
A, exploration on H can be a relatively simpler task.

8.1 Similarity Transformation and Orthogonal Similarity Transformation


As mentioned previously, the Hessenberg decomposition introduced in this section, the
tridiagonal decomposition in the next section, the Schur decomposition (Theorem 12.1,
p. 110), and the spectral decomposition (Theorem 13.1, p. 113) share a similar form that
transforms the matrix into a similar matrix. We now give the rigorous definition of similar
matrices and similarity transformations.

Definition 8.4: (Similar Matrices and Similarity Transformation)


A and B are called similar matrices if there exists a nonsingular matrix P such that
B = P AP −1 .
In words, for any nonsingular matrix P , the matrices A and P AP −1 are similar
matrices. And in this sense, given the nonsingular matrix P , P AP −1 is called a similarity
transformation applied to matrix A.

90
Matrix Decomposition and Applications

Moreover, when P is orthogonal, then P AP > is also known as the orthogonal simi-
larity transformation of A.
The difference between the similarity transformation and orthogonal similarity transforma-
tion is partly explained in the sense of coordinate transformation (Section 15, p. 148). Now
we prove the important properties of similar matrices that will be proved very useful in the
sequel.

Lemma 8.5: (Eigenvalue, Trace and Rank of Similar Matrices)


Any eigenvalue of A is also an eigenvalue of P AP −1 . The converse is also true that any
eigenvalue of P AP −1 is also an eigenvalue of A. I.e., Λ(A) = Λ(B), where Λ(X) is the
spectrum of matrix X (Definition 0.2, p. 10).
And also the trace and rank of A are equal to those of matrix P AP −1 for any
nonsingular matrix P .

Proof [of Lemma 8.5] For any eigenvalue λ of A, we have Ax = λx. Then λP x =
P AP −1 P x such that P x is an eigenvector of P AP −1 corresponding to λ.
Similarly, for any eigenvalue λ of P AP −1 , we have P AP −1 x = λx. Then AP −1 x =
λP −1 x such that P −1 x is an eigenvector of A corresponding to λ.
For the trace of P AP −1 , we have trace(P AP −1 ) = trace(AP −1 P ) = trace(A), where
the first equality comes from the fact that trace of a product is invariant under cyclical
permutations of the factors:

trace(ABC) = trace(BCA) = trace(CAB),

if all ABC, BCA, and CAB exist.


For the rank of P AP −1 , we separate it into two claims as follows.
Rank claim 1: rank(ZA) = rank(A) if Z is nonsingular We will first show that
rank(ZA) = rank(A) if Z is nonsingular. For any vector n in the null space of A, that
is An = 0. Thus, ZAn = 0, that is, n is also in the null space of ZA. And this implies
N (A) ⊆ N (ZA).
Conversely, for any vector m in the null space of ZA, that is ZAm = 0, we have Am =
Z −1 0 = 0. That is, m is also in the null space of A. And this indicates N (ZA) ⊆ N (A).
By “sandwiching”, the above two arguments imply

N (A) = N (ZA) −→ rank(ZA) = rank(A).

Rank claim 2: rank(AZ) = rank(A) if Z is nonsingular We notice that the row


rank is equal to the column rank of any matrix (Corollary 0.13, p. 12). Then rank(AZ) =
rank(Z > A> ). Since Z > is nonsingular, by claim 1, we have rank(Z > A> ) = rank(A> ) =
rank(A) where the last equality is again from the fact that the row rank is equal to the
column rank of any matrix. This results in rank(AZ) = rank(A) as claimed.
Since P , P −1 are nonsingular, we then have rank(P AP −1 ) = rank(AP −1 ) = rank(A)
where the first equality is from claim 1 and the second equality is from claim 2. We complete
the proof.

91
Jun Lu

8.2 Existence of the Hessenberg Decomposition

We will prove that any n × n matrix can be reduced to Hessenberg form via a sequence
of Householder transformations that are applied from the left and the right to the matrix.
Previously, we utilized a Householder reflector to triangularize matrices and introduce zeros
below the diagonal to obtain the QR decomposition. A similar approach can be applied to
introduce zeros below the subdiagonal.
Before introducing the mathematical construction of such decomposition, we empha-
size the following remark which will be very useful in the finding of the decomposition.

Remark 8.6: (Left and Right Multiplied by a Matrix with Block Identity)
For square matrix A ∈ Rn×n , and a matrix
 
Ik 0
B= ,
0 Bn−k

where Ik is a k × k identity matrix. Then BA will not change the first k rows of A, and
AB will not change the first k columns of A.

The proof of this remark is trivial.

First Step: Introduce Zeros for the First Column

Let A = [a1 , a2 , . . . , an ] be the column partitions of A, and each ai ∈ Rn . Suppose


ā1 , ā2 , . . . , ān ∈ Rn−1 are vectors removing the first component in ai ’s. Let

ā1 − r1 e1 f1 = I − 2u1 u> ∈ R(n−1)×(n−1) ,


r1 = ||ā1 ||, u1 = , and H 1
||ā1 − r1 e1 ||

where e1 here is the first basis for Rn−1 , i.e., e1 = [1; 0; 0; . . . ; 0] ∈ Rn−1 . To introduce zeros
below the sub-diagonal and operate on the submatrix A2:n,1:n , we append the Householder
reflector into
 
1 0
H1 = f1 ,
0 H

in which case, H1 A will introduce zeros in the first column of A below entry (2,1). The
first row of A will not be affected at all and kept unchanged by Remark 8.6. And we can
easily verify that both H1 and H f1 are orthogonal matrices and they are symmetric (from
the definition of Householder reflector). To have the form in Theorem 8.2, we multiply
H1 A on the right by H1> which results in H1 AH1> . The H1> on the right will not change
the first column of H1 A and thus keep the zeros introduced in the first column.

92
Matrix Decomposition and Applications

An example of a 5 × 5 matrix is shown as follows where  represents a value that is not


necessarily zero, and boldface indicates the value has just been changed.
     
              
    
 1 ×       ×H1>      
   
     H→

 0     →  0    
     
      0      0    
     0     0    
A H1 A H1 AH1>

Second Step: Introduce Zeros for the Second Column


Let B = H1 AH1> , where the entries in the first column below entry (2,1) are all zeros.
And the goal is to introduce zeros in the second column below entry (3,2). Let B2 =
B2:n,2:n = [b1 , b2 , . . . , bn−1 ]. Suppose again b̄1 , b̄2 , . . . , b̄n−1 ∈ Rn−2 are vectors removing
the first component in bi ’s. We can again construct a Householder reflector

b̄1 − r1 e1 f2 = I − 2u2 u> ∈ R(n−2)×(n−2) ,


r1 = ||b̄1 ||, u2 = , and H 2 (8.1)
||b̄1 − r1 e1 ||

where e1 now is the first basis for Rn−2 . To introduce zeros below the sub-diagonal and
operate on the submatrix B3:n,1:n , we append the Householder reflector into
 
I2 0
H2 = f2 ,
0 H

where I2 is a 2 × 2 identity matrix. We can see that H2 H1 AH1> will not change the first
two rows of H1 AH1> , and as the Householder cannot reflect a zero vector such that the
zeros in the first column will be kept. Again, putting H2> on the right of H2 H1 AH1> will
not change the first 2 columns so that the zeros will be kept.
Following the example of a 5 × 5 matrix, the second step is shown as follows where 
represents a value that is not necessarily zero, and boldface indicates the value has just
been changed.
     
              
    
 H2 ×      ×H2>      
   

 0     →  0      →  0     
     
 0      0 0     0 0   
0     0 0    0 0   
H1 AH1 > H2 H1 AH1 > H2 H1 AH1> H2>

The same process can go on, and there are n − 2 such steps. We will finally triangularize
by
H = Hn−2 Hn−3 . . . H1 AH1> H2> . . . Hn−2
>
.
And since Hi ’s are symmetric and orthogonal, the above equation can be simply reduced
to
H = Hn−2 Hn−3 . . . H1 AH1 H2 . . . Hn−2 .

93
Jun Lu

Note here only n − 2 such stages exist rather than n − 1 or n. We will verify this number of
steps by the example below. The example of a 5 × 5 matrix as a whole is shown as follows
where again  represents a value that is not necessarily zero, and boldface indicates the
value has just been changed.

A Complete Example of Hessenberg Decomposition


     
              
      H1 ×       ×H1>      
   

 →  0     →  0    
        

       0       0    
     0     0    
A H1 A H1 AH1>
   
         
 ×H2>     
      
H2 × 
→  0     →  0    
  

 0 0     0 0   
0 0    0 0   
>
H2 H1 AH1 H2 H1 AH1> H2>
   
         
    
 ×H3>      
 
H3 × 
→  0     →  0     
  
 ,
 0 0     0 0   
0 0 0   0 0 0  
> >
H3 H2 H1 AH1 H2 H3 H2 H1 AH1> H2> H3>

8.3 Properties of the Hessenberg Decomposition

The Hessenberg decomposition is not unique in the different ways to construct the House-
holder reflectors (say Equation (8.1), p. 93). However, under mild conditions, we can claim
a similar structure in different decompositions.

Theorem 8.7: (Implicit Q Theorem for Hessenberg Decomposition)


Suppose two Hessenberg decompositions of matrix A ∈ Rn×n are given by A = U HU > =
V GV > where U = [u1 , u2 , . . . , un ] and V = [v1 , v2 , . . . , vn ] are the column partitions of
U , V . Suppose further that k is the smallest positive integer for which hk+1,k = 0 where
hij is the entry (i, j) of H. Then
• If u1 = v1 , then ui = ±vi and |hi,i−1 | = |gi,i−1 | for i ∈ {2, 3, . . . , k}.
• When k = n − 1, the Hessenberg matrix H is known as unreduced. However, if
k < n − 1, then gk+1,k = 0.

94
Matrix Decomposition and Applications

Proof [of Theorem 8.7] Define the orthogonal matrix Q = V > U and we have
GQ = V > AV V > U = V > AU
)
leads
−−−−−→to GQ = QH,
QH = V > U U > AU = V > AU
the (i − 1)-th column of each can be represented as
Gqi−1 = Qhi−1 ,
where qi−1 and hi−1 are the (i − 1)-th column of Q and H respectively. Since hl,i−1 = 0
for l ≥ i + 1 (by the definition of upper Hessenberg matrices), Qhi−1 can be represented as
i
X i−1
X
Qhi−1 = hj,i−1 qj = hi,i−1 qi + hj,i−1 qj .
j=1 j=1

Combine the two findings above, it follows that


i−1
X
hi,i−1 qi = Gqi−1 − hj,i−1 qj .
j=1

A moment of reflexion reveals that [q1 , q2 , . . . , qk ] is upper triangular. And since Q is or-
thogonal, it must be diagonal and each value on the diagonal is in {−1, 1} for i ∈ {2, . . . , k}.
>
Then, q1 = eP 1 and qi = ±ei for i ∈ {2, . . . , k}. Further, since qi = V ui and hi,i−1 =
> i−1 > >
qi (Gqi−1 − j=1 hj,i−1 qj ) = qi Gqi−1 . For i ∈ {2, . . . , k}, qi Gqi−1 is just ±gi,i−1 . It
follows that
|hi,i−1 | = |gi,i−1 |, ∀i ∈ {2, . . . , k},
ui = ±vi , ∀i ∈ {2, . . . , k}.
This proves the first part. For the second part, if k < n − 1,
gk+1,k = e> >
GQ ek = ±e>
k+1 Gek = ±ek+1 |{z} k+1 QHek
| {z }
QH k-th column of QH
k+1
X k
X
= ±e>
k+1 Qhk = ±e>
k+1 hjk qj = ±e>
k+1 hjk qj = 0,
j j

where the penultimate equality is from the assumption that hk+1,k = 0. This completes the
proof.
We observe from the above theorem, when two Hessenberg decompositions of matrix A
are both unreduced and have the same first column in the orthogonal matrices, then
the Hessenberg matrices H, G are similar matrices such that H = DGD −1 where D =
diag(±1, ±1, . . . , ±1). Moreover, and most importantly, if we restrict the elements in the
lower sub-diagonal of the Hessenberg matrix H to be positive (if possible), then the Hes-
senberg decomposition A = QHQ> is uniquely determined by A and the first column of
Q. This is similar to what we have claimed on the uniqueness of the QR decomposition
(Corollary 3.10, p. 64) and it is important to reduce the complexity of the QR algorithm
for computing the singular value decomposition or eigenvalues of a matrix in general (Lu,
2021c).
The next finding involves a Krylov matrix defined as follows:

95
Jun Lu

Definition 8.8: (Krylov Matrix)


Given matrix A ∈ Rn×n , a vector q ∈ Rn , and a scalar k, the Krylov matrix is defined to
be
K(A, q, k) = q Aq . . . Ak−1 q ∈ Rn×n .
 

Theorem 8.9: (Reduced Hessenberg)


Suppose there exists an orthogonal matrix Q such that A ∈ Rn×n can be factored as
A = QHQ> . Then Q> AQ = H is an unreduced upper Hessenberg matrix if and only
if R = Q> K(A, q1 , n) is nonsingular and upper triangular where q1 is the first column
of Q.
If R is singular and k is the smallest index so that rkk = 0, then k is also the smallest
index that hk,k−1 = 0.

Proof [of Theorem 8.9] We prove by forward implication and converse implication sepa-
rately as follows:
Forward implication Suppose H is unreduced, write out the following matrix
R = Q> K(A, q1 , n) = [e1 , He1 , . . . , H n−1 e1 ],
where R is upper triangular with r11 = 1 obviously. Observe that rii = h21 h32 . . . hi,i−1 for
i ∈ {2, 3, . . . , n}. When H is unreduced, R is nonsingular as well.
Converse implication Now suppose R is upper triangular and nonsingular, we observe
that rk+1 = Hrk such that the (k + 2 : n)-th rows of H are zero and hk+1,k 6= 0 for
k ∈ {1, 2, . . . , n − 1}. Then H is unreduced.
If R is singular and k is the smallest index so that rkk = 0, then
)
rk−1,k−1 = h21 h32 . . . hk−1,k−2 6= 0
leads
−−−−−→to hk,k−1 = 0,
rkk = h21 h32 . . . hk−1,k−2 hk,k−1 = 0
from which the result follows.

9. Tridiagonal Decomposition: Hessenberg in Symmetric Matrices


We firstly give the formal definition of the tridiagonal matrix.

Definition 9.1: (Tridiagonal Matrix)


A tridiagonal matrix is a square matrix where all the entries below the lower sub-diagonal
and the entries above the upper sub-diagonal are zeros. I.e., the tridiagonal matrix is a
band matrix.
The definition of the tridiagonal matrix can also be extended to rectangular matrices,
and the form can be implied from the context.

96
Matrix Decomposition and Applications

In matrix language, for any matrix T ∈ Rn×n , and the entry (i, j) denoted by tij for
all i, j ∈ {1, 2, . . . , n}. Then T with tij = 0 for all i ≥ j + 2 and i ≤ j − 2 is known as a
tridiagonal matrix.
Let i denote the smallest positive integer for which hi+1,i = 0 where i ∈ {1, 2, . . . , n −
1}, then T is unreduced if i = n − 1.

Take a 5 × 5 matrix as an example, the lower triangular below the lower sub-diagonal
and upper triangular above the upper sub-diagonal are zero in the tridiagonal matrix:
   
  0 0 0   0 0 0
 
  0 0 
   0 0 
 
0    0 0    0.
   
0 0     0 0 0  
0 0 0   0 0 0  
possibly unreduced reduced

Obviously, a tridiagonal matrix is a special case of an upper Hessenberg matrix. Then we


have the following tridiagonal decomposition:

Theorem 9.2: (Tridiagonal Decomposition)


Every n × n symmetric matrix A can be factored as

A = QT Q> or T = Q> AQ,

where T is a symmetric tridiagonal matrix, and Q is an orthogonal matrix.

The existence of the tridiagonal matrix is trivial by applying the Hessenberg decomposition
to symmetric matrix A.

9.1 Properties of the Tridiagonal Decomposition


Similarly, the tridiagonal decomposition is not unique. However, and most importantly, if
we restrict the elements in the lower sub-diagonal of the tridiagonal matrix T to be positive
(if possible), then the tridiagonal decomposition A = QT Q> is uniquely determined by A
and the first column of Q.

Theorem 9.3: (Implicit Q Theorem for Tridiagonal)


Suppose two Tridiagonal decompositions of symmetric matrix A ∈ Rn×n are given by
A = U T U > = V GV > where U = [u1 , u2 , . . . , un ] and V = [v1 , v2 , . . . , vn ] are the
column partitions of U , V . Suppose further that k is the smallest positive integer for
which tk+1,k = 0 where tij is the entry (i, j) of T . Then
• If u1 = v1 , then ui = ±vi and |ti,i−1 | = |gi,i−1 | for i ∈ {2, 3, . . . , k}.
• When k = n − 1, the tridiagonal matrix T is known as unreduced. However, if
k < n − 1, then gk+1,k = 0.

97
Jun Lu

From the above theorem, we observe that if we restrict the elements in the lower sub-
diagonal of the tridiagonal matrix T to be positive (if possible), i.e., unreduced, then the
tridiagonal decomposition A = QT Q> is uniquely determined by A and the first column of
Q. This again is similar to what we have claimed on the uniqueness of the QR decomposition
(Corollary 3.10, p. 64).
Similarly, a reduced tridiagonal decomposition can be obtained from the implication of
the Krylov matrix (Definition 8.8, p. 96).

Theorem 9.4: (Reduced Tridiagonal)


Suppose there exists an orthogonal matrix Q such that A ∈ Rn×n can be factored as
A = QT Q> . Then Q> AQ = T is an unreduced tridiagonal matrix if and only if
R = Q> K(A, q1 , n) is nonsingular and upper triangular where q1 is the first column of
Q.
If R is singular and k is the smallest index so that rkk = 0, then k is also the smallest
index that tk,k−1 = 0.

The proofs of the above two theorems are the same as those in Theorem 8.7 and 8.9.

10. Bidiagonal Decomposition


We firstly give the rigorous definition of the upper Bidiagonal matrix as follows:

Definition 10.1: (Upper Bidiagonal Matrix)


An upper bidiagonal matrix is a square matrix which is a banded matrix with non-zero
entries along the main diagonal and the upper subdiagonal (i.e., the ones above the main
diagonal). This means there are exactly two nonzero diagonals in the matrix.
Furthermore, when the diagonal below the main diagonal has the non-zero entries,
the matrix is lower bidiagonal.
The definition of bidigonal matrices can also be extended to rectangular matrices, and
the form can be implied from the context.

Take a 7 × 5 matrix as an example, the lower triangular below the main diagonal and
the upper triangular above the upper subdiagonal are zero in the upper bidiagonal matrix:
 
  0 0 0
0   0 0
 
0 0   0
 
0 0 0  
 .
0 0 0 0 
 
0 0 0 0 0
0 0 0 0 0

Then we have the following bidiagonal decomposition:

98
Matrix Decomposition and Applications

Theorem 10.2: (Bidiagonal Decomposition)


Every m × n matrix A can be factored as

A = U BV > or B = U > AV ,

where B is an upper bidiagonal matrix, and U , V are orthogonal matrices.

We will see the bidiagonalization resembles the form of a singular value decomposition where
the only difference is the values of B in bidiagonal form has nonzero entries on the upper
sub-diagonal such that it will be shown to play an important role in the calculation of the
singular value decomposition.

10.1 Existence of the Bidiagonal Decomposition: Golub-Kahan


Bidiagonalization
Previously, we utilized a Householder reflector to triangularize matrices and introduce zeros
below the diagonal to obtain the QR decomposition, and introduce zeros below the sub-
diagonal to obtain the Hessenberg decomposition. A similar approach can be employed to
find the bidiagonal decomposition.

First Step 1.1: Introduce Zeros for the First Column


Let A = [a1 , a2 , . . . , an ] be the column partitions of A, and each ai ∈ Rm . We can construct
the Householder reflector as follows:
a1 − r1 e1
r1 = ||a1 ||, u1 = , and H1 = I − 2u1 u>
1 ∈R
m×m
,
||a1 − r1 e1 ||

where e1 here is the first basis for Rm , i.e., e1 = [1; 0; 0; . . . ; 0] ∈ Rm . In this case, H1 A
will introduce zeros in the first column of A below entry (1,1), i.e., reflect a1 to r1 e1 . We
can easily verify that both H1 is a symmetric and orthogonal matrix (from the definition
of Householder reflector).
An example of a 7 × 5 matrix is shown as follows where  represents a value that is not
necessarily zero, and boldface indicates the value has just been changed.
   
         
     0    
   
 H1 ×  0
         
 

     → 

0    .
     0    
   
     0    
     0    
A H1 A

Till now, this is exactly what we have done in the QR decomposition via the Householder
reflector in Section 3.11 (p. 56). Going further, to introduce zeros above the upper sub-
diagonal of H1 A is equivalent to introducing zeros below the lower subdiagonal of (H1 A)> .

99
Jun Lu

First Step 1.2: Introduce Zeros for the First Row


Now suppose we are looking at the transpose of H1 A, that is (H1 A)> = A> H1> ∈ Rn×m
and the column partition is given by A> H1> = [z1 , z2 , . . . , zm ] where each zi ∈ Rn . Suppose
z̄1 , z̄2 , . . . , z̄m ∈ Rn−1 are vectors removing the first component in zi ’s. Let
z̄1 − r1 e1 e 1 = I − 2v1 v > ∈ R(n−1)×(n−1) ,
r1 = ||z̄1 ||, v1 = , and L 1
||z̄1 − r1 e1 ||

where e1 now is the first basis for Rn−1 , i.e., e1 = [1; 0; 0; . . . ; 0] ∈ Rn−1 . To introduce
zeros below the sub-diagonal and operate on the submatrix (A> H1> )2:n,1:m , we append the
Householder reflector into  
1 0
L1 = e1 ,
0 L
in which case, L1 (A> H1> ) will introduce zeros in the first column of (A> H1> ) below entry
(2,1), i.e., reflect z̄1 to r1 e1 . The first row of (A> H1> ) will not be affected at all and
kept unchanged by Remark 8.6 (p. 92) such that the zeros introduced in Step 1.1) will be
kept. And we can easily verify that both L1 and L e 1 are orthogonal matrices and they are
symmetric (from the definition of Householder reflector).
Come back to the original untransposed matrix H1 A, multiply on the right by L> 1 is
to introduce zeros in the first row to the right of entry (1,2). Again, following the example
above, a 7 × 5 matrix is shown as follows where  represents a value that is not necessarily
zero, and boldface indicates the value has just been changed.
 
  0 0 0
   
 0 0 0 0 0 0  0 0 0 0 0 0  0    
 
               0    
>
 0        (·)
 1× 
       L→
   
 →  0     .
 
  
        0        0    
 
       0        0    
A> H1> L1 A> H1> 0    
H1 AL> 1

In short, H1 AL>
1 finishes the first step to introduce zeros for the first column and the
first row of A.

Second Step 2.1: Introduce Zeros for the Second Column


Let B = H1 AL> 1 , where the entries in the first column below entry (1,1) are all zeros
and the entries in the first row to the right of entry (1,2) are all zeros as well. And the
goal is to introduce zeros in the second column below entry (2,2). Let B2 = B2:m,2:n =
[b1 , b2 , . . . , bn−1 ] ∈ R(m−1)×(n−1) . We can again construct a Householder reflector
b1 − r1 e1 f2 = I − 2u2 u> ∈ R(m−1)×(m−1) ,
r1 = ||b1 ||, u2 = , and H 2
||b1 − r1 e1 ||

where e1 now is the first basis for Rm−1 i.e., e1 = [1; 0; 0; . . . ; 0] ∈ Rm−1 . To introduce zeros
below the main diagonal and operate on the submatrix B2:m,2:n , we append the Householder

100
Matrix Decomposition and Applications

reflector into
 
1 0
H2 = f2 ,
0 H

in which case, we can see that H2 (H1 AL> >


1 ) will not change the first row of (H1 AL1 ) by
Remark 8.6 (p. 92), and as the Householder cannot reflect a zero vector such that the zeros
in the first column will be kept as well.
Following the example above, a 7 × 5 matrix is shown as follows where  represents a
value that is not necessarily zero, and boldface indicates the value has just been changed.

   
  0 0 0   0 0 0
0      0    
   
0     H2 ×  0 0    
 

0
     →  0 0    
 
.
0      0 0   
   
0      0 0   
0     0 0   
H1 AL>
1 H2 H1 AL>
1

Second Step 2.2: Introduce Zeros for the Second Row


Same as step 1.2), now suppose we are looking at the transpose of H2 H1 AL> 1 , that is
(H2 H1 AL> 1 )> = L A> H > H > ∈ Rn×m and the column partition is given by L A> H > H > =
1 1 2 1 1 2
[x1 , x2 , . . . , xm ] where each xi ∈ Rn . Suppose x̄1 , x̄2 , . . . , x̄m ∈ Rn−2 are vectors removing
the first two components in xi ’s. Construct the Householder reflector as follows:

x̄1 − r1 e1 e 2 = I − 2v2 v > ∈ R(n−2)×(n−2) ,


r1 = ||x̄1 ||, v2 = , and L 2
||x̄1 − r1 e1 ||

where e1 now is the first basis for Rn−2 , i.e., e1 = [1; 0; 0; . . . ; 0] ∈ Rn−2 . To introduce zeros
below the sub-diagonal and operate on the submatrix (L1 A> H1 H2 )3:n,1:m , we append the
Householder reflector into
 
I2 0
L1 = e2 ,
0 L

where I2 is a 2×2 identity matrix. In this case, L2 (L1 A> H1> H2> ) will introduce zeros in the
second column of (L1 A> H1> H2> ) below entry (3,2). The first two rows of (L1 A> H1> H2> )
will not be affected at all and kept unchanged by Remark 8.6 (p. 92). Further, the first
column of it will be kept unchanged as well. And we can easily verify that both L1
and Le 1 are orthogonal matrices and they are symmetric (from the definition of Householder
reflector).
Come back to the original untransposed matrix H2 H1 AL> 1 , multiply on the right by
>
L2 is to introduce zeros in the second row to the right of entry (2,3). Following the example
above, a 7 × 5 matrix is shown as follows where  represents a value that is not necessarily

101
Jun Lu

zero, and boldface indicates the value has just been changed.
 
  0 0 0
   
 0 0 0 0 0 0  0 0 0 0 0 0 0   0 0
 

  0 0 0 0 0 L2 ×   0 0 0 0 0  (·)>
   0 0   
 
0
       →  0       
 
 →
 0 0    .
 
0        0 0       0 0   
 
0       0 0       0 0   
> > >
L1 A H1 H2 L2 L1 A> H1> H2> 0 0   
H2 H1 AL> >
1 L2

In short, H2 (H1 AL> >


1 )L2 finish the second step to introduce zeros for the second column
and the second row of A. The same process can go on, and we shall notice that there are n
such Hi Householder reflectors on the left and n − 2 such Li Householder reflectors on the
right (suppose m > n for simplicity). The interleaved Householder factorization is known as
the Golub-Kahan Bidiagonalization (Golub and Kahan, 1965). We will finally bidiagonalize

B = Hn Hn−1 . . . H1 AL> > >


1 L2 . . . Ln−2 .

And since the Hi ’s and Li ’s are symmetric and orthogonal, we have

B = Hn Hn−1 . . . H1 AL1 L2 . . . Ln−2 .

A full example of a 7 × 5 matrix is shown as follows where again  represents a value that
is not necessarily zero, and boldface indicates the value has just been changed.

A Complete Example of Golub-Kahan Bidiagonalization


     
            0 0 0
      0     0    
     
 H1 ×  0      ×L>
       0    
  
     →  0      →1  0    
     
      0     0    
     
      0     0    
     0     0    
A H1 A H1 AL>
1

   
       0 0 0
 0     0   0 0
   
 0 0    >  0 0   
H2 ×   ×L2  
→   0 0    →
  0 0   
 
 0 0     0 0   
   
 0 0     0 0   
0 0    0 0   
H2 H1 AL>
1 H2 H1 AL> >
1 L2

102
Matrix Decomposition and Applications

   
       0 0 0
 0        0 0 
   
 0 0    >  0    0 
H3 × 
  ×L 
3 

→  0 0 0    →  0 0   

 0 0 0    0 0 0  
   
 0 0 0    0 0 0  
0 0 0   0 0 0  
>
H3 H2 H1 AL1 L2> H3 H2 H1 AL> > >
1 L2 L3
   
         
 0      0    
   
 0 0     0 0   
H4 ×    H5 ×  
→  0 0 0    →  0 0 0  
  .
 0 0 0 0   0 0 0 0 
   
 0 0 0 0  0 0 0 0 0
0 0 0 0  0 0 0 0 0
>
H4 H3 H2 H1 AL1 L2 L3> H5 H4 H3 H2 H1 AL> >
1 L2 L3

We present in a way where a right Householder reflector Li follows from a left one Hi .
However, a trivial error that might be employed is that we do the left ones altogether,
and the right ones follow. That is, a bidiagonal decomposition is a combination of a QR
decomposition and a Hessenberg decomposition. Nevertheless, this is problematic, the right
Householder reflector L1 will destroy the zeros introduced by the left ones. Therefore, the
left and right reflectors need to be employed in an interleaved manner to introduce back
the zeros.
The Golub-Kahan bidiagonalization is not the most efficient way to calculate the bidi-
agonal decomposition. It requires ∼ 4mn2 − 34 n3 flops to compute a bidiagonal decompo-
sition of an m × n matrix with m > n. Further, if U , V are needed explicitly, additional
∼ 4m2 n − 2mn2 + 2n3 flops are required.
LHC Bidiagonalization Nevertheless, when m  n, we can extract the square triangu-
lar matrix (i.e., the QR decomposition) and apply the Golub-Kahan diagonalization on the
square n × n matrix. This is known as the Lawson-Hanson-Chan (LHC) bidiagonalization
(Lawson and Hanson, 1995; Chan, 1982) and the procedure is shown in Figure 14. The
LHC bidiagonalization starts by computing the QR decomposition A = QR. Then follows
by applying the Golub-Kahan process such that R e=U e > where R
e BV e is the square n × n
triangular submatrix inside R. Append U e into
 
U
e
U0 = ,
Im−n
which results in R = U0 BV > and A = QU0 BV > . Let U = QU0 , we obtain the bidiagonal
decomposition. The QR decomposition requires 2mn2 − 32 n3 flops and the Golub-Kahan
process now requires 83 n3 (operating on an n × n submatrix). Thus the total complexity to
obtain the bidiagonal matrix B is then reduced to
LHC bidiagonalization: ∼ 2mn2 + 2n3 flops.

103
Jun Lu

Amn ~ ~
Rmn Rnn Bnn Bmn

Figure 14: Demonstration of LHC-bidiagonalization of a matrix

The LHC process creates zeros and then destroys them again in the lower triangle of the
upper n × n square of R, but the zeros in the lower (m − n) × n rectangular matrix of R
will be kept. Thus when m − n is large enough (or m  n), there is a net gain. Simple
calculations will show the LHC bidiagonalization costs less when m > 35 n compared to the
Golub-Kahan bidiagonalization.

Three-Step Bidiagonalization The LHC procedure is advantageous only when m > 53 n.


A further trick is to apply the QR decomposition not at the beginning of the computation,
but at a suitable point in the middle (Trefethen and Bau III, 1997). In particular, the
procedure is shown in Figure 15 where we apply the first k steps of left and right Householder
reflectors as in the Golub-Kahan process leaving the bottom-right (m−k)×(n−k) submatrix
“unreflected”. Then following the same LHC process on the submatrix to obtain the final
bidiagonal decomposition. By doing so, the complexity reduces when n < m < 2n.

~ ~
R( m-k )( n-k ) T( n-k )( n-k )

Amn Rmn Tmn Bmn

Figure 15: Demonstration of Three-Step bidiagonalization of a matrix

To conclude, the costs of the three methods are shown as follows:

4


Golub-Kahan: ∼ 4mn2 − n3 flops,
3



LHC: ∼ 2mn2 + 2n3 flops,

Three-Step: ∼ 2mn2 + 2m2 n − 2 m3 − 2 n3 flops.



3 3

104
Matrix Decomposition and Applications

Figure 16: Comparison of the


6.5 complexity among the three bidi-
6.0 (53 , 163 ) agonal methods. When m > 2n,
5.5
(2, 6) LHC is preferred; when n < m <
2n, the Three-Step method is pre-
5.0 ferred though the improvement is
flops 4.5 small enough.
n3
4.0
3.5
3.0
(1, 83 )
Golub-Kahan
2.5 LHC
Three-Step
2.0
0.5 1.0 1.5 2.0 2.5 3.0
m
n

When m > 2n, LHC is preferred; when n < m < 2n, the Three-Step method is preferred
though the improvement is small enough as shown in Figure 16 where the operation counts
for the three methods are plotted as a function of m
n . Notice that the complexity discussed
here does not involve the extra computation of U , V . We shall not discuss the issue for
simplicity.

10.2 Connection to Tridiagonal Decomposition


We fist illustrate the connection by the following lemma that reveals how to construct a
tridiagonal matrix from a bidiagonal one.

Lemma 10.3: (Construct Tridiagonal From Bidiagonal)


Suppose B ∈ Rn×n is upper bidiagonal, then T1 = B > B and T2 = BB > are symmetric
triangular matrices.

The lemma above reveals an important property. Suppose A = U BV > is the bidiagonal
decomposition of A, then the symmetric matrix AA> has a tridiagonal decomposition

AA> = U BV > V B > U > = U BB > U > .

And the symmetric matrix A> A has a tridiagonal decomposition

A> A = V B > U > U BV > = V B > BV > .

As a final result in this section, we state a theorem giving the tridiagonal decomposition of
a symmetric matrix with special eigenvalues.

Theorem 10.4: (Tridiagonal Decomposition for Nonnegative Eigenvalues)

105
Jun Lu

Suppose n×n symmetric matrix A has nonnegative eigenvalues, then there exists a matrix
Z such that
A = ZZ > .
Moreover, the tridiagonal decomposition of A can be reduced to a problem to find the
bidiagonal decomposition of Z = U BV > such that the tridiagonal decomposition of A
is given by
A = ZZ > = U BB > U > .
Proof [of Theorem 10.4] The eigenvectors of symmetric matrices can be chosen to be or-
thogonal (Lemma 13.3, p. 114) such that symmetric matrix A can be decomposed into
A = QΛQ> (spectral theorem 13.1, p. 113) where Λ is a diagonal matrix containing the
eigenvalues of A. When eigenvalues are nonnegative, Λ can be factored as Λ = Λ1/2 Λ1/2 .
Let Z = QΛ1/2 , A can be factored as A = ZZ > . Thus, combining our findings yields the
result.

Part V
Eigenvalue Problem
11. Eigenvalue and Jordan Decomposition

Theorem 11.1: (Eigenvalue Decomposition)


Any square matrix A ∈ Rn×n with linearly independent eigenvectors can be factored as

A = XΛX −1 ,

where X contains the eigenvectors of A as columns, and Λ is a diagonal matrix diag(λ1 , λ2 ,


. . . , λn ) and λ1 , λ2 , . . . , λn are eigenvalues of A.

Eigenvalue decomposition is also known as to diagonalize the matrix A. When no


eigenvalues of A are repeated, the eigenvectors are sure to be linearly independent. Then
A can be diagonalized. Note here without n linearly independent eigenvectors, we cannot
diagonalize. In Section 13.3 (p. 119), we will further discuss conditions under which the
matrix has linearly independent eigenvectors.

11.1 Existence of the Eigenvalue Decomposition


Proof [of Theorem 11.1] Let X = [x1 , x2 , . . . , xn ] as the linearly independent eigenvectors
of A. Clearly, we have

Ax1 = λ1 x1 , Ax2 = λ2 x2 , ..., Axn = λn xn .

In the matrix form,

AX = [Ax1 , Ax2 , . . . , Axn ] = [λ1 x1 , λ2 x2 , . . . , λn xn ] = XΛ.

106
Matrix Decomposition and Applications

Since we assume the eigenvectors are linearly independent, then X has full rank and is
invertible. We obtain
A = XΛX −1 .
This completes the proof.

We will discuss some similar forms of eigenvalue decomposition in the spectral decom-
position section, where the matrix A is required to be symmetric, and the X is not only
nonsingular but also orthogonal. Or, the matrix A is required to be a simple matrix, that
is, the algebraic multiplicity and geometric multiplicity are the same for A, and X will be a
trivial nonsingular matrix that may not contain the eigenvectors of A. The decomposition
also has a geometric meaning, which we will discuss in Section 15 (p. 148).
A matrix decomposition in the form of A = XΛX −1 has a nice property that we can
compute the m-th power efficiently.

Remark 11.2: (m-th Power)


The m-th power of A is Am = XΛm X −1 if the matrix A can be factored as A =
XΛX −1 .

We notice that we require A have linearly independent eigenvectors to prove the exis-
tence of the eigenvalue decomposition. Under specific conditions, the requirement is intrin-
sically satisfied.

Lemma 11.3: (Different Eigenvalues)


Suppose the eigenvalues λ1 , λ2 , . . . , λn of A ∈ Rn×n are all different. Then the correspond-
ing eigenvectors are automatically independent. In another word, any square matrix with
different eigenvalues can be diagonalized.

Proof [of Lemma 11.3] Suppose the eigenvalues λ1 , λ2 , . . . , λn are all different, and the
eigenvectors x1 , x2 , . . . , xn are dependent. That is, there exists a nonzero vector c =
[c1 , c2 , . . . , cn−1 ]> such that
n−1
X
xn = ci xi .
i=1
Then we have
n−1
X
Axn = A( ci xi )
i=1
= c1 λ1 x1 + c2 λ2 x2 + . . . + cn−1 λn−1 xn−1 .
and
Axn = λn xn
= λn (c1 x1 + c2 x2 + . . . + cn−1 xn−1 ).
Combine above two equations, we have
n−1
X
(λn − λi )ci xi = 0.
i=1

107
Jun Lu

This leads to a contradiction since λn 6= λi for all i ∈ {1, 2, . . . , n − 1}, from which the result
follows.

Remark 11.4: (Limitation of Eigenvalue Decomposition)


The limitation of eigenvalue decomposition is that:
• The eigenvectors in X are usually not orthogonal and there are not always enough
eigenvectors (i.e., some eigenvalues are equal).
• To compute the eigenvalues and eigenvectors Ax = λx requires A to be square.
Rectangular matrices cannot be diagonalized by eigenvalue decomposition.

11.2 Jordan Decomposition


In eigenvalue decomposition, we suppose matrix A has n linearly independent eigenvectors.
However, this is not necessarily true for all square matrices. We introduce further a gener-
alized version of eigenvalue decomposition which is called the Jordan decomposition named
after Camille Jordan (Jordan, 1870).
We first introduce the definitin of Jordan blocks and Jordan form for the further de-
scription of Jordan decomposition.

Definition 11.5: (Jordan Block)


An m × m upper triangular matrix B(λ, m) is called a Jordan block provided all m
diagonal elements are the same eigenvalue λ and all upper sub-digonal elements are all
ones:  
λ 1 0 ... 0 0 0
0 λ 1 . . . 0 0 0
 
0 0 λ . . . 0 0 0
 
B(λ, m) =  ... ... ... .. .. ..
 
 . . . 

0 0 0 . . . λ 1 0
 
0 0 0 . . . 0 λ 1
0 0 0 . . . 0 0 λ m×m

Definition 11.6: (Jordan Form)


Given an n × n matrix A, a Jordan form J for A is a block diagonal matrix defined as

J = diag(B(λ1 , m1 ), B(λ2 , m2 ), . . . B(λk , mk ))

where λ1 , λ2 , . . . , λk are eigenvalues of A (duplicates possible) and m1 +m2 +. . .+mk = n.

Then, the Jordan decomposition follows:

108
Matrix Decomposition and Applications

Theorem 11.7: (Jordan Decomposition)


Any square matrix A ∈ Rn×n can be factored as

A = XJ X −1 ,

where X is a nonsingular matrix containing the generalized eigenvectors of A as columns,


and J is a Jordan form matrix diag(J1 , J2 , . . . , Jk ) where
 
λi 1 0 . . . 0 0 0
 0 λi 1 . . . 0 0 0 
 
 0 0 λi . . . 0 0 0 
 
Ji =  ... .. .. .. .. ..
 
 . . . . . 

 0 0 0 . . . λi 1 0 
 
 0 0 0 . . . 0 λi 1 
0 0 0 . . . 0 0 λi m ×m
i i

is an mi × mi square matrix with mi being the number of repetitions of eigenvalue λi and


m1 + m2 + . . . + mk = n. Ji ’s are referred to as Jordan blocks.
Further, nonsingular matrix X is called the matrix of generalized eigenvectors
of A.

As an example, a Jordan form can have the following structure:

J = diag(B(λ1 , m1 ), B(λ2 , m2 ), . . . , B(λk , mk ))


  
λ1 1 0
 0 λ1 1  
 
 0 0 λ1 
   

 λ2  


=
 λ3 1 .


 0 λ3 

 .. 

 . 
 
 λk 1 
0 λk

Decoding a Jordan Decomposition: Note that zeros can appear on the upper sub-
diagonal of J and in each block, the first column is always a diagonal containing only
eigenvalues of A. Take out one block to decode, without loss of generality, we take out
the first block J1 . We shall show the columns 1, 2, . . . , m1 of AX = XJ with X =
[x1 , x2 , . . . , xn ]:
Ax1 = λ1 x1
Ax2 = λ1 x2 + x1
.. ..
.=.
Axm1 = λ1 xm1 + xm1 −1 .

109
Jun Lu

For more details about Jordan decomposition, please refer to (Gohberg and Goldberg, 1996;
Hales and Passi, 1999).
The Jordan decomposition is not particularly interesting in practice as it is extremely
sensitive to perturbation. Even with the smallest random change to a matrix , the matrix
can be made diagonalizable (van de Geijn and Myers, 2020). As a result, there is no practical
mathematical software library or tool that computes it. And the proof takes dozens of pages
to discuss. For this reason, we leave the proof to interesting readers.

12. Schur Decomposition

Theorem 12.1: (Schur Decomposition)


Any square matrix A ∈ Rn×n with real eigenvalues can be factored as

A = QU Q> ,

where Q is an orthogonal matrix, and U is an upper triangular matrix. That is, all square
matrix A with real eigenvalues can be triangularized.

A close look at Schur decomposition The first column of AQ and QU are Aq1 and
U11 q1 . Then, U11 , q1 are eigenvalue and eigenvector of A. But other columns of Q need
not be eigenvectors of A.

Schur decomposition for symmetric matrices Symmetric matrix A = A> leads to


QU Q> = QU > Q> . Then U is a diagonal matrix. And this diagonal matrix actually
contains eigenvalues of A. All the columns of Q are eigenvectors of A. We conclude that
all symmetric matrices are diagonalizable even with repeated eigenvalues.

12.1 Existence of the Schur Decomposition


To prove Theorem 12.1, we need to use the following lemmas.

Lemma 12.2: (Determinant Intermezzo)


We have the following properties for determinant of matrices:
• The determinant of multiplication of two matrices is det(AB) = det(A) det(B);
• The determinant of the transpose is det(A> ) = det(A);
• Suppose matrix A has eigenvalue λ, then det(A − λI) = 0;
• Determinant of any identity matrix is 1;
• Determinant of an orthogonal matrix Q:

det(Q) = det(Q> ) = ±1, since det(Q> ) det(Q) = det(Q> Q) = det(I) = 1;

• Any square matrix A, we then have an orthogonal matrix Q:

det(A) = det(Q> ) det(A) det(Q) = det(Q> AQ);

110
Matrix Decomposition and Applications

Lemma 12.3: (Submatrix with Same Eigenvalue)


Suppose square matrix Ak+1 ∈ R(k+1)×(k+1) has real eigenvalues λ1 , λ2 , . . . , λk+1 . Then
we can construct a k × k matrix Ak with eigenvalues λ2 , λ3 , . . . , λk+1 by

−p>
 
2−
 −p> − 
3  
Ak =   Ak+1 p2 p3 . . . pk+1 ,
 
..
 . 
>
−pk+1 −

where p1 is a eigenvector of Ak+1 with norm 1 corresponding to eigenvalue λ1 , and


p2 , p3 , . . . , pk+1 are any orthonormal vectors orthogonal to p1 .

> P
Proof [of Lemma 12.3] Let Pk+1 = [p1 , p2 , . . . , pk+1 ]. Then Pk+1 k+1 = I, and
 
> λ1 0
Pk+1 Ak+1 Pk+1 = .
0 Ak

For any eigenvalue λ = {λ2 , λ3 , . . . , λk+1 }, by Lemma 12.2, we have


>
det(Ak+1 − λI) = det(Pk+1 (Ak+1 − λI)Pk+1 )
> >
= det(Pk+1 Ak+1 Pk+1 − λPk+1 Pk+1 )
 
λ1 − λ 0
= det
0 Ak − λI
= (λ1 − λ) det(Ak − λI).
Where
 the
 last equality is from the fact that if matrix M has a block formulation: M =
E F
, then det(M ) = det(E) det(H − GE −1 F ). Since λ is an eigenvalue of A and
G H
λ 6= λ1 , then det(Ak+1 − λI) = (λ1 − λ) det(Ak − λI) = 0 means λ is also an eigenvalue of
Ak .

We then prove the existence of the Schur decomposition by induction.


Proof [of Theorem 12.1: Existence of Schur Decomposition] We note that the
theorem is trivial when n = 1 by setting Q = 1 and U = A. Suppose the theorem is true for
n = k for some k ≥ 1. If we prove the theorem is also true for n = k + 1, then we complete
the proof.
Suppose for n = k, the theorem is true for Ak = Qk Uk Q> k.
Suppose further Pk+1 contains orthogonal vectors Pk+1 = [p1 , p2 , . . . , pk+1 ] as con-
structed in Lemma 12.3 where p1 is an eigenvector of Ak+1 corresponding to eigenvalue λ1
and its norm is 1, p2 , . . . , pk+1 are orthonormal to p1 . Let the other k eigenvalues of Ak+1
be λ2 , λ3 , . . . , λk+1 . Since we suppose for n = k, the theorem is true, we can find a matrix
Ak with eigenvalues λ2 , λ3 , . . . , λk+1 . So we have the following property by Lemma 12.3:
   
> λ 0 λ1 0
Pk+1 Ak+1 Pk+1 = and Ak+1 Pk+1 = Pk+1 .
0 Ak 0 Ak

111
Jun Lu

 
1 0
Let Qk+1 = Pk+1 . Then, it follows that
0 Qk
 
1 0
Ak+1 Qk+1 = Ak+1 Pk+1
0 Qk
  
λ1 0 1 0
= Pk+1
0 Ak 0 Qk
 
λ1 0
= Pk+1
0 A k Qk
 
λ1 0
= Pk+1 (By the assumption for n = k)
0 Qk Uk
  
1 0 λ1 0
= Pk+1
0 Qk 0 Uk
 
λ1 0
= Qk+1 Uk+1 . (Uk+1 = )
0 Uk

We then have Ak+1 = Qk+1 Uk+1 Q> k+1 , where U is an upper triangular matrix, and
 k+1 
1 0
Qk+1 is an orthogonal matrix since Pk+1 and are both orthogonal matrices.
0 Qk

12.2 Other Forms of the Schur Decomposition


From the proof of the Schur decomposition, we obtain the upper triangular matrix Uk+1
by appending the eigenvalue λ1 to Uk . From this process, the values on the diagonal are
always eigenvalues. Therefore, we can decompose the upper triangular into two parts.

Corollary 12.4: (Form 2 of Schur Decomposition)


Any square matrix A ∈ Rn×n with real eigenvalues can be factored as

Q> AQ = Λ + T , or A = Q(Λ + T )Q> ,

where Q is an orthogonal matrix, Λ = diag(λ1 , λ2 , . . . , λn ) is a diagonal matrix containing


the eigenvalues of A, and T is a strictly upper triangular matrix.

A strictly upper triangular matrix is an upper triangular matrix having 0’s along the diag-
onal as well as the lower portion. Another proof for this decomposition is that A and
U (where U = Q> AQ) are similar matrices so that they have the same eigenvalues
(Lemma 8.5, p. 91). And the eigenvalues of any upper triangular matrices are on the
diagonal. To see this, for any upper triangular matrix R ∈ Rn×n where the diagonal values
are rii for all i ∈ {1, 2, . . . , n}. We have
Rei = rii ei ,
where ei is the i-th basis vector in Rn , i.e., ei is the i-th column of the n × n identity matrix
In . So we can decompose U into Λ and T .

112
Matrix Decomposition and Applications

A final observation on the second form of the Schur decomposition is shown as follows.
From AQ = Q(Λ + T ), it follows that
k−1
X
Aqk = λk qk + tik qi ,
i=1

where tik is the (i, k)-th entry of T . The form is quite close to the eigenvalue decomposi-
tion. Nevertheless, the columns become orthonormal bases and the orthonormal bases are
correlated.

13. Spectral Decomposition (Theorem)

Theorem 13.1: (Spectral Decomposition)


A real matrix A ∈ Rn×n is symmetric if and only if there exists an orthogonal matrix Q
and a diagonal matrix Λ such that

A = QΛQ> ,

where the columns of Q = [q1 , q2 , . . . , qn ] are eigenvectors of A and are mutually or-
thonormal, and the entries of Λ = diag(λ1 , λ2 , . . . , λn ) are the corresponding eigenvalues
of A, which are real. And the rank of A is the number of nonzero eigenvalues. This is
known as the spectral decomposition or spectral theorem of real symmetric matrix
A. Specifically, we have the following properties:
1. A symmetric matrix has only real eigenvalues;
2. The eigenvectors are orthogonal such that they can be chosen orthonormal by
normalization;
3. The rank of A is the number of nonzero eigenvalues;
4. If the eigenvalues are distinct, the eigenvectors are unique as well.

The above decomposition is called the spectral decomposition for real symmetric matri-
ces and is often known as the spectral theorem.
Spectral theorem vs eigenvalue decomposition In the eigenvalue decomposition, we
require the matrix A to be square and the eigenvectors to be linearly independent. Whereas
in the spectral theorem, any symmetric matrix can be diagonalized, and the eigenvectors
are chosen to be orthogonormal.
A word on the spectral decomposition In Lemma 8.5 (p. 91), we proved that the
eigenvalues of similar matrices are the same. From the spectral decomposition, we notice
that A and Λ are similar matrices such that their eigenvalues are the same. For any diagonal
matrices, the eigenvalues are the diagonal components. 19 To see this, we realize that
Λei = λi ei ,
where ei is the i-th basis vector. Therefore, the matrix Λ contains the eigenvalues of A.
19. Actually, we have shown in the last section that the diagonal values for triangular matrices are the
eigenvalues of it.

113
Jun Lu

13.1 Existence of the Spectral Decomposition


We prove the theorem in several steps.
Symmetric Matrix Property 1 of 4

Lemma 13.2: (Real Eigenvalues)


The eigenvalues of any symmetric matrix are all real.

Proof [of Lemma 13.2] Suppose eigenvalue λ is a complex number λ = a + ib where a, b


are real. Its complex conjugate is λ̄ = a − ib. Same for complex eigenvector x = c + id and
its complex conjugate x̄ = c − id where c, d are real vectors. We then have the following
property

Ax = λx −leads
−−−−− to
→ Ax̄ = λ̄x̄ transpose to x̄> A = λ̄x̄> .
−−−−−−−−−−→
We take the dot product of the first equation with x̄ and the last equation with x:

x̄> Ax = λx̄> x, and x̄> Ax = λ̄x̄> x.

Then we have the equality λx̄> x = λ̄x̄> x. Since x̄> x = (c − id)> (c + id) = c> c + d> d is
a real number. Therefore the imaginary part of λ is zero and λ is real.

Symmetric Matrix Property 2 of 4

Lemma 13.3: (Orthogonal Eigenvectors)


The eigenvectors corresponding to distinct eigenvalues of any symmetric matrix
are orthogonal so that we can normalize eigenvectors to make them orthonormal
x x
since Ax = λx−leads
−−−−−to
→A ||x|| = λ ||x|| which corresponds to the same eigenvalue.

Proof [of Lemma 13.3] Suppose eigenvalues λ1 , λ2 correspond to eigenvectors x1 , x2 so that


Ax1 = λx1 and Ax2 = λ2 x2 . We have the following equality:

Ax1 = λ1 x1 leads
−−−−−→to x> >
1 A = λ1 x1 leads
−−−−−→to x> >
1 Ax2 = λ1 x1 x2 ,

and
Ax2 = λ2 x2 leads
−−−−−→to x> >
1 Ax2 = λ2 x1 x2 ,

which implies λ1 x> >


1 x2 = λ2 x1 x2 . Since eigenvalues λ1 6= λ2 , the eigenvectors are orthogo-
nal.

In the above Lemma 13.3, we prove that the eigenvectors corresponding to distinct
eigenvalues of symmetric matrices are orthogonal. More generally, we prove the important
theorem that eigenvectors corresponding to distinct eigenvalues of any matrix are linearly
independent.

114
Matrix Decomposition and Applications

Theorem 13.4: (Independent Eigenvector Theorem)


If a matrix A ∈ Rn×n has k distinct eigenvalues, then any set of k corresponding eigen-
vectors are linearly independent.

Proof [of Theorem 13.4] We will prove by induction. Firstly, we will prove that any two
eigenvectors corresponding to distinct eigenvalues are linearly independent. Suppose v1 , v2
correspond to distinct eigenvalues λ1 and λ2 respectively. Suppose further there exists a
nonzero vector x = [x1 , x2 ] 6= 0 that

x1 v1 + x2 v2 = 0. (13.1)

That is, v1 , v2 are linearly independent. Multiply Equation (13.1) on the left by A, we get

x1 λ1 v1 + x2 λ2 v2 = 0. (13.2)

Multiply Equation (13.1) on the left by λ2 , we get

x1 λ2 v1 + x2 λ2 v2 = 0. (13.3)

Subtract Equation (13.2) from Equation (13.3) to find

x1 (λ2 − λ1 )v1 = 0.

Since λ2 6= λ1 , v1 6= 0, we must have x1 = 0. From Equation (13.1), v2 6= 0, we must also


have x2 = 0 which arrives at a contradiction. Thus v1 , v2 are linearly independent.
Now, suppose any j < k eigenvectors are linearly independent, if we could prove that any
j + 1 eigenvectors are also linearly independent, we finish the proof. Suppose v1 , v2 , . . . , vj
are linearly independent and vj+1 is dependent on the first j eigenvectors. That is, there
exists a nonzero vector x = [x1 , x2 , . . . , xj ] 6= 0 that

vj+1 = x1 v1 + x2 v2 + . . . + xj vj . (13.4)

Suppose the j + 1 eigenvectors correspond to distinct eigenvalues λ1 , λ2 , . . . , λj , λj+1 . Mul-


tiply Equation (13.4) on the left by A, we get

λj+1 vj+1 = x1 λ1 v1 + x2 λ2 v2 + . . . + xj λj vj . (13.5)

Multiply Equation (13.4) on the left by λj+1 , we get

λj+1 vj+1 = x1 λj+1 v1 + x2 λj+1 v2 + . . . + xj λj+1 vj . (13.6)

Subtract Equation (13.6) from Equation (13.5), we find

x1 (λj+1 − λ1 )v1 + x2 (λj+1 − λ2 )v2 + . . . + xj (λj+1 − λj )vj = 0.

From assumption, λj+1 6= λi for all i ∈ {1, 2, . . . , j}, and vi 6= 0 for all i ∈ {1, 2, . . . , j}. We
must have x1 = x2 = . . . = xj = 0 which leads to a contradiction. Then v1 , v2 , . . . , vj , vj+1
are linearly independent. This completes the proof.

115
Jun Lu

Corollary 13.5: (Independent Eigenvector Theorem, CNT.)


If a matrix A ∈ Rn×n has n distinct eigenvalues, then any set of n corresponding eigen-
vectors form a basis for Rn .

Symmetric Matrix Property 3 of 4

Lemma 13.6: (Orthonormal Eigenvectors for Duplicate Eigenvalue)


If A has a duplicate eigenvalue λi with multiplicity k ≥ 2, then there exist k
orthonormal eigenvectors corresponding to λi .

Proof [of Lemma 13.6] We note that there is at least one eigenvector xi1 corresponding
to λi . And for such eigenvector xi1 , we can always find additional n − 1 orthonormal
vectors y2 , y3 , . . . , yn so that {xi1 , y2 , y3 , . . . , yn } forms an orthonormal basis in Rn . Put
the y2 , y3 , . . . , yn into matrix Y1 and {xi1 , y2 , y3 , . . . , yn } into matrix P1

Y1 = [y2 , y3 , . . . , yn ] and P1 = [xi1 , Y1 ].

We then have  
λi 0
P1> AP1 = .
0 Y1> AY1

As a result, A and P1> AP1 are similar matrices such that they have the same eigenvalues
since P1 is nonsingular (even orthogonal here, see Lemma 8.5, p. 91). We obtain

det(P1> AP1 − λIn ) = 20 (λi − λ) det(Y1> AY1 − λIn−1 ).

If λi has multiplicity k ≥ 2, then the term (λi − λ) occurs k times in the polynomial from
the determinant det(P1> AP1 −λIn ), i.e., the term occurs k −1 times in the polynomial from
det(Y1> AY1 − λIn−1 ). In another word, det(Y1> AY1 − λi In−1 ) = 0 and λi is an eigenvalue
of Y1> AY1 .
Let B = Y1> AY1 . Since det(B − λi In−1 ) = 0, the null space of B − λi In−1 is not none.
Suppose (B − λi In−1 )n = 0, i.e., Bn = λi n and n is an eigenvector of B.
      
> λi 0 z λi 0 z
From P1 AP1 = , we have AP1 = P1 , where z is any scalar.
0 B n 0 B n
From the left side of this equation, we have
   
z   z
AP1 = λi xi1 , AY1
n n (13.7)
= λi zxi1 + AY1 n.
 
A B
20. By the fact that if matrix M has a block formulation: M = , then det(M ) = det(A) det(D −
C D
CA−1 B).

116
Matrix Decomposition and Applications

And from the right side of the equation, we have


     
λi 0 z   λi 0 z
P1 = xi1 Y1
0 B n 0 B n
 
  z
= λi xi1 Y1 B (13.8)
n
= λi zxi1 + Y1 Bn
= λi zxi1 + λi Y1 n. (Since Bn = λi n)

Combine Equation (13.8) and Equation (13.7), we obtain

AY1 n = λi Y1 n,

which means Y1 n is an eigenvector of A corresponding to the eigenvalue λi (same eigenvalue


corresponding to xi1 ). Since Y1 n is a combination of y2 , y3 , . . . , yn which are orthonormal
to xi1 , the Y1 n can be chosen to be orthonormal to xi1 .
To conclude, if we have one eigenvector xi1 corresponding to λi whose multiplicity is
k ≥ 2, we could construct the second eigenvector by choosing one vector from the null space
of (B − λi In−1 ) constructed above. Suppose now, we have constructed the second eigen-
vector xi2 which is orthonormal to xi1 . For such eigenvectors xi1 , xi2 , we can always find
additional n−2 orthonormal vectors y3 , y4 , . . . , yn so that {xi1 , xi2 , y3 , y4 , . . . , yn } forms an
orthonormal basis in Rn . Put the y3 , y4 , . . . , yn into matrix Y2 and {xi1 , xi2 , y3 , y4 , . . . , yn }
into matrix P2 :

Y2 = [y3 , y4 , . . . , yn ] and P2 = [xi1 , xi2 , Y1 ].

We then have    
λi 0 0 λi 0 0
P2> AP2 =  0 λi 0  =  0 λi 0  ,
>
0 0 Y2 AY2 0 0 C

where C = Y2> AY2 such that det(P2> AP2 − λIn ) = (λi − λ)2 det(C − λIn−2 ). If the
multiplicity of λi is k ≥ 3, det(C − λi In−2 ) = 0 and the null space of C − λi In−2 is not
 a vector from null space of C − λi In−2 and Cn = λi n. Now
none so that we can still find
z1
we can construct a vector z2  ∈ Rn , where z1 , z2 are any scalar values, such that

n
    
z1 λi 0 0 z1
AP2 z2  = P2  0 λi 0  z2  .
n 0 0 C n

Similarly, from the left side of the above equation, we will get λi z1 xi1 + λi z2 xi2 + AY2 n.
From the right side of the above equation, we will get λi z1 xi1 + λi z2 xi2 + λi Y2 n. As a
result,
AY2 n = λi Y2 n,

117
Jun Lu

where Y2 n is an eigenvector of A and orthogonal to xi1 , xi2 . And it is easy to construct


the eigenvector to be orthonormal to the first two.
The process can go on, and finally, we will find k orthonormal eigenvectors corresponding
to λi .
Actually, the dimension of the null space of P1> AP1 − λi In is equal to the multiplicity
k. It also follows that if the multiplicity of λi is k, there cannot be more than k orthogonal
eigenvectors corresponding to λi . Otherwise, it will come to the conclusion that we could
find more than n orthogonal eigenvectors which leads to a contradiction.

The proof of the existence of the spectral decomposition is trivial from the lemmas
above. Also, we can use Schur decomposition to prove the existence of it.
Proof [of Theorem 13.1: Existence of Spectral Decomposition] From the Schur
decomposition in Theorem 12.1 (p. 110), symmetric matrix A = A> leads to QU Q> =
QU > Q> . Then U is a diagonal matrix. And this diagonal matrix actually contains eigen-
values of A. All the columns of Q are eigenvectors of A. We conclude that all symmetric
matrices are diagonalizable even with repeated eigenvalues.

For any matrix multiplication, we have the rank of the multiplication result no larger
than the rank of the inputs. However, the symmetric matrix A> A is rather special in that
the rank of A> A is equal to that of A which will be used in the proof of singular value
decomposition in the next section.

Lemma 13.7: (Rank of AB)


For any matrix A ∈ Rm×n , B ∈ Rn×k , then the matrix multiplication AB ∈ Rm×k has
rank(AB)≤min(rank(A), rank(B)).

Proof [of Lemma 13.7] For matrix multiplication AB, we have


• All rows of AB are combinations of rows of B, the row space of AB is a subset of
the row space of B. Thus rank(AB)≤rank(B).
• All columns of AB are combinations of columns of A, the column space of AB is a
subset of the column space of A. Thus rank(AB)≤rank(A).
Therefore, rank(AB)≤min(rank(A), rank(B)).

Symmetric Matrix Property 4 of 4

Lemma 13.8: (Rank of Symmetric Matrices)


If A is an n×n real symmetric matrix, then rank(A) = the total number of nonzero
eigenvalues of A. In particular, A has full rank if and only if A is nonsingular.
Further, C(A) is the linear space spanned by the eigenvectors of A that correspond
to nonzero eigenvalues.

118
Matrix Decomposition and Applications

Proof [of Lemma 13.8] For any symmetric matrix A, we have A, in spectral form, as
A = QΛQ> and also Λ = Q> AQ. Since we have shown in Lemma 13.7 that the rank of
the multiplication rank(AB)≤min(rank(A), rank(B)).
• From A = QΛQ> , we have rank(A) ≤ rank(QΛ) ≤ rank(Λ);
• From Λ = Q> AQ, we have rank(Λ) ≤ rank(Q> A) ≤ rank(A),
The inequalities above give us a contradiction. And thus rank(A) = rank(Λ) which is
the total number of nonzero eigenvalues.
Since A is nonsingular if and only if all of its eigenvalues are nonzero, A has full rank
if and only if A is nonsingular.

Similar to the eigenvalue decomposition, we can compute the m-th power of matrix A
via the spectral decomposition more efficiently.

Remark 13.9: (m-th Power)


The m-th power of A is Am = QΛm Q> if the matrix A can be factored as A = QΛQ> .

13.2 Uniqueness of Spectral Decomposition


Clearly, the spectral decomposition is not unique essentially because of the multiplicity of
eigenvalues. One can imagine that eigenvalue λi and λj are the same for some 1 ≤ i, j ≤ n,
and interchange the corresponding eigenvectors in Q will have the same results but the
decompositions are different. But the eigenspaces (i.e., the null space N (A − λi I) for
eigenvalue λi ) corresponding to each eigenvalue are fixed. So there is a unique decomposition
in terms of eigenspaces and then any orthonormal basis of these eigenspaces can be chosen.

13.3 Other Forms, Connecting Eigenvalue Decomposition*


In this section, we discuss other forms of the spectral decomposition under different condi-
tions.

Definition 13.10: (Characteristic Polynomial)


For any square matrix A ∈ Rn×n , the characteristic polynomial det(A − λI) is given
by
det(λI − A) = λn − γn−1 λn−1 + . . . + γ1 λ + γ0
= (λ − λ1 )k1 (λ − λ2 )k2 . . . (λ − λm )km ,
where λ1 , λ2 , . . . , λm are the distinct roots of det(λI − A) and also the eigenvalues of A,
and k1 + k2 + . . . + km = n, i.e., det(λI − A) is a polynomial of degree n for any matrix
A ∈ Rn×n (see proof of Lemma 13.6, p. 116).

An important multiplicity arises from the characteristic polynomial of a matrix is then


defined as follows:

119
Jun Lu

Definition 13.11: (Algebraic Multiplicity and Geometric Multiplicity)


Given the characteristic polynomial of matrix A ∈ Rn×n :

det(λI − A) = (λ − λ1 )k1 (λ − λ2 )k2 . . . (λ − λm )km .

The integer ki is called the algebraic multiplicity of the eigenvalue λi , i.e., the algebraic
multiplicity of eigenvalue λi is equal to the multiplicity of the corresponding root of the
characteristic polynomial.
The eigenspace associated to eigenvalue λi is defined by the null space of (A −
λi I), i.e., N (A − λi I).
And the dimension of the eigenspace associated to λi , N (A − λi I), is called the
geometric multiplicity of λi .
In short, we denote the algebraic multiplicity of λi by alg(λi ), and its geometric
multiplicity by geo(λi ).

Remark 13.12: (Geometric Multiplicity)


Note that for matrix A and the eigenspace N (A−λi I), the dimension of the eigenspace is
also the number of linearly independent eigenvectors of A associated to λi , namely a basis
for the eigenspace. This implies that while there are an infinite number of eigenvectors
associated with each eigenvalue λi , the fact that they form a subspace (provided the zero
vector is added) means that they can be described by a finite number of vectors.

By definition, the sum of the algebraic multiplicities is equal to n, but the sum of the
geometric multiplicities can be strictly smaller.

Corollary 13.13: (Multiplicity in Similar Matrices)


Similar matrices have same algebraic multiplicities and geometric multiplicities.

Proof [of Corollary 13.13] In Lemma 8.5 (p. 91), we proved that the eigenvalues of similar
matrices are the same, therefore, the algebraic multiplicities of similar matrices are the same
as well.
Suppose A and B = P AP −1 are similar matrices where P is nonsingular. And the
geometric multiplicity of an eigenvalue of A, say λ, is k. Then there exists a set of orthogonal
vectors v1 , v2 , . . . , vk that are the basis for the eigenspace N (A−λI) such that Avi = λvi for
all i ∈ {1, 2, . . . , k}. Then, wi = P vi ’s are the eigenvectors of B associated with eigenvalue
λ. Further, wi ’s are linearly independent since P is nonsingular. Thus, the dimension of
the eigenspace N (B − λi I) is at least k, that is, dim(N (A − λI)) ≤ dim(N (B − λI)).
Similarly, there exists a set of orthogonal vectors w1 , w2 , . . . , wk that are the bases for
the eigenspace N (B − λI), then vi = P −1 wi for all i ∈ {1, 2, . . . , k} are the eigenvectors of
A associated to λ. This will result in dim(N (B − λI)) ≤ dim(N (A − λI)).
Therefore, by “sandwiching”, we get dim(N (A − λI)) = dim(N (B − λI)), which is the
equality of the geometric multiplicities, and the claim follows.

120
Matrix Decomposition and Applications

Lemma 13.14: (Bounded Geometric Multiplicity)


For any matrix A ∈ Rn×n , its geometric multiplicity is bounded by algebraic multiplicity
for any eigenvalue λi :
geo(λi ) ≤ alg(λi ).

Proof [of Lemma 13.14] If we can find a similar matrix B of A that has a specific form of
the characteristic polynomial, then we complete the proof.
Suppose P1 = [v1 , v2 , . . . , vk ] contains the eigenvectors of A associated with λi which
are linearly independent. That is, the k vectors are bases for the eigenspace N (A − λI)
and the geometric multiplicity associated with λi is k. We can expand it to n linearly
independent vectors such that

P = [P1 , P2 ] = [v1 , v2 , . . . , vk , vk+1 , . . . , vn ],

where P is nonsingular. ThenAP = [λi P1 , AP2 ].


λ i Ik C
Construct a matrix B = where AP2 = P1 C + P2 D, then P −1 AP = B
0 D
such that A and B are similar matrices. We can always find such C, D that satisfy the
above condition, since vi ’s are linearly independent with spanning the whole space Rn , and
any column of AP2 is in the column space of P = [P1 , P2 ]. Therefore,

det(A − λI) = det(P −1 ) det(A − λI) det(P ) (det(P −1 ) = 1/ det(P ))


= det(P −1 (A − λI)P ) (det(A) det(B) = det(AB))
= det(B − λI)
 
(λi − λ)Ik C
= det( )
0 D − λI
= (λi − λ)k det(D − λI),

where
 the
 last equality is from the fact that if matrix M has a block formulation: M =
A B
, then det(M ) = det(A) det(D − CA−1 B). This implies
C D

geo(λi ) ≤ alg(λi ).

And we complete the proof.

Following from the proof of Lemma 13.6, we notice that the algebraic multiplicity and
geometric multiplicity are the same for symmetric matrices. We call these matrices simple
matrices.

Definition 13.15: (Simple Matrix)


When the algebraic multiplicity and geometric multiplicity are the same for a matrix, we
call it a simple matrix.

121
Jun Lu

Definition 13.16: (Diagonalizable)


A matrix A is diagonalizable if there exists a nonsingular matrix P and a diagonal matrix
D such that A = P DP −1 .

Eigenvalue decomposition in Theorem 11.1 and spectral decomposition in Theorem 13.1 are
such kinds of matrices that are diagonalizable.

Lemma 13.17: (Simple Matrices are Diagonalizable)


A matrix is a simple matrix if and only if it is diagonalizable.

Proof [of Lemma 13.17] We will show by forward implication and backward implication
separately as follows.
Forward implication Suppose that A ∈ Rn×n is a simple matrix, such that the algebraic
and geometric multiplicities for each eigenvalue are equal. For a specific eigenvalue λi , let
{v1i , v2i , . . . , vki i } be a basis for the eigenspace N (A − λi I), that is, {v1i , v2i , . . . , vki i } is a
set of linearly independent eigenvectors of A associated to λi , where ki is the algebraic
or geometric multiplicity associated to λi : alg(λi ) = geo(λi ) = ki . Suppose there are m
distinct eigenvalues, since k1 + k2 + . . . + km = n, the set of eigenvectors consists of the
union of n vectors. Suppose there is a set of xj ’s such that
k1
X k2
X km
X
z= x1j vj1 + x2j vj2 + . . . xm m
j vj = 0. (13.9)
j=1 j=1 j=1
Pki i i
Let wi = i
j=1 xj vj .PThen w is either an eigenvector associated to λi , or it is a zero
vector. That is z = m i
i=1 w is a sum of either zero vector or an eigenvector associated
with different eigenvalues of A. Since eigenvectors associated with different eigenvalues are
linearly independent. We must have wi = 0 for all i ∈ {1, 2, . . . , m}. That is
ki
X
i
w = xij vji = 0, for all i ∈ {1, 2, . . . , m}.
j=1

Since we assume the eigenvectors vji ’s associated to λi are linearly independent, we must
have xij = 0 for all i ∈ {1, 2, . . . , m}, j ∈ {1, 2, . . . , ki }. Thus, the n vectors are linearly
independent:
{v11 , v21 , . . . , vk1i }, {v12 , v22 , . . . , vk2i }, . . . , {v1m , v2m , . . . , vkmi }.
By eigenvalue decomposition in Theorem 11.1, matrix A can be diagonalizable.
Backward implication Suppose A is diagonalizable. That is, there exists a nonsingular
matrix P and a diagonal matrix D such that A = P DP −1 . A and D are similar matrices
such that they have the same eigenvalues (Lemma 8.5, p. 91), same algebraic multiplicities,
and geometric multiplicities (Corollary 13.13, p. 120). It can be easily verified that a di-
agonal matrix has equal algebraic multiplicity and geometric multiplicity such that A is a
simple matrix.

122
Matrix Decomposition and Applications

Remark 13.18: (Equivalence on Diagonalization)


From Theorem 13.4 that any eigenvectors corresponding to different eigenvalues are lin-
early independent, and Remark 13.12 that the geometric multiplicity is the dimension of
the eigenspace. We realize, if the geometric multiplicity is equal to the algebraic multi-
plicity, the eigenspace can span the whole space Rn if matrix A ∈ Rn×n . So the above
Lemma is equivalent to claim that if the eigenspace can span the whole space Rn , then
A can be diagonalizable.

Corollary 13.19
A square matrix A with linearly independent eigenvectors is a simple matrix. Or if A is
symmetric, it is also a simple matrix.

From the eigenvalue decomposition in Theorem 11.1 and the spectral decomposition in
Theorem 13.1, the proof is trivial for the corollary.
Now we are ready to show the second form of the spectral decomposition.

Theorem 13.20: (Spectral Decomposition: The Second Form)


A simple matrix A ∈ Rn×n can be factored as a sum of a set of idempotent matrices
n
X
A= λi Ai ,
i=1

where λi for all i ∈ {1, 2, . . . , n} are eigenvalues of A (duplicate possible), and also known
as the spectral values of A. Specifically, we have the following properties:
1. Idempotent: A2i = Ai for all i ∈ {1, 2, . . . , n};
2. Orthogonal:P Ai Aj = 0 for all i 6= j;
n
3. Additivity: i=1 Ai = In ;
4. Rank-Additivity: rank(A1 ) + rank(A2 ) + . . . + rank(An ) = n.

Proof [of Theorem 13.20] Since A is a simple matrix, from Lemma 13.17, there exists
a nonsingular matrix P and a diagonal matrix Λ such that A = P ΛP −1 where Λ =
diag(λ1 , λ2 , . . . , λn ), and λi ’s are eigenvalues of A and columns of P are eigenvectors of A.
Suppose  >
w1
w> 
 2
P −1
 
P = v1 v2 . . . vn and = . 
 .. 
wn>
are the column and row partitions of P and P −1 respectively. Then, we have
 >
w1
−1
 w2> 

 X
n
λi vi wi> .

A = P ΛP = v1 v2 . . . vn Λ  .  =
 .. 
i=1
wn>

123
Jun Lu

Pn
Let Ai = vi wi> , we have A = i=1 λi Ai . We realize that P −1 P = I such that
( >
wi vj = 1, if i = j.
wi> vj = 0, if i 6= j.
Therefore, (
vi wi> = Ai , if i = j.
Ai Aj = vi wi> vj wj> =
0, if i 6= j.
This implies the idempotency and orthogonality of Ai ’s. We also notice that ni=1 Ai =
P
P P −1 = I, that is the additivity of Ai ’s. The rank-additivity of the Ai ’s is trivial since
rank(Ai ) = 1 for all i ∈ {1, 2, . . . , n}.
The decomposition is highly related to the Cochran’s theorem and its application in the
distribution theory of linear models (Lu, 2021c,e).

Theorem 13.21: (Spectral Decomposition: The Third Form)


A simple matrix A ∈ Rn×n with k distinct eigenvalues can be factored as a sum of a
set of idempotent matrices
Xk
A= λi Ai ,
i=1

where λi for all i ∈ {1, 2, . . . , k} are the distinct eigenvalues of A, and also known as the
spectral values of A. Specifically, we have the following properties:
1. Idempotent: A2i = Ai for all i ∈ {1, 2, . . . , k};
2. Orthogonal: Ai Aj = 0 for all i 6= j;
Pk
3. Additivity: i=1 Ai = In ;
4. Rank-Additivity: rank(A1 ) + rank(A2 ) + . . . + rank(Ak ) = n.

Proof [of Theorem 13.21] From Theorem 13.20, we can decompose A by A = nj=1 βi Bj .
P
Without loss of generality, the eigenvalues βi ’s are ordered such that β1 ≤ β2 ≤ . . . βn where
duplicate is possible. Let λi ’s be the distinct eigenvalues, and Ai be the sum of the Bj ’s
associated with λi . Suppose the multiplicity of λi is mi , and the Bj ’sPassociated to λi can
mi
be denoted as {B1i , B2i , . . . , Bm
i }. Then A can be denoted as A =
i i i
i
j=1 Bj . Apparently
Pk
A = i=1 λi Ai .
Idempotency A2i = (B1i + B2i + . . . Bm
i )(B i + B i + . . . B i ) = B i + B i + . . . B i = A
i 1 2 mi 1 2 mi i
from the idempotency and orthogonality of Bji ’s.

Ortogonality i )(B j +B j +. . . B j ) = 0 from the orthogonality


Ai Aj = (B1i +B2i +. . . Bm i 1 2 mj
of the Bji ’s.
Pk
Additivity It is trivial that i=1 Ai
= In .
Pmi
Rank-Additivity rank(Ai ) = rank( j=1 Bji ) = mi such that rank(A1 ) + rank(A2 ) +
. . . + rank(Ak ) = m1 + m2 + . . . + mk = n.

124
Matrix Decomposition and Applications

Theorem 13.22: (Spectral Decomposition: Backward Implication)


If a matrix A ∈ Rn×n with k distinct eigenvalues can be factored as a sum of a set of
idempotent matrices
Xk
A= λi Ai ,
i=1

where λi for all i ∈ {1, 2, . . . , k} are the distinct eigenvalues of A, and


1. Idempotent: A2i = Ai for all i ∈ {1, 2, . . . , k};
2. Orthogonal: Ai Aj = 0 for all i 6= j;
Pk
3. Additivity: i=1 Ai = In ;
4. Rank-Additivity: rank(A1 ) + rank(A2 ) + . . . + rank(Ak ) = n.
Then, the matrix A is a simple matrix.

Proof [of Corollary 13.22] Suppose rank(Ai ) = ri for all i ∈ {1, 2, . . . , k}. By ULV decom-
position in Theorem 4.1, Ai can be factored as
 
Li 0
Ai = Ui V,
0 0 i

where Li ∈ Rri ×ri , Ui ∈ Rn×n and Vi ∈ Rn×n are orthogonal matrices. Let
   
Li Yi
Xi = Ui and Vi = ,
0 Zi

where Xi ∈ Rn×ri , and Yi ∈ Rri ×n is the first ri rows of Vi . Then, we have

Ai = Xi Yi .

This can be seen as a reduced ULV decomposition of Ai . Appending the Xi ’s and Yi ’s


into X and Y ,  
Y1
Y2 
X = [X1 , X2 , . . . , Xk ], Y =  . ,
 
 .. 
Yk
where X ∈ Rn×n and Y ∈ Rn×n (from rank-additivity). By block matrix multiplication
and the additivity of Ai ’s, we have
k
X k
X
XY = Xi Yi = Ai = I.
i=1 i=1

Therefore Y is the inverse of X, and


   
Y1 Y1 X1 Y1 X2 . . . Y1 Xk
Y2  Y2 X1 Y2 X2 . . . Y2 Xk 
Y X =  .  [X1 , X2 , . . . , Xk ] =  . ..  = I,
   
. . .. . .
 .   . . . . 
Yk Yk X1 Yk X2 . . . Yk Xk

125
Jun Lu

such that (
Iri , if i = j;
Yi Xj =
0, if i 6= j.
This implies (
Xi , if i = j;
Ai Xj = and AXi = λi Xi .
0, if i 6= j,
Finally, we have

AX = A[X1 , X2 , . . . , Xk ] = [λ1 X1 , λ2 X2 , . . . , λk Xk ] = XΛ,

where  
λ1 Ir1 0 ... 0
 0 λ2 Ir2 ... 0 
Λ= .
 
.. .. ..
 ..

. . . 
0 0 . . . λk Irk
is a diagonal matrix. This implies A can be diagonalized and from Lemma 13.17, A is a
simple matrix.

Corollary 13.23: (Forward and Backward Spectral)


Combine Theorem 13.21 and Theorem 13.22, we can claim that matrix A ∈ Rn×n is a
simple matrix with k distinct eigenvalues if and only if it can be factored as a sum of a
set of idempotent matrices
X k
A= λi Ai ,
i=1

where λi for all i ∈ {1, 2, . . . , k} are the distinct eigenvalues of A, and


1. Idempotent: A2i = Ai for all i ∈ {1, 2, . . . , k};
2. Orthogonal: Ai Aj = 0 for all i 6= j;
Pk
3. Additivity: i=1 Ai = In ;
4. Rank-Additivity: rank(A1 ) + rank(A2 ) + . . . + rank(Ak ) = n.

13.4 Skew-Symmetric Matrices and its Properties*


We have introduced the spectral decomposition for symmetric matrices. A special kind of
matrices that’s related to symmetric is called the skew-symmetric matrices.

Definition 13.24: (Skew-Symmetric Matrix)


If matrix A ∈ Rn×n have the following property, then it is known as a skew-symmetric
matrix:
A> = −A.

126
Matrix Decomposition and Applications

Note that under this definition, for the diagonal values aii for all i ∈ {1, 2, . . . , n}, we
have aii = −aii which implies all the diagonal components are 0.
We have proved in Lemma 13.2 that all the eigenvalues of symmetric matrices are real.
Similarly, we could show that all the eigenvalues of skew-symmetric matrices are imaginary.

Lemma 13.25: (Imaginary Eigenvalues)


The eigenvalues of any skew-symmetric matrix are all imaginary or zero.

Proof [of Lemma 13.25] Suppose eigenvalue λ is a complex number λ = a + ib where a, b


are real. Its complex conjugate is λ̄ = a − ib. Same for complex eigenvector x = c + id and
its complex conjugate x̄ = c − id where c, d are real vectors. We then have the following
property
Ax = λx −leads
−−−−− to
→ Ax̄ = λ̄x̄ transpose to x̄> A> = λ̄x̄> .
−−−−−−−−−−→
We take the dot product of the first equation with x̄ and the last equation with x:
x̄> Ax = λx̄> x, and x̄> A> x = λ̄x̄> x.
Then we have the equality −λx̄> x = λ̄x̄> x (since A> = −A). Since x̄> x = (c − id)> (c +
id) = c> c + d> d is a real number. Therefore the real part of λ is zero and λ is either
imaginary or zero.

Lemma 13.26: (Odd Skew-Symmetric Determinant)


For skew-symmetric matrix A ∈ Rn×n , if n is odd, then det(A) = 0.

Proof [of Lemma 13.26] When n is odd, we have


det(A) = det(A> ) = det(−A) = (−1)n det(A) = − det(A).
This implies det(A) = 0.

Theorem 13.27: (Block-Diagonalization of Skew-Symmetric Matrices)


A real skew-symmetric matrix A ∈ Rn×n can be factored as

A = ZDZ > ,

where Z is an n × n nonsingular matrix, and D is a block-diagonal matrix with the


following form     
0 1 0 1
D = diag ,..., , 0, . . . , 0 .
−1 0 −1 0

Proof [of Theorem 13.27] We will prove by recursive calculation. As usual, we will denote
the entry (i, j) of matrix A by Aij .

127
Jun Lu

Case 1). Suppose the first row of A is nonzero, we notice that EAE > is skew-symmetric
if A is skew-symmetric for any matrix E. This will make both the diagonals of A and
EAE > are zeros, and the upper-left 2 × 2 submatrix of EAE > has the following form
 
> 0 x
(EAE )1:2,1:2 = .
−x 0

Since we suppose the first row of A is nonzero, there exists a permutation matrix P (Defi-
nition 0.17, p. 15), such that we will exchange the nonzero value, say a, in the first row to
the second column of P AP > . And as discussed above, the upper-left 2 × 2 submatrix of
P AP > has the following form
 
> 0 a
(P AP )1:2,1:2 = .
−a 0
 
1/a 0
Construct a nonsingular matrix M = such that the upper left 2 × 2 submatrix
0 In−1
of M P AP > M > has the following form
 
0 1
(M P AP > M > )1:2,1:2 = .
−1 0

Now we finish diagonalizing the upper-left 2 × 2 block. Suppose now (M P AP > M > ) above
has a nonzero value, say b, in the first row with entry (1, j) for some j > 2, we can construct
a nonsingular matrix L = I − b · Ej2 where E2j is an all-zero matrix except the entry (2, j)
is 1, such that (LM P AP > M > L> ) will introduce 0 for the entry with value b.

A Trivial Example

For example, suppose M P AP > M > is a 3 × 3 matrix with the following value
   
0 1 b 1 0 0
M P AP > M > = −1 0 × , and L = I − b · Ej2 = 0 1 0 ,
× × 0 0 −b 1

where j = 3 for this specific example. This results in


     
1 0 0 0 1 b 1 0 0 0 1 0
LM P AP > M > L> = 0 1 0 −1 0 × 0 1 −b = −1 0 × .
0 −b 1 × × 0 0 0 1 × × 0

Similarly, if the second row of LM P AP > M > L> contains a nonzero value, say c, we could
construct a nonsingular matrix K = I + c · Ej1 such that KLM P AP > M > L> K > will
introduce 0 for the entry with value c.

A Trivial Example

128
Matrix Decomposition and Applications

For example, suppose LM P AP > M > L> is a 3 × 3 matrix with the following value
   
0 1 0 1 0 0
LM P AP > M > L> = −1 0 c  , and K = I + c · Ej1 = 0 1 0 ,
× × 0 c 0 1

where j = 3 for this specific example. This results in


     
1 0 0 0 1 0 1 0 c 0 1 0
KLM P AP > M > L> K > = 0 1 0 −1 0 c  0 1 0 = −1 0 0 .
c 0 1 × × 0 0 0 1 × × 0

Since we have shown that KLM P AP > M > L> K > is also skew-symmetric, then, it is
actually  
0 1 0
> > > >
KLM P AP M L K = −1 0  0 ,
0 0 0
so that we do not need to tackle the first 2 columns of the above equation.

Apply this process for the bottom-right (n − 2) × (n − 2) submatrix, we will complete the
proof.
Case 2). Suppose the first row of A is zero, a permutation matrix to put the first row
into the last row and apply the process in case 1 to finish the proof.

From the block-diagonalization of skew-symmetric matrices above, we could easily find


that the rank of a skew-symmetric matrix is even. And we could prove the determinant of
skew-symmetric with even order is nonnegative as follows.

Lemma 13.28: (Even Skew-Symmetric Determinant)


For skew-symmetric matrix A ∈ Rn×n , if n is even, then det(A) ≥ 0.

Proof [of Lemma 13.28] By Theorem 13.27, we could block-diagonalize A = ZDZ > such
that
det(A) = det(ZDZ > ) = det(Z)2 det(D) ≥ 0.
This completes the proof.

13.5 Applications
13.5.1 Application: Eigenvalue of Projection Matrix
In Section 14.4 (p. 141), we will introduce the QR decomposition can be applied to solve
the least squares problem, where we consider the overdetermined system Ax = b with A ∈
Rm×n being the data matrix, b ∈ Rm with m > n being the observation matrix. Normally A

129
Jun Lu

will have full column rank since the data from real work has a large chance to be unrelated.
And the least squares solution is given by xLS = (A> A)−1 A> b for minimizing ||Ax − b||2 ,
where A> A is invertible since A has full column rank and rank(AT A) = rank(A). The
recovered observation matrix is then b̂ = AxLS = A(A> A)−1 A> b. b may not be in the
column space of A, but the recovered b̂ is in this column space. We then define such matrix
H = A(A> A)−1 A> to be a projection matrix, i.e., projecting b onto the column space of
A. Or, it is also known as hat matrix, since we put a hat on b. It can be easily verified the
projection matrix is symmetric and idempotent (i.e., H 2 = H).

Remark 13.29: (Column Space of Projection Matrix)


We notice that the hat matrix H = A(A> A)−1 A> is to project any vector in Rm into
the column space of A. That is, Hy ∈ C(A). Notice again Hy is the nothing but a
combination of the columns of H, thus C(H) = C(A).
In general, for any projection matrix H to project vector onto subspace V, then
C(H) = V. More formally, in a mathematical language, this property can be proved by
SVD.

We now show that for any projection matrix, it has specific eigenvalues.

Proposition 13.30: (Eigenvalue of Projection Matrix)


The only possible eigenvalues of a projection matrix are 0 and 1.

Proof [of Proposition 13.30] Since H is symmetric, we have spectral decomposition H =


QΛQ> . From the idempotent property, we have
(QΛQ> )2 = QΛQ>
QΛ2 Q> = QΛQ>
Λ2 = Λ
λ2i = λi ,
Therefore, the only possible eigenvalues for H are 0 and 1.

This property of the projection matrix is important for the analysis of distribution
theory for linear models. See (Lu, 2021e) for more details. Following from the eigen-
value of the projection matrix, it can also give rise to the perpendicular projection I − H.

Proposition 13.31: (Project onto V ⊥ )


Let V be a subspace and H be a projection onto V. Then I − H is the projection matrix
onto V ⊥ .

Proof [of Proposition 13.31] First, (I − H) is symmetric, (I − H)> = I − H > = I − H


since H is symmatrix. And
(I − H)2 = I 2 − IH − HI + H 2 = I − H.

130
Matrix Decomposition and Applications

Thus I − H is a projection matrix. By spectral theorem again, let H = QΛQ> . Then


I − H = QQ> − QΛQ> = Q(I − Λ)Q> . Hence the column space of I − H is spanned
by the eigenvectors of H corresponding to the zero eigenvalues of H (by Proposition 13.30,
p. 130), which coincides with V ⊥ .

Again, for a detailed analysis of the origin of the projection matrix and results behind
the projection matrix, we highly recommend the readers refer to (Lu, 2021c) although it is
not the main interest of matrix decomposition results.

13.5.2 Application: An Alternative Definition on PD and PSD of Matrices


In Definition 2.2 (p. 29), we defined the positive definite matrices and positive semidefinite
matrices by the quadratic form of the matrices. We here prove that a symmetric matrix is
positive definite if and only if all eigenvalues are positive.

Lemma 13.32: (Eigenvalues of PD and PSD Matrices)


A matrix A ∈ Rn×n is positive definite (PD) if and only if A has only positive eigenval-
ues. And a matrix A ∈ Rn×n is positive semidefinite (PSD) if and only if A has only
nonnegative eigenvalues.

Proof [of Lemma 13.32] We will prove by forward implication and reverse implication
separately as follows.
Forward implication: Suppose A is PD, then for any eigenvalue λ and its corresponding
eigenvector v of A, we have Av = λv. Thus

v > Av = λ||v||2 > 0.

This implies λ > 0.


Reverse implication: Conversely, suppose the eigenvalues are positive. By spectral
decomposition of A = QΛQ> . If x is a nonzero vector, let y = Q> x, we have
n
X
> > > > > >
x Ax = x (QΛQ )x = (x Q)Λ(Q x) = y Λy = λi yi2 > 0.
i=1

That is, A is PD.


Analogously, we can prove the second part of the claim.

Theorem 13.33: (Nonsingular Factor of PSD and PD Matrices)


A real symmetric matrix A is PSD if and only if A can be factored as A = P > P , and is
PD if and only if P is nonsingular.

Proof [of Theorem 13.33] For the first part, we will prove by forward implication and
reverse implication separately as follows.

131
Jun Lu

Forward implication: Suppose A is PSD, its spectral decomposition is given by


A = QΛQ> . Since eigenvalues of PSD matrices are nonnegative, we can decompose
Λ = Λ1/2 Λ1/2 . Let P = Λ1/2 Q> , we can decompose A by A = P > P .
Reverse implication: If A can be factored as A = P > P , then all eigenvalues of A are
nonnegative since for any eigenvalues λ and its corresponding eigenvector v of A, we have
v > Av v>P >P v ||P v||2
λ= = = ≥ 0.
v>v v>v ||v||2
This implies A is PSD by Lemma 13.32.
Similarly, we can prove the second part for PD matrices where the positive definiteness
will result in the nonsingular P and the nonsingular P will result in the positiveness of the
eigenvalues. 21

13.5.3 Proof for Semidefinite Rank-Revealing Decomposition


In this section, we provide a proof for Theorem 2.10 (p. 39), the existence of the rank-
revealing decomposition for positive semidefinite matrix.
Proof [of Theorem 2.10] The proof is a consequence of the nonsingular factor of PSD
matrices (Theorem 13.33, p. 131) and the existence of column-pivoted QR decomposition
(Theorem 3.2, p. 53).
By Theorem 13.33, the nonsingular factor of PSD matrix A is given by A = Z > Z,
where Z = Λ1/2 Q> and A = QΛQ> is the spectral decomposition of A.
By Lemma 13.8, the rank of matrix A is the number of nonzero eigenvalues (here the
number of positive eigenvalues since A is PSD). Therefore only r components in the diagonal
of Λ1/2 are nonzero, and Z = Λ1/2 Q> contains only r independent columns, i.e., Z is of
rank r. By column-pivoted QR decomposition, we have
 
R11 R12
ZP = Q ,
0 0
where P is a permutation matrix, R11 ∈ Rr×r is upper triangular with positive diagonals,
and R12 ∈ Rr×(n−r) . Therefore
 >  
> > > R11 0 R11 R12
P AP = P Z ZP = > .
R12 0 0 0
Let  
R11 R12
R= ,
0 0
we find the rank-revealing decomposition for semidefinite matrix P > AP = R> R.

This decomposition is produced by using complete pivoting, which at each stage per-
mutes the largest diagonal element in the active submatrix into the pivot position. The
procedure is similar to the partial pivoting discussed in Section 1.9.1 (p. 25).
21. See also wiki page: https://en.wikipedia.org/wiki/Sylvester’s criterion.

132
Matrix Decomposition and Applications

13.5.4 Application: Cholesky Decomposition via the QR Decomposition and


the Spectral Decomposition
In this section, we provide another proof for the existence of Cholesky decomposition.
Proof [of Theorem 2.1] From Theorem 13.33, the PD matrix A can be factored as A =
P > P where P is a nonsingular matrix. Then, the QR decomposition of P is given by
P = QR. This implies
A = P > P = R> Q> QR = R> R,

where we notice that the form is very similar to the Cholesky decomposition except that we
do not claim the R has only positive diagonal values. From the CGS algorithm to compute
the QR decomposition, we realize that the diagonals of R are nonnegative, and if P is
nonsingular, the diagonals of R are also positive.
The proof for the above theorem is a consequence of the existence of both the QR decom-
position and the spectral decomposition. Thus, the existence of Cholesky decomposition
can be proved via the QR decomposition and the spectral decomposition in this sense.

13.5.5 Application: Unique Power Decomposition of Positive Definite


Matrices

Theorem 13.34: (Unique Power Decomposition of PD Matrices)


Any n × n positive matrix A can be uniquely factored as a product of a positive definite
matrix B such that A = B 2 .

Proof [of Theorem 13.34] We first prove that there exists such positive definite matrix B
so that A = B 2 .

Existence Since A is PD which is also symmetric, the spectral decomposition of A is


given by A = QΛQ> . Since eigenvalues of PD matrices are positive by Lemma 13.32, the
square root of Λ exists. We can define B = QΛ1/2 Q> such that A = B 2 where B is
apparently PD.

Uniqueness Suppose such factorization is not unique, then there exist two of this decom-
position such that
A = B12 = B22 ,

where B1 and B2 are both PD. The spectral decompositions of them are given by

B1 = Q1 Λ1 Q>
1, and B2 = Q2 Λ2 Q>
2.

We notice that Λ21 and Λ22 contains the eigenvalues of A, and both eigenvalues of B1
and B2 contained in Λ1 and Λ2 are positive (since B1 and B2 are both PD). Without
loss of generality, we suppose Λ1 = Λ2 = Λ1/2 , and Λ = diag(λ1 , λ2 , . . . , λn ) such that
λ1 ≥ λ2 ≥ . . . ≥ λn . By B12 = B22 , we have

Q1 ΛQ> >
1 = Q2 ΛQ2 leads
−−−−−→to Q> >
2 Q1 Λ = ΛQ2 Q1 .

133
Jun Lu

Let Z = Q> 2 Q1 , this implies Λ and Z commute, and Z must be a block diagonal matrix
whose partitioning conforms to the block structure of Λ. This results in Λ1/2 = ZΛ1/2 Z >
and
B2 = Q2 Λ1/2 Q> >
2 = Q2 Q2 Q1 Λ
1/2 >
Q1 Q2 Q>
2 = B1 .
This completes the proof.
Similarly, we could prove the unique decomposition of PSD matrix A = B 2 where B is
PSD. A more detailed discussion on this topic can be referred to (Koeber and Schäfer,
2006).
Decomposition for PD matrices To conclude, for PD matrix A, we can factor it into
A = R> R where R is an upper triangular matrix with positive diagonals as shown in The-
orem 2.1 by Cholesky decomposition, A = P > P where P is nonsingular in Theorem 13.33,
and A = B 2 where B is PD in Theorem 13.34.

14. Singular Value Decomposition (SVD)


In the eigenvalue decomposition, we factor the matrix into a diagonal matrix. However,
this is not always true. If A does not have linearly independent eigenvectors, such diago-
nalization does not exist. The singular value decomposition (SVD) fills this gap. Instead of
factoring the matrix into an eigenvector matrix, SVD gives rise to two orthogonal matrices.
We provide the result of SVD in the following theorem and we will discuss the existence of
SVD in the next sections.

Theorem 14.1: (Reduced SVD for Rectangular Matrices)


For every real m × n matrix A with rank r, then matrix A can be factored as

A = U ΣV > ,

where Σ ∈ Rr×r is a diagonal matrix Σ = diag(σ1 , σ2 . . . , σr ) with σ1 ≥ σ2 ≥ . . . ≥ σr


and
• σi ’s are the nonzero singular values of A, in the meantime, they are the (positive)
square roots of the nonzero eigenvalues of A> A and AA> .
• Columns of U ∈ Rm×r contain the r eigenvectors of AA> corresponding to the r
nonzero eigenvalues of AA> .
• Columns of V ∈ Rn×r contain the r eigenvectors of A> A corresponding to the r
nonzero eigenvalues of A> A.
• Moreover, the columns of U and V are called the left and right singular vectors
of A, respectively.
• Further, the columns of U and V are orthonormal (by Spectral Theorem 13.1,
p. 113).
In particular, we can write
P out the matrix decomposition by the sum of outer products
of vectors A = U ΣV > = ri=1 σi ui vi> , which is a sum of r rank-one matrices.

If we append additional m − r silent columns that are orthonormal to the r eigenvectors


of AA> , just like the silent columns in the QR decomposition, we will have an orthogonal

134
Matrix Decomposition and Applications

matrix U ∈ Rm×m . Similar situation for the columns of V . The comparison between the
reduced and the full SVD is shown in Figure 17 where white entries are zero and blue entries
are not necessarily zero.

r
 

Amn U mr  rr VnTr Amn U mm  mn VnTn

(a) Reduced SVD decomposition (b) Full SVD decomposition

Figure 17: Comparison between the reduced and full SVD.

14.1 Existence of the SVD


To prove the existence of the SVD, we need to use the following lemmas. We mentioned that
the singular values are the square roots of the eigenvalues of A> A. While, negative values
do not have square roots such that the eigenvalues must be nonnegative.

Lemma 14.2: (Nonnegative Eigenvalues of A> A)


For any matrix A ∈ Rm×n , A> A has nonnegative eigenvalues.

Proof [of Lemma 14.2] For eigenvalue and its corresponding eigenvector λ, x of A> A, we
have
A> Ax = λx leads
−−−−−→to x> A> Ax = λx> x.
Since x> A> Ax = ||Ax||2 ≥ 0 and x> x ≥ 0. We then have λ ≥ 0.

Since A> A has nonnegative eigenvalues, we then can define the singular value σ ≥ 0
of A such that σ 2 is the eigenvalue of A> A, i.e., A> Av = σ 2 v . This is essential to the
existence of the SVD.
We have shown in Lemma 13.7 (p. 118) that rank(AB)≤min{rank(A), rank(B)}.
However, the symmetric matrix A> A is rather special in that the rank of A> A is equal to
rank(A). And we now prove it.

Lemma 14.3: (Rank of A> A)


A> A and A have same rank.

Proof [of Lemma 14.3] Let x ∈ N (A), we have

Ax = 0 leads
−−−−−→to A> Ax = 0,

i.e., x ∈ N (A) leads


−−−−−→to x ∈ N (A> A), therefore N (A) ⊆ N (A> A).

135
Jun Lu

Further, let x ∈ N (A> A), we have


A> Ax = 0 leads
−−−−−→to x> A> Ax = 0 leads
−−−−−→to ||Ax||2 = 0 leads
−−−−−→to Ax = 0,

i.e., x ∈ N (A> A) leads to x ∈ N (A), therefore N (A> A) ⊆ N (A).


−−−−−→
As a result, by “sandwiching”, it follows that
N (A) = N (A> A) and dim(N (A)) = dim(N (A> A)).
By the fundamental theorem of linear algebra (Theorem 0.15, p. 14), A> A and A have the
same rank.
Apply the observation to A> , we can also prove that AA> and A have the same rank:
rank(A) = rank(A> A) = rank(AA> ).
In the form of the SVD, we claimed the matrix A is a sum of r rank-one matrices where
r is the number of nonzero singular values. And the number of nonzero singular values is
actually the rank of the matrix.

Lemma 14.4: (The Number of Nonzero Singular Values Equals the Rank)
The number of nonzero singular values of matrix A equals the rank of A.

Proof [of Lemma 14.4] The rank of any symmetric matrix (here A> A) equals the number
of nonzero eigenvalues (with repetitions) by Lemma 13.8. So the number of nonzero singular
values equals the rank of A> A. By Lemma 14.3, the number of nonzero singular values
equals the rank of A.

We are now ready to prove the existence of the SVD.


Proof [of Theorem 14.1: Existence of the SVD] Since A> A is a symmetric matrix,
by Spectral Theorem 13.1 (p. 113) and Lemma 14.2, there exists an orthogonal matrix V
such that
A> A = V Σ2 V > ,
where Σ is a diagonal matrix containing the singular values of A, i.e., Σ2 contains the
eigenvalues of A> A. Specifically, Σ = diag(σ1 , σ2 , . . . , σr ) and {σ12 , σ22 , . . . , σr2 } are the
nonzero eigenvalues of A> A with r being the rank of A. I.e., {σ1 , . . . , σr } are the singular
values of A. In this case, V ∈ Rn×r . Now we are into the central part.

Start from A> Avi = σi2 vi , ∀i ∈ {1, 2, . . . , r}, i.e., the eigenvector vi of A> A corre-
sponding to σi2 :
1. Multiply both sides by vi> :

vi> A> Avi = σi2 vi> vi leads


−−−−−→to ||Avi ||2 = σi2 leads
−−−−−→to ||Avi || = σi

2. Multiply both sides by A:


Avi Avi
AA> Avi = σi2 Avi leads
−−−−−→to AA> = σi2 leads
−−−−−→to AA> ui = σi2 ui
σi σi

136
Matrix Decomposition and Applications

where we notice that this form can find the eigenvector of AA> corresponding to σi2
which is Avi . Since the length of Avi is σi , we then define ui = Av
σi with norm 1.
i

These ui ’s are orthogonal because (Avi )> (Avj ) = vi> A> Avj = σj2 vi> vj = 0. That is

AA> = U Σ2 U > .

Since Avi = σi ui , we have


[Av1 , Av2 , . . . , Avr ] = [σ1 u1 , σ2 u2 , . . . , σr ur ] leads
−−−−−→to AV = U Σ,
which completes the proof.
By appending silent columns in U and V , we can easily find the full SVD. A byproduct
of the above proof is that the spectral decomposition of A> A = V Σ2 V > will result in the
spectral decomposition of AA> = U Σ2 U > with the same eigenvalues.

Corollary 14.5: (Eigenvalues of A> A and AA> )


The nonzero eigenvalues of A> A and AA> are the same.

We have shown in Lemma 14.2 that the eigenvalues of A> A are nonnegative, such that the
eigenvalues of AA> are nonnegative as well.

Corollary 14.6: (Nonnegative Eigenvalues of A> A and AA> )


The eigenvalues of A> A and AA> are nonnegative.

The existence of the SVD is important for defining the effective rank of a matrix.

Definition 14.7: (Effective Rank vs Exact Rank)


Effective rank, or also known as the numerical rank. Following from Lemma 14.4, the
number of nonzero singular values is equal to the rank of a matrix. Assume the i-th
largest singular value of A is denoted as σi (A). Then if σr (A)  σr+1 (A) ≈ 0, r is
known as the numerical rank of A. Whereas, when σi (A) > σr+1 (A) = 0, it is known as
having exact rank r as we have used in most of our discussions.

14.2 Properties of the SVD


14.2.1 Four Subspaces in SVD
For any matrix A ∈ Rm×n , we have the following property:
• N (A) is the orthogonal complement of the row space C(A> ) in Rn : dim(N (A)) +
dim(C(A> )) = n;
• N (A> ) is the orthogonal complement of the column space C(A) in Rm : dim(N (A> ))+
dim(C(A)) = m;
This is called the fundamental theorem of linear algebra and is also known as the rank-
nullity theorem. From the SVD, we can find an orthonormal basis for each subspace.

137
Jun Lu

dim = r
row column dim = r
space space
of A of A
 1u1  r ur
Av1   1u1
v1...vr u1...ur
n
R Av r   r ur Rm

vr 1...vn ur 1...um

nullspace nullspace
of A of AT
dim = n-r dim = m-r

Figure 18: Orthonormal bases that diagonalize A from SVD.

Lemma 14.8: (Four Orthonormal Basis)


Given the full SVD of matrix A = U ΣV > , where U = [u1 , u2 , . . . , um ] and V =
[v1 , v2 , . . . , vn ] are the column partitions of U and V . Then, we have the following
property:
• {v1 , v2 , . . . , vr } is an orthonormal basis of C(A> );
• {vr+1 , vr+2 , . . . , vn } is an orthonormal basis of N (A);
• {u1 , u2 , . . . , ur } is an orthonormal basis of C(A);
• {ur+1 , ur+2 , . . . , um } is an orthonormal basis of N (A> ).
The relationship of the four subspaces is demonstrated in Figure 18 where A transfer
the row basis vi into column basis ui by σi ui = Avi for all i ∈ {1, 2, . . . , r}.

Proof [of Lemma 14.8] From Lemma 13.8, for symmetric matrix A> A, C(A> A) is spanned
by the eigenvectors, thus {v1 , v2 , . . . , vr } is an orthonormal basis of C(A> A).
Since,
1. A> A is symmetric, then the row space of A> A equals the column space of A> A.
2. All rows of A> A are the combinations of the rows of A, so the row space of A> A ⊆
the row space of A, i.e., C(A> A) ⊆ C(A> ).
3. Since rank(A> A) = rank(A) by Lemma 14.3, we then have
The row space of A> A = the column space of A> A = the row space of A, i.e.,
C(A> A) = C(A> ). Thus {v1 , v2 , . . . , vr } is an orthonormal basis of C(A> ).
Further, the space spanned by {vr+1 , vr+2 , . . . , vn } is an orthogonal complement to the
space spanned by {v1 , v2 , . . . , vr }, so {vr+1 , vr+2 , . . . , vn } is an orthonormal basis of N (A).

138
Matrix Decomposition and Applications

If we apply this process to AA> , we will prove the rest claims in the lemma. Also, we
can see that {u1 , u2 , . . . , ur } is a basis for the column space of A by Lemma 0.14 22 , since
ui = Av
σi , ∀i ∈ {1, 2, . . . , r}.
i

14.2.2 Relationship between Singular Values and Determinant


Let A ∈ Rn×n be a square matrix and the singular value decomposition of A is given by
A = U ΣV > , it follows that

| det(A)| = | det(U ΣV > )| = | det(Σ)| = σ1 σ2 . . . σn .

If all the singular values σi are nonzero, then det(A) 6= 0. That is, A is nonsingular. If
there is at least one singular value such that σi = 0, then det(A) = 0, and A does not have
full rank, and is not invertible. Then the matrix is called singular. This is why σi ’s are
known as the singular values.

14.2.3 Orthogonal Equivalence


We have defined in Definition 8.4 (p. 90) that A and P AP −1 are similar matrices for any
nonsingular matrix P . The orthogonal equivalence is defined in a similar way.

Definition 14.9: (Orthogonal Equivalent Matrices)


For any orthogonal matrices U and V , the matrices A and U AV are called orthogonal
equivalent matrices. Or unitary equivalent in complex domain when U and V are unitary.

Then, we have the following property for orthogonal equivalent matrices.

Lemma 14.10: (Orthogonal Equivalent Matrices)


For any orthogonal equivalent matrices A and B, then singular values are the same.

Proof [of Lemma 14.10] Since A and B are orthogonal equivalent, there exist orthogonal
matrices that B = U AV . We then have

BB > = (U AV )(V > A> U > ) = U AA> U > .

This implies BB > and AA> are similar matrices. By Lemma 8.5 (p. 91), the eigenvalues
of similar matrices are the same, which proves the singular values of A and B are the same.

14.2.4 SVD for QR

22. For any matrix A, let {r1 , r2 , . . . , rr } be a set of vectors in Rn which forms a basis for the row space,
then {Ar1 , Ar2 , . . . , Arr } is a basis for the column space of A.

139
Jun Lu

Lemma 14.11: (Orthogonal Equivalent Matrices)


Suppose the full QR decomposition for matrix A ∈ Rm×n with m ≥ n is given by A = QR
where Q ∈ Rm×m is orthogonal and R ∈ Rm×n is upper triangular. Then A and R have
the same singular values and right singular vectors.

Proof [of Lemma 14.11] We notice that A> A = R> R such that A> A and R> R have the
same eigenvalues and eigenvectors, i.e., A and R have the same singular values and right
singular vectors (i.e., the eigenvectors of A> A or R> R).

The above lemma implies that an SVD of a matrix can be constructed by the QR
decomposition of itself. Suppose the QR decomposition of A is given by A = QR and the
SVD of R is given by R = U0 ΣV > . Therefore, the SVD of A can be obtained by
A = QU0 ΣV > .
| {z }
U

14.3 Polar Decomposition

Theorem 14.12: (Polar Decomposition)


For every real n × n square matrix A with rank r, then matrix A can be factored as

A = Ql S,

where Ql is an orthogonal matrix, and S is a positive semidefinite matrix. And this form
is called the left polar decomposition. Also matrix A can be factored as

A = SQr ,

where Qr is an orthogonal matrix, and S is a positive semidefinite matrix. And this form
is called the right polar decomposition.
Specially, the left and right polar decomposition of a square matrix A is unique.

Since every n × n square matrix A has full SVD A = U ΣV > , where both U and
V are n × n orthogonal matrix. We then have A = (U V > )(V ΣV > ) = Ql S where it
can be easily verified that Ql = U V > is an orthogonal matrix and S = V ΣV > is a
symmetric matrix. We notice that the singular values in Σ are nonnegative, such that
S = V ΣV > = V Σ1/2 Σ1/2> V > showing that S is PSD.
Similarly, we have A = U ΣU > U V > = (U ΣU > )(U V > ) = SQr . And S = U ΣU > =
U Σ1/2 Σ1/2> U > such that S is PSD as well.
For the uniqueness of the right polar decomposition, we suppose the decomposition is
not unique, and two of the decompositions are given by
A = S1 Q1 = S2 Q2 ,
such that
S1 = S2 Q 2 Q >
1.

140
Matrix Decomposition and Applications

Since S1 and S2 are symmetric, we have


S12 = S1 S1> = S2 Q2 Q> > 2
1 Q 1 Q 2 S2 = S2 .

This implies S1 = S2 , and the decomposition is unique (Theorem 13.34, p. 133). Similarly,
the uniqueness of the left polar decomposition can be implied from the context.

Corollary 14.13: (Full Rank Polar Decomposition)


When A ∈ Rn×n has full rank, then the S in both the left and right polar decomposition
above is a symmetric positive definite matrix.

14.4 Application: Least Squares via the Full QR Decomposition, UTV, SVD
Least Squares via the Full QR Decomposition
Let’s consider the overdetermined system Ax = b, where A ∈ Rm×n is the data matrix,
b ∈ Rm with m > n is the observation matrix. Normally A will have full column rank
since the data from real work has a large chance to be unrelated. And the least squares
(LS) solution is given by xLS = (A> A)−1 A> b for minimizing ||Ax − b||2 , where A> A is
invertible since A has full column rank and rank(AT A) = rank(A).
However, the inverse of a matrix is not easy to compute, we can then use QR de-
composition to find the least squares solution as illustrated in the following theorem.

Theorem 14.14: (LS via QR for Full Column Rank Matrix)


Let A ∈ Rm×n and A = QR is its full QR decomposition with Q ∈ Rm×m being an
orthogonal matrix, R ∈ Rm×n being an upper triangular matrix appended by  additional

R1
m − n zero rows, and A has full column rank with m ≥ n. Suppose R = , where
0
R1 ∈ Rn×n is the square upper triangular in R, b ∈ Rm , then the LS solution to Ax = b
is given by
xLS = R1−1 c,
where c is the first n components of Q> b.

Proof [of Theorem 14.14] Since A = QR is the full QR decomposition of A and m ≥ n,


the last m − n rows of R are zero as shown n×n is the square
 in Figure 8. Then R1 ∈ R
R1
upper triangular in R and Q> A = R = . Thus,
0

||Ax − b||2 = (Ax − b)> (Ax − b)


= (Ax − b)> QQ> (Ax − b) (Since Q is an orthogonal matrix)
= ||Q> Ax − Q> b||2 (Invariant under orthogonal)
  2
R1
= x − Q> b
0
= ||R1 x − c||2 + ||d||2 ,

141
Jun Lu

where c is the first n components of Q> b and d is the last m − n components of Q> b.
And the LS solution can be calculated by back substitution of the upper triangular system
R1 x = c, i.e., xLS = R1−1 c.

To verify Theorem 14.14, for the full QR decomposition of A = QR where Q ∈ Rm×m


and R ∈ Rm×n . Together with the LS solution xLS = (A> A)−1 A> b, we obtain

xLS = (A> A)−1 A> b


= (R> Q> QR)−1 R> Q> b
= (R> R)−1 R> Q> b
= (R1> R1 )−1 R> Q> b (14.1)
= R1−1 R1−> R> Q> b
= R1−1 R1−> R1> Q>
1b
= R1−1 Q>
1 b,
 
R1
where R = and R1 ∈ Rn×n is an upper triangular matrix, and Q1 = Q1:m,1:n ∈ Rm×n
0
is the first n columns of Q (i.e., Q1 R1 is the reduced QR decomposition of A). Then the
result of Equation (14.1) agrees with Theorem 14.14.
To conclude, using the QR decomposition, we first derive directly the least squares result
which results in the argument in Theorem 14.14. Moreover, we verify the result of LS from
calculus indirectly by the QR decomposition as well. The two results coincide with each
other. For those who are interested in LS in linear algebra, a pictorial view of least squares
for full column rank A in the fundamental theorem of linear algebra is provided in (Lu,
2021d).

Least Squares via ULV/URV for Rank Deficient Matrices


In the above section, we introduced the LS via the full QR decomposition for full rank
matrices. However, if often happens that the matrix may be rank-deficient. If A does not
have full column rank, A> A is not invertible. We can then use the ULV/URV decomposition
to find the least squares solution as illustrated in the following theorem.

Theorem 14.15: (LS via ULV/URV for Rank Definient Matrix)


Let A ∈ Rm×n with rank r and m ≥ n. Suppose A = U T V is its full ULV/URV
decomposition with U ∈ Rm×m , V ∈ Rn×n being orthogonal matrix matrices, and
 
T11 0
T =
0 0

142
Matrix Decomposition and Applications

where T11 ∈ Rr×r is a lower triangular matrix or an upper triangular matrix. Suppose
b ∈ Rm , then the LS solution with the minimal 2-norm to Ax = b is given by
 −1 
> T11 c
xLS = V ,
0

where c is the first r components of U > b.


Proof [of Theorem 14.15] Since A = QR is the full QR decomposition of A and m > n,
the last m − n rows of R are zero as shown n×n is the square
 in Figure 8. Then R1 ∈ R
R1
upper triangular in R and Q> A = R = . Thus,
0

||Ax − b||2 = (Ax − b)> (Ax − b)


= (Ax − b)> U U > (Ax − b) (Since U is an orthogonal matrix)
= ||U > Ax − U > b||2 (Invariant under orthogonal)
> > 2
= ||U U T V x − U b||
= ||T V x − U > b||2
= ||T11 e − c||2 + ||d||2 ,

where c is the first r components of U > b and d is the last m − r components of U > b, e is
the first r components of V x and f is the last n − r components of V x:
   
> c e
U b= , Vx=
d f

And the LS solution can be calculated by back/forward substitution of the upper/lower


−1
triangular system T11 e = c, i.e., e = T11 c. For x to have a minimal 2-norm, f must be
zero. That is  −1 
> T11 c
xLS = V .
0
This completes the proof.

A word on the minimal 2-norm LS solution For the least squares problem, the set
of all minimizers
X = {x ∈ Rn : ||Ax − b|| = min}
is convex (Golub and Van Loan, 2013). And if x1 , x2 ∈ X and λ ∈ [0, 1], then

||A(λx1 + (1 − λ)x2 ) − b|| ≤ λ||Ax1 − b|| + (1 − λ)||Ax2 − b|| = minn ||Ax − b||.
x∈R

Thus λx1 + (1 − λ)x2 ∈ X . In above proof, if we do not set f = 0, we will find more
least squares solutions. However, the minimal 2-norm least squares solution is unique. For
full-rank case in the previous section, the least squares solution is unique and it must have
a minimal 2-norm. See also (Foster, 2003; Golub and Van Loan, 2013) for a more detailed
discussion on this topic.

143
Jun Lu

Least Squares via SVD for Rank Deficient Matrices


Apart form the UTV decomposition for rank-deficient least squares solution, SVD serves as
an alternative.

Theorem 14.16: (LS via SVD for Rank Deficient Matrix)


Let A ∈ Rm×n and A = U ΣV > is its full SVD decomposition with U ∈ Rm×m and
V ∈ Rn×n being orthogonal matrices and rank(A) = r. Suppose U = [u1 , u2 , . . . , um ],
V = [v1 , v2 , . . . , vn ] and b ∈ Rm , then the LS solution with minimal 2-norm to Ax = b
is given by
r
X u>
i b
xLS = vi = V Σ+ U > b, (14.2)
σi
i=1
 + 
+ n×m + Σ1 0
where the upper-left side of Σ ∈ R is a diagonal matrix, Σ = where
0 0
Σ+ 1 1 1
1 = diag( σ1 , σ2 , . . . , σr ).

Proof [of Theorem 14.16] Write out the loss to be minimized

||Ax − b||2 = (Ax − b)> (Ax − b)


= (Ax − b)> U U > (Ax − b) (Since U is an orthogonal matrix)
= ||U > Ax − U > b||2 (Invariant under orthogonal)
> > > 2
= ||U AV V x − U b|| (Since V is an orthogonal matrix)
= ||Σα − U > b||2 (Let α = V > x)
X r m
X
= (σi αi − u>
i b) 2
+ (u> 2
i b) . (Since σr+1 = σr+2 = . . . = σm = 0)
i=1 i=r+1

u> b
Since x only appears in α, we just need to set αi = σi i for all i ∈ {1, 2, . . . , r} to minimize
the above equation. For any value of αr+1 , αr+2 , . . . , αn , it won’t change the result. From
the regularization point of view (or here, we want the minimal 2-norm) we can set them to
be 0. This gives us the LS solution via SVD:
r
X u> b
xLS = i
vi = V Σ+ U > b = A+ b,
σi
i=1

where A+ = V Σ+ U > ∈ Rn×m is known as the pseudo-inverse of A.

14.5 Application: Principal Component Analysis (PCA) via the Spectral


Decomposition and the SVD
Given a data set of n observations {x1 , x2 , . . . , xn } where xi ∈ Rp for all i ∈ {1, 2, . . . , n}.
Our goal is to project the data onto a low-dimensional space, say m < p. Define the sample

144
Matrix Decomposition and Applications

mean vector and sample covariance matrix


n n
1X 1 X
x= xi and S= (xi − x)(xi − x)> .
n n−1
i=1 i=1

where the n − 1 term in the covariance matrix is to make it to be an unbiased consistent


estimator of the covariance (Lu, 2021e). Or the covariance matrix can also be defined as
1 Pn
S = n i=1 (xi − x)(xi − x)> which is also a consistent estimator of covariance matrix 23 .
Each data point xi is then projected onto a scalar value by u1 such that u> 1 xi . The
> >
mean of the projected data is obtained by E[u1 xi ] = u1 x, and the variance of the projected
data is given by
n n
1 X > 1 X >
Cov[u>
1 xi ] = (u1 xi − u> 2
1 x) = u1 (xi − x)(xi − x)> u1
n−1 n−1
i=1 i=1
>
= u1 Su1 .

We want to maximize the projected variance u> 1 Su1 with respect to u1 where we must
constrain ||u1 || to prevent ||u1 || → ∞ by setting it to be u>
1 u1 = 1. By Lagrange multiplier
(see (Bishop, 2006; Boyd et al., 2004)), we have

u> >
1 Su1 + λ1 (1 − u1 u1 ).

Trivial calculation will lead to

Su1 = λ1 u1 leads
−−−−−→to u>
1 Su1 = λ1 .

That is, u1 is an eigenvector of S corresponding to eigenvalue λ1 . And the maximum


variance projection u1 is corresponding to the largest eigenvalues of S. The eigenvector is
known as the first principal axis.
Define the other principal axes by decremental eigenvalues until we have m such prin-
cipal components bring about the dimension reduction. This is known as the maximum
variance formulation of PCA (Hotelling, 1933; Bishop, 2006; Shlens, 2014). A minimum-
error formulation of PCA is discussed in (Pearson, 1901; Bishop, 2006).
PCA via the spectral decomposition Now let’s assume the data are centered such
that x is zero, or we can set xi = xi − x to centralize the data. Let the data matrix
X ∈ Rn×p contain the data observations as rows. The covariance matrix is given by

X >X
S= ,
n−1
which is a symmetric matrix, and its spectral decomposition is given by

S = U ΛU > , (14.3)
23. Consistency: An estimator θn of θ constructed on the basis of a sample of size n is said to be consistent
p
if θn → θ as n → ∞.

145
Jun Lu

where U is an orthogonal matrix of eigenvectors (columns of U are eigenvectors of S),


and Λ = diag(λ1 , λ2 , . . . , λp ) is a diagonal matrix with eigenvalues (ordered such that
λ1 ≥ λ2 ≥ . . . ≥ λp ). The eigenvectors are called principal axes of the data, and they
decorrelate the the covariance matrix. Projections of the data on the principal axes are
called the principal components. The i-th principal component is given by the i-th column
of XU . If we want to reduce the dimension from p to m, we just select the first m columns
of XU .
PCA via the SVD If the SVD of X is given by X = P ΣQ> , then the covariance matrix
can be written as
X >X Σ2
S= =Q Q> , (14.4)
n−1 n−1
where Q ∈ Rp×p is an orthogonal matrix and contains the right singular vectors of X, and
the upper left of Σ is a diagonal matrix containing the singular values diag(σ1 , σ2 , . . .) with
σ1 ≥ σ2 ≥ . . .. The number of singular values is equal to min{n, p} which will not be larger
than p and some of which are zeros.
The above Equation (14.4) compared with Equation (14.3) implies Equation (14.4) is
also a spectral decomposition of S, since the eigenvalues in Λ and singular values in Σ are
ordered in a descending way and the uniqueness of the spectral decomposition in terms of
the eigenspaces (Section 13.2, p. 119).
This results in the right singular vectors Q are also the principal axes which decorrelate
the covariance matrix, and the singular values are related to the eigenvalues of the covariance
σi2
matrix via λi = n−1 . To reduce the dimensionality of the data from p to m, we should
select largest m singular values and the corresponding
Pm right singular vectors. This is also
related to the truncated SVD (TSVD) Xm = i=1 σi pi qi> as will be shown in the next
section, where pi ’s and qi ’s are the columns of P and Q.
A byproduct of PCA via the SVD for high-dimensional data For a principle axis
>
ui of S = Xn−1X , we have
X >X
ui = λi ui .
n−1
Left multiply by X, we obtain
XX >
(Xui ) = λi (Xui ),
n−1
>
which implies λi is also an eigenvalue of XX
n−1 ∈ R
n×n , and the corresponding eigenvector

is Xui . This is also stated in the proof of Theorem 14.1, the existence of the SVD. If
p  n, instead of finding the eigenvector of S, i.e., the principle axes of S, we can find the
>
eigenvector of XX 3 3
n−1 . This reduces the complexity from O(p ) to O(n ). Suppose now, the
XX >
eigenvector of n−1 is vi corresponding to nonzero eigenvalue λi ,
XX >
vi = λi vi .
n−1
Left multiply by X > , we obtain
X >X
(X > vi ) = S(X > vi ) = λi (X > vi ),
n−1

146
Matrix Decomposition and Applications

>
i.e., the eigenvector ui of S, is proportional to X > vi , where vi is the eigenvector of XX
n−1
corresponding to the same eigenvalue λi . A further normalization step is needed to make
||ui || = 1.

14.6 Application: Low-Rank Approximation


For a low-rank approximation problem, there are basically two types related due to the
interplay of rank and error: fixed-precision approximation problem and fixed-rank approxi-
mation problem. In the fixed-precision approximation problem, for a given matrix A and a
given tolerance , one wants to find a matrix B with rank r = r() such that ||A − B|| ≤ 
in an appropriate matrix norm. On the contrary, in the fixed-rank approximation problem,
one looks for a matrix B with fixed rank k and an error ||A − B|| as small as possible.
In this section, we will consider the latter. Some excellent examples can also be found in
(Kishore Kumar and Schneider, 2017; Martinsson, 2019).
Suppose we want to approximate matrix A ∈ Rm×n with rank r by a rank k < r matrix
B. The approximation is measured by spectral norm:

B = arg min ||A − B||2 ,


B

where the spectral norm is defined as follows:

Definition 14.17: (Spectral Norm)


The spectral norm of a matrix A ∈ Rm×n is defined as

||Ax||2
||A||2 = max = max ||Ax||2 ,
x6=0 ||x||2 u∈Rn :||u||2 =1

which is also the maximal singular value of A, i.e., ||A||2 = σ1 (A).

Then, we can recover the best rank-k approximation by the following theorem.

Theorem 14.18: (Eckart-Young-Misky Theorem w.r.t. Spectral Norm)


Given matrix A ∈ Rm×n and 1 ≤ k ≤ rank(A) = r, and Plet Ak be the truncated
k >
SVD P (TSVD) of A with the largest k terms, i.e., Ak = i=1 i ui vi from SVD of
σ
r
A = i=1 σi ui vi> by zeroing out the r − k trailing singular values of A. Then Ak is the
best rank-k approximation to A in terms of the spectral norm.

Proof [of Theorem 14.18] We need to show for any matrix B, if rank(B) = k, then
||A − B||2 ≥ ||A − Ak ||2 .
Since rank(B) = k, then dim(N (B)) = n − k. As a result, any k + 1 basis in Rn
intersects N (B). As shown in Lemma 14.8, {v1 , v2 , . . . , vr } is an orthonormal basis of
C(A> ) ⊂ Rn , so that we can choose the first k + 1 vi ’s as a k + 1 basis for Rn . Let
Vk+1 = [v1 , v2 , . . . , vk+1 ], then there is a vector x that

x ∈ N (B) ∩ C(Vk+1 ), s.t. ||x||2 = 1.

147
Jun Lu

Pk+1 Pk+1 Pk+1


That is x = i=1 ai vi , and || i=1 ai vi ||2 = i=1 a2i = 1. Thus,

||A − B||22 ≥ ||(A − B)x||22 · ||x||22 , (From defintion of spectral norm)


= ||Ax||22 , (x in null space of B)
k+1
X
= σi2 (vi> x)2 , (x orthogonal to vk+2 , . . . , vr )
i=1
k+1
X
2
≥ σk+1 (vi> x)2 , (σk+1 ≤ σk ≤ . . . ≤ σ1 )
i=1
k+1
X
2
≥ σk+1 a2i , (vi> x = ai )
i=1
2
= σk+1 .
Pr > 2
It is trivial that ||A − Ak ||22 = || i=k+1 σi ui vi ||2
2 . Thus, ||A − A || ≤ ||A − B|| ,
= σk+1 k 2 2
which completes the proof.

Moreover, readers can prove that Ak is the best rank-k approximation to A in terms
of the Frobenius norm. The minimal error is given by the Euclidean
q norm of the singular
2
values that have been zeroed out in the process ||A − Ak ||F = σk+1 + σk+22 + . . . + σr2 .
SVD gives the best approximation of a matrix. As mentioned in (Stewart, 1998;
Kishore Kumar and Schneider, 2017), the singular value decomposition is the creme de
la creme of rank-reducing decompositions — the decomposition that all others try to beat.
And also The SVD is the climax of this linear algebra course in (Strang, 2009).

Part VI
Special Topics
15. Coordinate Transformation in Matrix Decomposition
Suppose a vector v ∈ R3 and it has elements v = [3; 7; 2]. But what do these values 3, 7,
and 2 mean? In the Cartesian coordinate system, it means it has a component of 3 on the
x-axis, a component of 7 on the y-axis, and a component of 2 on the z-axis.

15.1 An Overview of Matrix Multiplication


Coordinate defined by a nonsingular matrix Suppose further a 3 × 3 nonsingular
matrix B which means B is invertible and columns of B are linearly independent. Thus
the 3 columns of B form a basis for the space R3 . One step forward, we can take the
3 columns of B as a basis for a new coordinate system, which we call the B coordinate
system. Going back to the Cartesian coordinate system, we also have three vectors as a
basis, e1 , e2 , e3 . If we put the three vectors into columns of a matrix, the matrix will be
an identity matrix. So Iv = v means transfer v from the Cartesian coordinate system into

148
Matrix Decomposition and Applications

the Cartesian coordinate system, the same coordinate. Similarly, Bv = u is to transfer v


from the Cartesian coordinate system into the B system. Specifically, for v = [3; 7; 2] and
B = [b1 , b2 , b3 ], we have u = Bv = 3b1 + 7b2 + 2b3 , i.e., u contains 3 of the first basis b1
of B, 7 of the second basis b2 of B, and 2 of the third basis b3 of B. If again, we want to
transfer the vector u from B coordinate system back to the Cartesian coordinate system,
we just need to multiply by B −1 u = v.

Coordinate defined by an orthogonal matrix A 3 × 3 orthogonal matrix Q defines


a “better” coordinate system since the three columns (i.e., basis) are orthonormal to each
other. Qv is to transfer v from the Cartesian to the coordinate system defined by the
orthogonal matrix. Since the basis vectors from the orthogonal matrix are orthonormal,
just like the three vectors e1 , e2 , e3 in the Cartesian coordinate system, the transformation
defined by the orthogonal matrix just rotates or reflects the Cartesian system. Q> can help
transfer back to the Cartesian coordinate system.

X 1  v2''
v2' X v2'''
1 v2

1
v1 v1'
v1'' v1'''

Figure 19: Eigenvalue Decomposition: X −1 transforms to a different coordinate system.


Λ stretches and X transforms back. X −1 and X are nonsingular, which will change the
basis of the system, and the angle between the vectors v1 and v2 will not be preserved,
that is, the angle between v1 and v2 is different from the angle between v10 and v20 . The
6 ||v10 || and ||v2 || =
length of v1 and v2 are also not preserved, that is, ||v1 || = 6 ||v20 ||.

15.2 Eigenvalue Decomposition

A square matrix A with linearly independent eigenvectors can be factored as A = XΛX −1


where X and X −1 are nonsingular so that they define a system transformation intrinsically.
Au = XΛX −1 u firstly transfers u into the system defined by X −1 . Let’s call this system
the eigen coordinate system. Λ is to stretch each component of the vector in the
eigen system by the length of the eigenvalue. And then X helps to transfer the resulting
vector back to the Cartesian coordinate system. A demonstration of how the eigenvalue
decomposition transforms between coordinate systems is shown in Figure 19 where v1 , v2
are two linearly independent eigenvectors of A such that they form a basis for R2 .

149
Jun Lu

2
QT  Q
2 q 2
1
q2

1
1
q1
 1q1
Q

Figure 20: Spectral Decomposition QΛQ> : Q> rotates or reflects, Λ stretches cycle to
ellipse, and Q rotates or reflects back. Orthogonal matrices Q> and Q only change the
basis of the system. However, they preserve the angle between the vectors q1 and q2 , and
the lengths of them.

15.3 Spectral Decomposition

A symmetric matrix A can be factored as A = QΛQ> where Q and Q> are orthogonal
so that they define a system transformation intrinsically. Au = QΛQ> u firstly rotates or
reflects u into the system defined by Q> . Let’s call this system the spectral coordinate
system. Λ is to stretch each component of the vector in the spectral system by the
length of eigenvalue. And then Q helps to rotate or reflect the resulting vector back to the
original coordinate system. A demonstration of how the spectral decomposition transforms
between coordinate systems is shown in Figure 20 where q1 , q2 are two linearly independent
eigenvectors of A such that they form a basis for R2 . The coordinate transformation in the
spectral decomposition is similar to that of the eigenvalue decomposition. Except that in
the spectral decomposition, the orthogonal vectors transferred by Q> are still orthogonal.
This is also a property of orthogonal matrices. That is, orthogonal matrices can be viewed
as matrices which change the basis of other matrices. Hence they preserve the angle (inner
product) between the vectors

u> v = (Qu)> (Qv).

The above invariance of the inner products of angles between the vectors are preserved,
which also relies on the invariance of their lengths:

||Qu|| = ||u||.

150
Matrix Decomposition and Applications

15.4 SVD

2
VT  U
1 2
v2

1
1 1
v1

Figure 21: SVD: V > and U rotate or reflect, Σ stretches the circle to an ellipse. Orthog-
onal matrices V > and U only change the basis of the system. However, they preserve the
angle between the vectors v1 and v2 , and the lengths of them.

2
VT  V  2 v2
1
v2

1
1
v1
 1v1
V

Figure 22: V ΣV > from SVD or Polar decomposition: V > rotates or reflects, Σ stretches
cycle to ellipse, and V rotates or reflects back. Orthogonal matrices V > and V only change
the basis of the system. However, they preserve the angle between the vectors v1 and v2 ,
and the lengths of them.

Any m × n matrix can be factored as A = U ΣV > . Au = U ΣV > u then firstly rotates


or reflects u into the system defined by V > , which we call the V coordinate system. Σ
stretches the first r components of the resulted vector in the V system by the lengths of
the singular values. If n ≥ m, then Σ only keeps additional m − r components which are
stretched to zero while removing the final n − m components. If m > n, the Σ stretches
n − r components to zero, and also adds additional m − n zero components. Finally, U
rotates or reflects the resulting vector into the U coordinate system defined by U . A
demonstration of how the SVD transforms in a 2×2 example is shown in Figure 21. Further,
Figure 22 demonstrates the transformation of V ΣV > in a 2 × 2 example. Similar to the
spectral decomposition, orthogonal matrices V > and U only change the basis of the system.
However, they preserve the angle between the vectors v1 and v2 .

151
Jun Lu

2
VT  V
1
v2

1
1
v1

Ql  2 v2
 2 v2

 1v1

 1v1

Figure 23: Polar decomposition: V > rotates or reflects, Σ stretches cycle to ellipse, and
V rotates or reflects back. Orthogonal matrices V > , V , Ql only change the basis of the
system. However, they preserve the angle between the vectors v1 and v2 , and the lengths
of them.

15.5 Polar Decomposition


Any n×n square matrix A can be factored as left polar decomposition A = (U V > )(V ΣV > ) =
Ql S. Similarly, Av = Ql (V ΣV > )u is to transfer u into the system defined by V > and
stretch each component by the lengths of the singular values. Then the resulted vector
is transferred back into the Cartesian coordinate system by V . Finally, Ql will rotate or
reflect the resulting vector from the Cartesian coordinate system into the Q system defined
by Ql . The meaning of right polar decomposition shares a similar description. Similar to
the spectral decomposition, orthogonal matrices V > and V only change the basis of the
system. However, they preserve the angle between the vectors v1 and v2 .

16. Alternating Least Squares


16.1 Netflix Recommender and Matrix Factorization
In the Netflix prize (Bennett et al., 2007), the goal was to predict ratings of users for
different movies, given the existing ratings of those users for other movies. We index M
movies with m = 1, 2, . . . , M and N users by n = 1, 2, . . . , N . We denote the rating of the
n-th user for the m-th movie by amn . Define A to be an M × N rating matrix with columns
an ∈ RM containing ratings of the n-th user. Note that many ratings amn are missing and
our goal is to predict those missing ratings accurately.

152
Matrix Decomposition and Applications

We formally consider algorithms for solving the following problem: The matrix A is
approximately factorized into an M × K matrix W and a K × N matrix Z. Usually K
is chosen to be smaller than M or N , so that W and Z are smaller than the original
matrix A. This results in a compressed version of the original data matrix. An appropriate
decision on the value of K is critical in practice, but the choice of K is very often problem
dependent. The factorization is significant in the sense, suppose A = [a1 , a2 , . . . , aN ] and
Z = [z1 , z2 , . . . , zN ] are the column partitions of A, Z respectively, then an = W zn , i.e.,
each column an is approximated by a linear combination of the columns of W weighted by
the components in zn . Therefore, columns of W can be thought of containing column basis
of A. This is similar to the factorization in the data interpretation part (Part III, p. 76).
What’s different is that we are not restricting W to be exact columns from A.
To find the approximation A ≈ W Z, we need to define a loss function such that
the distance between A and W Z can be measured. The loss function is selected to be
the Frobenius norm between two matrices which vanishes to zero if A = W Z where the
advantage will be seen shortly.
To simplify the problem, let us assume that there are no missing ratings firstly. Project
data vectors an to a smaller dimension zn ∈ RK with K < M , such that the reconstruction
error measured by Frobenius norm is minimized (assume K is known):
N X
X M  2
>
min amn − wm zn , (16.1)
W ,Z
n=1 m=1

where W = [w1> ; w2> ; . . . ; wM


> ] ∈ RM ×K and Z = [z , z , . . . , z ] ∈ RK×N containing w ’s
1 2 N m
and zn ’s as rows and columns respectively. The loss form in Equation (16.1) is known as
the per-example loss. It can be equivalently written as
N X
X M  2
>
L(W , Z) = amn − wm zn = ||W Z − A||2 .
n=1 m=1
PN PM >

Moreover, the loss L(W , Z) = n=1 m=1 amn − wm zn is convex with respect to Z
given W and vice versa. Therefore, we can first minimize with respect to Z given W and
then minimize with respect to W given Z:

 Z ← arg min L(W , Z);
 (ALS1)
Z

W ← arg min L(W , Z).


 (ALS2)
W

This is known as the coordinate descent algorithm in which case we employ the least squares,
it is also called the alternating least squares (ALS) (Comon et al., 2009; Takács and Tikk,
2012; Giampouras et al., 2018). The convergence is guaranteed if the loss function L(W , Z)
decreases at each iteration and we shall discuss more on this in the sequel.

Remark 16.1: (Convexity and Global Minimum)

153
Jun Lu

Although the loss function defined by Frobenius norm ||W Z − A||2 is convex in W given
Z or vice versa, it is not convex in both variables together. Therefore we are not able to
find the global minimum. However, the convergence is assured to find local minima.
Given W , Optimizing Z Now, let’s see what is in the problem of Z ← arg minZ L(W , Z).
When there exists a unique minimum of the loss function L(W , Z) with respect to Z, we
speak of the least squares minimizer of arg minZ L(W , Z). Given W , L(W , Z) can be
written as L(Z|W ) to emphasize on the variable of Z:
  2
W z1 − a1
 W z2 − a2 
L(Z|W ) = ||W Z − A||2 = kW [z1 , z2 , . . . , zN ] − [a1 , a2 , . . . , aN ]k2 =   .
24
 
..
 . 
W zN − aN
Now, if we define
     
W 0 ... 0 z1 a1
0 W ... 0   z2   a2 
W
f =  ∈ RM N ×KN , ze =  .  ∈ RKN , e =  .  ∈ RM N ,
a
    
 .. .. .. ..
 . . . .   ..   .. 
0 0 ... W zN aN

then the (ALS1) problem can be reduced to the normal least squares for minimizing ||W
f ze −
2
e || with respect to ze. And the solution is given by
a
f >W
ze = (W f )−1 W
f >a
e.

The construction may seem reasonable at first glance. But since rank(W f ) = min{M, K},
(W >
f W f ) is not invertible. A direct way to solve (ALS1) is to find the differential of L(Z|W )
with respect to Z:
∂ tr (W Z − A)(W Z − A)>

∂L(Z|W )
=
∂Z ∂Z
∂ tr (W Z − A)(W Z − A)> ∂(W Z − A)

(16.2)
=
∂(W Z − A) ∂Z
?
= 2W > (W Z − A) ∈ RK×N ,
qP
m,n 2
where the first equality is from the definition of Frobenius such that ||A|| = i=1,j=1 (Aij ) =
>)
tr(AA> ), and equality (?) comes from the fact that ∂tr(AA
p
∂A = 2A. When the loss
function is a differentiable function of Z, we may determine the least squares solution by
differential calculus, and a minimum of the function L(Z|W ) must be a root of the equation:
∂L(Z|W )
= 0.
∂Z
qP
m,n m×n
24. The matrix norm used here is the Frobenius norm such that ||A|| = i=1,j=1 (Aij ) if A ∈ R
2 .
pPn 2 n
And the vector norm used here is the l2 norm such that ||x||2 = i=1 xi if x ∈ R .

154
Matrix Decomposition and Applications

By finding the root of the above equation, we have the “candidate” update on Z that find
the minimizer of L(Z|W )

Z = (W > W )−1 W > A ← arg min L(Z|W ). (16.3)


Z

Before we declare a root of the above equation is actually a minimizer rather than a max-
imizer (that’s why we call the update a “candidate” update above), we need to verify the
function is convex such that if the function is twice differentiable, this can be equivalently
done by verifying
∂ 2 L(Z|W )
> 0,
∂Z 2
i.e., the Hessian matrix is positive definite (recall the definition of positive definiteness,
Definition 2.2, p. 29). To see this, we write out the twice differential

∂ 2 L(Z|W )
= 2W > W ∈ RK×K ,
∂Z 2
which has full rank if W ∈ RM ×K has full rank (Lemma 14.3, p. 135) and K < M . We
2
here claim that if W has full rank, then ∂ L(Z|W
∂Z 2
)
is positive definite. This can be done by
checking that when W has full rank, W x = 0 only when x = 0 since the null space of W
is of dimension 0. Therefore,

x> (2W > W )x > 0, for any nonzero vector x ∈ RK .

Now, the thing is that we need to check if W has full rank so that the Hessian of L(Z|W )
is positive definiteness, otherwise, we cannot claim the update of Z in Equation (16.3)
decreases the loss so that the matrix decomposition is going into the right way to better
approximate the original matrix A by W Z. We will shortly come back to the positive defi-
niteness of the Hessian matrix in the sequel which relies on the following lemma

Lemma 16.2: (Rank of Z after Updating)


Suppose A ∈ RM ×N has full rank with M ≤ N and W ∈ RM ×K has full rank with
K < M , then the update of Z = (W > W )−1 W > A ∈ RK×N in Equation (16.3) has full
rank.

Proof [of Lemma 16.2] Since W > W ∈ RK×K has full rank if W has full rank (Lemma 14.3,
p. 135) such that (W > W )−1 has full rank.
Suppose W > x = 0, this implies (W > W )−1 W > x = 0. Thus
 
N (W > ) ⊆ N (W > W )−1 W > .

Moreover, suppose (W > W )−1 W > x = 0, and since (W > W )−1 is invertible. This implies
W > x = (W > W )0 = 0, and
 
N (W > W )−1 W > ⊆ N (W > ).

155
Jun Lu

As a result, by “sandwiching”, it follows that


 
N (W > ) = N (W > W )−1 W > . (16.4)

Therefore, (W > W )−1 W > has full rank K. Let T = (W > W )−1 W > ∈ RK×M , and suppose
T > x = 0. This implies A> T > x = 0, and

N (T > ) ⊆ N (A> T > ).

Similarly, suppose A> (T > x)  = 0. Since A has full rank with the dimension of the null
space being 0: dim N (A> ) = 0, (T > x) must be zero. The claim follows from that since
A has full rank M with the row space of A> being equal to the column space of A where
dim (C(A)) = M and the dim N (A> ) = M − dim (C(A)) = 0. Therefore, x is in the null
space of T > if x is in the null space of A> T > :

N (A> T > ) ⊆ N (T > ).

By “sandwiching” again,
N (T > ) = N (A> T > ). (16.5)
> > > >
 
Since T has full rank K < M < N , dim N (T ) = dim N (A T ) = 0. Therefore,
Z > = A> T > has full rank K. We complete the proof.

Given Z, Optimizing W Given Z, L(W , Z) can be written as L(W |Z) to emphasize


on the variable of W :
L(W |Z) = ||W Z − A||2 .
A direct way to solve (ALS2) is to find the differential of L(W |Z) with respect to W :

∂ tr (W Z − A)(W Z − A)>

∂L(W |Z)
=
∂W ∂W
∂ tr (W Z − A)(W Z − A)> ∂(W Z − A)

=
∂(W Z − A) ∂W
= 2(W Z − A)Z > ∈ RM ×K .
∂L(W |Z)
The “candidate” update on W is similarly to find the root of the differential ∂W :

W > = (ZZ > )−1 ZA> ← arg min L(W |Z). (16.6)
W

Again, we emphasize that the update is only a “candidate” update. We need to further
check whether the Hessian is positive definite or not. The Hessian matrix is given by

∂ 2 L(W |Z)
= 2ZZ > ∈ RK×K .
∂W 2
Therefore, by analogous analysis, if Z has full rank with K < N , the Hessian matrix is
positive definite.

156
Matrix Decomposition and Applications

Lemma 16.3: (Rank of W after Updating)


Suppose A ∈ RM ×N has full rank with M ≤ N and Z ∈ RK×N has full rank with K < N ,
then the update of W > = (ZZ > )−1 ZA> in Equation (16.6) has full rank.

Proof [of Lemma 16.3] The proof is slightly different to that of Lemma 16.2. Since
Z ∈ RK×N and A> ∈ RN ×M have full rank, i.e., det(Z) > 0 and det(A> ) > 0. The
determinant of their product det(ZA> ) = det(Z) det(A> ) > 0 such that ZA> has full
rank (rank K). Similarly argument can find W > also has full rank.

Combine the observations in Lemma 16.2 and Lemma 16.3, as long as we initialize Z, W
to have full rank, the updates in Equation (16.3) and Equation (16.6) are reasonable. The
requirement on the M ≤ N is reasonable in that there are always more users than the
number of movies. We conclude the process in Algorithm 4.

Algorithm 4 Alternating Least Squares


Require: A ∈ RM ×N with M ≤ N ;
1: initialize W ∈ RM ×K , Z ∈ RK×N with full rank and K < M ≤ N ;
2: choose a stop criterion on the approximation error δ;
3: choose maximal number of iterations C;
4: iter = 0;
5: while ||A − W Z|| > δ and iter < C do
6: iter = iter + 1;
7: Z = (W > W )−1 W > A ← arg minZ L(Z|W );
8: W > = (ZZ > )−1 ZA> ← arg minW L(W |Z);
9: end while
10: Output W , Z;

16.2 Regularization: Extension to General Matrices


We can add a regularization to minimize the following loss:

L(W , Z) = ||W Z − A||2 + λw ||W ||2 + λz ||Z||2 , λw > 0, λz > 0, (16.7)

where the differential with respect to Z, W are given respectively by


∂L(W , Z)


 = 2W > (W Z − A) + 2λz Z ∈ RK×N ;
∂Z (16.8)
 ∂L(W , Z) = 2(W Z − A)Z > + 2λ W ∈ RM ×K .

w
∂W
The Hessian matrices are given respectively by
 2
∂ L(W , Z)
= 2W > W + 2λz I ∈ RK×K ;


∂Z 2

2
 ∂ L(W , Z) = 2ZZ > + 2λw I ∈ RK×K ,


∂W 2

157
Jun Lu

which are positive definite due to the perturbation by the regularization. To see this,


 x> (2W > W + 2λz I)x = |2x> W > 2
{z W x} +2λz ||x|| > 0, for nonzero x;

≥0



 x> (2ZZ > + 2λw I)x = |2x> ZZ > 2
{z x} +2λw ||x|| > 0, for nonzero x.

≥0

The regularization makes the Hessian matrices positive definite even if W , Z are rank
deficient. And now the matrix decomposition can be extended to any matrix even when
M > N . In rare cases, K can be chosen as K > max{M, N } such that a high-rank
approximation of A is obtained. However, in most scenarios, we want to find the low-rank
approximation of A such that K < min{M, N }. For example, the ALS can be utilized
to find the low-rank neural networks to reduce the memory of the neural networks whilst
increase the performance (Lu, 2021c).
Therefore, the minimizers are given by finding the roots of the differential:
Z = (W > W + λz I)−1 W > A;
(
(16.9)
W > = (ZZ > + λw I)−1 ZA> .
The regularization parameters λz , λ2 ∈ R are used to balance the trade-off between the
accuracy of the approximation and the smoothness of the computed solution. The choice
on the selection of the parameters is typically problem dependent and can be obtained by
cross-validation.

16.3 Missing Entries


Since the matrix decomposition via the ALS is extensively used in the Netflix recommender
data, where many entries are missing since many users have not watched some movies or
they will not rate the movies for some reasons. We can make an additional mask matrix
M ∈ RM ×N where Mmn ∈ {0, 1} means if the user n has rated the movie m or not.
Therefore, the loss function can be defined as
L(W , Z) = ||M A−M (W Z)||2 ,
where is the Hadamard product between matrices. For example, the Hadamard product
for a 3 × 3 matrix A with a 3 × 3 matrix B is
     
a11 a12 a13 b11 b12 b13 a11 b11 a12 b12 a13 b13
A B = a21 a22 a23  b21 b22 b23  = a21 b21 a22 b22 a23 b23  .
a31 a32 a33 b31 b32 b33 a31 b31 a32 b32 a33 b33
To find the solution of the problem, let’s decompose the updates in Equation (16.9) into:
zn = (W > W + λz I)−1 W > an ,
(
for n ∈ {1, 2, . . . , N };
(16.10)
wm = (ZZ > + λw I)−1 Zbm , for m ∈ {1, 2, . . . , M },
where Z = [z1 , z2 , . . . , zN ], A = [a1 , a2 , . . . , aN ] are the column partitions of Z, A respec-
tively. And W > = [w1 , w2 , . . . , wM ], A> = [b1 , b2 , . . . , bM ] are the column partitions of
W > , A> respectively. The factorization of the updates indicates the update can be done
via a column by column fashion.

158
Matrix Decomposition and Applications

Given W Let on ∈ RM denote the movies rated by user n where onm = 1 if user n
has rated movie m, and onm = 1 otherwise. Then the n-th column of A without missing
entries can be denoted as the matlab style notation an [on ]. And we want to approximate
the existing n-th column by an [on ] ≈ W [on , :]zn which is actually a rank-one least squares
problem:
 −1
zn = W [on , :]> W [on , :] + λz I W [on , :]> an [on ], for n ∈ {1, 2, . . . , N }. (16.11)

Moreover, the loss function with respect to zn :


X  2
>
L(zn |W ) = amn − wm zn
m∈on

and if we are concerned about the loss for all users:


N
X X  2
>
L(Z|W ) = amn − wm zn
n=1 m∈on

Given Z Similarly, if pm ∈ RN denotes the users that have rated the movie m with
pdn = 1 if the movie m has been rated by user n. Then the m-th row of A without missing
entries can be denoted as the matlab style notation bm [pm ]. And we want to approximate
the existing m-th row by bm [pm ] ≈ Z[:, pm ]> wm , 25 which again is a rank-one least squares
problem:

wm = (Z[:, pm ]Z[:, pm ]> + λw I)−1 Z[:, pm ]bm [pm ], for m ∈ {1, 2, . . . , M }. (16.12)

Moreover, the loss function with respect to wn :


X  2
>
L(wn |Z) = amn − wm zn
n∈pm

and if we are concerned about the loss for all users:


M X 
X 2
>
L(W |Z) = amn − wm zn
d=1 n∈pm

16.4 Vector Inner Product


We have seen the ALS is to find matrices W , Z such that W Z can approximate A ≈ W Z
in terms of minimum least squared loss:
N X
X M  2
>
min amn − wm zn ,
W ,Z
n=1 d=1

25. Note that Z[:, pm ]> is the transpose of Z[:, pm ], which is equal to Z > [pm , :], i.e., transposing first and
then selecting.

159
Jun Lu

that is, each entry amn in A can be approximated by the inner product between the two
vectors wm > z . The geometric definition of vector inner product is given by
n

>
wm zn = ||w|| · ||z|| cos θ,

where θ is the angle between w and z. So if the vector norms of w, z are determined, the
smaller the angle, the larger the inner product.
Come back to the Netflix data, where the rating are ranging from 0 to 5 and the larger
the better. If wm and zn fall “close” enough, then w> z will have a larger value. This
reveals the meaning behind the ALS where wm represents the features of movie m, whilst
zn contains the features of user n. And each element in wm and zn represents a same
feature. For example, it could be that the second feature wm2 26 represents if the movie
is an action movie or not, and zn2 denotes if the user n likes action movies or not. If it
happens the case, then wm > z will be large and approximates a
n mn well.
Note that, in the decomposition A ≈ W Z, we know the rows of W contain the hidden
features of the movies and the columns of Z contain the hidden features of the users.
However, we cannot identify what are the meanings of the rows of W or the columns of
Z. We know they could be something like categories or genres of the movies, that provide
some underlying connections between the users and the movies, but we cannot be sure what
exactly they are. This is where the terminology “hidden” comes from.

16.5 Gradient Descent


In Equation (16.10), we obtain the column-by-column update directly from the full matrix
way in Equation (16.9) (with regularization considered). Now let’s see what’s behind the
idea. Following from Equation (16.7), the loss under the regularization:

L(W , Z) = ||W Z − A||2 + λw ||W ||2 + λz ||Z||2 , λw > 0, λz > 0, (16.13)

Since we are now considering the minimization of above loss with respect to zn , we can
decompose the loss into

L(zn ) = ||W Z − A||2 + λw ||W ||2 + λz ||Z||2


X X
= ||W zn − an ||2 + λz ||zn ||2 + ||W zi − ai ||2 + λz ||zi ||2 + λw ||W ||2 ,
i6=n i6=n
| {z }
Czn
(16.14)
where Czn is a constant with respect to zn , and Z = [z1 , z2 , . . . , zN ], A = [a1 , a2 , . . . , aN ]
are the column partitions of Z, A respectively. Taking the differential
∂L(zn )
= 2W > W zn − 2W > an + 2λz zn ,
∂zn
under which the root is exactly the first update of the column fashion in Equation (16.10):

zn = (W > W + λz I)−1 W > an , for n ∈ {1, 2, . . . , N }.


26. wm2 is the second element of vector wm2

160
Matrix Decomposition and Applications

Similarly, we can decompose the loss with respect to wm ,


L(wm ) = ||W Z − A||2 + λw ||W ||2 + λz ||Z||2
= ||Z > W − A> ||2 + λw ||W > ||2 + λz ||Z||2
X X
= ||Z > wm − bn ||2 + λw ||wm ||2 + ||Z > wi − bi ||2 + λw ||wi ||2 + λz ||Z||2 ,
i6=m i6=m
| {z }
Cwm
(16.15)
where Cwm is a constant with respect to wm , and W>
= [w1 , w2 , . . . , wM ], A>
= [b1 , b2 ,
. . . , bM ] are the column partitions of W > , A> respectively. Analogously, taking the differ-
ential with respect to wm , it follows that
∂L(wm )
= 2ZZ > wm − 2Zbn + 2λw wm ,
∂wm
under which the root is exactly the second update of the column fashion in Equation (16.10):

wm = (ZZ > + λw I)−1 Zbm , for m ∈ {1, 2, . . . , M }.

Now suppose we write out the iteration as the superscript and we want to find the
(k+1) (k+1)
updates {zn , wm } base on {Z (k) , W (k) }:
 (k+1)
 zn
 ← arg min L(zn(k) );
 (k)
zn
(k+1) (k)
wm

 ← arg min L(wm ).
(k)
wm

(k+1) (k)
For simplicity, we will be looking at zn ← arg minz(k) L(zn |−), and the derivation for
n
(k+1) (k+1)
the update on wm will be the same. Suppose we want to approximate zn by a linear
(k)
update on zn :
zn(k+1) = zn(k) + ηv.
The problem now turns to the solution of v such that

v = arg min L(zn(k) + ηv).


v

(k)
By Taylor’s formula, L(zn + ηv) can be approximated by

L(zn(k) + ηv) ≈ L(zn(k) ) + ηv > ∇L(zn(k) ),

when η is small enough. Then an search under the condition ||v|| = 1 given positive η is as
follows: n o
v = arg min L(zn(k) + ηv) ≈ arg min L(zn(k) ) + ηv > ∇L(zn(k) ) .
||v||=1 ||v||=1

This is known as the greedy search. The optimal v can be obtained by


(k)
∇L(zn )
v=− (k)
,
||∇L(zn )||

161
Jun Lu

(k) (k+1)
i.e., v is in the opposite direction of ∇L(zn ). Therefore, the update of zn is reasonable
to be taken as
(k)
(k+1) (k) (k) ∇L(zn )
zn = zn + ηv = zn − η (k)
,
||∇L(zn )||
(k+1)
which usually called the gradient descent. Similarly, the gradient descent of wm is given
by
(k)
(k+1) (k) (k) ∇L(wm )
wm = wm + ηv = wm − η (k)
.
||∇L(wm )||
Geometrical Interpretation of Gradient Descent

Lemma 16.4: (Direction of Gradients)


An important fact is that gradients are orthogonal to level curves (a.k.a., level surface).

See (Lu, 2021c) for a proof.


The lemma above reveals the geometrical interpretation of gradient descent. For finding
a solution to minimize a convex function L(z), gradient descent goes to the negative gradient
direction that can decrease the loss. Figure 24 depicts a 2-dimensional case, where −∇L(z)
pushes the loss to decrease for the convex function L(z).

30
25
20
L(z)
L(z)

15
z2

10
5
0
1.01.5 4.0
2.02.5 2.53.03.5
z13.03.54.04.5 2.0
1.01.5 z 2
5.0 0.00.5 z1
(a) A 2-dimensional convex function L(z) (b) L(z) = c is a constant

Figure 24: Figure 24(a) shows a function “density” and a contour plot (blue=low, yel-
low=high) where the upper graph is the “density”, and the lower one is the projection of it
(i.e., contour). Figure 24(b): −∇L(z) pushes the loss to decrease for the convex function
L(z).

16.6 Regularization: A Geometrical Interpretation


We have seen in Section 16.2 that the regularization can extend the ALS to general matrices.
The gradient descent can reveal the geometric meaning of the regularization. To avoid
confusion, we denote the loss function without regularization by l(z) and the loss with
regularization by L(z) = l(z) + λz ||z||2 where l(z) : Rn → R. When minimizing l(z),

162
Matrix Decomposition and Applications

𝑙(𝑧) = 𝑐1
𝑙(𝑧) = 𝑐2

𝑧∗ 𝑧∗

𝑣2 -𝛻𝑙(𝑧2 )
𝑤 𝑣1
-𝛻𝑙(𝑧1 )
𝑧2
𝑧1
-𝛻𝑙(𝑧1 )
0 𝑤 𝑧1 0

𝑧 𝑇 𝑧=C −2𝜆𝑧1 𝑧 𝑇 𝑧=C

Figure 25: Constrained gradient descent with z > z ≤ C. The green vector w is the
projection of v1 into z > z ≤ C where v1 is the component of −∇l(z) perpendicular to z1 .
The right picture is the next step after the update in the left picture. z ? denotes the optimal
solution of {min l(z)}.

descent method will search in Rn for a solution. However, in machine learning, searching
in the whole space can cause overfitting. A partial solution is to search in a subset of the
vector space, e.g., searching in z > z < C for some constant C. That is

arg min l(z), s.t., z > z ≤ C.


z

As shown above, a trivial gradient descent method will go further in the direction of −∇l(z),
i.e., update z by z ← z −η∇l(z) for small step size η. When the level curve is l(z) = c1 and
the current position of z = z1 where z1 is the intersection of z > z = C and l(z) = c1 , the
descent direction −∇l(z1 ) will be perpendicular to the level curve of l(z1 ) = c1 as shown
in the left picture of Figure 25. However, if we further restrict the optimal value can only
be in z > z ≤ C, the trivial descent direction −∇l(z1 ) will lead z2 = z1 − η∇l(z1 ) outside
of z > z ≤ C. A solution is to decompose the step −∇l(z1 ) into

−∇l(z1 ) = az1 + v1 ,

where az1 is the component perpendicular to the curve of z > z = C, and v1 is the component
parallel to the curve of z > z = C. Keep only the step v1 , then the update
 

z2 = project(z1 + ηv1 ) = project z1 + η (−∇l(z1 ) − az1 ) 27


| {z }
v1

will lead to a smaller loss from l(z1 ) to l(z2 ) and still match z > z ≤ C. This is known as the
projection gradient descent. It is not hard to see that the update z2 = project(z1 + ηv1 ) is
equivalent to finding a vector w (shown by the green vector in the left picture of Figure 25)
27. where the project(x) will project the vector x to the closest point inside z > z ≤ C. Notice here the
direct update z2 = z1 + ηv1 can still make z2 outside the curve of z > z ≤ C.

163
Jun Lu

such that z2 = z1 + w is inside the curve of z > z ≤ C. Mathematically, the w can be


obtained by −∇l(z1 ) − 2λz1 for some λ as shown in the middle picture of Figure 25. This
is exactly the negative gradient of L(z) = l(z) + λ||z||2 such that

∇L(z) = ∇l(z) + 2λz,

and
w = −∇L(z) leads
−−−−−→to z2 = z1 + w = z1 − ∇L(z).
And in practice, a small step size η can avoid going outside the curve of z > z ≤ C:

z2 = z1 − η∇L(z),

which is exactly what we have discussed in Section 16.2, the regularization term.
Sparsity In rare cases, we want to find sparse solution z such that l(z) is minimized.
Constrained in ||z||1 ≤ C exists to this purpose where || · ||1 is the l1 norm of a vector or
a matrix. Similar to the previous case, the l1 constrained optimization pushes the gradient
descent towards the border of the level of ||z||1 = C. The situation in the 2-dimensional
case is shown in Figure 26. In a high-dimensional case, many elements in z will be pushed
into the breakpoint of ||z||1 = C as shown in the right picture of Figure 26.

𝑙(𝑧) = 𝑐1 𝑙(𝑧) = 𝑐2

𝑧∗ 𝑧∗
breakpoint -𝛻𝑙(𝑧2 )
-𝛻𝑙(𝑧1 ) 𝑧1 𝑧2
𝑣2
𝑣1

0 0

||𝑧||1 =C ||𝑧||1 =C

Figure 26: Constrained gradient descent with ||z||1 ≤ C, where the red dot denotes the
breakpoint in l1 norm. The right picture is the next step after the update in the left picture.
z ? denotes the optimal solution of {min l(z)}.

16.7 Stochastic Gradient Descent


Now suppose we come back to the per-example loss:
N X
X M  2
>
L(W , Z) = amn − wm zn + λw ||wm ||2 + λz ||zn ||2 .
n=1 m=1

>z 2

And when we iteratively decrease the per-example loss term l(wm , zn ) = amn − wm n
for all m ∈ [0, M ], n ∈ [1, N ], the full loss L(W , Z) can also be decreased. This is known as

164
Matrix Decomposition and Applications

the stochastic coordinate descent. The differentials with respect to wm , zn , and their roots
are given by

∂l(wm , zn ) >
∇l(zn ) = = 2wm wm zn + 2λw wm − 2amn wm


∂z

n



>
+ λz I)−1 wm ;

leads to zn = amn (wm wm


−−−−−→
∂l(wm , zn )
= 2zn zn> wm + 2λz zn − 2amn zn


∇l(w m ) =



 ∂w m
to wm = amn (zn zn> + λw I)−1 wn .

 leads
−−−−−→

or analogously, the update can be done by gradient descent, and since we update by per-
example loss, it is also known as the stochastic gradient descent

∇l(zn )

 zn = zn − ηz ||∇l(z )|| ;


n
 ∇l(w m)
wm = wm − ηw
 .
||∇l(wm )||

In practice, the update for each m, n in the algorithm can be randomly produced, that’s
where the name “stochastic” comes from.

16.8 Bias Term

   

~ ~
AM  N WM  K Z K N WM ( K  2) Z ( K  2) N
Figure 27: Bias terms in alternating least squares where the yellow entries denote ones
(which are fixed) and cyan entries denote the added features to fit the bias terms. The
dotted boxes give an example on how the bias terms work.

In ordinary least squares, a bias term is added to the raw matrix. A similar idea can be
applied to the ALS problem. We can add a fixed column with all 1’s to the last column of
W , thus an extra row should be added to the last row of Z to fit the features introduced
by the bias term in W . Analogously, a fixed row with all 1’s can be added to the first row
of Z, and an extra column in the first column of W to fit the features. The situation is
shown in Figure 27.

165
Jun Lu

Following
  from the loss with respect to the columns of Z in Equation (16.14), suppose
1
zen = is the n-th column of Z,
e we have
zn

L(zn ) = ||W
fZe − A||2 + λw ||W
f ||2 + λz ||Z||
e 2
  2
f 1 − an + λz ||e
X X
= W zn ||2 + f zei − ai ||2 + λz
||W zi ||2 + λw ||W
||e f ||2
zn | {z }
i6=n i6=n
=λz ||zn ||2 +λz
2
  2
 1
+ λz ||zn ||2 + Czn = W zn − (an − w0 ) + λz ||zn ||2 + Czn ,

= w0 W − an
zn | {z }
an
(16.16)
where w0 is the first column of W f and Czn is a constant with respect to zn . Let an =
an − w0 , the update of zn is just like the one in Equation (16.14) where the differential is
given by:
∂L(zn ) > >
= 2W W zn − 2W an + 2λz zn .
∂zn
Therefore the update on zn is given by the root of the above differential:
> >

−1
zn = (W W + λz I) W an , for n ∈ {1, 2, . . . , N };

update on zen =  
1
z
 n
 = .
zn
e

Similarly,
 follow
 from the loss with respect to each row of W in Equation (16.15), suppose
wm f > ), we have
w
em = is the m-th row of Wf (or m-th column of W
1

e>W
L(wm ) = ||Z f − A> ||2 + λw ||W
f > ||2 + λz ||Z||
e 2
X X
e>w
= ||Z em − bm ||2 + λw ||w
em ||2 + e>w
||Z ei − bi ||2 + λw ei ||2 + λz ||Z||
||w e 2
| {z }
i6=m i6=m
=λw ||wm ||2 +λw
h i w  2
> m
= Z z0 − bm + λw ||wm ||2 + Cwm
1
> 2
= Z wm − (bm − z 0 ) + λw ||wm ||2 + Cwm ,
 (16.17)

e > and Z > is the left columns of it: Z
where z 0 is the last column of Z e > = wm , Cwm is
1
> >
a constant with respect to wm , and W = [w1 , w2 , . . . , wM ], A = [b1 , b2 , . . . , bM ] are the
column partitions of W > , A> respectively. Let bm = bm − z 0 , the update of wm is again
just like the on in Equation (16.15) where the differential is given by:

∂L(wm d) >
= 2Z · Z wm − 2Z · bm + 2λw wm .
∂wm

166
Matrix Decomposition and Applications

Therefore the update on wm is given by the root of the above differential


>

−1
wm = (Z · Z + λw I) Z · bm , for m ∈ {1, 2, . . . , M };

update on wem = 
wm

w
 em = .
1
Similar updates by gradient descent under the bias terms or treatment on missing entries can
be deduced and we shall not repeat the details (see Section 16.5 and 16.3 for a reference).

17. Nonnegative Matrix Factorization (NMF)


Following from the matrix factorization via the ALS, we now consider algorithms for solving
the nonnegative matrix factorization (NMF) problem:
• Given a nonnegative matrix A ∈ RM ×N , find nonnegative matrix factors W ∈ RM ×K
and Z ∈ RK×N such that:
A ≈ W Z.
To measure the approximation, the loss to evaluate is still from the Frobenius norm of the
difference between the two matrices:

L(W , Z) = ||W Z − A||2 .

17.1 NMF via Multiplicative Update


Following from Section 16.1, given W ∈ RM ×K , we want to update Z ∈ RK×N , the gradient
with respect to Z is given by Equation (16.2):
∂L(Z|W )
= 2W > (W Z − A) ∈ RK×N .
∂Z
Applying the gradient descent idea in Section 16.5, the trivial update on Z can be done by
 
∂L(Z|W )  
(GD on Z) Z ← Z − η = Z − η 2W > W Z − 2W > A ,
∂Z
where η is a small positive step size. Now if we suppose a different step size for each entry
of Z and incorporate the constant 2 into the step size, the update can be obtained by
 
ηkn ∂L(Z|W )
Z kn ← Z kn −
(GD0 on Z) 2 ∂Z kn
= Zkn − ηkn (W W Z − W > A)kn ,
>
k ∈ [1, K], n ∈ [1, N ],

where Zkn is the (k, n)-th entry of Z. Now if we rescale the step size:
Zkn
ηkn = >
,
(W W Z) kn

then we obtain the update rule:


(W > A)kn
(Multiplicative update on Z) Zkn ← Zkn , k ∈ [1, K], n ∈ [1, N ],
(W > W Z)kn

167
Jun Lu

which is known as the multiplicative update and is first developed in (Lee and Seung, 2001)
and further discussed in (Pauca et al., 2006). Analogously, the multiplicative update on W
can be obtained by

(AZ > )mk


(Multiplicative update on W ) Wmk ← Wmk , m ∈ [1, M ], k ∈ [1, K].
(W ZZ > )mk

Theorem 17.1: (Convergence of Multiplicative Update)


The loss L(W , Z) = ||W Z −A||2 is non-increasing under the multiplicative update rules:

(W > A)kn
← k ∈ [1, K], n ∈ [1, N ];

 Z
 kn Zkn ,
(W > W Z)kn

 (AZ > )mk


Wmk ← Wmk , m ∈ [1, M ], k ∈ [1, K].


(W ZZ > )mk

We refer the proof of the above theorem to (Lee and Seung, 2001). Clearly the approxi-
mations W and Z remain nonnegative during the updates. It is generally best to update
W and Z “simultaneously”, instead of updating each matrix fully before the other. In
this case, after updating a row of Z, we update the corresponding column of W . In the
implementation, a small positive quantity, say the square root of the machine precision,
should be added to the denominators in the approximations of W and Z at each iteration
step. And a trivial  = 10−9 can do the job. The full procedure is shown in Algorithm 5.

Algorithm 5 NMF via Multiplicative Updates


Require: A ∈ RM ×N ;
1: initialize W ∈ RM ×K , Z ∈ RK×N randomly with nonnegative entries.
2: choose a stop criterion on the approximation error δ;
3: choose maximal number of iterations C;
4: iter = 0;
5: while ||A − (W Z)||2 > δ and iter < C do
6: iter = iter + 1;
7: for k = 1 to K do
8: for n = 1 to N do > . udate k-th row of Z
9: Zkn ← Zkn (W(W A)kn
> W Z) ;
kn +
10: end for
11: for m = 1 to M do >)
. udate k-th column of W
12: Wmk ← Wmk (W(AZ mk
ZZ > )mk +
;
13: end for
14: end for
15: end while
16: Output W , Z;

168
Matrix Decomposition and Applications

17.2 Regularization
Similar to the ALS with regularization in Section 16.2, recall the regularization helps employ
the ALS into general matrices. We can also add a regularization in the context of NMF:

L(W , Z) = ||W Z − A||2 + λw ||W ||2 + λz ||Z||2 , λw > 0, λz > 0,

where the induced matrix norm is still the Frobenius norm. The gradient with respective
to Z given W is the same as that in Equation (16.8):

∂L(Z|W )
= 2W > (W Z − A) + 2λz Z ∈ RK×N .
∂Z
The trivial gradient descent update can be obtained by
 
∂L(Z|W )  
(GD on Z) Z ← Z − η = Z − η 2W > W Z − 2W > A + 2λz Z ,
∂Z

Analogously, if we suppose a different step size for each entry of Z and incorporate the
constant 2 into the step size, the update can be obtained by
 
ηkn ∂L(Z|W )
Zkn ← Zkn −
(GD0 on Z) 2 ∂Z kn
= Zkn − ηkn (W > W Z − W > A + λz Z)kn , k ∈ [1, K], n ∈ [1, N ],

Now if we rescale the step size:


Zkn
ηkn = >
,
(W W Z) kn

then we obtain the update rule:

(W > A)kn − λz Zkn


(Multiplicative update on Z) Zkn ← Zkn , k ∈ [1, K], n ∈ [1, N ].
(W > W Z)kn
Similarly, the multiplicative update on W can be obtained by

(AZ > )mk − λw Wmk


(Multiplicative update on W ) Wmk ← Wmk , m ∈ [1, M ], k ∈ [1, K].
(W ZZ > )mk

17.3 Initialization
In the above discussion, we initialize W and Z randomly. Whereas, there are also alter-
native strategies designed to obtain better initial estimates in the hope of converging more
rapidly to a good solution (Boutsidis and Gallopoulos, 2008; Gillis, 2014). We sketch the
methods as follows:
• Clustering techniques. Use some clustering methods on the columns of A, and make
the cluster means of the top K clusters as the columns of W , and initialize Z as a
proper scaling of the cluster indicator matrix (that is, Zkn 6= 0 indicates an belongs
to the k-th cluster);

169
Jun Lu

• Subset selection. Pick K columns of A and set those as the initial columns for W ,
and analogously, K rows of A are selected to form the rows of Z;
• SVD-based. Suppose the SVD of A = ri=1 σi ui vi> where each factor σi ui vi> is a
P
rank-one matrix with possible negative values in ui , vi , and nonnegative σi . Denote
[x]+ = max(x, 0), we notice

ui vi> = [ui ]+ [vi ]> > > >


+ + [−ui ]+ [−vi ]+ − [−ui ]+ [vi ]+ − [ui ]+ [−vi ]+ .

Either [ui ]+ [vi ]> >


+ or [−ui ]+ [−vi ]+ can be selected as a column and a row in W , Z.

18. Biconjugate Decomposition


18.1 Existence of the Biconjugate Decomposition
The biconjugate decomposition was proposed in (Chu et al., 1995) and discussed in (Yang,
2000). The existence of the biconjugate decomposition relies on the rank-one reduction
theorem shown below. And a variety of matrix decomposition methods can be unified via
this biconjugate decomposition.

Theorem 18.1: (Rank-One Reduction)


Any m × n matrix A ∈ Rm×n with rank r, a pair of vectors x ∈ Rn and y ∈ Rm such that
w = y > Ax 6= 0, then the matrix B = A − w−1 xy > A has rank r − 1 which has exactly
one less than the rank of A.

Proof [of Theorem 18.1] If we can show that the dimension of N (B) is one larger than
that of A. Then this implicitly shows B has rank exactly one less than the rank of A.
For any vector n ∈ N (A), i.e., An = 0, we then have Bn = An − w−1 xy > An = 0
which means N (A) ⊆ N (B).
Now for any vector m ∈ N (B), then Bm = Am − w−1 xy > Am = 0.
Let k = w−1 y > Am, which is a scalar, thus A(m − kx) = 0, i.e., for any vector
n ∈ N (A), we could find a vector m ∈ N (B) such that n = (m − kx) ∈ N (A). Note that
Ax 6= 0 from the definition of w. Thus, the null space of B is therefore obtained from the
null space of A by adding x to its basis which will increase the order of the space by 1.
Thus the dimension of N (A) is smaller than the dimension of N (B) by 1 which completes
the proof.

Suppose matrix A ∈ Rm×n has rank r, we can define a rank reducing process to generate
a sequence of Wedderburn matrices {Ak }:

A1 = A, and Ak+1 = Ak − wk−1 Ak xk yk> Ak ,

where xk ∈ Rn and yk ∈ Rm are any vectors satisfying wk = yk> Axk 6= 0. The sequence
will terminate in r steps since the rank of Ak decreases by exactly one at each step. Write

170
Matrix Decomposition and Applications

out the sequence:


A1 = A,
A1 − A2 = w1−1 A1 x1 y1> A1 ,
A2 − A3 = w2−1 A2 x2 y2> A2 ,
A3 − A4 = w3−1 A3 x3 y3> A3 ,
.. ..
.=.
−1 >
Ar−1 − Ar = wr−1 Ar−1 xr−1 yr−1 Ar−1 ,
Ar − 0 = wr−1 Ar xr yr> Ar .
By adding the sequence we will get
r
X
(A1 − A2 ) + (A2 − A3 ) + . . . + (Ar−1 − Ar ) + (Ar − 0) = A = wi−1 Ai xi yi> Ai .
i=1

Theorem 18.2: (Biconjugate Decomposition: Form 1)


This equality from rank-reducing process implies the following matrix decomposition

A = ΦΩ−1 Ψ> ,

where Ω = diag(w1 , w2 , . . . , wr ), Φ = [φ1 , φ2 , . . . , φr ] ∈ Rm×r and Ψ = [ψ 1 , ψ 2 , . . . , ψ r ]


with
φk = Ak xk , and ψ k = A>k yk .

Obviously, different choices of xk ’s and yk ’s will result in different factorizations. So this


factorization is rather general and we will show its connection to some well-known decom-
position methods.

Remark 18.3
For the vectors xk , yk in the Wedderburn sequence, we have the following property

xk ∈ N (Ak+1 )⊥C(A>
k+1 ),
yk ∈ N (A>
k+1 )⊥C(Ak+1 ).

Lemma 18.4: (General Term Formula of Wedderburn Sequence: V1)


For each matrix with Ak+1 = Ak − wk−1 Ak xk yk> Ak , then Ak+1 can be written as

k
X
Ak+1 = A − wi−1 Aui vi> A,
i=1

171
Jun Lu

where
k−1 > k−1 >
X v Axk
i
X y Aui k
uk = xk − ui , and vk = yk − vi .
wi wi
i=1 i=1

The proof of this lemma is delayed in Section 18.4. We notice that wi = yk> Ak xk in
the general term formula is related to Ak . So it’s not the true general term formula. We
will write wi to be related to A rather than Ak later. From the general term formula of
Wedderburn sequence, we have

k
X
Ak+1 = A − wi−1 Aui vi> A
i=1
k−1
X
Ak = A − wi−1 Aui vi> A.
i=1

Thus, Ak+1 − Ak = −wk−1 Auk vk> A. Since we define the sequence by Ak+1 = Ak −
wk−1 Ak xk yk> Ak . We then find wk−1 Auk vk> A = wk−1 Ak xk yk> Ak . It is trivial to see

Auk = Ak xk ,
(18.1)
vk> A = yk> Ak .

vi> Axk
Let zk,i = wi which is a scalar. From the definition of uk and vk in the above
lemma, then
• u1 = x1 ;
• u2 = x2 − z2,1 u1 ;
• u3 = x3 − z3,1 u1 − z3,2 u2 ;
• . . ..
This process is just similar to the Gram-Schmidt process. But now, we do not project
x2 onto x1 with the smallest distance. The vector of x2 along x1 is now defined by z2,1 .
This process is shown in Figure 28. In Figure 28(a), u2 is not perpendicular to u1 . But u2
does not lie on the same line of u1 so that u1 , u2 still could span a R2 subspace. Similarly,
in Figure 28(b), u3 = x3 − z3,1 u1 − z3,2 u2 does not lie in the space spanned by u1 , u2 so
that u1 , u2 , u3 could still span a R3 subspace.
A moment of reflexion would reveal that the span of x2 , x1 is the same as the span of
u2 , u1 . Similarly for vi ’s. We have the following property:

(
span{x1 , x2 , . . . , xj } = span{u1 , u2 , . . . , uj };
(18.2)
span{y1 , y2 , . . . , yj } = span{v1 , v2 , . . . , vj }.

172
Matrix Decomposition and Applications

x2
x2

z 3, 2 u 2 u 2 a - aˆ
u2 u1  x1
u1  x1 z 2,1u1
z 2,1u1 x3 z3,1u1
(a) “Project” onto a line (b) “Project” onto a space

Figure 28: “Project” a vector onto a line and onto a space.

Further, from the rank-reducing property in the Wedderburn sequence, we have


(
C(A1 ) ⊃ C(A2 ) ⊃ C(A3 ) ⊃ . . . ;
N (A> > >
1 ) ⊂ N (A2 ) ⊂ N (A3 ) ⊂ . . . .

Since yk ∈ N (A> > >


k+1 ), it then follows that yj ∈ N (Ak+1 ) for all j < k + 1, i.e., Ak+1 yj = 0
> >
for all j < k + 1. Which also holds true for xk+1 Ak+1 yj = 0 for all j < k + 1. From Equa-
tion (18.1), we also have u> >
k+1 A yj = 0 for all j < k + 1. Following from Equation (18.2),
we obtain
vj> Auk+1 = 0 for all j < k + 1.
Similarly, we can prove
>
vk+1 Auj = 0 for all j < k + 1.
Moreover, we defined wk = yk> Ak xk . By Equation (18.1), we can write the wk as:

wk = yk> Ak xk
= vk> Axk
k−1 >
X v Axk
= vk> A(uk + i
ui ) (by the definition of uk in Lemma 18.4)
wi
i=1
= vk> Auk , (by vk> Auj = 0 for all j < k)

which can be used to substitute the wk in Lemma 18.4 and we then have the full version
of the general term formula of the Wedderburn sequence such that the formula does not
depend on Ak ’s (in the form of wk ’s) with
k−1 > k−1 >
X v Axk
i
X y Aui
k
uk = xk − > Au
ui , and vk = yk − vi . (18.3)
v
i=1 i i i=1
vi> Aui

Gram-Schmidt Process from Wedderburn Sequence: If X = [x1 , x2 , . . . , xr ] ∈


Rn×r , Y = [y1 , y2 , . . . , yr ] ∈ Rm×r effects a rank-reducing process for A. Let A be the

173
Jun Lu

identity matrix and (X, Y ) are identical and contain the vectors for which an orthogonal
basis is desired, then (U = V ) give the resultant orthogonal basis.
This form of uk and vk in Equation (18.3) is very close to the projection to the
perpendicular space of the Gram-Schmidt process in Equation (3.1). We then define
< x, y >:= y > Ax to explicitly mimic the form of projection in Equation (3.1). We for-
mulate the results so far in the following lemma which can help us have a clear vision
about what we have been working on and we will use these results extensively in the sequel:

Lemma 18.5: (Properties of Wedderburn Sequence)


For each matrix with Ak+1 = Ak − wk−1 Ak xk yk> Ak , then Ak+1 can be written as

k
X
Ak+1 = A − wi−1 Aui vi> A,
i=1

where
k−1 k−1
X < xk , vi > X < ui , yk >
uk = xk − ui , and vk = yk − vi . (18.4)
< ui , vi > < ui , vi >
i=1 i=1

Further, we have the following properties:

Auk = Ak xk ,
(18.5)
vk> A = yk> Ak .

< uk , vj >=< uj , vk >= 0 for all j < k. (18.6)


wk = yk> Ak xk =< uk , vk > (18.7)

By substituting Equation (18.5) into Form 1 of biconjugate decomposition, and using


Equation (18.7) which implies wk = vk> Auk , we have the Form 2 and Form 3 of this
decomposition:

Theorem 18.6: (Biconjugate Decomposition: Form 2 and Form 3)


The equality from rank-reducing process implies the following matrix decomposition

A = AUr Ω−1 >


r Vr A,

where Ωr = diag(w1 , w2 , . . . , wr ), Ur = [u1 , u2 , . . . , ur ] ∈ Rm×r and Vr = [v1 , v2 , . . . ,


vr ] ∈ Rn×r with
k−1 k−1
X < xk , vi > X < ui , yk >
uk = xk − ui , and vk = yk − vi . (18.8)
< ui , vi > < ui , vi >
i=1 i=1

174
Matrix Decomposition and Applications

And also the following decomposition

Vγ> AUγ = Ωγ , (18.9)

where Ωγ = diag(w1 , w2 , . . . , wγ ), Uγ = [u1 , u2 , . . . , uγ ] ∈ Rm×γ and Vγ = [v1 , v2 , . . . ,


vγ ] ∈ Rn×γ . Note the difference between the subscripts r and γ we used here with γ ≤ r.

We notice that, in these two forms of biconjugate decomposition, they are independent of
the Wedderburn matrices {Ak }.
A word on the notation: we will use the subscript to indicate the dimension of the
matrix avoiding confusion in the sequel, e.g., the r, γ in the above theorem.

18.2 Properties of the Biconjugate Decomposition

Corollary 18.7: (Connection of Uγ and Xγ )


If (Xγ , Yγ ) ∈ Rn×γ × Rm×γ effects a rank-reducing process for A, then there are unique
(x) (y)
unit upper triangular matrices Rγ ∈ Rγ×γ and Rγ ∈ Rγ×γ such that

Xγ = Uγ Rγ(x) , and Yγ = Vγ Rγ(y) ,

where Uγ and Vγ are matrices with columns resulting from the Wedderburn sequence as
in Equation (18.9).

Proof [of Corollary 18.7] The proof is trivial from the definition of uk and vk in Equa-
(x) (y)
tion (18.4) or Equation (18.8) by setting the j-th column of Rγ and Rγ as
 >
< xj , v1 > < xj , v2 > < xj , vj−1 >
, ,..., , 1, 0, 0, . . . , 0 ,
< u1 , v1 > < u2 , v2 > < uj−1 , vj−1 >

and  >
< u1 , yj > < u2 , yj > < uj−1 , yj >
, ,..., , 1, 0, 0, . . . , 0 .
< u1 , v1 > < u2 , v2 > < uj−1 , vj−1 >
This completes the proof.

The (Uγ , Vγ ) ∈ Rm×γ ×Rn×γ in Theorem 18.6 is called a biconjugate pair with respect
to A if Ωγ is nonsingular and diagonal. And let (Xγ , Yγ ) ∈ Rn×γ × Rm×γ effect a rank-
reducing process for A, then (Xγ , Yγ ) is said to be biconjugatable and biconjugated
into a biconjugate pair of matrices (Uγ , Vγ ), if there exist unit upper triangular matrices
(x) (y) (x) (y)
Rγ , Rγ such that Xγ = Uγ Rγ and Yγ = Vγ Rγ .

18.3 Connection to Well-Known Decomposition Methods


18.3.1 LDU Decomposition

175
Jun Lu

Theorem 18.8: (LDU, Chu et al. (1995) Theorem 2.4)


If (Xγ , Yγ ) ∈ Rn×γ × Rm×γ and A ∈ Rm×n with γ in {1, 2, . . . , r}. Then (Xγ , Yγ ) can be
biconjugated if and only if Yγ> AXγ has an LDU decomposition.

Proof [of Theorem 18.8] Suppose Xγ and Yγ are biconjugatable, then, there exists a
(x) (y) (x) (y)
unit upper triangular matrices Rγ and Rγ such that Xγ = Uγ Rγ , Yγ = Vγ Rγ and
>
Vγ AUγ = Ωγ is a nonsingular diagonal matrix. Then, it follows that

Yγ> AXγ = Rγ(y)> Vγ> AUγ Rγ(x) = Rγ(y)> Ωγ Rγ(x)

is the unique unit triangular LDU decomposition of Yγ> AXγ . This form above can be
seen as the fourth form of biconjugate decomposition, thus we put the proof into a
graybox.
Conversely, suppose Yγ> AXγ = R2> DR1 is an LDU decomposition with both R1 and
R2 being unit upper triangular matrices. Then since R1−1 and R2−1 are also unit upper
triangular matrices, and (Xγ , Yγ ) biconjugates into (Xγ R1−1 , Yγ R2−1 ).

Corollary 18.9: (Determinant)


Suppose (Xγ , Yγ ) ∈ Rn×γ × Rm×γ are biconjugatable. Then
γ
Y
det(Yγ> AXγ ) = wi .
i=1

Proof [of Corollary 18.9] By Theorem 18.8, since (Xγ , Yγ ) are biconjugatable, then there
(x) (y) (y)> (x)
are unit upper triangular matrices Rγ and Rγ such that Yγ> AXγ = Rγ Ωγ Rγ . The
determinant is just product of the trace.

Lemma 18.10: (Biconjugatable in Principal Minors)


Let r = rank(A) ≥ γ with A ∈ Rm×n . In the Wedderburn sequence, take xi as the i-th
basis in Rn for i ∈ {1, 2, . . . , γ} (i.e., xi = ei ∈ Rn ) and yi as the i-th basis in Rm for
i ∈ {1, 2, . . . , γ} (i.e., yi = ei ∈ Rm ). That is Yγ> AXγ is the leading principal submatrix
of A, i.e., Yγ> AXγ = A1:γ,1:γ . Then, (Xγ , Yγ ) is biconjugatable if and only if the γ-th
leading principal Qγ minor of A is nonzero. In this case, the γ-th leading principal minor of
A is given by i=1 wi .

Proof [of Lemma 18.10] The proof is trivial that the γ-th leading principal minor of A is
nonzero will imply that wi 6= 0 for all i ≤ γ. Thus the Wedderburn sequence can be success-
fully obtained. The converse holds since Corollary 18.9 implies det(Yγ> AXγ ) is nonzero.

We thus finally come to the LDU decomposition for square matrices.

176
Matrix Decomposition and Applications

Theorem 18.11: (LDU: Biconjugate Decomposition for Square Matrices)


For any matrix A ∈ Rn×n , (In , In ) is biconjugatable if and only if all the leading principal
minors of A are nonzero. In this case, A can be factored as

A = Vn−> Ωn Un−1 = LDU ,

where Ωn = D is a diagonal matrix with nonzero values on the diagonal, Vn−> = L is a


unit lower triangular matrix and Un−1 = U is a unit upper triangular matrix.

Proof [of Theorem 18.11] From Lemma 18.10, it is trivial that (In , In ) is biconjugatable.
(x) (y) (x)
From Corollary 18.7, we have Un Rn = In and In = Vn Rn , thus Rn = Un−1 and
(y)
Rn = Vn−1 are well defined and we complete the proof.

18.3.2 Cholesky Decomposition


For symmetric and positive definite, the leading principal minors are positive for sure. The
proof is provided in Section 2.2.

Theorem 18.12: (Cholesky: Biconjugate Decomposition for PD Matrices)


For any symmetric and positive definite matrix A ∈ Rn×n , the Cholesky decomposition
of A can be obtained from the Wedderburn sequence applied to (In , In ) as (Xn , Yn ). In
this case, A can be factored as

A = Un−> Ωn Un−1 = (Un−> Ω1/2 1/2 −1 >


n )(Ωn Un ) = R R,

where Ωn is a diagonal matrix with positive values on the diagonal, and U −1 is a unit
upper triangular matrix.

Proof [of Theorem 18.12] Since the leading principal minors of positive definite matrices
are positive, wi > 0 for all i ∈ {1, 2, . . . , n}. It can be easily verified via the LDU from
biconjugation decomposition and the symmetric property of A that A = Un−> Ωn Un−1 . And
1/2 1/2
since wi ’s are positive, thus Ωn is positive definite and can be factored as Ωn = Ωn Ωn
1/2
which implies Ωn Un−1 is the Cholesky factor.

18.3.3 QR Decomposition
Without loss of generality, we shall assume that A ∈ Rn×n has full column rank so that the
columns of A can be factored as A = QR with Q, R ∈ Rn×n

Theorem 18.13: (QR: Biconjugate Decomposition for Nonsingular Matrices)

177
Jun Lu

For any nonsingular matrix A ∈ Rn×n , the QR decomposition of A can be obtained from
the Wedderburn sequence applied to (In , A) as (Xn , Yn ). In this case, A can be factored
as
A = QR,
−1/2 1/2 (x)
where Q = Vn Ωn is an orthogonal matrix and R = Ωn Rn is an upper triangular
matrix with Form 4 in Theorem 18.8 and let γ = n

Yn> AXn = Rn(y)> Vn> AUn Rn(x) = Rn(y)> Ωn Rn(x) .

where we set γ = n since γ is any value that γ ≤ r and the rank r = n.


Proof [of Theorem 18.13] Since (Xn , Yn ) = (In , A). Then By Theorem 18.8, we have the
decomposition
Yn> AXn = Rn(y)> Vn> AUn Rn(x) = Rn(y)> Ωn Rn(x) .
Substitute (In , A) into above decomposition, we have
Yn> AXn = Rn(y)> Vn> AUn Rn(x) = Rn(y)> Ωn Rn(x)
A> A = Rn(y)> Ωn Rn(x)
A> A = R1> Ωn R1 (A> A is symmetric and let R1 = Rn(x) = Rn(y) ) (18.10)
>
A A= (R1> Ω1/2>
n )(Ω1/2
n R1 )
A> A = R R. >
(Let R = Ωn1/2 R1 )
1/2> 1/2
To see why Ωn can be factored as Ωn = Ωn Ωn . Suppose A = [a1 , a2 , . . . , an ]. We
obtain wi = a> i ai > 0 since A is nonsingular. Thus Ωn = diag(w1 , w2 , . . . , wn ) is positive
definite and can be factored as
Ωn = Ω1/2 1/2 1/2> 1/2
n Ωn = Ωn Ωn . (18.11)
(x)
By Xγ = Uγ Rγ in Theorem 18.8 for all γ ∈ {1, 2, . . . , n}, we have
Xn = Un R1
In = Un R1 , (Since Xn = In )
Un = R1−1
(y)
By Yγ = Vγ Rγ in Theorem 18.8 for all γ ∈ {1, 2, . . . , n}, we have
Yn = Vn R1
A = Vn R1 , (A = Yn )
>
A A= R1> Vn> Vn R1
R1> Ωn R1 = R1> Vn> Vn R1 , (From Equation (18.10))
(R1> Ω1/2>
n )(Ω1/2
n R1 ) = (R1> Ω1/2>
n Ωn−1/2> )Vn> Vn (Ω−1/2
n Ωn1/2 R1 ), (From Equation (18.11))
>
R R= R> (Ω−1/2>
n Vn> )(Vn Ωn−1/2 )R
(18.12)
−1/2
Thus, Q = Vn Ωn is an orthogonal matrix.

178
Matrix Decomposition and Applications

18.3.4 SVD
To differentiate the notation, let A = U svd Σsvd V svd> be the SVD of A where U svd =
[usvd svd svd svd = [v svd , v svd , . . . , v svd ] and Σsvd = diag(σ , σ , . . . , σ ). With-
1 , u2 , . . . , un ], V 1 2 n 1 2 n
out loss of generality, we assume A ∈ Rn×n and rank(A) = n. Readers can prove the
equivalence for A ∈ Rm×n .
If Xn = V svd , Yn = U svd effects a rank-reducing process for A. From the definition of
uk and vk in Equation (18.4) or Equation (18.8), we have
uk = vksvd and vk = usvd
k and wk = yk> Axk = σk .
That is Vn = U svd , Un = V svd , and Ωn = Σsvd , where we set γ = n since γ is any value
that γ ≤ r and the rank r = n.
(x)
By Xn = Un Rn in Theorem 18.8, we have
Xn = Un Rn(x) leads
−−−−−→to V svd = V svd Rn(x) leads
−−−−−→to In = Rn(x)
(y)
By Yn = Vn Rn in Theorem 18.8, we have
Yn = Vn Rn(y) leads
−−−−−→to U svd = U svd Rn(y) leads
−−−−−→to In = Rn(y)
Again, from Theorem 18.8 and let γ = n, we have
Yn> AXn = Rn(y)> Vn> AUn Rn(x) = Rn(y)> Ωn Rn(x) .
That is
U svd> AV svd = Σsvd ,
which is exactly the form of SVD and we prove the equivalence of SVD and biconjugate
decomposition when the Wedderburn sequence is applied to (V svd , U svd ) as (Xn , Yn ).

18.4 Proof General Term Formula of Wedderburn Sequence


We define the Wedderburn sequence of A by Ak+1 = Ak − wk−1 Ak xk yk> Ak and A1 = A.
The proof of the general term formula of this sequence is then:
Proof [of Lemma 18.4] For A2 , we have:
A2 = A1 − w1−1 A1 x1 y1> A1
= A − w1−1 Au1 v1> A, where u1 = x1 , v1 = y1 .
For A3 , we can write out the equation:
A3 = A2 − w2−1 A2 x2 y2> A2
= (A − w1−1 Au1 v1> A)
− w2−1 (A − w1−1 Au1 v1> A)x2 y2> (A − w1−1 Au1 v1> A) (substitute A2 )
= (A − w1−1 Au1 v1> A)
− w2−1 A(x2 − w1−1 u1 v1> Ax2 )(y2> − w1−1 y2> Au1 v1> )A (take out A)
=A− w1−1 Au1 v1> A − w2−1 Au2 v2> A
X2
=A− wi−1 Aui vi> A,
i=1

179
Jun Lu

v > Ax y2> Au1


where u2 = x2 − w1−1 u1 v1> Ax2 = x2 − 1 w1 2 u1 , v2 = y2 − w1−1 y2> Au1 v1 = y2 − w1 v1 .
Similarly, we can find the expression of A4 by A:

A4 = A3 − w3−1 A3 x3 y3> A3
2
X
=A− wi−1 Aui vi> A
i=1
2
X 2
X
− w3−1 (A − wi−1 Aui vi> A)x3 y3> (A − wi−1 Aui vi> A) (substitute A3 )
i=1 i=1
2
X
=A− wi−1 Aui vi> A
i=1
2
X 2
X
− w3−1 A(x3 − wi−1 x3 ui vi> A)(y3> − wi−1 y3> Aui vi> )A (take out A)
i=1 i=1
2
X
=A− wi−1 Aui vi> A − w3−1 Au3 v3> A
i=1
X3
=A− wi−1 Aui vi> A,
i=1

v > Ax y2> Aui


where u3 = x3 − 2i=1 i wi 3 ui , v3 = y3 − 2i=1
P P
wi vi .
Continue this process, we can define
k−1 > k−1 >
X v Axk
i
X y Auik
uk = xk − ui , and vk = yk − vi ,
wi wi
i=1 i=1

and find the general term of Wedderburn sequence.

19. Acknowledgments
We thank Gilbert Strang for raising the question formulated in Corollary 6.2, checking the
writing of the survey, for a stream of ideas and references about the three factorizations
from the steps of elimination, and for the generous sharing of the manuscript of (Strang
and Drucker, 2021).

180
Matrix Decomposition and Applications

References
Sudipto Banerjee and Anindya Roy. Linear algebra and matrix analysis for statistics, volume
181. Crc Press Boca Raton, FL, USA:, 2014.

Amir Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.

James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and
workshop, volume 2007, page 35. New York, NY, USA., 2007.

Christopher M Bishop. Pattern recognition. Machine learning, 128(9), 2006.

Christos Boutsidis and Efstratios Gallopoulos. Svd based initialization: A head start for
nonnegative matrix factorization. Pattern recognition, 41(4):1350–1362, 2008.

Stephen Boyd and Lieven Vandenberghe. Introduction to applied linear algebra: vectors,
matrices, and least squares. Cambridge university press, 2018.

Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cam-
bridge university press, 2004.

Tony F Chan. An improved algorithm for computing the singular value decomposition.
ACM Transactions on Mathematical Software (TOMS), 8(1):72–83, 1982.

Tony F Chan. Rank revealing QR factorizations. Linear algebra and its applications, 88:
67–82, 1987.

Moody T Chu, Robert E Funderlic, and Gene H Golub. A rank–one reduction formula and
its applications to matrix factorizations. SIAM review, 37(4):512–530, 1995.

Pierre Comon, Xavier Luciani, and André LF De Almeida. Tensor decompositions, al-
ternating least squares and other tales. Journal of Chemometrics: A Journal of the
Chemometrics Society, 23(7-8):393–405, 2009.

Froilán M Dopico, Charles R Johnson, and Juan M Molera. Multiple LU factorizations of


a singular matrix. Linear algebra and its applications, 419(1):24–36, 2006.

Ricardo D Fierro and Per Christian Hansen. Low-rank revealing UTV decompositions.
Numerical Algorithms, 15(1):37–55, 1997.

Leslie V Foster. Solving rank-deficient and ill-posed problems using UTV and QR factor-
izations. SIAM journal on matrix analysis and applications, 25(2):582–600, 2003.

Jean Gallier and Jocelyn Quaintance. Fundamentals of Linear Algebra and Optimization.
Department of Computer and Information Science, University of Pennsylvania, 2017.

James E Gentle. Numerical linear algebra for applications in statistics. Springer Science &
Business Media, 1998.

James E Gentle. Matrix algebra. Springer texts in statistics, Springer, New York, NY, doi,
10:978–0, 2007.

181
Jun Lu

Paris V Giampouras, Athanasios A Rontogiannis, and Konstantinos D Koutroumbas. Alter-


nating iteratively reweighted least squares minimization for low-rank matrix factorization.
IEEE Transactions on Signal Processing, 67(2):490–503, 2018.

George T Gilbert. Positive definite matrices and Sylvester’s criterion. The American Math-
ematical Monthly, 98(1):44–46, 1991.

Philip E Gill, Walter Murray, and Margaret H Wright. Numerical linear algebra and opti-
mization. SIAM, 2021.

Nicolas Gillis. The why and how of nonnegative matrix factorization. Connections, 12:2–2,
2014.

Israel Gohberg and Seymour Goldberg. A simple proof of the jordan decomposition theorem
for matrices. The American Mathematical Monthly, 103(2):157–159, 1996.

Donald Goldfarb. Factorized variable metric methods for unconstrained optimization. Math-
ematics of Computation, 30(136):796–811, 1976.

Gene Golub and William Kahan. Calculating the singular values and pseudo-inverse of a
matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numer-
ical Analysis, 2(2):205–224, 1965.

Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2013.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

AW Hales and IBS Passi. Jordan decomposition. In Algebra, pages 75–87. Springer, 1999.

Nicholas J Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.

Nicholas J Higham. Cholesky factorization. Wiley Interdisciplinary Reviews: Computational


Statistics, 1(2):251–254, 2009.

Harold Hotelling. Analysis of a complex of statistical variables into principal components.


Journal of educational psychology, 24(6):417, 1933.

Alston S Householder. Principles of numerical analysis. Courier Corporation, 2006.

Tsung-Min Hwang, Wen-Wei Lin, and Eugene K Yang. Rank revealing LU factorizations.
Linear algebra and its applications, 175:115–141, 1992.

Camille Jordan. Traité des substitutions et des équations algébriques. Gauthier-Villars,


1870.

N Kishore Kumar and Jan Schneider. Literature survey on low rank approximation of
matrices. Linear and Multilinear Algebra, 65(11):2212–2244, 2017.

Martin Koeber and Uwe Schäfer. The unique square root of a positive semidefinite matrix.
International Journal of Mathematical Education in Science and Technology, 37(8):990–
992, 2006.

182
Matrix Decomposition and Applications

Charles L Lawson and Richard J Hanson. Solving least squares problems. SIAM, 1995.

Daniel D Lee and Hyunjune Sebastian Seung. Algorithms for non-negative matrix factor-
ization. In 14th Annual Neural Information Processing Systems Conference, NIPS 2000.
Neural information processing systems foundation, 2001.

Jun Lu. A survey on Bayesian inference for Gaussian mixture model. arXiv preprint
arXiv:2108.11753, 2021a.

Jun Lu. On the column and row ranks of a matrix. arXiv preprint arXiv:2112.06638, 2021b.

Jun Lu. Numerical matrix decomposition and its modern applications: A rigorous first
course. arXiv preprint arXiv:2107.02579, 2021c.

Jun Lu. Revisit the fundamental theorem of linear algebra. arXiv preprint
arXiv:2108.04432, 2021d.

Jun Lu. A rigorous introduction for linear models. arXiv preprint arXiv:2105.04240, 2021e.

Per-Gunnar Martinsson. Randomized methods for matrix computations. The Mathematics


of Data, 25(4):187–231, 2019.

L Miranian and Ming Gu. Strong rank revealing LU factorizations. Linear algebra and its
applications, 367:1–16, 2003.

C-T Pan. On the existence and computation of rank-revealing LU factorizations. Linear


Algebra and its Applications, 316(1-3):199–222, 2000.

V Paul Pauca, Jon Piper, and Robert J Plemmons. Nonnegative matrix factorization for
spectral data analysis. Linear algebra and its applications, 416(1):29–47, 2006.

Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The
London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):
559–572, 1901.

Alfio Quarteroni, Riccardo Sacco, and Fausto Saleri. Numerical mathematics, volume 37.
Springer Science & Business Media, 2010.

Wil HA Schilders. Solution of indefinite linear systems using an LQ decomposition for the
linear constraints. Linear algebra and its applications, 431(3-4):381–395, 2009.

Matthias Seeger. Low rank updates for the Cholesky decomposition. Technical report, 2004.

Jonathon Shlens. A tutorial on principal component analysis. arXiv preprint


arXiv:1404.1100, 2014.

Gilbert W Stewart. Matrix Algorithms: Volume 1: Basic Decompositions. SIAM, 1998.

GW Stewart. The decompositional approach to matrix computation. Computing in Science


& Engineering, 2(1):50–59, 2000.

183
Jun Lu

Gilbert Strang. Introduction to linear algebra. Wellesley-Cambridge Press Wellesley, 4th


edition, 2009.

Gilbert Strang. Linear algebra and learning from data. Wellesley-Cambridge Press Cam-
bridge, 2019.

Gilbert Strang. Linear algebra for everyone. Wellesley-Cambridge Press Wellesley, 2021.

Gilbert Strang and Daniel Drucker. Three matrix factorizations from the steps of elimina-
tion. 2021.

Gilbert Strang and Cleve Moler. LU and CR elimination. 2021.

Kuduvally Swamy. On Sylvester’s criterion for positive-semidefinite matrices. IEEE Trans-


actions on Automatic Control, 18(3):306–306, 1973.

Gábor Takács and Domonkos Tikk. Alternating least squares for personalized ranking. In
Proceedings of the sixth ACM conference on Recommender systems, pages 83–90, 2012.

Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.

Robert van de Geijn and Margaret Myers. Advanced linear algebra: Foundations to fron-
tiers. Creative Commons NonCommercial (CC BY-NC), 2020.

Ming Yang. Matrix decomposition. Northwestern University, Class Notes, 2000.

184

You might also like