Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Fundamentals of Linear Algebra For Signal Processing 2022 09 22

Download as pdf or txt
Download as pdf or txt
You are on page 1of 321

Fundamentals of Linear Algebra for Signal

Processing

©
James P. Reilly
Professor Emeritus
Department of Electrical and Computer Engineering
McMaster University

DRAFT

Not for public distribution

September 22, 2022


ii
Contents

1 Fundamental Concepts 1

1.0.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Fundamental Linear Algebra . . . . . . . . . . . . . . . . . . 2

1.1.1 Linear Vector Spaces . . . . . . . . . . . . . . . . . . . 3

1.1.2 Linear Independence . . . . . . . . . . . . . . . . . . . 7

1.1.3 The Orthogonal Complement Subspace . . . . . . . . 10

1.1.4 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.1.5 Null Space of A . . . . . . . . . . . . . . . . . . . . . 12

1.2 The Four Fundamental Subspaces of a Matrix . . . . . . . . . 13

1.3 Further Interpretations of Matrix Multiplication . . . . . . . 14

1.3.1 Bigger–Block Interpretations of Matrix Multiplication 14

1.4 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iii
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Eigenvalues and Eigenvectors 29

2.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . 29

2.1.1 Orthonormal Matrices . . . . . . . . . . . . . . . . . . 35

2.1.2 The Eigendecomposition (ED) of a Square Symmetric


Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1.3 Conventional Notation on Eigenvalue Indexing . . . . 38

2.1.4 The Eigendecomposition in Relation to the Funda-


mental Matrix Subspaces . . . . . . . . . . . . . . . . 39

2.2 An Alternate Interpretation of Eigenvectors . . . . . . . . . . 42

2.3 Covariance and Covariance Matrices . . . . . . . . . . . . . . 44

2.3.1 Covariance Matrices . . . . . . . . . . . . . . . . . . . 48

2.4 Covariance Matrices of Stationary Time Series . . . . . . . . 50

2.5 Examples of Eigen–Analysis with Covariance Matrices . . . . 54

2.5.1 Array Processing . . . . . . . . . . . . . . . . . . . . . 54

2.6 Principal Component Analysis (PCA) . . . . . . . . . . . . . 59

2.6.1 Examples of PCA Analysis . . . . . . . . . . . . . . . 64

2.6.2 PCA vs. Wavelet Analysis: . . . . . . . . . . . . . . . 76

2.7 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2.8 Differentiation of a Quadratic Form . . . . . . . . . . . . . . 79

iv
2.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3 The Singular Value Decomposition (SVD) 83

3.1 Development of the SVD . . . . . . . . . . . . . . . . . . . . . 83

3.1.1 Relationship between SVD and ED . . . . . . . . . . . 86

3.1.2 Partitioning the SVD . . . . . . . . . . . . . . . . . . 87

3.1.3 Properties and Interpretations of the SVD . . . . . . 87

3.2 Orthogonal Projections . . . . . . . . . . . . . . . . . . . . . 96

3.2.1 The Orthogonal Complement Projector . . . . . . . . 101

3.2.2 Orthogonal Projections and the SVD . . . . . . . . . . 101

3.3 Alternative Proof of the SVD . . . . . . . . . . . . . . . . . . 103

3.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4 The Quadratic Form 107

4.1 The Quadratic Form and Positive Definiteness . . . . . . . . 107

4.2 The Locus of Points {z|z T Λz = 1} . . . . . . . . . . . . . . 109

4.3 The Gaussian Multi-Variate Probability Density Function . . 111

4.4 The Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . 115

4.5 Methods for Computing a Single Eigen– Pair . . . . . . . . . 115

4.5.1 The Rayleigh Quotient Method: . . . . . . . . . . . . 115

4.5.2 The Power Method: . . . . . . . . . . . . . . . . . . . 116

v
4.6 Alternate Differentiation of the Quadratic Form . . . . . . . . 119

4.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5 Gaussian Elimination and Associated Numerical Issues 123

5.1 Floating Point Arithmetic Systems . . . . . . . . . . . . . . . 123

5.1.1 Catastrophic Cancellation . . . . . . . . . . . . . . . . 127

5.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . 129

5.2.1 The LU Decomposition . . . . . . . . . . . . . . . . . 131

5.2.2 Gauss Transforms . . . . . . . . . . . . . . . . . . . . 132

5.2.3 Recovery of the LU factors from Gaussian Elimination 134

5.3 Numerical Properties of Gaussian Elimination . . . . . . . . . 140

5.3.1 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.3.2 The Cholesky Decomposition [1] . . . . . . . . . . . . 144

5.3.3 Application of the Cholesky Decomposition . . . . . . 147

5.4 The Sensitivity of Linear Systems . . . . . . . . . . . . . . . . 149

5.5 The Interlacing Theorem and Condition Numbers [1] . . . . . 155

5.6 Iterative Solutions . . . . . . . . . . . . . . . . . . . . . . . . 157

5.7 Alternate Derivation of condition number . . . . . . . . . . . 157

5.8 Condition Number and Power Spectral Density . . . . . . . . 159

5.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

vi
6 The QR Decomposition 165

6.1 Classical Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . 167

6.1.1 Modified G-S Method for QR Decomposition . . . . . 169

6.2 Householder Transformations . . . . . . . . . . . . . . . . . . 171

6.2.1 Description of the Householder Algorithm . . . . . . . 171

6.2.2 Example of Householder Elimination . . . . . . . . . . 174

6.2.3 Selective Elimination . . . . . . . . . . . . . . . . . . . 175

6.2.4 Householder Numerical Properties . . . . . . . . . . . 176

6.3 The QR Method for Computing the Eigendecomposition . . . 177

6.3.1 Enhacements to the QR Method . . . . . . . . . . . . 178

6.4 Givens Rotations . . . . . . . . . . . . . . . . . . . . . . . . . 182

6.5 “Fast” Givens Method for QR Decomposition . . . . . . . . . 185

6.5.1 Flop Counts . . . . . . . . . . . . . . . . . . . . . . . . 189

6.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

7 Linear Least Squares Estimation 193

7.1 Examples of least squares analysis . . . . . . . . . . . . . . . 194

7.1.1 Example 1: An Equalizer in a Communications Sys-


tem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

7.1.2 Example 2: Autoregressive Modelling [2] . . . . . . . . 196

7.1.3 Example 3: Hurricane prediction using machine learning198

vii
7.2 The Least-Squares Solution . . . . . . . . . . . . . . . . . . . 199

7.2.1 Interpretation of the Normal Equations . . . . . . . . 204

7.3 Properties of the LS Estimate . . . . . . . . . . . . . . . . . 206

7.3.1 xLS is an unbiased estimate of xo , the true value . . 207

7.3.2 Covariance Matrix of xLS . . . . . . . . . . . . . . . . 208

7.3.3 Variance of a Predicted Value of b . . . . . . . . . . . 209

7.3.4 xLS is a BLUE (aka The Gauss–Markov Theorem) . 211

7.4 Least Squares Estimation from a Probabilistic Approach . . . 213

7.4.1 Least Squares Estimation and the Cramer Rao Lower


Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

7.4.2 Least-Squares Estimation and the CRLB for Gaussian


Coloured Noise . . . . . . . . . . . . . . . . . . . . . . 218

7.4.3 Maximum–Likelihood Property . . . . . . . . . . . . . 220

7.5 Other Constructs Related to Whitening . . . . . . . . . . . . 221

7.5.1 Mahalanobis distance . . . . . . . . . . . . . . . . . . 221

7.5.2 Generalized Eigenvalues and Eigenvectors . . . . . . . 222

7.6 Solving Least Squares Using the QR Decomposition . . . . . 225

7.7 A Short Section on Adaptive Filters . . . . . . . . . . . . . . 227

7.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

8 Thc Rank Deficient Least Squares Problem 231

viii
8.1 The Pseudo–Inverse . . . . . . . . . . . . . . . . . . . . . . . 231

8.2 Interpretation of the Pseudo-Inverse . . . . . . . . . . . . . . 235

8.2.1 Geometrical Interpretation . . . . . . . . . . . . . . . 235

8.2.2 Relationship of the Pseudo-Inverse Solution to the


Normal Equations . . . . . . . . . . . . . . . . . . . . 236

8.2.3 The Pseudo–Inverse as a Generalized Linear System


Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

8.3 Principal Component Analysis (PCA) . . . . . . . . . . . . . 237

8.4 The Rank–deficient QR Method . . . . . . . . . . . . . . . . . 241

8.4.1 Computation of the Rank-Deficient QR Decomposition 241

8.5 The Rank-Deficient LS Problem with QR: . . . . . . . . . . . 244

8.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

9 Model Building Using Latent Variable Methods 249

9.1 Design of X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

9.2 Principal Component Analysis Revisited . . . . . . . . . . . . 254

9.3 Partial Least Squares (PLS) and Canonical Correlation Anal-


ysis (CCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

9.3.1 Prediction for the PLS and CCA cases . . . . . . . . . 261

9.3.2 Simulation Example . . . . . . . . . . . . . . . . . . . 262

9.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

ix
10 Regularization 269

10.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . 270

10.2 Regularization using a smoothness penalty . . . . . . . . . . . 273

10.3 Sparsity regularization . . . . . . . . . . . . . . . . . . . . . . 275

10.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

11 Toeplitz Systems 283

11.1 Toeplitz Systems [3] . . . . . . . . . . . . . . . . . . . . . . . 283

11.1.1 Autoregressive Processes . . . . . . . . . . . . . . . . . 284

11.1.2 The Levinson-Durbin Recursion (LDR) . . . . . . . . 291

11.1.3 Further Analysis on Toeplitz Systems . . . . . . . . . 295

11.1.4 The Burg Recursion . . . . . . . . . . . . . . . . . . . 297

11.1.5 Lattice Filters . . . . . . . . . . . . . . . . . . . . . . 301

11.1.6 Application of AR Analysis to Speech Coding . . . . . 303

11.1.7 Toeplitz Factorizations . . . . . . . . . . . . . . . . . . 303

x
Preface
This book is intended for graduate students who require a background in
linear algebra, or for practitioners who are entering a new field which requires
some familiarity with this topic. The book will give the reader familiarity
with the basic linear algebra toolkit that that is required in many disciplines
of modern engineering and science relating to signal processing, including
machine learning, signal processing, control theory, process control, applied
statistics, robotics, etc. Above all, this is a teaching text, where the emphasis
is placed on understanding and interpretation of the material.

Pre-requisites: We assume the reader has an equivalent background to


a freshman course in linear algebra, so that they are familiar with the basic
concepts of matrices and vectors, and arithmetic operations on these quan-
tities, as well as transposes, inverses and other elementary knowledge. Some
knowledge of probability and statistics, and a basic knowledge of the Fourier
transform is helpful in certain sections. Also, familiarity with a high–level
programming language such as matlab, python or R is assumed.

The first chapter, some fundamental ideas required for the remaining por-
tion of the book are established. First, we look at some fundamental ideas
of linear algebra such as linear independence, subspaces, rank, nullspace,
range, etc., and how these concepts are interrelated. A review of matrix
multiplication from a more advanced perspective is introduced.

In chapter 2, the most basic matrix decomposition, the so–called eigende-


composition, is presented. The focus of the presentation is to give an in-
terpretive insight into what this decomposition accomplishes. We illustrate
how the eigendecomposition can be applied through principal component
analysis (PCA) aka the Karhunen-Loeve transform. In this way, the reader
is made familiar with the important properties of this decomposition. The
ideas of autocorrelation, and the covariance matrix of a signal, are discussed
and interpreted.

In chapter 3, we develop the singular value decomposition (SVD), which is


closely related to the eigendecomposition of a matrix. We develop the rela-
tionships between these two decompositions and explore various properties
of the SVD. Orthogonal projectors are also discussed.

i
Chapter 4 deals with the quadratic form and its relation to the eigende-
composition. The multi–variate Gaussian probability function is discussed,
and the concept of joint confidence regions are presented. In Chapter 5, a
brief introduction to numerical issues encountered when dealing with floating
point number systems is presented. Then Gaussian elimination is discussed
in some length. The Gaussian elimination process is described through a
bigger–block matrix approach, that leads to other useful decompositions,
such as the Cholesky decomposition of a square symmetric positive definite
matrix. The condition number of a matrix, which is a critical part in de-
termining a lower bound on the relative error in the solution of a system of
linear equations, is also developed.

The QR decomposition is developed in Chapter 6, along with the QR algo-


rithm for computing the eigendecomposition.

Chapters 7 – 10 deal with solving least–squares problems. The standard least


squares problem, its solution and its properties are developed in Chapter 7.
In Chapter 8 we discuss a variety of methods for addressing the poorly
conditioned least–squares problem. Chapter 9 deals with least squares from
the perspective of model building, where the objective is to estimate a set
of output values of a system in response to a new set of input values, using
latent variable methods.

Chapter 10 deals with regularization, which is a means of introducing prior


information into a least–squares problem and improving the robustness of
the LS solution.

ii
Chapter 1

Fundamental Concepts

The purpose of this chapter is to review important fundamental concepts


in linear algebra, as a foundation for the remaining portion of this text.
We first discuss the fundamental building blocks, such as an overview of
matrix multiplication from a “big block” perspective, linear independence,
subspaces and related ideas, rank, etc., upon which the rigor of linear algebra
rests. We then discuss vector norms, and various interpretations of the
matrix multiplication operation. We close the chapter with a discussion on
determinants.

1.0.1 Notation

Throughout this book we indicate that a matrix A is of dimension m × n,


and whose elements are taken from the set of real numbers, by the notation
A ∈ Rm×n . This means that the matrix A belongs to the Cartesian product
of the real numbers, taken m × n times, one for each element of A. In a
similar way, the notation A ∈ Cm×n means the matrix is of dimension m×n,
and the elements are taken from the set of complex numbers. By the matrix
dimension “m × n”, we mean A consists of m rows and n columns. Matrices
with m > n (more rows than columns) are referred to as tall matrices,
whereas matrices where m < n are referred to as short. Matrices where
m = n are called square. A square matrix which has non–zero elements only
on its main diagonal is referred to as a diagonal matrix.

1
Similarly, the notation a ∈ Rm (Cm ), belongs to the Cartesian product taken
m times, implies a vector of m elements which are taken from the set of real
(complex) numbers. When referring to a single vector, we use the term
dimension or length to denote the number of elements.

Also, we indicate that a scalar a is from the set of real (complex) numbers
by the notation a ∈ R(C). Thus, an upper case bold character denotes
a matrix, a lower case bold character denotes a vector, and a lower case
non-bold character denotes a scalar.

By convention, a vector by default is taken to be a column vector. Further,


for a matrix A, we denote its ith column as ai . We also imply that its
jth row is aTj , even though this notation may be ambiguous, since it may
also be taken to mean the transpose of the jth column. The context of
the discussion will help to resolve the ambiguity. In the hand–written case,
vectors and matrices are denoted by underlining the respective character.

1.1 Fundamental Linear Algebra

In this section, we introduce fundamental concepts such as vector spaces,


linear independence, bases, rank and related ideas. These concepts form the
foundation for most of what follows in this book. First we introduce a few
preliminaries:

The Matrix Transpose: Consider an m × n matrix A. Its transpose,


which is n × m and is denoted AT , is formed by converting each column
of A into the corresponding row of AT . The transpose operation can also
apply to vectors.

The Identity Matrix: We define a set of elementary vectors ei , i =


1, . . . , m as a vector of m elements having all zeros, except for a 1 in the

ith position. The m × m identity matrix I is defined as I = e1 , e2 , . . . em .
As may be seen, it is a diagonal matrix of consisting of 1’s along the main
diagonal. This matrix is called the identity because of its property Ix = x
for any x. The inverse of I is I itself.

The Matrix Inverse The inverse A−1 of a matrix A is defined such that

2
AA−1 = A−1 A = I. To be “invertible”, A must be square and full rank.
More on this later.

Trace: The trace of a matrix A, denoted as tr(A), is the sum of its diagonal
elements.

Inner and Outer Products: Given twoPvectors a, b ∈ Rn , then their


inner product p is a scalar defined as p = ni=1 ai bi . This is equivalent to
the expression p = aT b. Recall that if a, b are orthogonal, then p = 0. If
a ∈ Rm and b ∈ Rn , then the outer product P is the m × n matrix defined
as P = abT .

1.1.1 Linear Vector Spaces

In this section, vectors are assumed to be length m. A linear vector space


S is a set of vectors, which satisfy two properties:

1. If x, y ∈ S, then x + y ∈ S, where ”+“ denotes vector addition.

2. If x ∈ S, then cx ∈ S, where c ∈ R, where vector multiplication by a


scalar has been indicated.

A more mathematically rigorous presentation is given e.g., in [4].

Given a set of vectors [a1 , . . . , an ], where ai ∈ Rm , i = 1, . . . , n, and a set of


n scalars ci ∈ R, then the vector y ∈ Rm defined by
n
X
y= ci ai (1.1)
i=1

is referred to as a linear combination of the vectors ai . Then according to


the properties above, if the ai ∈ S, the linear combination y is also in S.

Eq. (1.1) can be represented more compactly as the matrix–vector product


given as:
y = Ac (1.2)

3
where A ∈ Rm×n = [a1 , . . . , an ]. To see this, we can depict the product Ac
in the following form for the 3 × 3 case:
  
a d g 1
y =  b e h  2 . (1.3)
c f i 3

Then, from the conventional rules of matrix–vector multiplication, we have


 
1a + 2d + 3g
y =  1b + 2e + 3h  .
1c + 2f + 3i

Note in this example, that all elements of the first column of A are multiplied
only by the coefficient c1 = 1, that the entire second column is multiplied
only by c2 = 2, and all elements of the third column of A are multiplied only
by c3 = 3. Therefore (1.2) can be written in the form y = c1 a1 +c2 a2 +c3 a3 ,
which is identical to (1.1). If C were a matrix with e.g., two columns instead
of one, then the resulting Y would be a matrix with two columns. In this
case, the second column of Y would also be a linear combination of the
columns of A, but in this case the coefficients are from the second column
of C.

It is very important that the reader understand the concept that y in (1.2) is
a linear combination of the columns of A, since it is one of the fundamental
ideas of linear algebra. In this vein it helps greatly to visualize an entire
column as being a single entity, rather than treating each element individ-
ually. We present an additional example to illustrate the concept further.
Consider the following depiction of matrix–vector multiplication
 
 
  c1





 c2 

   .. 
y= 
 

 . 



 cn
···
c
A

Each vertical line in the long brackets represents an entire column ai of A.


In a manner similar to (1.1), from the rules of matrix–vector multiplication,
we see from the above diagram that each element ci of c multiples only the

4
corresponding P
column ai ; i.e., coefficient ci interacts only with the column
ai . Thus, y = i ai ci which is the same result as (1.2).

We now transpose (1.3) to obtain

yT = cT A T
 
  a b c
= 1 2 3  d e f 
g h i
 
1a 1b 1c
=  +2d +2e +2f 
+3g +3h +3i
Xn
= ci aTi (1.4)
i=1

where aTi is the ith row of AT in this case. With respect to the middle
equation above, note that y T is a 1 × 3 row vector, where the summation
corresponding to each element is represented in a column format for clarity
of presentation. In a transposed manner corresponding to (1.3), note the ith
element of c interacts only with the ith row aTi of AT , i = 1, . . . , 3. Thus,
using similar logic as the column case, we see that the row vector y T in this
case is a linear combination of the rows of AT , whose coefficients are the
elements of cT .

Thus if we pre–multiply a matrix AT by a row vector cT , then each row of


the product is a linear combination of the rows of AT . Likewise, when a
matrix A is post–multiplied by a column vector c, then each column of the
product is a linear combination of the columns of A.

Instead of using (1.2) to define a single vector y, we can extend it to define


a vector space S:
 
m n
S = y ∈ R |y = Ac, c ∈ R (1.5)

Here it is implied that c takes on the infinite set of all possible values within
Rn , and consequently {y} is the set of all possible linear combinations of
the columns ai . The dimension of S denoted as dim(S) is the number
of independent directions that span the space; e.g., the dimension of the
universe we live in is 3. In the case where n = 2 and if a1 and a2 are

5
linearly independent (to be defined), then S is a two–dimensional plane
which contains a1 and a2 .

Dim(S) is not necessarily n, the number of vectors or columns of A. In


fact, dim(S) ≤ n. The quantity dim(S) depends on the characteristics of
the vectors ai . For example, the vector space defined by the vectors a1 and
a2 in Fig. 1.1 below is infinite extension of the plane of the paper. The
dimension of this vector space is 2: If a third vector a3 which is orthogonal

a1

 * a2

 
 
 

 


Figure 1.1. A vector set containing two linearly independent vectors. The dimension of
the corresponding vector space S is 2.

to the plane of the paper were added to the set, then the resulting vector
space would be the three–dimensional universe. A third example is shown
in Figure 1.2. Here, since none of the vectors a1 . . . , a3 have a component
which is orthogonal to the plane of the paper, all linear combinations of this
vector set, and hence the corresponding vector space, lies in the plane of the
paper. Thus, in this case, dim(S) = 2, even though there are three vectors
in the set.

a1

 * a2

 
 
 

 
 
 - a
3
Figure 1.2. A vector set containing three linearly dependent vectors. The dimension of the
corresponding vector space S is 2, even though there are three vectors in the set. This is
because the vectors all lie in the plane of the paper.

6
1.1.2 Linear Independence

A vector set [a1 , . . . , an ], ai ∈ Rm , is linearly independent under the condi-


tion
X n
y= cj aj = Ac = 0 if and only if c1 , . . . , cn = 0 (1.6)
j=1
i.e., the only way to make a linear combination of a set of linearly indepen-
dent vectors to be zero is to make all the coefficients [c1 , . . . , cn ] = 0. We
can also say that the set [a1 , . . . , an ] is linearly independent if and only if
dim(S) = n, where S is the vector space corresponding to the set. Loosely
speaking, a vector set is linearly independent if and only if (iff) the corre-
sponding vector space “fills up” n dimensions. Notice that if m < n (recall
m is the length of the vectors) then the vectors must be linearly dependent,
since a set of vectors of length m can only span at most an m–dimensional
space. Further, a linearly dependent vector set can be made independent by
removing appropriate vectors from the set.

Example 1  
1 2 1
A = [a1 a2 a3 ] =  0 3 −1  (1.7)
0 0 1
This set is linearly independent. On the other hand, the set
 
1 2 −3
B = [b1 b2 b3 ] =  0 3 −3  (1.8)
1 1 −2
is not. This follows because the third column is a linear combination of the
first two. (−1 times the first column plus −1 times the second equals the
third column). Thus, the coefficients of the vector c in (1.6) which results
in zero are any scalar multiple of (1, 1, 1). We will see later that this vector
defines the null space of B.

Span, Subspaces and Range

In this section, we explore these three closely-related ideas. In fact, their


mathematical definitions are similar, but the interpretation is different for
each case.

7
Span: The span of a vector set [a1 , . . . , an ], written as span[a1 , . . . , an ], is
the vector space S corresponding to this set; i.e.,

 n
X 
m

S = span [a1 , . . . , an ] = y ∈ R y=
cj aj , cj ∈ R ,
j=1

where in this case the coefficients cj are each assumed to take on the infi-
nite range of real numbers. In the above, we have dim(S) ≤ n, where the
equality is satisfied iff the vectors ai are linearly independent. Note that
the argument of span(·) is a vector set.

Subspaces: A subspace is a subset of a vector space. Formally speaking, a


subspace U of S = span[a1 , . . . , an ] is determined by U = span[ai1 , . . . aik ],
where the indices satisfy {i1 , . . . , ik } ⊂ {1, . . . , n}. In other words, a sub-
space is a vector space formed from a subset of the vectors [a1 . . . an ]. The
subspace is of dimension k if the [ai1 , . . . aik ] are linearly independent.

Note that [ai1 , . . . aik ] is not necessarily a basis for the subspace S. This
set is a basis iff it is a maximally independent set. This idea is discussed
shortly. The set {ai } need not be linearly independent to define the span
or subspace.

For example, the vectors [a1 , a2 ] in Fig. 1 define a subspace (the plane of
the paper) which is a subset of the three–dimensional universe R3 .

∗ What is the span of the vectors [b1 , . . . , b3 ] in example 1?

Range: The range of a matrix A ∈ Rm×n , denoted R(A), is the vector


space satisfying

R(A) = {y ∈ Rm | y = Ax, for x ∈ Rn } .

Thus, we see that R(A) is the vector space consisting of all linear combi-
nations of the columns ai of A, whose coefficients are the elements xi of
x. Therefore, R(A) ≡ span[a1 , . . . , an ]. The distinction between range and
span is that the argument of range is a matrix, whereas we have seen that
the argument of span is a vector set. We have dim[R(A)] ≤ n, where the
equality is satisfied iff the columns are linearly independent. Any vector
y ∈ R(A) is of dimension (length) m.

8
Example 3:
 
1 5 3
A= 2 4 3  (the last column is the arithmetic average of the first two)
3 3 3

R(A) is the set of all linear combinations of any two columns of A.

In the case when m > n (i.e., A is a tall matrix), it is important to note


that R(A) is indeed a subspace of the m-dimensional “universe” Rm . In this
case, the dimension of R(A) is less than or equal to n. Thus, R(A) does
not span the whole m–dimensional universe, and therefore is a subspace of
it.

Bases

A maximally independent set is a vector set which cannot be made larger


without losing independence, and smaller without remaining maximal; i.e.
it is a set containing the maximum number of linearly independent vectors
that span the space. A basis for a subspace is any maximally independent
set within the subspace. It is not unique.

A commonly used basis is I. Another commonly used form of bases are


columns from an orthonormal matrix; these matrices have mutually orthog-
onal columns with unit norm. More on this later in Chapter 2. Given a
basis, any vector in the corresponding subspace is uniquely represented by
that basis.

Example 4. A basis for the subspace U spanning the first 2 columns of


 
1 2 3
A= 3 −3 
3

is

e1 = (1, 0, 0)T
e2 = (0, 1, 0)T .

9
Note that any linearly independent set in span[e1 , e2 ] is also a basis.

Change of Basis: We are given a basis formed by A which we use to express


a vector y as y = Ac1 , where c1 are the respective coefficients. Suppose
we wish to represent y using the basis B instead of A. To determine the
coefficients c2 to represent y in the basis B, we solve the system of linear
equations
y = Bc2 ; (1.9)
i.e., c2 = B −1 y. (The reader is invited to explain why this inverse always
exists in this case.) It is interesting to note that if the basis B = I, then
the respective coefficients c2 = y. This fact greatly simplifies the underlying
algebra and explains why the basis I = [e1 , . . . , en ] is so commonly used.

1.1.3 The Orthogonal Complement Subspace

Recall for any pairP of vectors x, y ∈ Rm , their dot product, or inner product
c is defined as c = m T T
i=1 xi yi = x y, where (·) denotes transpose. Further,
recall that two vectors are orthogonal iff their inner product is zero. Now
suppose we have a subspace S of dimension r corresponding to the vectors
[a1 , . . . , an ], for r ≤ n ≤ m; i.e., the respective matrix A is tall and the
ai ’s are not necessarily linearly independent. With this background, the
orthogonal complement subspace S⊥ of S of dimension m − r is defined as

S⊥ = y ∈ Rm |y T x = 0 for all x ∈ S

(1.10)

i.e., any vector in S⊥ is orthogonal to any vector in S. The quantity S⊥ is


pronounced “S–perp”.

Example 5: Take the vector set defining S from Example 4:


 
1 2
S≡ 0 3 
0 0

then, a basis for S⊥ is  


0
 0 
1

10
1.1.4 Rank

Rank is a fundamental concept of linear algebra. Its definition and properties


follow:

1. The rank of a matrix A (denoted rank(A)), is the number of linearly


independent rows or columns in A. Thus, it is the dimension of R(A),
which is the subspace formed from the columns of A. The symbol r
is commonly used to denote rank; i.e., r = rank(A).

2. A matrix A ∈ Rm×n is said to be rank deficient if r < min(m, n).


Otherwise, it is said to be full rank. The columns (if A is tall), or
rows (if A is short) of a full rank matrix must be linearly independent.
This follows directly from the definition of rank. For column matrices,
full column rank is sometimes used in place of full rank, and for short
matrices, the term full row rank applies.

3. As indicated earlier, a matrix is invertible if and only if it is square


and full rank.

4. The columns (rows) of a rank deficient matrix are linearly dependent.

5. If A is square and rank deficient, then det(A) = 0.

6. If A = BC, and r1 = rank(B), r2 = rank(C), then rank(A) ≤


min(r1 , r2 ).

7. It can be shown that rank(A) = rank(AT ). More is said on this point


later.

Example 6: The rank of A in Example 4 is 3, whereas the rank of A in


Example 3 is 2.

Example 7: Consider the matrix multiplication C ∈ Rm×n = AB, where

11
A ∈ Rm×2 and B ∈ R2×n , depicted by the following diagram:
   
   
     
x x x x
   
   
= x x x x .
   
   
B
   
   

C A

where the symbol x represents the respective element of B. Then, the rank of
C is at most two. To see this, we realize from our discussion on representing
a linear combination of vectors by matrix multiplication, that the ith column
of C is a linear combination of the two columns of A whose coefficients are
the ith column of B. Thus, all columns of C reside in the vector space
R(A). If the columns of A and the rows of B are linearly independent, then
the dimension of this vector space is two, and hence rank(C) = 2. If the
columns of A or the rows of B are linearly dependent and non–zero, then
rank(C) = 1. This example can be extended in an obvious way to matrices
of arbitrary size.

1.1.5 Null Space of A

The null space N (A) of A is defined as

N (A) = x ∈ Rn | Ax = 0 ,
 
(1.11)

where the trivial value x = 0 is normally excluded from the space. From
previous discussions, the product Ax is a linear combination of the columns
ai of A, where the elements xi of x are the corresponding coefficients. Thus,
from (2.22), N (A) is the set of non–zero coefficients of all zero linear combi-
nations of the columns of A. Therefore if N (A) is non-empty, then A must
have linearly dependent columns and thus be column rank deficient.

If the columns of A are linearly independent, then N (A) = ∅ by definition,


because there can be no coefficients except zero which result in a zero linear
combination. In this case, the dimension of the null space is zero (i.e., N (A)
is empty), and A is full column rank. Note that any vector in N (A) is of

12
dimension n. Any vector in N (A) is orthogonal to the rows of A, and is
thus in the orthogonal complement subspace of the rows of A.

Example 8: Let A be as before in Example 3. Then N (A) = c(1, 1, −2)T ,


where c ∈ R.

Example 9: Consider a matrix A ∈ R3×3 whose columns are constrained


to lie in a 2–dimensional plane. Then there exists a zero linear combination
of these vectors. The coefficients of this linear combination define a vector
x which is in the nullspace of A. In this case, A must be rank deficient.

Another important characterization of a matrix is its nullity. The nullity of


A is the dimension of the nullspace of A. In Example 6 above, the nullity of
A is one. We then have the following interesting property, which is proved
later in Ch.2:
rank(A) + nullity(A) = n. (1.12)

1.2 The Four Fundamental Subspaces of a Matrix

The four matrix subspaces of concern are: the column space, the row space,
and their respective orthogonal complements. The development of these four
subspaces is closely linked to N (A) and R(A). We assume for this section
that A ∈ Rm×n , r ≤ min(m, n), where r = rank(A).

The Column Space: This is simply R(A). Its dimension is r. It is the


set of all linear combinations of the columns of A. Any vector in R(A) is of
dimension m.

The Orthogonal Complement of the Column Space: This may be


expressed as R(A)⊥ , with dimension m−r. It may be shown to be equivalent
to N (AT ), as follows: By definition, N (AT ) is the set x satisfying:
 
 
x
  .1 
 

  .  = 0,
.

xm
AT

13
where columns of A are the rows of AT . From above, we see that N (AT )
is the set of x ∈ Rm which is orthogonal to all columns of A (rows of AT ).
This by definition is the orthogonal complement of R(A). Any vector in
R(A)⊥ is of dimension m.

The Row Space The row space is defined simply as R(AT ), with dimension
r. The row space is the span of the rows of A. Any vector in R(AT ) is of
dimension n.

The Orthogonal Complement of the Row Space This may be denoted


as R(AT )⊥ . Its dimension is n−r. This set must be that which is orthogonal
to all rows of A: i.e., for x to be in this space, x must satisfy
 
 
 x1
rows

 
  .. 
of   .  = 0.


 .. 
A →  .  xn

Thus it is apparent that the set x satisfying the above is N (A). Any vector
in R(AT )⊥ is of dimension n.

We have noted before that rank(A) = rank(AT ). Thus, the dimension


of the row and column subspaces are equal. This is surprising, because it
implies the number of linearly independent rows of a matrix is the same as
the number of linearly independent columns. This holds regardless of the
size or rank of the matrix. It is not an intuitively obvious fact and there
is no immediately obvious reason why this should be so. Nevertheless, the
rank of a matrix is the number of independent rows or columns.

1.3 Further Interpretations of Matrix Multiplica-


tion

1.3.1 Bigger–Block Interpretations of Matrix Multiplication

In this section, we generalize on the discussion of Sect. 1.1.1 for matrix–


vector multiplication. To standardize our discussion, we define the matrix

14
product C as
C = A B
(1.13)
m×n m×k k×n

Note that the matrix multiplication operation requires the inner dimensions
of A and B to be equal. Such matrices are said to have conformable dimen-
sions. Four interpretations of the matrix multiplication operation follow:

1. Inner-Product Representation

If we define aTi ∈ Rk as the ith row of A and bj ∈ Rk as the jth column of


B, then the element cij of C is defined as the inner product aTi bj . This is
the conventional small-block representation of matrix multiplication.

2. Column Representation

This is the next bigger–block view of matrix multiplication. Here we form


the product one column at a time. We have seen this idea before in Sect.
1.1.1 – the jth column cj of C may be expressed as a linear combination of
columns ai of A with coefficients which are the elements of the jth column
of B. Thus,
X k
cj = ai bij , j = 1, . . . , n. (1.14)
i=1
For example, if we evaluate only the pth element of the jth column cj (i.e.,
element cpj ), we see that (1.14) degenerates into cpj = ki=1 api bij . This is
P
the inner product of the pth row and jth column of A and B respectively.
Even though the column representation is more compact from the algebraic
viewpoint, its computer execution is identical to that of the inner-product
representation. From this column representation, it is straightforward to
show that R(C) = R(A) (see problem 2 at the end of this Chapter).

3. Row Representation

This is the transpose operation of the column representation above. The ith
row cTi of C can be written as a linear combination of the rows bTj of B,

15
whose coefficients are given as the ith row of A, i.e.,
k
X
cTi = aij bTj , i = 1, . . . , m.
j=1

It is possible to show that R(C T ) = R(B T ) (Problem 2).

4. Outer–Product Representation

This is the largest–block representation. Let ai and bTi be the ith column
and row of A and B respectively. Then the product C may also be expressed
as
Xk
C= ai bTi . (1.15)
i=1

By looking at this operation one column at a time, we see this form of


matrix multiplication performs exactly the same operations as the column
representation above. For example, the jth column cj of the product is
determined from (1.15) to be cj = ki=1 ai bij , which is identical to (1.14)
P
above.

Multiplication by Diagonal Matrices We consider a matrix A premulti-


plied by D of conformable dimensions, i.e., C = DA, where D is diagonal.
In this case, the ith row cTi of C can be written as cTi = dii aTi ; i.e., pre–
multiplication by a diagonal matrix has the effect of scaling each row of the
product by the corresponding diagonal element of D. Similarly, if we take a
matrix B postmultiplied by a diagonal D, i.e., C = BD then each column
ci = dii bi ; i.e., post–multiplication by a diagonal matrix scales each column
by the corresponding diagonal element of D.

Multiplication with Block Matrices In many cases matrix analysis be-


comes much easier if we partition a matrix into blocks. This happens ex-
tensively in Chapter 2 for example. Manipulating block matrices is a very
straightforward process, since the blocks can be treated just as regular el-
ements are when manipulating ordinary matrices, provided the dimensions
of the blocks of the two matrices being operated upon are conformable. We
partition two matrices A and B into blocks as shown, where the dimensions
of each block are as indicated [1]:

16
 
A11 . . . A1p m1
 .. ..  ..
A =  . .  .
Aq1 . . . Aqp mq
s1 . . . sp
 
B 11 . . . B 1r s1
 .. ..  ..
B =  . .  .
B p1 . . . B pr sp
n1 . . . nr

Then the product C = AB can be formed by treating each block as a


regular element when performing ordinary matrix multiplication. For ex-
ample, block C ij can be written following the conventional rules of matrix
multiplication as
p
X
C ij = Aik B kj . (1.16)
k=1

Notice that for each term in the above, the number of columns of the kth
A–block is equal to the number of rows in the kth B–block, which is the
dimension sk , k = 1, . . . , p in the above equations. This way, the blocks
have conformable dimensions with regard to matrix multiplication. Also
the numer of blocks p in a row of A must equal the number of blocks
in a column of B. Eq. (1.16) can be proved by verifying that for any
element cij , (1.16) performs exactly the same operations as ordinary matrix
multiplication performs in evaluating the same element.

A relevant example of block matrix multiplication, which is related to the


eigendecomposition in the following Chapter is the following, where all the
blocks are assumed to have conformable dimensions:

V T1 ΛV T1
    
  Λ 0  
V1 V2 = V1 V2
0 0 V T2 0

= V 1 ΛV T1 .

17
1.4 Vector Norms

A vector norm is a means of expressing the length or distance associated


with a vector. A norm on a vector space Rm is a function f , which maps
a point in Rm into a point in R. Formally, this is stated mathematically as
f : Rm → R. The norm has the following properties:

1. f (x) ≥ 0 for all x ∈ Rn .

2. f (x) = 0 if and only if x = 0.

3. f (x + y) ≤ f (x) + f (y) for x, y ∈ Rm .

4. f (ax) = |a|f (x) for a ∈ R, x ∈ Rm .

We denote the function f (x) as ||x||.

The p-norms: This is a useful class of norms, generalizing on the idea of


the Euclidean norm. They are defined by

||x||p = (|x1 |p + |x2 |p + . . . + |xm |p )1/p , (1.17)

where p can assume any positive value. Below we discuss commonly used
values for p:

p = 1:
X
||x||1 = |xi |
i

which is simply the sum of absolute values of the elements.

p = 2:
!1
2
X 1
||x||2 = xi 2 = (xT x) 2
i

which is the familiar Euclidean norm. As implied from the above, we have
the important identity ||x||22 = xT x.

18
p = ∞:
||x||∞ = max |xi | ,
i
which is the element of x with the largest magnitude. This may be shown
in the following way. As p → ∞, the largest term within the round brackets
in (1.17) dominates all others in the summation. Therefore (1.17) may be
written as
"m #1
X p p 1
||x||∞ = lim |x|i → [|xk |p ] p
p→∞
i=1
= |xk |

where k is the index corresponding to the element of x with the largest


absolute value.

Note that the p = 2 norm has many useful properties, but is expensive to
compute. The 1– and ∞–norms are easier to compute, but are more difficult
to deal with algebraically. All the p–norms obey all the properties of a vector
norm.

Figure 1.3 shows the locus of points of the set {x | ||x||p = 1} for p = 1, 2, ∞.
We now consider the relation between ||x||1 and ||x||2 for some point x,
(assumed not to be on a coordinate axis, for the sake of argument). Let x
be a point which lies on the ||x||2 = 1 locus. Because the p = 1 locus lies
inside the p = 2 locus, the p = 1 locus must expand outwards (i.e., ||x||1
must assume a larger value), to intersect the p = 2 locus at the point x.
Therefore we have ||x||1 ≥ ||x||2 . The same reasoning can be used to show
the same relation holds for ||x||1 and ||x||2 vs. ||x||∞ . Even though we
have considered only the 2-dimensional case, the same argument is readily
extended to vectors of arbitrary dimension. Therefore we have the following
generalization: for any vector x, we have

||x||1 ≥ ||x||2 ≥ ||x||∞ .

1.5 Determinants

Consider a square matrix A ∈ Rm×m . We can define the matrix Aij as


the submatrix obtained from A by deleting the ith row and jth column

19
1.5

1
p=infinity

0.5 p=1

-0.5

p=2

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5

Figure 1.3. Locus of points of the set {x | ||x||p = 1} for various values of p.

of A. The scalar number det(Aij ) ( where det(·) denotes determinant) is


called the minor associated with the element aij of A. The signed minor

cij = (−1)j+i det(Aij ) is called the cofactor of aij .

The determinant of A is the m-dimensional volume contained within the


columns (rows) of A. This interpretation of determinant is very useful
as we see shortly. The determinant of a matrix may be evaluated by the
expression
Xm
det(A) = aij cij , i ∈ (1 . . . m). (1.18)
j=1

or
m
X
det(A) = aij cij , j ∈ (1 . . . m). (1.19)
i=1

Both the above are referred to as the cofactor expansion of the determinant.
Eq. (1.18) is along the ith row of A, whereas (1.19) is along the jth column.

20
It is indeed interesting to note that both versions above give exactly the
same number, regardless of the value of i or j.

Eqs. (1.18) and (1.19) express the m × m determinant detA in terms of the
cofactors cij of A, which are themselves (m − 1) × (m − 1) determinants.
Thus, m − 1 recursions of (1.18) or (1.19) will finally yield the determinant
of the m × m matrix A.

From (1.18) it is evident that if A is triangular, then det(A) is the product


of the main diagonal elements. Since diagonal matrices are in the upper
triangular set, then the determinant of a diagonal matrix is also the product
of its diagonal elements.

Properties of Determinants

Before we begin this discussion, let us define the volume of a parallelopiped


defined by the set of column vectors comprising a matrix as the principal
volume of that matrix.

We have the following properties of determinants, which are stated without


proof:

1. det(AB) = det(A) det(B) A, B ∈ Rm×m .


The principal volume of the product of matrices is the product of
principal volumes of each matrix.
2. det(A) = det(AT )
This property shows that the characteristic polynomials 1 of A and
AT are identical. Consequently, as we see later, eigenvalues of AT and
A are identical.
3. det(cA) = cm det(A) c ∈ R, A ∈ Rm×m .
This is a reflection of the fact that if each vector defining the principal
volume is multiplied by c, then the resulting volume is multiplied by
cm .
4. det(A) = 0
A is singular.
This implies that at least one dimension of the principal volume of the
corresponding matrix has collapsed to zero length.
1
The characteristic polynomial of a matrix is defined in Chapter 2.

21
5. det(A) = m
Q
i=1 λi , where λi are the eigen (singular) values of A.
This means the parallelopiped defined by the column or row vectors
of a matrix may be transformed into a regular rectangular solid of the
same m– dimensional volume whose edges have lengths corresponding
to the eigen (singular) values of the matrix.

6. The determinant of an orthonormal2 matrix is ±1.


This is easy to see, because the vectors of an orthonormal matrix are
all unit length and mutually orthogonal. Therefore the corresponding
principal volume is ±1.

7. If A is nonsingular, then det(A−1 ) = [det(A)]−1 .

8. If B is nonsingular, then det(B −1 AB) = det(A).

9. If B is obtained from A by interchanging any two rows (or columns),


then det(B) = − det(A).

10. If B is obtained from A by by adding a scalar multiple of one row to


another (or a scalar multiple of one column to another), then det(B) =
det(A).

A further property of determinants allows us to compute the inverse of A.


Define the matrix à as the adjoint of A:
 T
c11 . . . c1m
à =  ... ..  (1.20)

. 
cm1 . . . cmm

where the cij are the cofactors of A. According to (1.18) or (1.19), the ith
row ãTi of à times the ith column ai is det(A); i.e.,

a˜Ti ai = det(A), i = 1, . . . , m. (1.21)

It can also be shown that

a˜Ti aj = 0, i 6= j. (1.22)

Then, combining (1.21) and (1.22) for i, j ∈ {1, . . . , m} we have the following
interesting property:
ÃA = det(A)I, (1.23)
2
An orthonormal matrix is defined in Chapter 2.

22
where I is the m × m identity matrix. It then follows from (1.23) that the
inverse A−1 of A is given as

A−1 = [det(A)]−1 Ã. (1.24)

Neither (1.19) nor (1.24) are computationally efficient ways of calculating


a determinant or an inverse respectively. Better methods which exploit the
properties of various matrix decompositions are made evident later in the
course.

23
1.6 Problems

1. Explain how to construct a short m × n matrix with rank r < m.

2. We are given a matrix A ∈ R10×2 and a matrix B ∈ R2×5 . Consider


the product C = AB. The columns and rows of A and B respectively
are linearly independent.

(a) What is rank(C)?


(b) Express a basis for the rowspace of C that does involve C itself.
Justify your response.
(c) Likewise, for the column space of C.

3. On Avenue 2 Learn for this course you will find a matlab file Ch1Q3.mat
which contains a short matrix A. Determine the orthogonal comple-
ment of the row space, using matlab. What can you infer about the
rank of A from your results? Also find, using matlab, an orthonormal
basis for R(A) as well as its orthogonal complement subspace..

4. You will also find the .mat file Ch1Q4.mat, which contains a 6 × 3
matrix A1 along with vectors b1 and b2 .

(a) Given that A1 is tall, explain how to determine whether b1 ∈


R(A1 ). Hint: The Matlab command that solves Ax = b yields a
solution such that the quantity ||Ax−b||22 is minimized. Thus for
a tall matrix, this command always provides a solution regardless
of whether b1 ∈ R(A1 ) or not.
(b) Find the coefficients of the linear combination of the columns of
A1 that give the vector b1 .
(c) Now repeat using the vector b2 . Does a set of coefficients exist
that yield exactly b2 ? If not, why not? Explain the difference for
the cases b1 and b2 .
(d) This problem suggests we can determine whether or not a vector
b ∈ R(A) by solving the system of equations Ax = b. Suggest
an alternative approach.

5. This question has to do with blind recovery of a source signal that


has been transmitted through a pair of linear finite impulse response
(FIR) systems, as shown in Fig. 1.4(a). This situation is relevant to
the situation where speech is recorded within a reverberant room using

24
a pair of microphones. The impulse responses f1 and f2 model the
reverberation effect of the room. Under certain conditions, the source
signal x[n] can be recovered without error, and without knowledge of
f1 and f2 , using the concepts developed in this chapter. The sequence
x[n] is of length m and f1 [n] and f2 [n] are FIR sequences of length
n  m. The outputs y1 [n] and y2 [n] are the convolution of x[n] with
f1 [n] and f2 [n] respectively; i.e.,
X
yi [n] = fi [k]x[n − k], i ∈ [1, 2],
k

as shown in Fig. 1.4(a). We observe only the sequences y1 [n] and


y2 [n].

(a) Show how to express the convolution of two sequnces as a matrix–


vector multiplication.
(b) Show how f1 [n], f2 [n] can be determined, based ONLY on the
observations y1 [n] and y2 [n]. Hint: Consider the configuration
of Fig. 1.4 (b). What are g1 and g2 so that the output z[n] =
0? This condition can be used to identify f1 and f2 . Note that
the sequences g2 and g1 are the recovered versions of f1 and f2
respectively. The g2 and g1 can be computed to represent f1 and
f2 with insignificant error.
(c) Compute g1 and g2 . Are there any required conditions on x[n],
or on f1 [n], f2 [n] ?

6. The speech signal used to generate the sequences y1 [n] and y2 [n] in
Problem 5 is SPF1.mat, which can be found on Avenue. Also available
are the signals y1 [n] and y2 [n] in the file Ch1Q6.mat.
It is possible to recover the source signal x[n] from the observations
y1 [n] and y2 [n] knowing f1 and f2 or equivlently g1 and g2 . If the
sequences f are of length n, then there exist sequences w1 [n], w2 [n] of
length n − 1 that satisfy the expression

f1 [n] ? w1 [n] + f2 [n] ? w2 [n] = δ[n]

where δ[n] is the delta-function sequence and ? denotes the convolution


operation. This configuration is shown in Fig. 1.4 (c) See the reference
by Miyoshi and Kaneda on Avenue for more details. Note that if the
structure of Fig. 1.4(a) is cascaded with that of Fig. 1.4 (c), then
the impulse response of the overall structure becomes a zero–delay

25
impulse function and thus the original speech x[n] is recovered. Find
the sequences w1 [n], w2 [n] and recover the source x[n].
The true impulse responses f1 and f2 may be found in file Ch1Q6.mat
so you can compare your responses with the true values. You can play
the speech file through your computer sound system by issuing the
command “soundsc(vector )” within Matlab.

7. A matrix A ∈ Rm×n may be decomposed according to the singular


value decomposition, to be discussed in Chapter 3, as A = U ΣV T ,
where U is m × m, Σ is m × n diagonal (i.e., a zero block either below
or to the right of the diagonal block is appended in order to maintain
dimensional consistency), and V is n × n. Express A using a modified
form of the outer product rule for matrix multiplication.

26
Figure 1.4. Configuration for blind deconvolution

27
28
Chapter 2

Eigenvalues and Eigenvectors

This chapter discusses eigenvalues and eigenvectors in the context of prin-


cipal component analysis of a random process. First, we discuss the funda-
mentals of eigenvalues and eigenvectors, then go on to covariance matrices.
These two topics are then combined into PCA analysis. Examples from
array signal processing and EEG analysis are given as applications of the
algebraic concepts.

A major aim of this presentation is an attempt to de-mystify the concepts


of eigenvalues and eigenvectors by showing their usefulness in the field of
signal processing.

2.1 Eigenvalues and Eigenvectors

We first discuss this subject from the classical mathematical viewpoint, and
then when the requisite background is in place we will apply eigenvalue and
eigenvectors in a signal processing context. We investigate the underlying
ideas of this topic using the matrix A as an example:

 
4 1
A= (2.1)
1 4

29
5 Ax3

Ax2 Note no rotation


4
Note cw rotation

1 Ax1
x Note ccw rotation of Ax

1 2 3 4 5

Figure 2.1. Matrix-vector multiplication for various vectors.

The product Ax1 , where x1 = [1, 0]T , is shown in Fig. 2.1. Then,
 
4
Ax1 = . (2.2)
1
By comparing the vectors x1 and Ax1 we see that the product vector is
scaled and rotated counter–clockwise with respect to x1 .

Now consider the case where x2 = [0, 1]T . Then Ax2 = [1, 4]T . Here, we
note a clockwise rotation of Ax2 with respect to x2 .

We now let x3 = [1, 1]T . Then A x3 = [5, 5]T . Now the product vector
points in the same direction as x3 ; i.e., Ax3 ∈ span(x3 ) and Ax3 = λx3 .
Because of this property, x3 = [1, 1]T is an eigenvector of A. The scale
factor (which in this case is 5) is given the symbol λ and is referred to as an
eigenvalue.

Note that x = [1, −1]T is also an eigenvector, because in this case, Ax =


[3, −3]T = 3x. The corresponding eigenvalue is 3.

Thus if x is an eigenvector of A ∈ Rn×n we have,

Ax = λx (2.3)

i.e., the vector Ax is in the same direction as x but scaled by a factor λ.

Now that we have an understanding of the fundamental idea of an eigen-


vector, we proceed to develop the idea further. Eq. (2.3) may be written in

30
the form Ax = λIx, or
(A − λI)x = 0 (2.4)
where I is the n × n identity matrix. Thus to be an eigenvector, x must lie
in the nullspace of Ax − λI. We know that a nontrivial solution to (2.4)
exists if and only if N (A − λI) is non–empty, which implies that

det(A − λI) = 0 (2.5)

where det(·) denotes determinant1 . Eq. (2.5), when evaluated, becomes a


polynomial in λ of degree n. For example, for the matrix A above we have
   
4 1 1 0
det −λ = 0
1 4 0 1
 
4−λ 1
det = (4 − λ)2 − 1
1 4−λ
= λ2 − 8λ + 15 = 0. (2.6)

It is easily verified that the roots of this polynomial are (5,3), which corre-
spond to the eigenvalues indicated above.

Eq. (2.6) is referred to as the characteristic equation of A, and the cor-


responding polynomial is the characteristic polynomial. The characteristic
polynomial is of degree n.

More generally, if A is n×n, then there are n solutions of (2.5), or n roots of


the characteristic polynomial. Thus there are n eigenvalues of A satisfying
(2.3); i.e.,
Axi = λi xi , i = 1, . . . , n. (2.7)
Note that the eigenvalues of a diagonal matrix are the diagonal elements
themselves, with corresponding eigenvector given by the respective elemen-
tary vector e1 , . . . , en . If the matrix is triangular, it is straightforward to
show that the eigenvalues are also the diagonal elements, but the eigenvec-
tors are no longer the elementary vectors.

If the eigenvalues are all distinct, there are n associated linearly–independent


eigenvectors, whose directions are unique, which span an n–dimensional Eu-
clidean space.
1
It turns out that determining the null space by evaluating the determinant is very in-
efficient. Nevertheless, the method presented here is what appears in most undergraduate
texts, and leads to the useful concept of the characteristic equation relating to a matrix.

31
In the case where there are r ≤ n repeated eigenvalues, then a linearly
independent set of n eigenvectors still exist(provided rank(A − λI) = n − r).
However, their directions are not unique in this case. In fact, if [v 1 . . . v r ]
are a set of r linearly independent eigenvectors associated with a repeated
eigenvalue, then any vector in span[v 1 . . . v r ] is also an eigenvector. The
proof is left as an exercise.

Example 1: Consider the matrix given by


 
1 0 0
 0 0 0 .
0 0 0
Here the eigenvalues are [1, 0, 0]T , and a corresponding linearly independent
eigenvector set is [e1 , e2 , e3 ]. Then it may be verified that any vector in
span[e2 , e3 ] is also an eigenvector associated with the zero repeated eigen-
value.

Example 2 : Consider the n×n identity matrix. It has n repeated eigenvalues


equal to one. In this case, any n–dimensional vector is an eigenvector with a
corresponding eigenvalue of 1, and the eigenvectors span an n–dimensional
space.

—————–

Eq. (2.5) gives us a clue how to compute eigenvalues. We can formulate


the characteristic polynomial and evaluate its roots to give the λi . Once
the eigenvalues are available, it is possible to compute the corresponding
eigenvectors v i by evaluating the nullspace of the quantity A − λi I, for i
= 1, . . . , n. This approach is adequate for small systems, but for those of
appreciable size, this method is slow and prone to appreciable numerical
error. Later, we consider various orthogonal transformations which lead to
much more effective techniques for finding the eigenvalues.

We now present some very interesting properties of eigenvalues and eigen-


vectors, to aid in our understanding.

Property 1 If the eigenvalues of a (Hermitian) 2 symmetric matrix are


distinct, then the eigenvectors are orthogonal.
2
A symmetric matrix is one where A = AT , where the superscript T means transpose,

32
Proof. Let {v i } and {λi }, i = 1, . . . , n be the eigenvectors and correspond-
ing eigenvalues of A ∈ Rn×n . Choose any i, j ∈ [1, . . . , n], i 6= j. Then

Av i = λi v i (2.8)

and
Av j = λj v j . (2.9)

Premultiply (2.8) by v Tj and (2.9) by v Ti :

v Tj Av i = λi v Tj v i (2.10)
v Ti Av j = λj v Ti v j (2.11)

The quantities on the left are equal when A is symmetric. We show this as
follows. Since the left-hand side of (2.10) is a scalar, its transpose is equal
to itself. Therefore, we get v Tj Av i = v Ti AT v j 3 . But, since A is symmetric,
AT = A. Thus, v Tj Av i = v Ti AT v j = v Ti Av j , which was to be shown.

Subtracting (2.10) from (2.11), we have

(λi − λj )v Tj v i = 0 (2.12)

where we have used the fact v Tj v i = v Ti v j . But by hypothesis, λi − λj 6= 0.


Therefore, (2.12) is satisfied only if v Tj v i = 0, which means the vectors are
orthogonal. 

Here we have considered only the case where the eigenvalues are distinct.
If an eigenvalue λ̃ is repeated r times, and rank(A − λ̃I) = n − r, then a
mutually orthogonal set of n eigenvectors can still be found.

Another useful property of eigenvalues of symmetric matrices is as follows:

i.e, for a symmetric matrix, an element aij = aji . A Hermitian symmetric (or just
Hermitian) matrix is relevant only for the complex case, and is one where A = AH , where
superscript H denotes the Hermitian transpose. This means the matrix is transposed and
complex conjugated. Thus for a Hermitian matrix, an element aij = a∗ji .
In this book we generally consider only real matrices. However, when complex matrices
are considered, Hermitian symmetric is implied instead of symmetric.
3
Here, we have used the property that for matrices or vectors A and B of conformable
size, (AB)T = BT AT .

33
Property 2 The eigenvalues of a (Hermitian) symmetric matrix are real.

Proof: from [5]. (By contradiction): First, we consider the case where A
is real. Let λ be a non–zero complex eigenvalue of a symmetric matrix A.
Then, since the elements of A are real, λ∗ , the complex–conjugate of λ, must
also be an eigenvalue of A, because the roots of the characteristic polynomial
must occur in complex conjugate pairs. Also, if v is a nonzero eigenvector
corresponding to λ, then an eigenvector corresponding λ∗ must be v∗ , the
complex conjugate of v. But Property 1 requires that the eigenvectors be
orthogonal; therefore, vT v∗ = 0. But vT v∗ = (vH v)∗ , which is by definition
the complex conjugate of the norm of v. But the norm of a vector is a pure
real number; hence, vT v∗ must be greater than zero, since v is by hypothesis
nonzero. We therefore have a contradiction. It follows that the eigenvalues
of a symmetric matrix cannot be complex; i.e., they are real.

While this proof considers only the real symmetric case, it is easily extended
to the case where A is Hermitian symmetric. 

Property 3 Let A be a matrix with eigenvalues λi , i = 1, . . . , n and eigen-


vectors vi . Then the eigenvalues of the matrix A + sI are λi + s, with
corresponding eigenvectors vi , where s is any real number.

Proof: From the definition of an eigenvector, we have Av = λv. Further,


we have sIv = sv. Adding, we have (A + sI)v = (λ + s)v. This new eigen-
vector relation on the matrix (A+sI) shows the eigenvectors are unchanged,
while the eigenvalues are displaced by s. This property has very useful con-
sequences with regard to regularization of poorly conditioned systems of
equations. We discuss this point further in Chapter 10. 

Property 4 Let A be an n × n matrix with eigenvalues λi , i = 1, . . . , n.


Then

ˆ The determinant det(A) = ni=1 λi .


Q

ˆ The trace4 tr(A) = ni=1 λi .


P

4
The trace denoted tr(·) of a square matrix is the sum of its diagonal elements.

34
The proof is straightforward, but because it is easier using concepts pre-
sented later in the course, it is not given here.

Property 5 If v is an eigenvector of a matrix A, then cv is also an eigen-


vector, where c is any real or complex constant.

The proof follows directly by substituting cv for v in Av = λv. This means


that only the direction of an eigenvector can be unique; its norm is not
unique.

Property 6 If the eigenpairs of A are {λ, v}, then the eigenpairs of cA


are {cλ, v}, where c ∈ R.

The proof follows from direct multiplication of cA with any v.

2.1.1 Orthonormal Matrices

Before proceeding with the eigendecomposition of a matrix, we must develop


the concept of an orthonormal matrix. This form of matrix has mutually
orthogonal columns, each of unit 2–norm. This implies that

q Ti q j = δij , (2.13)

where δij is the Kronecker delta, and q i and q j are columns of the orthonor-
mal matrix Q. When i = j, the quantity q Ti q i defines the squared 2–norm
of q i , which has been defined as unity. When i 6= j, q Ti q j = 0, due to the
orthogonality of the q i ). We therefore have

QT Q = I. (2.14)

Thus, for an orthonormal matrix, (2.14) implies Q−1 = QT . Thus the


inverse may be computed simply by taking the transpose of the matrix, an
operation which requires almost no computational effort.

Eq. (2.14) follows directly from the fact Q has orthonormal columns. It is
not so clear that the quantity QQT should also equal the identity. We can

35
resolve this question in the following way. Suppose that A and B are any
two square invertible matrices such that AB = I. Then, BAB = B. By
parsing this last expression, we have

(BA) · B = B. (2.15)

Clearly, if (2.15) is to hold, then the quantity BA must be the identity;


hence, if AB = I, then BA = I. Therefore, if QT Q = I, then also
QQT = I. From this fact, it follows that if a matrix has orthonormal
columns, then it also must have orthonormal rows. We now develop a further
useful property of orthonormal marices:

An orthonormal matrix is sometimes referred to as a unitary matrix. This


follows because the determinant of an orthonormal matrix is ±1.

Property 7 The vector 2-norm is invariant under an orthonormal trans-


formation.

If Q is orthonormal, then for any x we have

||Qx||22 = xT QT Qx = xT x = ||x||22 .

Thus, because the norm does not change, an orthonormal transformation


performs a rotation operation on a vector. We use this norm–invariance
property later in our study of the least–squares problem.

Orthonormal Matrices as a Basis Set: To represent an m-length vector


x in an m × m orthonormal basis represented by Q (as per the discussion
in Sect. 1.2.4), we form the coefficients c by taking c = QT x. Then x is
given as x = Qc. Because of the simplicity of these operations, orthonormal
matrices are convenient to use as a basis. Later in this chapter, we use the
eigenvector set of a covariance matrix as a basis for principal component
analysis.

Consider the case where we have a tall matrix U ∈ Rm×n , where m > n,
whose columns are orthonormal. U can be formed by extracting only the
first n columns of an arbitrary orthonormal matrix. (We reserve the term
orthonormal matrix to refer to a complete m × m matrix). Because U has
orthonormal columns, it follows that the quantity U T U = I n×n . However,

36
it is important to realize that the quantity U U T 6= I m×m in this case, in
contrast to the situation when m ≤ n. This fact is easily verified, since
rank(U U T ) = n, which less than m, and so cannot be the identity.

However, the matrix U U T has interesting consequences when we interpret


the tall U as an (incomplete) orthonormal basis. For an m-length vector
x, we form the coefficients of x in the basis U as c = U T x, where c is
length n < m. Then the representation x̃ of x in the new incomplete basis
is x̃ = U c = U U T x. Notice however that R(U ) is a subspace of dimension
n of Rm , and that x̃ is in this subspace, even though x itself is in the full m–
dimensional universe. So the operation U U T x projects x into the subspace
R(U ). The matrix U U T is referred as a projector. We will study projector
matrices in more detail later in Ch. 3.

2.1.2 The Eigendecomposition (ED) of a Square Symmetric


Matrix

The eigendecomposition is a very useful tool in many branches of engineering


analysis. While it may only be applied to square symmetric matrices, that
is not too limiting a restriction in practice, since many matrices that are of
interest in signal processing and related diisciplines fit into this category.

Let A ∈ Rn×n be symmetric. Then, for eigenvalues λi and eigenvectors vi ,


we have

Avi = λi vi , i = 1, . . . , n. (2.16)

Let v i be an eigenvector of A having arbitrary 2–norm. Then, using Prop-


erty 5, we can normalize v i to unit 2-norm by replacing it with the quantity
v i /c, where c = ||v i ||2 . We therefore assume all the eigenvectors have been
normalized in this manner.

Then these n equations can be combined, or stacked side–by–side together,


and represented in the following compact form:

AV = VΛ (2.17)

37
where V = [v1 , v2 , . . . , vn ] (i.e., each column of V is an eigenvector), and
 
λ1 0
 λ2 
Λ=  = diag(λ1 . . . λn ). (2.18)
 
..
 . 
0 λn

Eq. (2.17) may be verified by realizing that corresponding columns from


each side of this equation correspond to one specific value of the index i
in (2.16). Because we have assumed A is symmetric, from Property 1, the
v i are orthogonal. Furthermore, since we have chosen ||v i ||2 = 1, V is an
orthonormal matrix. Thus, post-multiplying both sides of (2.17) by VT ,
and using V V T = I we get

A = VΛVT . (2.19)

Eq. (2.19) is called the eigendecomposition (ED) of A. The columns of V


are eigenvectors of A, and the diagonal elements of Λ are the corresponding
eigenvalues. Any square symmetric matrix may be decomposed in this way.
This form of decomposition, with Λ being diagonal, is of extreme inter-
est and has many interesting consequences. It is this decomposition which
leads directly to the concepts behind principal component analysis, which
we discuss shortly.

Note that from (2.19), knowledge of the eigenvalues and eigenvectors of A


is sufficient to completely specify A. Note further that if the eigenvalues are
distinct, then the ED is unique.

Eq. (2.19) can also be written as

VT AV = Λ.

Since Λ is diagonal, we say that the orthonormal matrix V of eigenvectors


diagonalizes A. No other orthonormal matrix can diagonalize A. The fact
that only V diagonalizes A is a very important property of eigenvectors.

2.1.3 Conventional Notation on Eigenvalue Indexing

Let A ∈ Rn×n be symmetric and have rank r ≤ n. Then, we see in the


next section we have r non-zero eigenvalues and n − r zero eigenvalues. It

38
is common convention to order the eigenvalues so that

|λ1 | ≥ |λ2 | ≥ . . . ≥ |λr | > λr+1 = . . . = λn = 0 (2.20)


| {z } | {z }
r nonzero eigenvalues n−r zero eigenvalues

i.e., we re–order the columns of V in (2.19) so that λ1 is the largest in abso-


lute value, with the remaining nonzero eigenvalues arranged in descending
order, followed by n − r zero eigenvalues. Note that if A is full rank, then
r = n and there are no zero eigenvalues. The quantity λn is the eigenvalue
with the smallest absolute value.

The eigenvectors are reordered to correspond with the ordering of the eigen-
values. For notational convenience, we refer to the eigenvector corresponding
to the largest eigenvalue as the “largest eigenvector” or “principal eigenvec-
tor”. The “smallest eigenvector” is then the eigenvector corresponding to
the smallest eigenvalue.

2.1.4 The Eigendecomposition in Relation to the Fundamen-


tal Matrix Subspaces

In this section, we develop relationships between the eigendecomposition of


a matrix and its range, null space and rank.

Here we discuss only the case where A is square and symmetric, and we
entertain the possibility that the matrix A may be rank deficient; i.e., we
are given that rank(A) = r ≤ n. We write the eigendecomposition of A as
A = V ΛV T . We partition V and Λ in the folllowing block formats:
 
V1 V2
V =
r n−r

where

V 1 = [v 1 , v 2 , . . . , v r ] ∈ Rn×r
V 2 = [v r+1 , . . . , v n ] ∈ Rn×n−r .

The columns of V 1 are eigenvectors corresponding to the first r larger (in


magnitude) eigenvalues of A, and the columns of V 2 are eigenvectors cor-

39
responding to the n − r smallest eigenvalues. We also have
 
Λ1 0
Λ=
0 Λ2

where Λ1 is an r × r diagonal block containing r non-zero eigenvalues, and


Λ2 is an n − r × n − r diagonal block containing the remaining eigenvalues.
The eigendecomposition of A may therefore be written in block form as:
     T 
V1 V2 Λ1 0 V1 r
A = . (2.21)
r n−r 0 Λ2 V T2 n−r

In the notation used above, the explicit absence of a matrix element in an


off-diagonal position implies that element is zero. In the next section we
show the partitioning of A in the form of (2.21) reveals a great deal about
the structure of the matrix.

Nullspace and Range

Recall from (2.22) that the null space N (A) of A is defined as

N (A) = x ∈ Rn | Ax = 0


for nonzero x5 . We therefore investigate what values of x result in Ax =


0. We substitute (2.21) for A to obtain
  T 
  Λ1 0 V1
V1 V2 x = 0. (2.22)
0 Λ2 V T2

Define
V T1
   
c1
c= = V Tx = x (2.23)
c2 V T2
where c1 ∈ Rr and c2 ∈ Rn−r . We rewrite Ax = 0 in the form
  
  Λ1 0 c1
V1 V2 = 0. (2.24)
0 Λ2 c2
5
With regards to this discussion on nullspace, the value of r in (2.21) is not necessarily
taken to be rank(A). The value of r being equal to rank is established in the discussion
on range, to follow.

40
Note that the expression that Ax = 0 can also be written as Ax = 0x,
which implies that A has at least one zero eigenvalue. Thus this expression
can only be satisfied for x 6= 0 (and therefore also c 6= 0) iff some of
the diagonal elements of Λ are zero. Since by definition Λ1 contains the
eigenvalues of largest magnitude, we put all the non–zero eigenvalues in
Λ1 and the zeros in Λ2 . Then (2.24) is satisfied for nonzero x if c1 = 0
and c2 6= 0. From (2.23), this implies that x ∈ R(V 2 ). Thus, V 2 is an
orthonormal basis for N (A).

We now turn our attention to range. Recall from (1.9) that R(A) is defined
as
R(A) = {y ∈ Rm | y = Ax, for x ∈ Rn } .
From the fact that Λ2 = 0 and using (2.24), the expression y = Ax becomes
y = V 1 Λ1 c1 . (2.25)
The vector y spans an r–dimensional space as c1 varies throughout its re-
spective universe iff V 1 consists of r columns and Λ1 contains r non–zero
diagonal elements. Since we have defined rank as the dimension of R(A),
V 1 must have exactly r columns and Λ1 must contain r non–zero values.
Further, it is seen that V 1 is an orthonormal basis for R(A).

From the partition of (2.21), if V 1 has r columns then V 2 must have n − r


columns. Since V 2 is an orthonormal basis for N (A), the nullity of A is
n − r. This is the justification for the relation (1.12) from Ch.1, repeated
here for convenience:
rank(A) + nullity(A) = n.

Diagonalizing a system of equations

We now show that transforming the variables involved in a linear system


of equations into the eigenvector basis diagonalizes the system of equations.
We are given a system of equations Ax = b, where A is assumed square and
symmetric, and hence can be decomposed using the eigendecomposition as
A = V ΛV T . Thus our system can be written in the form
Λc = d, (2.26)
where c = V T x and d = V T b. Since Λ is diagonal, (2.26) has a much
simpler structure than the original form.

41
2.2 An Alternate Interpretation of Eigenvectors

Now that we have the requisite background in place, we can present an alter-
native to the classical interpretation of eigenvectors that we have previously
discussed. This alternate interpretation is very useful in the science and en-
gineering contexts, since it sets the stage for principal component analysis,
which is a widely used tool in many applications in signal processing.

Consider the m × n matrix X as above, whose ith row is xTi and where we
have assumed zero mean, as before. Let θi = xTi q, where q is a vector with
unit 2–norm to be determined. We address the question “What is the q so
that the variance E(θi )2 = E(xTi q)2 is maximum when taken over all values
of i?” The quantity θi = xTi q is the projection of the ith observation xi
onto the unit–norm vector q. The problem at hand is therefore equivalent
to determining the direction for which these projections have maximum
variation, on average. For example, with reference to Fig. 2.2, it is apparent
that the direction along the [1 1]T axis corresponds to the solution in this
case.

We can state the problem more formally as


q ∗ = arg max E(xTi q)2
q
= arg max E(q T xi xTi q)
q
= arg max q T E(xi xTi )q
q
= arg max q T Rx q subject to ||q||22 = 1. (2.27)
q

where Rx = E(xi xTi ) is the covariance matrix of x. The constraint ||q||22 = 1
necessary to prevent ||q|| from growing to infinity.

Eq.(2.27) is a constrained optimization problem which may be solved by


the method of Lagrange multipliers6 . Briefly, this method may be stated
as follows. Given some objective function G(x) which we wish to minimize
or maximize with respect to x, subject to an equality constraint function
f (x) = 0, we formulate the Lagrangian function L(x) which is defined as
L(x) = G(x) + λf (x),
6
There is a good description of the method of Lagrange multipliers in Wikipedia.

42
where λ is referred to as a Lagrange multiplier, whose value is to be deter-
mined. The constrained solution x∗ then satisfies
dL(x∗ )
= 0.
dx
Applying the Lagrange multiplier method to (2.27), the Lagrangian is given
by
L(q) = q T Rx q + λ(1 − q T q).
As shown in the appendix of this chapter, and from [6]7 , the derivative of
the first term (when the associated matrix is symmetric) is 2Rx q. It is
straightforward to show that with respect to the second term,
d
(1 − q T q) = −2q.
dq

Therefore (2.2) in this case becomes

2Rx q − 2λq = 0

or
Rx q = λq. (2.28)
We therefore have the important result that the stationary points of the
constrained optimization problem (2.27) are given by the eigenvectors of
Rx . Thus, the vector q ∗ onto which the observations should be projected
for maximum variance is given as the principal eigenvector v 1 of the matrix
Rx . This direction coincides with the major axis of the scatterplot ellipse.
The direction which results in minimum variance of θ is the smallest eigen-
vector v n . Each eigenvector aligns itself along one of the major axes of the
corresponding scatterplot ellipse. In the practical case where only a finite
quantity of data is available, we replace the expected value Rx in (2.27)
with its finite sample approximation given by (2.33).

As an example, the principal eigenvector of R1 in the example of Fig. 2.2


is √12 [1, 1]T . We see from Fig. 2.2 that this is indeed the direction of
maximum variation of θ.

7
There is a link to this document on the course website.

43
2.3 Covariance and Covariance Matrices

Here we “change gears” for a while and discuss the idea of covariance, which
is a very important topic in any form of statistical analysis and signal pro-
cessing. In Section 2.5, we combine the topics of eigen–analysis and covari-
ance matrices. The definitions of covariances vary somewhat across various
books and articles, but the fundamental idea remains unchanged. We start
the discussion with a review of some fundamental definitions.

We are given two scalar random variables x1 and x2 . Recall that the mean
µ and variance σ 2 of a random variable are defined as

µ = E(x) and
σ 2 = E(x − µ)2 ,

where E is the expectation operator. The covariance σ12 and correlation ρ12
between the variables x1 and x2 are defined respectively as

σ12 = E(x1 − µ1 )(x2 − µ2 ) and


E(x1 − µ1 )(x2 − µ2 )
ρ12 = ,
σ1 σ2
where subscripts 1 and 2 refer to the random processes x1 and x2 respectively
and σ denotes standard deviation. It is straightforward to show that −1 ≤
ρ12 ≤ 1. Expectations are only an abstract concept, since they require an
infinite sample of data to evaluate. In the practical case where only a finite
sample of length n is available, we replace the expectation operator with an
average over the available samples to obtain an estimate of the respective
quantity. The fact it is an estimate is denoted by placing a hat; i.e., (ˆ·) over
the respective quantity. Thus we have8 :
n
1X
µ̂ = x(i)
n
i=1
n
1 X
σ̂ 2 = (x(i) − µ̂)2
n
i=1
n
1X
σ̂12 = (x1 (i) − µ̂1 )(x2 (i) − µ̂2 ).
n
i=1
8 1 1
In some cases in the literature, the n
terms are replaced with n−1
.

44
We often dispense with the hat notation since in most cases the context is
clear. The hats are used only if necessary to avoid ambiguity. The above
equations can be written in a more compact manner if we asemble the avail-
able samples into a vector x ∈ Rn . In this case we can write
1
σj2 = (xj − µ)T (xj − µ), j ∈ [1, 2] (2.29)
n
1
σ12 = (x1 − µ1 )T (x2 − µ2 ). (2.30)
n
where in the above we define subtraction of a scalar from a vector as the
scalar being subtracted from each element of the vector.

Because the presentation is easier in the case where the means are zero, and
because most of the real–life signals we deal with do have zero mean, from
this point onwards, we assume either that the means of the variables are
zero, or that we have subtracted the means as a pre–processing step – i.e.,
xj ← xj − µj . Variables of this sort are referred to as mean centered data.

We offer an example to help in the interpretation of covariances and related


topics. Let x1 be a person’s mean–centered height and x2 be their corre-
sponding mean–centered weight. Then with respect to a particular person,
if x1 is positive (above the mean value), then it is more likely that x2 is
also positive, and vice–versa. Thus it is likely that x1 and x2 both have
the same sign and therefore the product x1 x2 is most often positive when
evaluated over many distinct people. Occasionally we encounter a short,
stout individual, or a tall, slender person, and in this case the product x1 x2
would be negative. By taking the expectation σ12 = E(x1 x2 ) over all pos-
sible people, on average we will obtain a positive value, i.e., in this case
the covariance σ12 is positive. We express this idea by saying that these
variables are “positively correlated”.

Another example is where we let x1 be the final mark a person receives in


a specific course, and x2 remains as their weight. Then, since the variables
appear to be unrelated we would expect that the product x1 x2 is positive as
often as it is negative, with equal average magnitudes. In this case, σ12 → 0.
Here we say the variables are uncorrelated.

The final example is a bit contrived, but it nevertheless illustrates the point.
Here we assume x1 is the maximum speed at which a person can run and
again x2 remains as the person’s corresponding weight. Then generally,

45
5

x2 in normalized units
2

−1

−2

−3

−4
−4 −3 −2 −1 0 1 2 3 4
x1 in normalized units

Figure 2.2. Figures 2.2 – 2.4: Scatter plots for different cases of random vectors [x1 x2 ]T
for different values of covariance, for mean–centered data. Fig. 2.2: covariance σ12 =
+0.8. Fig. 2.3: covariance σ12 = 0, and Fig. 2.4: covariance σ12 = −0.8. The axes
are normalized to zero mean and standard deviation = 1. Each point in each figure
represents one observation [x1 , x2 ]T of the random vector x. In each figure there are 1000
observations.

the greater the person’s weight, the slower they run. So in this case, the
variables most often have opposite signs, so the covariance σ12 is negative. A
non-zero covariance between two random variables implies that one variable
affects the other, whereas a zero covariance implies there is no (first–order)
relationship between them.

The situations corresponding to these three examples are depicted in what


is referred to as a scatter plot. Each point in the scatterplots correspond to
a single measurement of the two variables x = [x1 , x2 ]. The scatterplot is
characteristic of the underlying probability density function (pdf), which in
this example is a multi–variate, Gaussian distribution. The variables have
been normalized to zero mean and unit standard deviation. We see that
when the covariance is positive as in Fig. 2.2, (here the covariance has the
value +0.8), the effect of the positive covariance is to cluster the samples
within an ellipse which is oriented along the direction [1, 1]T . When the
covariance is zero (Fig. 2.3), there is no directional discrimination when the
variances are equal. When the covariance is negative, the samples are again
clustered within an ellipse, but oriented along the direction [1, −1]T .

In the example of Fig. 2.2, the effect of a positive covariance between the
variables is to cause the respective scatterplot to become elongated along
the major axis, which in this case is along the direction [1, 1]T , where the

46
4

2
x2 in normalized units

−1

−2

−3

−4
−4 −3 −2 −1 0 1 2 3 4
x1 in normalized units

Figure 2.3. Correlation σ12 = 0

2
x2 in normalized units

−1

−2

−3

−4
−4 −3 −2 −1 0 1 2 3 4
x1 in normalized units

Figure 2.4. Correlation σ12 = −0.8

47
elements have the same sign. Note that the direction [1, 1]T coincides with
that of the first eigenvector – see Sect. 2.2. In this case, due to the pos-
itive correlation, mean–centered observations where the height and weight
are simultaneously either larger or smaller than the means (i.e., heigth and
weight have the same sign) are relatively common, and therefore observa-
tions relatively far from the mean along this direction have a relatively high
probability, and so the variance of the observations along this direction is
relatively high. On the other hand, again due to the positive correlation,
mean–centered observations along the direction [1, −1]T ] (i.e., where height
and weight have opposite signs) that are far from the mean have a lower
probability, with the result the variance in this direction is smaller (i.e., tall
and skinny people occur more rarely than tall and correspondingly heavier
people). In cases where the governing distribution is not Gaussian, similar
behaviour persists, although the scatterplots will not be elliptical in shape.

As a further example, take the limiting case in Fig. 2.2 where σ12 → 1.
Then, the knowledge of one variable completely specifies the other, and the
scatterplot devolves into a straight line. In summary, we see that as the value
of the covariance increases from zero, the scatterplot ellipse transistions from
being circular (when the variance of the variables are equal) to becoming
elliptical, with the eccentricity of the ellipse increasing with covariance until
eventually the ellipse collapses into a line as σ12 → 1.

2.3.1 Covariance Matrices

The covariance structure of a vector x ∈ Rn of random variables can be


conveniently encapsulated into a covariance matrix, Rx ∈ Rn×n . In the
more general case where the data are not mean–centered, the covariance
matrix corresponding to the (column) vector random variable x is defined
as:

Rx = E (x − µ)(x − µ)T
 
(2.31)
 
(x1 − µ1 )(x1 − µ1 ) ... . . . (x1 − µ1 )(xn − µn )
 (x2 − µ2 )(x1 − µ1 ) (x2 − µ2 )(x2 − µ2 ) . . . (x1 − µ1 )(xn − µn ) 
= E .
 
.. .. . . ..
 . . . . 
(xn − µn )(x1 − µ1 ) ... . . . (xn − µn )(xn − µn )

where µ is the vector mean.

48
We recognize the diagonal elements as the variances σ12 . . . σn2 of the elements
x1 . . . xn respectively, and the (i, j)th off–diagonal element as the covariance
σij between xi and xj . Since multiplication is commutative, σij = σji and
so Rx is symmetric. It is also apparent that covariance matrices are square.
It therefore follows that its eigenvectors are orthogonal.

Lets say we have m available samples of the vector xi , i = 1, . . . , m. To


form a finite–sample estimate R̂x of Rx , we formulate a matrix X ∈ Rm×n .
The ith row xTi of X contains all n variables (height and weight in our
previous example) corresponding to the ith observation (person), whereas
the jth column xj of X contains all m values of the jth variable. (In our
previous examples, m = 1000 and n = 2). The matrix X formulated in
this way is the standard format for organizing data in machine learning and
regression problems.

By replacing expectations by arithmetic averages, it follows from (2.31) that


R̂ can be evaluated by forming
m
1 X
R̂ = (xi − µ)(xi − µ)T . (2.32)
m
i=1

Eq. (2.32) is a sum of outer products. Using the outer-product representa-


tion of matrix multiplication for the mean–centered case, we also have
m
1 X 1
R̂ = xi xTi = X T X. (2.33)
m m
i=1

Note that the rows xTi of X in the last term (2.33) are the transpose of the
xi ’s in the middle term. It would perhaps be more straightforward if we
represented X in its transposed version where each xi forms a column of
X. Then X would be an n × m matrix where each column is an observation
and there would be a more direct correspondence between the summation
term in the middle and the outer product term on the right. However the
formulation in (2.33) is the only way to abide by the common conventions
that vectors are columns and that X is formulated so that its rows consist
of observations.

From (2.33) it is apparent that every element of R̂ is an instance of either


(2.29) or (2.30), depending on whether the respective element is on the main
diagonal or not.

49
Note that m ≥ n for R̂ to be full rank. The covariance matrices for each of
our three examples discussed above are given as
     
1 +0.8 1 0 1 −0.8
R1 = , R2 = , and R3 = .
+0.8 1 0 1 −0.8 1

Note that 1’s on each diagonal result because the variances of x1 and x2
have been normalized to unity.

2.4 Covariance Matrices of Stationary Time Series

Here, we extend the concept of covariance matrices to a discrete–time, one-


dimensional, random stationary time series, a.k.a. a random process, rep-
resented by x[k]. The idea in this case is similar to that of the previous
discussion, except the variables involved are sequential samples from a time
series. The objective is to characterize the covariance structure of the time
series over an interval (window) of n samples. In this vein we form the row
vectors xTi ∈ Rn , which consist of n sequential samples of the time series
within a window of length n, as shown in Fig. 2.5. The value of n is to
be determined by the application at hand. The windows generally overlap;
in fact, they are typically displaced from one another by only one sample.
Hence, the vector corresponding to each window is a vector sample from the
random process x[k]. Examples of random processes are a speech signal, or
the signal received from an electroencephalogram (EEG) electrode, noise, or
the signal transmitted from a cell phone, etc.

The word stationary as used above means the random process is one for
which the corresponding joint n–dimensional probability density function
describing the distribution of the vector sample xT does not change with
time. This means that all moments of the distribution (i.e., quantities such
as the mean, the variance, and all covariances, (as well as all other higher–
order statistical characterizations) are invariant with time. Here however,
we deal with a weaker form of stationarity referred to as wide–sense sta-
tionarily (WSS). With these processes, only the first two moments (mean,
variances and covariances) need be invariant with time. Strictly, the idea of
a covariance matrix is only relevant for stationary or WSS processes, since
expectations only have meaning if the underlying process is stationary. How-
ever, we see later that this condition can be relaxed in an approximate sense

50
- ...
1me I

I •

J_ .a _,_

?t'l.
)

pa

... VY'\

f\

Figure 2.5. The received signal x[k] is decomposed into windows of length n. The samples
in the ith window comprise the vector xi , i = 1, 2, . . . m.

2
A sample of a white Gaussian random process withµ = 1 and σ = 1.
3

0
w[k]

-1

-2

-3

-4 •

0 10 20 30 40 50 60 70 80 90 100
time

Figure 2.6. A sample of a white Gaussian discrete–time process, with mean µ = 1 and
variance σ 2 = 1.

51
in the case of a slowly–varying non–stationary signal, if the expectation is
replaced with a time average over an interval over which the signal does not
vary significantly.

Here, as in (2.31), Rx corresponding to a mean–centered, stationary or WSS


process x[k] is defined as
 
x1 x1 x1 x2 · · · x1 xn
 x2 x1 x2 x2 · · · x2 xn 
Rx ∈ Rn×n = E xi xTi = E  .
 
..  , (2.34)
 
. .. . .
 . . . . 
xn x1 xn x2 · · · xn xn

Taking the expectation over all windows, eq. (2.34) tells us that the element
r(1, 1) of Rx is by definition E(x21 ), which is the variance of the first element
x1 over all possible vector samples xi of the process. But because of station-
arity, r(1, 1) = r(2, 2) = . . . , = r(n, n) which are all equal to σx2 . Thus all
main diagonal elements of Rx are equal to the variance of the process. The
element r(1, 2) = E(x1 x2 ) is the covariance between the first element of xi
and its second element. Taken over all possible windows, we see this quantity
is the covariance of the process and itself delayed by one sample. Because of
stationarity, the elements r(1, 2) = r(2, 3) = . . . = r(n − 1, n) and hence all
elements on the first upper diagonal are equal to the covariance for a time-
lag of one sample. Since multiplication is commutative, r(2, 1) = r(1, 2),
and therefore all elements on the first lower diagonal are also all equal to
this same cross-correlation value. Using similar reasoning, all elements on
the jth upper or lower diagonal are all equal to the covariance value of the
process for a time lag of j samples. A matrix with equal elements along any
diagonal is referred to as Toeplitz.

If we compare the process shown in Fig. 2.5 with that shown in Fig. 2.6, we
see that in the former case the process is relatively slowly varying. Because
we have assumed x[k] to be mean–centered, adjacent samples of the process
in Fig. 2.5 will have the same sign most of the time, and hence E(xi xi+1 )
will be a positive number, coming close to the value E(x2i ). The same can
be said for E(xi xi+2 ), except it is not quite so close to E(x2i ). Thus, we see
that for the process of Fig. 2.5, the diagonals decay fairly slowly away from
the main diagonal value.

However, for the process shown in Fig. 2.6, adjacent samples are uncorre-
lated with each other. This means that adjacent samples are just as likely

52
to have opposite signs as they are to have the same signs. On average, the
terms with positive values have the same magnitude as those with negative
values. Thus, when the expectations E(xi xi+1 ), E(xi xi+2 ) . . . are taken, the
resulting averages approach zero. In this case, we see the covariance matrix
concentrates around the main diagonal, and becomes equal to σx2 I. We note
that all the eigenvalues of Rx are equal to the value σx2 . Because of this
property, such processes are referred to as “white”, in analogy to white light,
whose spectral components are all of equal magnitude.

When x[k] is stationary, the sequence {r(1, 1), r(1, 2), . . . , r(1, n)} is the au-
tocorrelation function of the process, for lags 0 to n−1. In the Gaussian case,
the process is completely characterized9 by the autocorrelation function. In
fact, it may be shown [7] that the Fourier transform of the autocorrelation
function is the power spectral density of the process. Further discussion on
this aspect of random processes is beyond the scope of this treatment; the
interested reader is referred to the reference.

In practice where only a finite number of observations are available, we form


the matrix X ∈ Rm×n as in (2.33), whose rows are the xTi formed from the
windows of data as in Fig. 2.5. Then, an estimate R̂x of Rx which represents
the stationary random process is formed in an exactly analogous manner to
(2.33):
1
R̂x = X T X. (2.35)
m
A method [8] for modifying R̂ to track mildly non–stationary environments
is outlined in Problem 10.

Some Properties of Rx :

1. Rx is square and symmetric. In the complex case, it is Hermitian


∗ , where ∗ denotes complex conjugation.
symmetric; i.e. rij = rji
2. If the process x[k] is stationary or wide-sense stationary, then Rx is
Toeplitz. This means that all the elements on a given diagonal of the
matrix are equal. If you understand this property, then you have a
good understanding of the nature of covariance matrices.
3. If Rx is diagonal with equal elements, then the elements of x are
uncorrelated and x is said to be a white process. If the magnitudes of
9
By “characterizing” a random process, we mean that its joint probability density
function is completely specified.

53
incident wave 1

wavelength lambda
wavefront 1
sensors

(antennas, theta1
etc. ) o

o
Normal to array
o

o
d

incident wave K

thetaK

wavefront K

Figure 2.7. Physical description of incident signals onto an array of sensors.

the off-diagonal elements of Rx are significant with respect to those


on the main diagonal, the process is said to be correlated or coloured.
4. Rx is positive semi–definite. This implies that all the eigenvalues are
greater than or equal to zero. We will discuss positive definiteness and
positive semi–definiteness in Ch. 4.

2.5 Examples of Eigen–Analysis with Covariance


Matrices

So far in this chapter, we have discussed eigenvalues, eigenvectors and covari-


ance matrices in some detail. We now present two practical examples from
signal processing where eigen–analysis is performed on covariance matrices
to considerable advantage. These examples are i) Array Signal Processing,
and ii) Principal Component Analysis (PCA).

2.5.1 Array Processing

A major objective of array signal processing is to estimate directions of


arrival of plane waves onto arrays of sensors. Here we present the “MUSIC”

54
(MUltiple SIgnal Classification) algorithm [9] for this purpose. A broader
treatment of the array signal processing field is given in [10].

Consider a linear array of M sensors (e.g., antennas or microphones) as


shown in Fig. 2.7. Let there be K < M plane waves incident onto the array
as shown. Assume for the moment that the amplitude of the first incident
wave at the first sensor (which we can assume receives the earliest signal) is
unity. Then, from the physics shown in Fig. 2.7, each element of the signal
vector x, obtained by sampling each element of the array simultaneously, is a
time–delayed verison of the signal received at the first element. If we assume
the incident waves are narrow band, then the signal received from successive
elements are phase–shifted versions of each other, where the phase shift may
be expressed by multiplying the signal from the adjacent “earlier” sensor
by the factor ejφ . The received vector x may then be described as x =
[1, ejφ , ej2φ , . . . , ej(M −1)φ ]T , where φ is the electrical phase–shift between
adjacent elements of the array, due to the first incident wave10 , and j =

−1. When there are K incident signals, with corresponding amplitudes
ak , k = 1, . . . , K, and distinct electrical angles φ1 , . . . , φK , the effects of
the K incident signals each add linearly together, each weighted by the
corresponding amplitude ak , to form the received signal vector x. The
resulting received signal vector at time n, including the noise can then be
written in the form
xn = S an + wn ,
n = 1, . . . , N, (2.36)
(M ×1) (M ×K) (K×1) (M ×1)

where

wn = M -length noise vector at time n whose elements are uncorrelated


random variables with zero mean and variance σ 2 , i.e., cov(w) = σ 2 .
The vector w is assumed uncorrelated with the signal.

S = [s1 . . . sK ]

sk = [1, ejφk , ej2φk , . . . , ej(M −1)φk ]T , k = 1, . . . , K, are referred to as steer-


ing vectors.

10
It may be shown that if d ≤ λ/2, then there is a one–to–one relationship between
the electrical angle φ and the corresponding physical angle θ. In fact, φ = 2πd
λ
sin θ. If
d ≤ λ/2, then θ can be inferred from φ.

55
φk , k = 1, . . . , K are the electrical phase–shift angles corresponding to the
incident signals. The φk are assumed to be distinct.

an = [a1 . . . aK ]T is a vector of uncorrelated random variables, describing


the amplitudes of each of the incident signals at time n.

The MUSIC algorithm requires that K < M . Before we discuss the imple-
mentation of the MUSIC algorithm per se, we analyze the covariance matrix
R of the received signal x:

R = E(xxH ) = E (Sa + w)(aH SH + wH )


 

= SE(aaH )SH + σ 2 I
= SAS H + σ 2 I (2.37)

where A = E(aa)H . The second line follows because the noise is uncorre-
lated with the signal, thus forcing the cross–terms to be zero. In the last
line of (2.37) we have also used that fact that the covariance matrix of the
noise contribution (second term) is σ 2 I. This follows because the noise is
assumed white. We refer to the first term of (2.37) as Ro , which is the con-
tribution to the covariance matrix due only to the signal component. The
matrix A ∈ CK×K is full rank if the incident signals are not fully correlated.
In this case, Ro ∈ CM ×M is rank K < M . Therefore Ro has K non-zero
eigenvalues and M − K zero eigenvalues.

From the definition of one of the zero eigenvectors, we have

R o vi = 0
or SASH vi = 0, i = K + 1, . . . , M.

Since A and S are assumed full rank, we must have

S H V N = 0, (2.38)

where V N = [v K+1 , . . . , v M ]. (These eigenvectors are referred to as the


noise eigenvectors. More on this later). It follows that if φo ∈ [φ1 , . . . , φK ],
then h i
1, ejφo , ej2φo , . . . , ej(M −1)φo V N = 0. (2.39)
Up to now, we have considered only the noise–free case. What happens
when the noise component σ 2 I is added to Ro to give R in (2.37)? From

56
Property 3 of this chapter, we see that if the eigenvalues of Ro are λi , then
because here we dealing with expectations and because the noise is white,
those of R are λi + σ 2 . The eigenvectors remain unchanged with the noise
contribution, and therefore (2.39) still holds when noise is present, under
the current assumptions.

With this background in place we can now discuss the implementation of


the MUSIC algorithm for estimating directions of arrival φ1 . . . φK of plane
waves incident onto arrays of sensors, given a finite number N of snapshots.
We simultaneously sample all array elements at N distinct points in time
to obtain vector samples xn ∈ CM ×1 , n = 1, . . . , N . Our objective is to
estimate the directions of arrival φk of the plane waves relative to the array,
by observing only the received signal. The basic idea follows from (2.39),
where s(φ) ⊥ V N iff φ = φo . We form an estimate R̂ of R based on the
available snapshots as follows:
N
1 X
R̂ = xn xH
n,
N
n=1

and then extract the M − K noise eigenvectors V N , which are those associ-
ated with the smallest M − K eigenvalues of R̂. Because of the finite N and
the presence of noise, (2.39) only holds approximately for the true φo . Thus,
a reasonable estimate of the desired directions of arrival may be obtained
by finding values of the variable φ for which the expression on the left of
(2.39) is small instead of exactly zero. Thus, we determine K estimates φ̂
which locally satisfy
2
φ̂ = arg min sH (φ)V̂N . (2.40)

φ 2

By convention, it is desirable to express (2.40) as a spectrum–like function,


where a peak instead of a null represents a desired signal. Thus, the MUSIC
“spectrum” P (φ) is defined as:

1
P (φ) = H
.
s(φ)H V̂ N V̂N s(φ)

It will look something like what is shown in Fig. 2.8, when K = 2 incident
signals. Estimates of the directions of arrival φk are then taken as the peaks
of the MUSIC spectrum.

57
P(phi)

phi
phi1 phi2

Figure 2.8. MUSIC spectrum P (φ) for the case K = 2 signals.

Signal and noise subspaces: The MUSIC algorithm opens up some in-
sight into the use of the eigendecomposition that will be of use later on. Let
us define the so–called signal subspace SS as

SS = span [v 1 , . . . , v K ] (2.41)

and the noise subspace SN as

SN = span [v K+1 , . . . , v M ] . (2.42)

From (2.37), all columns of Ro are linear combinations of S. Therefore

R(Ro ) = R(S). (2.43)

We have seen earlier in this chapter that the eigenvectors associated with
the non–zero eigenvectors form a basis for R(Ro ). Therefore

R(Ro ) ∈ Ss . (2.44)

Comparing (2.43) and (2.44), we see that S ∈ SS . From (2.36) we see that
any received signal vector x, in the absence of noise, is a linear combination
of the columns of S. Thus, any noise–free signal resides completely in SS .
This is the origin of the term “signal subspace”. Further, in the presence of
noise, provided N → ∞, (2.41) and (2.42) still hold and then any component
of the received signal residing in SN must be entirely due to the noise,

58
although noise can also reside in the signal subspace. This is the origin of
the term “noise subspace”. In the case where N is finite, the eigenvectors are
only approximations to their ideal values, and (2.41) and (2.42) hold only
approximately. This results in some leakage between the two subspaces and
(2.38) holding only approximately, resulting in some error in the estimates of
φ, as one would expect under these circumstances. We note that the signal
and noise subspaces are orthogonal complement subspaces of each other.
The ideas surrounding this example lead to the ability to de–noise a signal
in some situations, as we see in a subsequent example.

2.6 Principal Component Analysis (PCA)

PCA is the second example how covariance matrices and eigen–analysis can
be applied to real–life problems. The basic idea behind PCA is that, given
a set of observations xi ∈ Rn , i = 1, . . . m, is to transform each observa-
tion xi into a new basis so that as much variance is concentrated into as
few coefficients as possible. The motivation for this objective, as we see
shortly, is that it provides the means for data compression, can be used to
denoise a signal and also provides useful features for classification problems
in a machine learning context. In this section, we assume the process x
is slowly varying, so that a significant degree of correlation exists between
consecutive elements of xi . In this case, the n–dimensional scatterplot is a
hyperellipse, ideally with significant variation along only a few of the axes.
PCA is sometimes referred to as the Karhunen Loeve transform.

The samplePcovariance matrix R̂x is then evaluated in the manner discssed


as R̂x = m m
1 T
i=1 xi x1 . The eigenvector matrix V is then extracted, where
the column ordering corresponds to the descending order of the eigenvalues.

Given this framework, the PCA idea is straightforward; we simply transform


each observation xi into the basis V . That is, the coefficients θ of the PCA
transformation are given as

θ i = V T xi . (2.45)

The motivation for using the eigenvector basis to represent x follows from
Sect. 2.2, where we have seen that the principal coefficients (i.e., the projec-
tions of xi onto the principal eigenvectors) have maximum variance. When

59
the elements of x are correlated, we have the result that
E(θ1 )2 ≥ E(θ2 )2 ≥ . . . ≥ E(θr )2 ≥ E(θr+1 )2 ≥ . . . ≥ E(θn )2 , (2.46)
where the [θ1 , . . . , θn ]T are the elements of θ i . This phenomenon is the key to
PCA compression. A justification of this behaviour is given in Property 10,
following. In a typical practical situation, the variances of the θ–elements
fall off quite quickly. We therefore determine a value of r such that the
coefficients θr+1 , . . . , θn are deemed negligible and so are neglected, and we
retain only the first r significant values of θ. Thus the n–length vector θ is
represented by a truncated version θ̂ ∈ Rr given by
 
θ1
θ̂ =  ...  .
 

θr
This set of r coefficients are the principal components. The reconstructed
version x̂ of x is then given as
x̂ = V r θ̂. (2.47)
where V r consists of the first r columns of V . We note that the entire
length-n sequence x can be represented by only r ≤ n coefficents. The
resulting error ∗r = E||x − x̂||22 can be shown to be the smallest possible
value with respect to the choice of basis (see Property 12, following). Thus
data compression is achieved with low error. Typically in highly correlated
systems, r can be significantly less than n and the compression ratios (i.e.,
r
n ) attainable with the PCA method can be substantial.

Generally there is no consistent method for determining a value for r. The


larger the value of r, the lower the compression ratio and the lower the error,
and vice–versa. In many practical problems, the value of r is determined by
trial-and-error. It is interesting to note that as the elements of x become
more correlated, the corresponding scatterplot ellipse becomes more elon-
gated, with the result that the PCA coefficients become more concentrated
in the first few values. This has the effect either of being able to reduce r for
the same reconstruction error, or reduce the error for the same r. This make
intuitive sense, since as the correlations between the elements increase, the
process becomes more predictable and thus easier to compress.

If we substitute the first r values of θ from (2.45) into (2.47), we have the
result
x̂ = V r V Tr x.

60
Since x̂ may be viewed as a projection of x into the principal component
subspace, it is apparent from the above that this projection operation is
accomplished by pre–multiplication of x by the matrix V r V Tr . This matrix
is referred to as a projector. Projectors are explored in more detail in Ch.3.

Properties of the PCA Representation

Here we introduce several properties of PCA analysis. These properties lead


to a more rigorous framework for the understanding of this topic.

Property 8 The coefficients θ are uncorrelated.

To prove this, we evaluate the covariance matrix Rθθ of θ, using the defini-
tion (2.45) as follows:

Rθθ = E θθ T


= E V T xxT V


= V T Rx V
= Λ. (2.48)

Since Rθθ is equal to the diagonal eigenvalue matrix Λ of Rx , the PCA


coefficients are uncorrelated.

Property 9 The variance of the ith PCA coefficient θi is equal to the ith
eigenvalue λi of Rx .

The proof follows directly from prop. (8) and the fact that the ith diagonal
element of Rθθ is the variance of θi .

From this property, we can infer that the length of the ith semi–axis of
the scatterplot ellipse is directly proportional to the square root of the ith
eigenvalue, which is equal to the variance of θi . The next property shows
that the eigenvalues indeed become smaller, and therefore so do the variances
of the θi , with increasing index i.

61
Property 10 The variances of the θ coefficients decrease with index as in
(2.46).

To justify this property, we consider a vector random sample x ∈ Rn with


significant correlation structure, as would occur e.g., if x were taken from
a sequence of samples of a slowly–varying random process, as previously
described. In this discussion, we assume for ease of presentation that all
correlations are positive, although the argument still holds for any combi-
nation of positive and negative correlations as long as the directions of the
scatterplot axes are adjusted accordingly.

Consider the scatterplot of Fig. 2.2, which shows variation of the samples in
the x1 − x2 axis. Due to the positive correlation between x1 and x2 , samples
where these variables have the same sign (after removal of the mean) are
more likely than the case where they have different signs. Therefore samples
farther from the mean along the principal eigenvector direction [1, 1]T have a
higher probability of occurring than those the same distance from the mean
along the axis [1, −1]T , which is the direction of the second eigenvector.
The result of this behaviour is that E(θ1 )2 ≥ E(θ2 )2 , as is evident from the
Figure. Now consider the variation in the x2 − x3 plane, where we aassume
n ≥ 3. Because of stationarity, we can assume that x2 and x3 have the same
correlation structure as that of x1 and x2 . We can therefore apply the same
argument as that above to show that E(θ2 )2 ≥ E(θ3 )2 . By continuing to
apply the same argument in all n dimensions, (2.46) is justified.

We can also present an alternative argument to show that the θ–elements


must have a diversity of variance when x is a correlated process. Let us
denote the covariance matrix associated with the coloured process shown in
Fig. 2.5 as Rc , and the white process shown in Fig.2.6 as Rw . We assume
both processes are stationary, and without loss of generality we assume the
processes have equal powers. Let αi be the eigenvalues of Rc and βi be the
eigenvalues of Rw . Because Rw is diagonal with equal diagonal elements, all
the βi are equal. Our assumptions imply that the main diagonal elements of
Rc are equal to the main diagonal elements of Rw , and hence from Property
4, the trace and the eigenvalue sum of each covariance matrix are equal.

To obtain further insight into the behavior of the two sets of eigenvalues, we
consider Hadamard’s inequality [11] which may be stated as:

62
Qm
Consider a square matrix A ∈ Rm×m . Then, det A ≤ i=1 aii ,
with equality if and only if A is diagonal.

From
Qn Hadamard’sQn inequality, det Rc < detPRw , and Pso also from Property 4,
α
i=1 i < β
i=1 i . Under the constraint α i = βi with the βi all equal,
it follows that α1 > αn ; i.e., the eigenvalues of Rc are not equal. (We say the
eigenvalues become disparate). Thus, according to prop.(9), the variances
in the first few PCA coefficients of a correlated process are larger than those
in the later PCA coefficients. In practice, when x[k] is highly correlated, the
variances in the later coefficients become negligible.

Property 11 The mean–squared error ∗r = Ei ||xi − x̂i ||22 in the PCA rep-
resentation x̂i of xi using r components is given by
m
X
∗r = Ei ||xi − x̂i ||22 = λi , (2.49)
i=r+1

which corresponds to the sum of the truncated eigenvalues11 .

Proof:
∗r = Ei ||xi − x̂i ||22 = Ei ||V θ − V θ̂||22
= E||(θ − θ̂)||22
Xm
= E(θi )2
i=r+1
Xm
= λi , (2.50)
i=r+1

where in the last line we have used prop (9) and the second line follows due
to the fact that the 2–norm is invariant to multiplication by an orthonormal
matrix.

Property 12 The eigenvector basis provides the minimum ∗r for a given
value of r.
11
We see later in Ch.4 that Rx is positive definite and so all the eigenvalues are positive.
Thus each truncated term can only increase the error.

63
Proof: Recall θ = V x = [V r V 2 ]x = [θ r θ 2 ] where θ 2 are the coefficients
which become truncated. Then from our discussions from Sect. 2.2, there
is no other basis for which ||θ r ||22 is greater. Because V is an orthonormal
transformation, ||θ r ||22 +||θ 2 ||22 = ||x||22 . Since ||x||22 is invariant to the choice
of basis and ||θ r ||22 is maximum, ||θ 2 ||22 = ∗r must be minimum which respect
to basis. 

We can offer an additional example illustrating the compression phenomenon.


Consider the extreme case where the process becomes so correlated that all
elements of its covariance matrix approach the same value. (This will happen
if the process x[k] is non–zero and does not vary with time. In this case, we
must consider the fact that the mean is non–zero). Then all columns of the
covariance matrix are equal, and the rank of Rx is one, and therefore only
one eigenvalue is nonzero. Then all the power of the process is concentrated
into only the first PCA coefficient and therefore all the later coefficients are
small (zero in this specific case) in comparison to the first coefficients, and
the process is highly compressible.

In contrast, in the opposite extreme when x[k] is white, the corresponding


Rx = σx2 I. Thus all the eigenvalues are equal to σx2 , and none of them are
small compared to the others. In this case there can be no compression
(truncation) without inducing significant distortion in (2.50). Also, since
we have m repeating eigenvectors, the eigenvectors are not unique. In fact,
any orthonormal set of m vectors is an eigenvector basis for a white signal.

2.6.1 Examples of PCA Analysis

Compression of a low–pass random process

Here we present a simulation example using PCA analysis, to illustrate the


effectiveness of this technique. A process x[k] was simulated by passing a
unit-variance zero–mean white noise sequence w[k] through a 3rd-order low-
pass digital lowpass Butterworth filter with a relatively low normalized cutoff
frequency (0.1 Hz). A white noise sample and its corresponding filtered ver-
sion are shown in Fig. 2.9. One can observe that there is no relationship
between successive samples of the white sequence, however we can see that
the filtered version is considerably smoother and thus successive samples are

64
Sample of a white noise sequence (red) and its low-pass filtered version (blue)
4

2
w[k], x[k]

-1

-2

-3
0 20 40 60 80 100 120 140 160 180 200
time

Figure 2.9. A sample of a white noise sequence (red) and its corresponding filtered version
(blue). The white noise sequence is a Gaussian random process with µ = 0 and σ 2 = 1,
generated using the matlab command “randn”.

correlated. Vector samples xTi ∈ Rn are extracted from the sequence x[k]
in the manner shown in Fig. 2.5 and assembled into rows of the data ma-
trix X. The filter removes the high-frequency components from the input
and so the resulting output process x[k] must therefore vary more slowly in
time and therefore exhibit a significant covariance structure. As a result,
we expect to be able to accurately represent the original signal using only
a few principal eigenvector components, and be able to achieve significant
compression gains.

The sample covariance matrix R̂x was then computed from X as in (2.35)
for the value n = 10. Listed below are the 10 eigenvalues of R̂x :

65
Eigenvalues:
0.5468
0.1975
0.1243 × 10−1
0.5112 × 10−3
0.2617 × 10−4
0.1077 × 10−5
0.6437 × 10−7
0.3895 × 10−8
0.2069 × 10−9
0.5761 × 10−11
Inspection of the eigenvalues above indicates that a large part of the total
variance is contained in the first two eigenvalues. We therefore choose
P r = 2.
The error ?r for r = 2 is thus evaluated from the above data as 10 i=3 λi =
0.0130, which may be compared to the value 0.7573, which is the total
eigenvalue sum. The normalized error is 0.0130
0.7573 = 0.0171. Because this error
may be considered a low enough value, only the first r = 2 components
may be considered significant. In this case, we have a compression gain of
10/2 = 5.

Since x̂i = V r θ̂ i , i.e., x̂i is a linear combination of the columns of V r , and


since x̂i is a function of time over the interval spanning the ith window,
the eigenvectors themselves must represent waveforms in time. As such,
Fig. 2.10 shows plots of the eigenvectors as discrete–time waveforms, with
the elements of the respective v acting as its samples. In this case, we
would expect that any observation xi can be expressed accurately as a linear
combination of only the first two eigenvector waveforms shown in Fig. 2.10,
whose coefficients θ̂ are given by (2.45). In Fig. 2.11 we show samples of
the true observation xi shown as a waveform in time, compared with the
reconstruction x̂i formed from (2.47) using only the first r = 2 eigenvectors.
It is seen that the difference between the true and reconstructed vector
samples is small, as expected.

Example: Denoising Signals

We now present an example showing how the PCA process can be used to
denoise a signal. If the signal of interest has significant correlation structure,
we can take advantage of the fact that the eigenvalues of the covariance

66
First Two Principal Components (10 x 10 system)
0.5

0.4

0.3
v1

0.2

0.1

0
x

−0.1

−0.2

−0.3 v2

−0.4

−0.5
1 2 3 4 5 6 7 8 9 10
time

Figure 2.10. First two eigenvector components as functions of time, for Butterworth low-
pass filtered noise example.

Original and Reconstructed Vector Samples


0.4

0.3

0.2

0.1

0
Amplitude

−0.1

−0.2

−0.3

−0.4

−0.5

−0.6
1 2 3 4 5 6 7 8 9 10
Time

Figure 2.11. Original vector samples of x as functions of time (solid), compared with their
reconstruction using only the first two eigenvector components (dotted). Three vector
samples are shown.

67
matrix are typically concentrated into only a few (i.e., r) significant values,
just as they are in the previous case where we were interested in compression.
Thus, in a manner similar to the compression case, we see that in the present
case the signal component is concentrated into a subspace whose basis is the
eigenvector matrix V r = [v 1 , . . . , v r ]. This subspace is exactly analogous to
the signal subspace associated with the MUSIC algorithm of Sect. 2.5.1. The
noise typically has significant contributions over all n eigen–components,
so if r is appreciably less than n which is the case when the signal has
significant correlation structure, reconstructing the signal using only the
signal subspace components has the effect of suppressing a large part of the
noise. The process we follow for this example is identical to the previous
compression example, except here the effect we are interested in is denoising
rather than compression.

We illustrate the concept through a simulation experiment. We consider


a Gaussian pulse whose samples are encapsulated into a vector xTi ∈ Rn ,
as shown in Fig. 2.12. We observe m = 1000 such pulses, where each
pulse is subject to timing jitter, amplitude variation and additive coloured
noise. The duration n of the pulse was chosen to be n = 100 samples. The
observations are collected into a 1000 × 100 matrix X, where the ith row
xTi represents the 100 samples from the ith distinct pulse. 50 of the 1000
pulses, which are subject to timing jitter, amplitude variation, and additive
coloured noise, are shown superimposed in Fig. 2.13.

68
Example of a Gaussian Pulse
1.2

0.8
Gaussian pulse

0.6

0.4

0.2

0
0 10 20 30 40 50 60 70 80 90 100
time

Figure 2.12. The prototype Gaussian pulse.

50 superimposed jittered waveforms corrupted by coloured noise


1.4

1.2

1
Gaussian Pulse

0.8

0.6

0.4

0.2

-0.2
0 10 20 30 40 50 60 70 80 90 100
time

Figure 2.13. 50 superimposed, simulated pulses corrupted by timing jitter, ampliude vari-
ation and additive coloured noise.

The covariance matrix estimate R̂ ∈ Rn×n = 1/m X T X was then calcu-


lated, from which the eigenvector matrix V r ∈ Rn×r was extracted according

69
to the conventional PCA process as discussed. It was empirically determined
that the best value for r in this case is 3.

We denoise the waveforms using the same procedure as in the compression


example. The PCA coefficients θ i ∈ Rr corresponding to the ith sample xi
can be determined using (2.45) as θ i = V Tr xi . The denoised waveforms x̂i
can then be reconstructed using (2.47) as x̂i = V r θ̂ i .

The first r = 3 principal eigenvectors are shown superimposed in Fig. 2.14,


where the eigenvector may again be interpreted as a function of time (i.e., a
waveform) over the 100 sample interval interval. A typical denoised wave-
form corresponding to an arbitrarily–chosen row is shown in Fig. 2.15,
showing the original waveform which has been subjected to timing jitter
(dotted, red), the same waveform corrupted by additive coloured noise (blue,
dash–dot), and the corresponding restored, denoised version shown in (black,
dashed). It may be seen that the quality of the recovered signal is quite re-
markable, in the presence of substantial timing jitter and noise, using only
r = 3 eigenvector components. The denoising effect works most effectively
when the signal is highly correlated (thus increasing the concentration of
the eigenvalues in the first few coefficients) and the noise components are
concentrated elsewhere.

In the finite data case, the principal eigenvectors of R̂ are only an approxi-
mation to the true signal subspace basis, with the result that there will be
some degree of noise leaking into the estimated signal subspace. Thus the
denoising process we have described in this section is not exact; however, in
most cases in practice the level of noise is suppressed considerably.

70
First 3 evector components of jittered Gaussian pulse
0.3

evector components 0.2

0.1

-0.1

-0.2

-0.3
0 10 20 30 40 50 60 70 80 90 100
time

Figure 2.14. The 3 principal eigenvectors, shown as functions of time.

Original, noisy, and recovered pulses


1.2

0.8
Gaussian pulse

0.6

0.4

0.2

-0.2
0 10 20 30 40 50 60 70 80 90 100
time

Figure 2.15. A comparison between the original (jittered) noise–free waveform (dotted,
red), the waveform corrupted by coloured noise, (blue, dash-dot), and the denoised version,
shown in (black, dashed).

71
Example: Classification Using PCA Coefficients

Here we introduce a simple machine learning example where we wish to


classify between two types of signals. The first signal is a white noise pro-
cess passed through a low-pass filter, while the other is a distinct white
noise process passed through a high–pass filter. Examples of the respective
waveforms are shown in Figs. 2.16 and Fig. 2.17.

Samples of low-pass waveforms


15

10

5
volts

-5

-10

0 5 10 15 20 25 30 35 40 45
time(samples)

Figure 2.16. Examples of two low–pass waveforms for the classification example.

Plots of two different high-pass waveforms


25

20

15

10

5
volts

-5

-10

-15

-20

-25
0 5 10 15 20 25 30 35 40 45
time (samples)

Figure 2.17. Examples of two high–pass waveforms for the classification example.

72
A data matrix X lo was formed from the low–pass data in a manner similar
to that in Sect. 2.6.1. Each row of X lo represents a window of data from the
low–pass data. The matrix X lo consists of m = 200 rows, each of n = 50
samples long. The covariance matrix Rlo ∈ Rn×n was formed in the usual
manner as Rlo = X Tlo X lo . Two principal evectors were then extracted from
Rlo to form the matrix V lo ∈ R50×2 . The same procedure was applied on
the high–pass data to generate X hi and V hi . The two prinicpal eigenvector
waveforms corresponding to the two classes (low–pass and high–pass) are
shown in Figs. 2.18 and 2.19 respectively. It may be observed that the low–
pass eigenvectors shown in Fig. 2.18 vary smoothly from one sample to the
next, as is characteristic of a low–pass waveforms; i.e., adjacent samples are
positively correlated, whereas the high–pass eigenvector waveforms tend to
change sign between adjaceent samples; i.e., adjacent samples are negatively
correlated in this case.
Eigenvector waveforms from Low-pass data
0.25
V(:,1)
0.2 V(:,2)

0.15

0.1 X 44
Y 0.05703
0.05
"volts"

-0.05

-0.1

-0.15

-0.2
0 5 10 15 20 25 30 35 40 45
Time (samples)

Figure 2.18. The two principal eigenvector waveforms for the low–pass data.

Eigenvector waveforms from high-pass data


0.2

0.15

0.1

0.05
volts

-0.05

-0.1

-0.15

0 5 10 15 20 25 30 35 40 45 50
time (samples)

Figure 2.19. The two principal eigenvector waveforms for the high–pass data.

73
There are several different techniques we can apply to classify a test sample
into one of these two classes. The approach used here is to use the PCA
coefficients formed from the low–pass eigenvectors V lo as features for the
classification process. In this respect, we form two 200 × 2 matrices θ i , i ∈
[lo, hi], one for each class, where each row (consisting of 2 elements) are the
PCA coefficients obtained using the first two principal eigenvectors V lo of
the low–pass process. The θ–matrices are generated in the following manner:
θ lo = X lo V lo and (2.51)
θ hi = X hi V lo . (2.52)
Note that the low–pass eigenvectors are used in each case. These coefficients
are then used as features for a random forest classifier, which is implemented
using the Matlab Classification Learner toolbox. A scatterplot of the θ lo and
θ hi coefficients, where each 2–dimensional row of θ represents a single point,
is shown in Fig. 2.20. The red and blue points correspond to the low–
pass and high–pass processes respectively. It is seen the classes separate
very cleanly, with the high–pass points concentrating near the origin, and
the low–pass points scattered throughout the feature space. The overall
training accuracy for this experiment is 98.3%. By inspection of the high–
pass waveforms of Fig. 2.17, we see the samples alternate sign between
adjacent samples, whereas the low–pass eigenvector waveforms of Fig. 2.18
vary smoothly. Eq. (2.52) evaluates the sample covariance between the
high–pass process and the low–pass eigenvectors. It may be observed that
the characteristics of the waveforms involved lead to the covariances in this
case being low in value, and therefore the high–pass points in Fig. 2.20
concentrate near the origin. On the other hand, in (2.51), because both
waveforms are slowly varying and mutually similar, the covariance values
are larger.

Note that we could have conducted the same experiment by using the high–
pass eigenvectors in (2.51) and (2.52) instead of the low–pass eigenvectors,
and the results would be similar, except the low–pass and high–pass samples
in this case would be reversed in role. The reader is invited to explain why
the procedure described in (2.51) and (2.52) for evaluating the features is
similar in many respects to passing samples from both classes through a
low–pass (or high–pass) filter and evaluating the variances at the output.
In this case, the classes would cluster in a similar manner to that shown
in Fig. 2.20. While this is another suitable method for discriminating the
two classes, it has little pedagogical value for the present purposes since it
doesn’t use eigenvectors.

74
Scatterplot for Classification Example
Model predictions
-1 - Incorrect
30 -1 - Correct
1 - Incorrect
1 - Correct
20

10

0
PC 1

-10

-20

-30

-40

-50

-50 -40 -30 -20 -10 0 10 20 30 40


PC 2

Figure 2.20. The scatterplot of the PCA coefficients θ corresponding to the low–pass eigen-
vectors, which are used as features for the classifier. The coeficients from a low–pass sample
are represented in red, whereas the high–pass samples are in blue. There are 200 samples
from each class.

75
2.6.2 PCA vs. Wavelet Analysis:

One of the practical difficulties in using PCA for compression is that the
eigenvector set V is usually not available at the reconstruction stage in
practical cases when the observed signal is mildly or severely nonstationary,
as with the case with speech or video signals. In this case, the covari-
ance matrix estimate R̂x changes with time; hence so do the eigenvectors.
Provision of the eigenvector set for reconstruction is expensive in terms of
information storage and so is undesirable. Wavelet functions, which can be
regarded as another form of orthonormal basis, can replace the eigenvector
basis in many cases. While not optimal, the wavelet transform still displays
an ability to concentrate coefficients, and so performs reasonably well in
compression situations. The advantage is that the wavelet basis, unlike the
eigenvector basis, is constant and so does not vary with time. The current
MPEG standard for audio and video signals use the wavelet transform for
compression.

On the other hand, in many instances where denoising is the objective, the
PCA basis may be more effective than wavelets. In these cases where real-
time performance is not required and a large sample of data is available, the
covariance and eigenvector matrices are readily computed and so denoising
with the PCA basis is straightforward to implement. Also, because of the
optimality of the eigenvector basis, a cleaner denoised signal is likely to
result.

2.7 Matrix Norms

Now that we have some understanding of eigenvectors and eigenvalues, we


can now present the matrix norm. The matrix norm is related to the vector
norm: it is a function which maps Rm×n into R. A matrix norm must
obey the same properties as a vector norm. Since a norm is only strictly
defined for a vector quantity, a matrix norm is defined by mapping a matrix
into a vector and evaluating the vector norm. This is accomplished by post
multiplying the matrix by a suitable vector x. However, ths quantity varies
in norm as x changes direction (as well as norm). Since the norm is a
measure of how large a quantity can be, we chose the direction of x so that
||Ax||p is maximum. This idea can be expressed more rigorously as follows:

76
Matrix p-Norms: A matrix p-norm is defined in terms of a vector p-norm.
The matrix p-norm of an arbitrary matrix A, denoted ||A||p , is defined as

||Ax||p
||A||p = sup (2.53)
x6=0 ||x||p

where “sup” means supremum; i.e., the largest value of the argument over
all values of x 6= 0. Since a property of a vector norm is ||cx||p = |c| ||x||p
for any scalar c, we can choose c in (2.53) so that ||x||p = 1. Then, an
equivalent statement to (2.53) is

||A||p = max ||Ax||p . (2.54)


||x||p =1

For the specific case where p = 2 for A square and symmetric, it follows
from (2.54) and Sect. 2.2 that ||Ax||2 = λ1 . More generally, it is shown in
the next lecture for an arbitrary matrix A that

||A||2 = σ1 (2.55)

where σ1 is the largest singular value of A. This quantity results from the
singular value decomposition, to be discussed in the next chapter.

Matrix norms for other values of p, for arbitrary A, are given as


m
X
||A||1 = max |aij | (maximum column sum) (2.56)
1≤j≤n
i=1

and
n
X
||A||∞ = max |aij | ( maximum row sum). (2.57)
1≤i≤m
j=1

Frobenius Norm: The Frobenius norm is the 2-norm of the vector ob-
tained by concatenating all the rows (or columns) of the matrix A:
 1/2
Xm X
n
||A||F =  |aij |2 
i=1 j=1

77
Properties of Matrix Norms

1. Consider the matrix A ∈ Rm×n and the vector x ∈ Rn . Then,

||Ax||p ≤ ||A||p ||x||p

This property follows by dividing both sides of the above by ||x||p , and
applying (2.53).

2. If Q and Z are orthonormal matrices of appropriate size, then

||QAZ||2 = ||A||2

and
||QAZ||F = ||A||F
Thus, we see that the matrix 2–norm and Frobenius norm are invariant
to pre– and post– multiplication by an orthonormal matrix.

3. Further,
||A||2F = tr AT A


where tr(·) denotes the trace of a matrix, which is the sum of its diag-
onal elements. While we are considering trace, an important property
of the trace operator is

tr (AB) = tr (BA) (2.58)

for any pair of matrices A, B whose dimensions are n × k and k × n


respectively, k, n ∈ Z 12 .

12
Z is the set of positive integers, excluding zero.

78
Appendix

2.8 Differentiation of a Quadratic Form

Consider a square matrix A ∈ Rn×n . Then the quantity xT Ax is referred


to as a quadratic form, which is a scalar quantity which we discuss further
in Ch. 4. To differentiate a scalar function f (x) by a vector or a matrix
x, we differentiate f (x) by each element of x in turn, and then assemble
the results back into a vector (or matrix as the case may be). Thus the
derivative in this case has the same dimensions as x.

To differentiate the quadratic form f (x) = xT Ax with respect to xi , we use


the vector version of the product rule13 . If f (x) = a(x)b(x), then

df (x) db(x) da(x)


= a(x) + b(x). (2.59)
dxi dxi dxi

For the problem at hand, we assign a(x) = xT , and b(x) = Ax. Then it
is readily verified that the first term of (2.59) is xT ai , while the second is
aTi x, where aTi is the ith row of A. Combining the results for i = 1, . . . , n
into a vector, we have
df (x)
= xT A + Ax.
dx
To combine these two terms into a more convenient form, we are at lib-
erty to transpose the first term, since the values of the derivatives remain
unchanged. We then end up with the result

df (x)
= AT x + Ax = (AT + A)x.
dx
In the case when A is symmetric, then

df (x)
= 2Ax.
dx
This result is loosely analogous to the scalar case where the derivative
d 2
dx ax = 2ax, where a ∈ R.

13
This may be proved in an analogous manner to the scalar case.

79
2.9 Problems

1. Consider a real skew–symmetric matrix (i.e., one for which AT =


−A). Prove its eigenvalues are pure imaginary, and the eigenvectors
are mutually orthogonal.

2. If the eigendecomposition of a square, symmetric matrix A is given by


A = V ΛV T , what is the eigendecomposition of A−1 ?

3. Consider 2 matrices A and B. Under what conditions is AB = BA?

4. We are given a matrix A whose eigendecomposition is A = V ΛV T .


Find the eigenvalues and eigenvectors of the matrix C = BAB −1 in
terms of those of A, where B is any invertible matrix. C is referred
to as a similarity transform of A.

5. Consider a non–white random process x[k] of duration m. The se-


quence f [k] of duration n <= m operates on x[k] to give an output
sequence y[k] according to
X
y[k] = x[k − i]f [i]
i

where x[k] = 0 for k > m or k < 0. Using the x-data in file as-
sig2Q5 2019.mat on the website, find f [k] of length n = 10 so that
||y||22 is minimized, subject to ||f ||22 = 1.

6. On the course website you will find a file assig1 Ch2 Q6 2020.mat,
which contains a matrix X ∈ Rm×n of data corresponding to the
example of Sect. 2.6.1. Each row is a time–jittered Gaussian pulse
corrupted by coloured noise. Here, m = 1000 and n = 100, as per
the example. Using your preferred programming language, produce a
denoised version of the signal represented by the first row of X.

7. On the course website you will find a .mat file assig2Q7 2019.mat. It
contains a matrix X whose columns contain two superimposed Gaus-
sian pulses with additive noise. Using methods discussed in the course,
estimate the position of the peaks of the Gaussian pulses.

8. We are given a sequence of vector samples xi = ai g, i = 1 . . . , N


of length K samples, where ai ∈ R is a zero mean random variable
with variance σa2 and the elements of g ∈ RK constitute a Gaussian
pulse waveform in time with unit norm, similar to that shown in Fig.

80
2.12. The width (σ) and position (µ) of the pulse are invariant with
i. We formPthe sample covariance matrix R̂ over the N observations
as R̂ = N1 N T
i=1 xi xi .

(a) What is the rank of R̂?


(b) What is the first eigenvector of R̂, as N → ∞? Describe the
remaining eigenvectors.
(c) White noise is added to xi so that xi = ai g + σw, where w ∈
RK is a zero–mean noise vector with uncorrelated elements; i.e.,
E(wwT ) = σ 2 I. The vector w is uncorrelated with the signal
component g. Address the two questions above for this present
case.

9. Prove that m must be greater than or equal to n for R̂ in (2.33) to be


full rank.

10. Consider the following alternative formulation for R̂ that enables it to


adapt to a changing environment:

R̂(k) = (1 − λ)x(k)xT (k) + λR̂(k − 1),

where k is the time index and 0 < λ < 1 is a parameter that controls
the adaptation rate. Explain how the method operates, and what is
the effect of varying λ? What happens to the observation x(ko ), where
ko is constant, as time increases?

11. Prove (2.58).

12. The random process x1 [k] is narrow–band, mean–centered, whose spec-


trum is centered at 0 Hz. The process x2 [k] is narrow band, mean–
centered, whose spectrum is centered at a normalized frequency of 1/2
Hz. What is the sign of the covariances between adjacent samples of
each process?

81
82
Chapter 3

The Singular Value


Decomposition (SVD)

In this chapter we learn about one of the most fundamental and important
matrix decompositions of linear algebra: the SVD. It bears some similarity
with the eigendecomposition (ED), but is more general. Usually, the ED is
of interest only on symmetric square matrices, but the SVD may be applied
to any matrix. The SVD gives us important information about the rank,
the column and row spaces of the matrix, and leads to very useful solutions
and interpretations of least squares problems. We also discuss the concept
of matrix projectors, and their relationship with the SVD.

3.1 Development of the SVD

We have found so far that the eigendecomposition is a useful analytic tool.


However, it is only applicable on square symmetric matrices. We now con-
sider the SVD, which may be considered a generalization of the ED to arbi-
trary matrices. Thus, with the SVD, all the analytical uses of the ED which
before were restricted to symmetric matrices may now be applied to any
form of matrix, regardless of size, whether it is symmetric or nonsymmetric,
rank deficient, etc.

83
Theorem 1 Let A ∈ Rm×n be a rank r matrix (r ≤ p = min(m, n)). Then
A can be decomposed according to the singular value decomposition as

A = UΣVT (3.1)

where U ∈ Rm×m and V ∈ Rn×n are orthonormal, and


 
r Σ̃ 0
Σ= m−r 0 0
r n−r

where Σ̃ = diag(σ1 , σ2 , . . . , σr ) and

σ1 ≥ σ2 ≥ σ3 . . . ≥ σp ≥ 0.

The matrix Σ must be of dimension Rm×n (i.e., the same size as A), to
maintain dimensional consistency of the product in (3.1). It is therefore
padded with appropriately–sized zero blocks to augment it to the required
size.

Since U and V are orthonormal, we may also write (3.1) in the form:

UT A V = Σ
(3.2)
m×m m×n n×n m×n

where Σ is a diagonal matrix. The values σi which are defined to be positive,


are referred to as the singular values of A. The columns ui and vi of U and
V are respectively called the left and right singular vectors of A.

Proof: Consider the square symmetric positive semi–definite matrix AT A1 .


Let the eigenvalues greater than zero be σ12 , σ22 , . . . , σr2 , r ≤ min(m, n). Then,
from our knowledge of the eigendecomposition, there exists an orthonormal
matrix V ∈ Rn×n such that
" #
2
T T
V A AV = Σ̃ 0 . (3.3)
0 0

1
The concept of positive semi–definiteness is discussed in the next chapter. It means
all the eigenvalues are greater than or equal to zero.

84
2
where Σ̃ = diag[σ12 , . . . , σr2 ]. We now partition V as [V 1 V 2 ], where
V 1 ∈ Rn×r . Then (3.3) has the form
" #
2
V T1
 
T
A A V1 V2
 
= Σ̃ 0 . (3.4)
V2 T 0 0
r n−r
n
Then by equating corresponding blocks in (3.4) we have
2
V T1 AT AV 1 = Σ̃ (r × r) (3.5)
T T
V 2 A AV 2 = 0. (n − r) × (n − r). (3.6)
From (3.5), we can write
−1 −1
Σ̃ V T1 AT AV 1 Σ̃ = I. (3.7)
Then, we define the matrix U 1 ∈ Rm×r from (3.7) as
−1
U 1 = AV 1 Σ̃ . (3.8)
Then, noting that the product of the first three terms in (3.7) is the transpose
of the product of the latter three terms, we have U T1 U 1 = I and it follows
that
U T1 AV 1 = Σ̃. (3.9)
From (3.6) we also have
AV 2 = 0. (3.10)
We now choose a matrix U 2 so that U ∈ Rm×m = [U 1 U 2 ] is orthonormal.
Then from (3.8) and because U 1 ⊥ U 2 , we have
−1
U T2 U 1 = U T2 AV 1 Σ̃ = 0. (3.11)
Therefore
U T2 AV 1 = 0. (3.12)
Combining (3.9), (3.10) and (3.12), we have
 T
U 1 AV 1 U T1 AV 2
  
T Σ̃ 0
U AV = = (3.13)
U T2 AV 1 U T2 AV 2 0 0
which was to be shown. 

The proof can be repeated using an eigendecomposition on the matrix


AAT ∈ Rm×m instead of on AT A. In this case, the roles of the orthonormal
matrices V and U are interchanged, and the non–zero eigenvalues remain
unchanged.

85
3.1.1 Relationship between SVD and ED

It is clear that the eigendecomposition and the singular value decomposition


share many properties in common. The price we pay for being able to
perform a diagonal decomposition on an arbitray matrix is that we need two
orthonormal matrices instead of just one, as is the case for square symmetric
matrices. In this section, we explore further relationships between the ED
and the SVD.

From the definition of the SVD and partitioning V according to


 
V1 V2
V ∈ Rn×n = n (3.14)
r n−r

as before, we can write

V T1
    
T
  Σ̃ 0 T Σ̃ 0
A A = V1 V2 U U
0 0 0 0 V T2
2
= V 1 Σ̃ V T1
" #
2
= V Σ̃ 0 VT
0 0

where Σ̃ ∈ Rr×r is diagonal. Thus it is apparent that the eigenvectors V


of the matrix AT A are the right singular vectors of A, and that the square
of the singular values of A are the corresponding nonzero eigenvalues. Note
that if A is short (m < n) and rank m, the corresponding matrix Σ will
contain n − m additional zero eigenvalues that are not included as singular
values of A. If rank(A) = r, regardless of its shape (tall, square or short)
and if n − r ≥ 2, there are repeated zero eigenvalues and V 2 is not unique
in this case.

Further, using the form AAT instead of AT A, it is straightforward to show


that " #
 Σ̃2 0 U T1

T 2
= U 1 Σ̃ U T1 ,

AA = U 1 U 2 T
0 0 U 2

which indicates that the eigenvectors of AAT are the left singular vectors
U of A, and the squared singular values of A are the nonzero eigenvalues
of AAT . Notice that in this case, if A is tall and full rank, the matrix Σ

86
corresponding to AAT will contain m − n additional zero eigenvalues that
are not included as singular values of A. If rank(A) = r and if m − r ≥ 2,
then there are repeated zero eigenvalues and U 2 is not unique in this case.

3.1.2 Partitioning the SVD

Following the convention used for ordering eigenvalues, for convenience of


notation, we arrange the singular values as:
σ1 ≥ ··· ≥ σr > σr+1 = · · · = σp = 0
max min
non-zero

s.v.
| {z } | {z }
r non-zero s.v’s p−r zero s.v.’s
We also partition the U and V as before in the previous section. We can
then write the SVD of A in the form
  T 
  Σ̃ 0 V1
A = U1 U2 (3.15)
0 0 V T2
where where Σ̃ ∈ Rr×r = diag(σ1 , . . . , σr ), and U is partitioned as
 
U= U1 U2 m
r m−r

V is partitioned in an analogous manner:


 
V = V1 V2 n.
r n−r

3.1.3 Properties and Interpretations of the SVD

The above partition reveals many interesting properties of the SVD:

rank(A) = r

From (3.15), by multiplying matrices from the right–hand side, we have


A = U 1 B, where B = Σ̃V T1 . It is clear that all columns of A are linear

87
combinations of U 1 ∈ Rm×r . Therefore, since B is full rank, R(A) spans r
dimensions and so rank(A) = r. If r < p = min(m, n), then there are p − r
zero singular values.

Determination of rank when σ1 , . . . , σr are distinctly greater than zero, and


when σr+1 , . . . , σp are exactly zero is easy. But often in practice, due to
finite precision arithmetic and fuzzy data, σr may be very small, and σr+1
may be not quite zero. Hence, in practice, determination of rank is not so
easy. A common method is to declare rank A = r if σr+1 ≤ , where  is a
small number specific to the problem considered.

N (A) = R(V 2 )

The proof is analogous to the eigenvector case discussed in Chapter 2. Recall


the nullspace N (A) = {x | Ax = 0}. From (3.15) it is clear that Ax = 0
for non–zero x iff x ∈ R(V 2 ).

R(A) = R(U1 )

From the paragraph on rank above, we have A = U 1 B and B is full rank.


Hence the ranges of A and U 1 are equivalent and U 1 is a basis for the
column space (range) of A.

R(AT ) = R(V1 )

Recall that R(AT ) is the set of all linear combinations of rows of A. If we


transpose the expression for A in (3.15) and apply the same argument that
justifies R(A) = R(U 1 ) above, we get the desired result. Hence V 1 is a
basis for the row space of A.

88
R(A)⊥ = R(U 2 )

We have seen that R(A) = R(U1 ). Since from (3.15), U 1 ⊥ U 2 , then U 2 is


a basis for the orthogonal complement of R(A). Hence the result.

||A||2 = σ1 = σmax

This is straightforward to see from the definition of the 2-norm and the
ellipsoid example to follow in Section 3.1.3.

Inverse of A

If the SVD of a square invertable matrix A is given, it is easy to find the


inverse. In this case we have σ1 . . . σn > 0. The inverse of A is given from
the SVD, using the familiar rules, as

A−1 = V Σ−1 U T .

The evaluation of Σ−1 is simple because it is square and diagonal. Note that
−1
this treatment indicates that the singular values of A−1 are [σn−1 , σn−1 , . . . , σ1−1 ]
in that order. The only difficulty with this approach is that in general, find-
ing the SVD is more costly in computational terms than finding the inverse
by more conventional means.

The SVD diagonalizes any system of equations

Consider the system of equations Ax = b, for an arbitrary matrix A. Using


the SVD of A, we have
U ΣV T x = b. (3.16)

Let us now represent b in the basis U , and x in the basis V . We therefore


have    T 
r c1 U1
c= = b (3.17)
m−r c2 U T2

89
and
V T1
   
r d1
d= = x. (3.18)
n−r d2 V T2
Substituting the above into (3.16), the system of equations becomes

Σd = c. (3.19)

This shows that as long as we choose the correct bases, any system of equa-
tions can become diagonal. This property represents the power of the SVD;
it allows us to transform arbitrary algebraic structures into their simplest
forms.

Eq. (3.19) can be expanded as


    
Σ̃ 0 d1 c1
= . (3.20)
0 0 d2 c2

The above equation reveals several interesting facts about the solution of the
system of equations. First, if m > n (A is tall) and A is full rank, then the
right blocks of zeros in Σ, as well as the quantity d2 , are both empty. In this
case, the system of equations can be satisfied exactly only if c2 = 0. This
implies that U T2 b = 0, or that b ∈ R(U 1 ) = R(A) for an exact solution to
exist. This results makes sense, since in this case since the quantity Ax is a
linear combination of the columns of A and therefore the equation Ax = b
can only be satisfied iff b ∈ R(A).

If m < n (A is short) and full rank, then the bottom blocks of zeros in Σ,
−1
as well as c2 in (3.20) are both empty. In this case we have d1 = Σ̃ c1
form the top row and d2 arbitrary from the bottom row. We can write these
relationships in the form
   T   −1 
d1 V1x Σ c1
= = .
d2 V T2 x d2

Multiplying both sides of the right-hand equality by V = [V 1 V 2 ] we have


 Σ−1 U T1 b
 

x = V1 V2
d2
= V 1 Σ−1 U T1 b + V 2 d2 (3.21)

where in the top line we have substituted (3.17) for c1 and (3.18) for d.
Thus we see that the solution consists of a “basic” component, which is the

90
first term above. This term is closely related to the pseudo–inverse, which
we discuss in some detail in Ch. 8. Since V 2 is a basis for N (A) and d2 is
an arbitrary n − r vector, the second term above contributes an arbitrary
component in the nullspace of A to the solution. Thus x is not unique. It
is straightforward to verify that the quantity AV 2 d2 = 0, so the addition
of the second term does not affect the fact we have an exact solution.

If A is not full rank, then none of the zero blocks in (3.20) are empty. This
implies that both scenarios above both apply in this case.

The “rotation” interpretation of the SVD

From the SVD relation A = U ΣV T , we have

AV = U Σ.

Note that since Σ is diagonal, the matrix U Σ on the right has orthogonal
columns, whose 2–norm’s are equal to the corresponding singular value.
We can therefore interpret the matrix V as an orthonormal matrix which
rotates the rows of A so that the result is a matrix with orthogonal columns.
Likewise, we have
U T A = ΣV T .
The matrix ΣV T on the right has orthogonal rows with 2–norm equal to the
corresponding singular value. Thus, the orthonormal matrix U T operates
(rotates) the columns of A to produce a matrix with orthogonal rows.

Defining relationships for the ED and the SVD

For the ED, if A is symmetric, we have:

A = QΛQT → AQ = QΛ,

where Q is the matrix of eigenvectors, and Λ is the diagonal matrix of


eigenvalues. Writing this relation column-by-column, we have the familiar
eigenvector/eigenvalue relationship:

Aq i = λi q i i = 1, . . . , n. ∗ (3.22)

91
For the SVD, we have
A = UΣVT → AV = UΣ
or
Av i = σi ui i = 1, . . . , p, ∗ (3.23)
where p = min(m, n). Also, since AT = V ΣU T → AT U = V Σ, we have
AT ui = σi v i i = 1, . . . , p. ∗ (3.24)

Thus, by comparing (3.22), (3.23), and (3.24), we see the singular vectors
and singular values obey a relation which is similar to that which defines
the eigenvectors and eigenvalues. However, we note that in the SVD case,
the fundamental relationship expresses left singular values in terms of right
singular values, and vice-versa, whereas the eigenvectors are expressed in
terms of themselves. These SVD relations are used in Chapter 9 to develop
the partial least squares regression method.

Exercise: compare the ED and the SVD on a square symmetric matrix,


when i) A is positive definite, and ii) when A has some positive and some
negative eigenvalues.

Ellipsoidal Interpretation of the SVD

The singular values of A, where A ∈ Rm×n are the lengths of the semi-axes
of the hyperellipsoid E given by:
E = {y | y = Ax, ||x||2 = 1} .
That is, E is the set of points mapped out as x takes on all possible values
such that ||x||2 = 1, as shown in Fig. 3.1. To appreciate this point, we look
at the set of y corresponding to {x | ||x||2 = 1}. We take
y = Ax (3.25)
T
= U ΣV x.
We change bases for both x and y. Define
c = UT y
d = V T x.

92
span(v1)
span(u1)

sigma1

E
span(v2)

sigma2

span(u2)

{x | ||x|| = 1}

Figure 3.1. The ellipsoidal interpretation of the SVD. The locus of points E = {y | y =
Ax, ||x||2 = 1} defines an ellipse. The principal axes of the ellipse are aligned along the
left singular vectors ui , with lengths equal to the corresponding singular value.

Then (3.25) becomes


c = Σd, or Σ−1 c = d, (3.26)

where A is assumed full rank.

Due to the orthonormal transformation, we note that ||d||2 = 1 if ||x||2 = 1.


Thus, our problem is transformed into observing the set {c} corresponding
to the set {d | ||d||2 = 1}. The set {c} can be determined by evaluating
2-norms on each side of (3.26):

p  p
ci 2
X  X
= (di )2 = 1.
σi
i=1 i=1

We see that the set {c} is indeed the canonical form of an ellipse in the basis
U . Thus, the principal axes of the ellipse are aligned along the columns
ui of U , with lengths equal to the corresponding singular value σi . This
interpretation of the SVD is useful later in our study of condition numbers.

93
A Useful Theorem [1]

First, we realize that the SVD of A provides a “sum of outer-products”


representation:
Xr
T
A = UΣV = σi ui vi T , (3.27)
i=1

where r ≤ p = min(m, n). Given A ∈ Rm×n with rank r, then what is the
matrix B ∈ Rm×n with rank k < r closest to A in 2-norm? What is this
2-norm distance? This question is answered in the following theorem:

Theorem 2 Define a truncated version Ak of A in the following way:


k
X
Ak = σ i ui v i T , k < r, (3.28)
i=1

then
min ||A − B||2 = ||A − Ak ||2 = σk+1 .
rank(B)=k

In words, this says the closest rank k < r matrix B matrix to A in the
2–norm sense is given by Ak . Ak is formed from A by truncating contri-
butions in (3.27) associated with the smallest singular values. This idea
may be seen as a generalization of PCA, where here we construct low–rank
approximations to matrices instead of vectors.

Proof:

Since UT Ak V = diag(σ1 . . . σk , 0 . . . 0) it follows that rank(Ak ) = k, and


that

||A − Ak ||2 = UT (A − Ak )V 2

= || diag(0 . . . 0, σk+1 . . . σr , 0 . . . 0)||2


= σk+1 . (3.29)

where the first line follows from the fact the the 2-norm of a matrix is invari-
ant to pre– and post–multiplication by an orthonormal matrix (properties

94
of matrix p-norms, Chapter 2). Further, it may be shown [1] that, for any
matrix B ∈ Rm×n of rank k < r, [1]

||A − B||2 ≥ σk+1 (3.30)

Comparing (3.29)and (3.30), we see the closest rank k matrix to A is Ak


given by (3.28). 

As an example, recall the PCA representation of a random process as dis-


cussed in Lecture 2. We express a sample x̂i ∈ Rm as

x̂i = Vθ̂ i (3.31)

where the columns of V are the eigenvectors of the covariance matrix R and
θ̂ is the sequence of PCA coefficients truncated to r non-zero coefficients.
Then the covariance matrix R̂ corresponding to x̂i is given as

R̂ = E(x̂x̃T )
T T
= E(Vθ̂ θ̂ VT ) = V E(θ̂ θ̂ )VT
= V Λ̂V T , (3.32)

where Λ̂ = diag [λ1 . . . , λr , 0 . . . , 0], and where Property 8 of Chapter 2


has been used. Since R̃ is positive definite, square and symmetric, its
eigendecomposition and singular value decomposition are identical when
λi = σi2 , i = 1, . . . , r, where the σi are the singular values of the cor-
responding X.2 Thus from this theorem, and (3.32), we know that the
covariance matrix R̂ formed from truncating the PCA coefficients is the
closest rank–r matrix to the true covariance matrix R in the 2–norm sense.

As a further example, we apply the above results to approximate an image


with another of lower rank, which is the closest to the true version. Here we
perform a partial SVD on the 512 × 512 Lena image, by treating the image
as a matrix. Let the image matrix be represented by I, which is given by
the SVD as by
X k
I= σi ui v Ti , (3.33)
i=1

where ui and v i are the columns of U and V of the true image respectively,
and the σi are the singular values. The true image is represented by k = 512;
2
The proof is left as an exercise.

95
Lena image reconstructed with k=1 component

Figure 3.2. Lena image reconstructed with only k = 1 component.

however, here we reconstruct the image for a sequence of lower values of k,


in order to obtain the closest rank(k) matrix to the original matrix. Figures
3.2 through to 3.6 show the reconstructed images for a sequence of k–values
ranging from 1 to 50.

3.2 Orthogonal Projections

Consider an orthonormal Q ∈ Rm×m which we partition as [Q1 Q2 ], where


Q1 ∈ Rm×n , m > n, so Q1 is tall. Let a subspace S be defined as S =
R(Q1 ). We wish to formulate a matrix P so that the vector P y ∈ S, where
y ∈ Rm is arbitrary; i.e., P projects y into S.

We can express any vector y ∈ Rm in the complete m × m orthonormal


basis Q according to the usual rules for changing bases, as y = Qc, where
c = QT y. Therefore
 
c1 n
y = Qc = [Q1 Q2 ]
c2 m−n

= yS + yC
where y S = Q1 c1 ∈ S and y C = Q2 c2 ∈ S⊥ (because Q2 ⊥ Q1 ). It is clear
the vector y S is given as y S = Q1 c1 = Q1 QT1 y. Therefore the matrix P for

96
Lena image with k=5 components

Figure 3.3. Lena image reconstructed with k = 5 components.

Lena image with k=10 components

Figure 3.4. Lena image reconstructed with k = 10 components.

97
Lena image with k−20 components

Figure 3.5. Lena image reconstructed with k = 20 components.

Lena image with k=50 components

Figure 3.6. Lena image reconstructed with k = 50 components.

98
Figure 3.7. The geometry of the orthogonal projection operation.

which y S = P y is given as
P = Q1 QT1 . (3.34)

We can represent the vectors y, y S and y C in a right–angled triangle con-


figuration as shown in Fig. 3.7, where y is the hypotenuse. Thus we can
interpret the vector y S as the result of dropping a perpendicular from y into
S. Therefore, according to the principle of orthogonality [7], which states
that the shortest distance in the 2–norm sense from a point p into a sub-
space is obtained by dropping a perpendicular from p into the subspace. y C
is the vector with the shortest distance from y into S in the 2-norm sense.
This is the origin of the term orthogonal projection.

We can extend the definition of a projector in the following way. Let X


be such that R(X) = S, where X ∈ Rm×n is full rank, m > n. Now take
the n × n matrix R = X T X. It is possible to compute a full–rank n × n
square–root matrix C such that C T C = R.3

Theorem 3 The matrix W ∈ Rm×n = XC −1 has orthonormal columns.

Proof : The matrix W has orthonormal columns if W T W = I. To show

3
The Cholesky decomposition, which we will study in Ch. 5, is one such example.

99
this, we take

WTW = C −T X T XC −1
= C −T RC −1 = C −T C T CC −1
= I.

We note that R(W ) = S because the columns of W are linear combinations


of the columns of X, which is assumed full rank. Since W has orthonor-
mal columns, we can replace Q1 with W in (3.34), to obtain the alternate
definition for a projector:

P = WWT
= XC −1 C −T X T
= XR−1 X T
= X(X T X)−1 X T . (3.35)

Eq. (3.35) is the usual definition of a projector.

It is straightforward to show that P is the unique projector onto S, regard-


less of the basis used to form it, provided it spans S. To show this, we
substitute Y = XC into (3.35), where C is an n × n invertible matrix. It is
apparent that if X ∈ S, then so is Y . Straightforward manipulation reveals
that the result for P is unchanged. Therefore P is unique, regardless of the
basis used to form it.

It is shown [1] that if a matrix P satisfies these conditions:

1. R(P) = S
2. P2 = P
3. PT = P

then P is a projection matrix onto S. The above are sufficient conditions for
a projector. This means that while these conditions are enough to specify
a projector, there may be other conditions which also specify a projector.
But since we have now proved the projector is unique, these conditions are
also necessary.

100
It is readily verified that both definitions for the projector i.e., P = Q1 QT1
and P = X(X T X)−1 X T ] satisfy the above properties.

Property 1 is necessary if the projector is to project into S. Property 2


implies that if we take the projection operation y S = P y, and then then
take P y S = P P y, the result of the second operation should still be y S .
Hence we require P 2 = P . A matrix satisfying this property is called an
idempotent matrix and is the fundamental property of a projector.

Property 3 above is included since we want the projector to apply to row


vectors as well as column vectors, both of length m; i.e., a subspace can
be defined using row vectors as well as column vectors. In this case, we
want both quantities (y T P )T and P y to be vectors in S, and so therefore
P should be symmetric.

3.2.1 The Orthogonal Complement Projector

Consider the vectors y, y S and y C as defined previously. Then

y = ys + yc
= P y + yc.

Therefore we have

y − P y = yc
(I − P ) y = y c .

It follows that if P is a projector onto S, then the matrix (I − P ) is a


projector onto S⊥ . It is easily verified that this matrix satisfies the all
required properties for this projector.

3.2.2 Orthogonal Projections and the SVD

Suppose we have a matrix A ∈ Rm×n of rank r. Then, using the SVD


partitions of (3.15), we have these useful relations:

1. V 1 V T1 is the orthogonal projector onto [N A)]⊥ = R(AT ).

101
2. V 2 V T2 is the orthogonal projector onto N (A)

3. U 1 U T1 is the orthogonal projector onto R(A)

4. U 2 U T2 is the orthogonal projector onto [R(A)]⊥ = N (AT )

The above may be justified since it is straightforward to show each of these


matrices are indeed projectors, and each of the respective SVD partitions
are an orthonormal basis for the subspace onto which they are projecting.

102
Appendices

3.3 Alternative Proof of the SVD

Consider two vectors x and y where ||x||2 = ||y||2 = 1, s.t. Ax = σy, where
σ = ||A||2 . The fact that such vectors x and y can exist follows from the
definition of the matrix 2-norm. We define orthonormal matrices U and V
so that x and y form their first columns, as follows:

U = [y, U1 ]
V = [x, V1 ]

That is, U1 consists of a set of non–unique orthonormal columns which are


mutually orthogonal to themselves and to y; similarly for V 1 .

We then define a matrix A1 as

UT AV = A1
 T 
y A[x, V1 ]
= (3.36)
U1 T

The matrix A1 has the following structure:


 T     T   
y A x V1 = y σy AV1
UT1 U1 T
| {z } | {z }
orthonormal orthonormal

σy T y y T AV1

↓ ↓
1 ∆
σ wT
 
= = A1 . (3.37)
m−1
0 B
1 n−1


where B = UT1 AV1 . The 0 in the (2,1) block above follows from the fact
that U1 ⊥ y, because U is orthonormal.

103
 
σ
Now, we post-multiply both sides of (3.37) by the vector and take
w
2-norms:

  2  T
  2
A1 σ = σ w σ
≥ (σ 2 + wT w)2 .

(3.38)
w 2 0 B w 2
This follows because the term on the extreme right is only the first element
of the vector product of the middle term. But, as we have seen, matrix
p-norms obey the following property:

||Ax||2 ≤ ||A||2 ||x||2 . (3.39)

Therefore using (3.38) and (3.39), we have

σ 2
    2
||A1 ||22 ≥ A1 σ ≥ (σ 2 + wT w)2 .

(3.40)
w 2 w 2

σ 2
 
Note that
= σ 2 + wT w. Dividing (3.40) by this quantity, we
w 2
obtain
||A1 ||22 ≥ σ 2 + wT w. (3.41)
But, we defined σ = ||A||2 . Therefore, the following must hold:

σ = ||A||2 = UT AV 2 = ||A1 ||2



(3.42)

where the equality on the right follows because the matrix 2-norm is invari-
ant to matrix pre- and post-multiplication by an orthonormal matrix. By
comparing (3.41) and (3.42), we have the result w = 0.

Substituting this result back into (3.37), we now have


 
σ 0
A1 = . (3.43)
0 B

The whole process repeats using only the component B, until An becomes
diagonal. 

104
3.4 Problems

1. Prove that the trace of a projector is equal to its rank.

2. Using Matlab or other suitable language, construct a 6 × 4 matrix A


with rank = 2.

(a) Construct the projectors for each of the four fundamental sub-
spaces of A.
(b) Explain how to test whether a vector y is in a specified subspace
S.
(c) Construct a random vector x and project it onto the four sub-
spaces, using the respective projectors from part a, to yield the
result y.
(d) Verify that each y is indeed in the subspace corresponding to the
respective P .

3. Consider a tall matrix A ∈ Rm×n with column rank r < n − 1, whose


SVD is given by A = U ΣV T .

(a) Explain how the SVD can be changed without changing A.


(b) Repeat the above for the case where A is short and r < m − 1.

4. Given the (tall) matrix A above, explain how to determine a vector x


so that x ∈ R(A)⊥ . Illustrate with an example.

5. We are given an m × k matrix A and a k × n matrix B, where m >


n ≤ k; i.e., A is tall and B is short. We form C = AB.

(a) Prove that the row spaces of B and C are equivalent.


(b) Prove that the column spaces of A and C are equivalent.
(c) Find a basis for R(B T ) in terms of the SVD of C.
(d) Using matlab, provide an example showing that R(C T ) ≡ R(B T ).
Also demonstrate that these are both equivalent to the range of
the basis determined in step 5c above.

6. We are given a very tall matrix A where m  n and where m is


very large. We are interested in determining only the matrix V of the
SVD. Solving this directly by computation of the SVD is expensive,
due to the fact that U is very large and therefore requires a lot of

105
time and memory to calculate. Suggest a more practical approach for
determining V . Also show how to find the first r columns of U , given
V . Using similar ideas, extend your method to the case where A is
large and very short, and we are interested only in U . Also show how
to determine the first r columns of V . Hint: Consider the matrices
AT A and AAT , respectively.

7. We are given a coloured noise process generated by filtering white


noise. From this process, we generate a matrix X ∈ Rm×n , where
m > n in the manner described in Sect. 2.4. The covariance matrix
1
R̂1 may be written as R̂1 = m X T X. Compare the eigenvalues and

eigenvectors of R̂1 with those of R̂2 = n1 XX T . What happens to the
eigenvector waveforms as both m and n become large?

106
Chapter 4

The Quadratic Form

4.1 The Quadratic Form and Positive Definiteness

We introduce the quadratic form by considering the idea of positive definite-


ness of a matrix A. A square matrix A ∈ Rn×n is positive definite if and
only if, for any 0 6= x ∈ Rn ,

xT Ax > 0. (4.1)

The matrix A is positive semi–definite if and only if, for any x 6= 0 we have

xT Ax ≥ 0, (4.2)

which, as we see later, includes the possibility that A is rank deficient. The
quantity on the left in (4.1) is referred to as a quadratic form of A. It
may be verified by direct multiplication that the quadratic form can also be
expressed in the form
n X
X n
T
x Ax = aij xi xj . (4.3)
i=1 j=1

It is only the symmetric part of A which is relevant in a quadratric form



expression. To see this, we define the symmetric part T of A as T =
1 T ∆ 1 T
2 [A + A ], and the asymmetric part S of A as S = 2 [A − A ]. Then we

107
have the desired properties that T T = T , S = −S T , and A = T + S. Note
the diagonal elements of S must be zero.

We can express (4.3) as


n X
X n n X
X n
T
x Ax = tij xi xj + sij xi xj . (4.4)
i=1 j=1 i=1 j=1

We now consider only the second term on the right in (4.4):


n X
X n
sij xi xj . (4.5)
i=1 j=1

Since S = −S T , the quantity sij = −sji , j 6= i, and sij = 0, i = j. Therefore,


the sum in (4.5) is zero for any x. Thus, when considering quadratic forms,
it suffices to consider only the symmetric part T of the matrix; i.e., we
have the result xT Ax = xT T x. This is useful, since now we can substitute
the matrix A by its symmetric part T , which is square and symmetric,
meaning we can represent it by its eigendecomposition. As we will see, this
substitution offers new interpretations and insight on the quadratic form.

This result generalizes to the case where A is complex. It is left as an



exercise to show that i) xH Ax = xH T x, where T = 12 [A + AH ], and ii)
that the quantity xH Ax is pure real.

Quadratic forms on positive definite matrices are used very frequently in


least-squares and adaptive filtering applications. Also as we see later, quadratic
forms play a fundamental role in defining the multivariate Gaussian proba-
bility density function.

Theorem 4 A matrix A is positive definite if and only if all eigenvalues of


the symmetric part of A are positive.

Proof: Let the eigendecomposition on the symmetric part T of A be repre-


sented as T = V ΛV T . Since only the symmetric part of A is relevant, the
quadratic form on A may be expressed as xT Ax = xT T x = xT V ΛV T x.

Let us define the variable z as z = V T x. As we have seen previously in

108
Chapters 1 and 2, z is a rotation of x due to the fact V is orthonormal.
Thus we have
xT Ax = z T Λz
Xn
= zi2 λi . (4.6)
i=1
Thus (4.6) is greater than zero for arbitrary x if and only if λi > 0, i =
1, . . . , n. 

We also see from (4.6) that if the equality in the quadratic form is satisfied,
(xT Ax = 0 for some x and corresponding z) then at least one eigenvalue of
T must be zero. Hence, if A is symmetric, then A being positive semidefinite
implies that at least one eigenvalue of A must also be zero, which means
that A is rank deficient.

4.2 The Locus of Points {z|z T Λz = 1}

Let A be positive definite. Here we investigate the set of points satisfying


{z|z T Λz = 1}, where z is defined as above. We can write
n
X
z T Λz = zi2 λi
i=1
n
X z2 i
= 1 . (4.7)
i=1 λi
When this quantity is equated to 1, we have theqcanonical form of an ellipse
in the variables zi , with principal axis lengths λ1i . The principal axes are
aligned along the corresponding elementary basis directions e1 , e2 , . . . , en .

Since z = V T x where V is orthonormal, the locus of points {x|xT Ax = 1}


is a rotated version of the ellipse in (4.7). This ellipse has the same principal
axes lengths as before, but the ith principal axis now lines up along the ith
eigenvector v i of A.

The locus of points {x|xT Ax = k, k > 0}, defines a scaled version of the
ellipse above.
q In this case, the ith principal axis length is given by the
k
quantity λ1 .

109
250

200

150

100

50

0
10
5 10
0 5
0
-5
-5
-10 -10

Figure 4.1. Three-dimensional plot of quadratic form.

Example: A three–dimensional plot of y = xT Ax is shown plotted in Fig.


4.1 for A given by
 
2 1
A= . (4.8)
1 2

Note that this curve is elliptical in cross-section in a plane y = k as dis-


cussed above. A quick calculation verifies the eigenvalues of A are 3, 1 with
corresponding eigenvectors [1, 1]T and [1, −1]T √. For y = k = 1, the lengths
of the principal axes of the ellipse are then 1/ 3 and 1. It can be verified
from the figure the principal axis lengths are indeed the lengths indicated,
and are lined up along the directions of the eigenvectors as required.

Positive definiteness of A in the quadratic form xT Ax is the matrix analog


to the scalar a being positive in the scalar expression ax2 . The scalar equa-
tion y = ax2 is a parabolaP which faces upwards if a is positive. Likewise,
the equation y = xT Ax = ni=1 zi2 λi , where z = V T x as before, is a multi-
dimensional parabola. The parabola faces upwards in all directions if A is
positive definite. If A is not positive (semi) definite, then some eigenval-
ues are negative and the curve faces down in the orientations corresponding
to the negative eigenvalues, and up in those corresponding to the positive
eigenvalues.

110
Theorem 5 A (square) symmetric matrix A can be decomposed into the
form A = BB T if and only if A is positive definite or positive semi–definite.

The matrix B is referred to as a square root factor of A.

Proof: (Necessary condition; i.e., if A = BB T , then A is positive definite.)


Let us define z as B T x. Then for any x,

xT Ax = xT BB T x
= zT z
≥ 0. (4.9)

Conversely (sufficient condition): Since A is square and symmetric, we can


write A as A = V T ΛV . Since A is positive definite by hypothesis, all the
eigenvalues are positive and so we can write A = (V Λ1/2 )(V Λ1/2 )T . Let

us define B = V Λ1/2 QT where Q is any matrix of appropriate size whose
columns are orthonormal, such that QT Q = I. Then A = V Λ1/2 QT QΛ1/2 V T =
BB T . 

Recall that A is n × n; thus, Q can be of size m × n, where m ≥ n. It is


therefore clear that Q is not unique, and therefore it follows that the square
root factor B of A is unique only up to an orthogonal ambiguity.

The fact that A can be decomposed into two symmetric factors in this way
is the fundamental idea behind the Cholesky factorization, which is a major
topic of the following chapter.

4.3 The Gaussian Multi-Variate Probability Den-


sity Function

Here, we very briefly introduce this topic so we can use this material for
an example of the application of the Cholesky decomposition later in this
course, and also in least-squares analysis to follow shortly. This topic is a
good application of quadratic forms. More detail is provided in several books
[12, 13]. First we consider the uni–variate case of the Gaussian probability

111
0.06

0.04

0.02

0
5
5
0
0

-5 -5

Figure 4.2. A Gaussian probability density function with covariance matrix [2 1; 1 2].

distribution function (pdf ). The pdf p(x) of a Gaussian-distributed random


variable x with mean µ and variance σ 2 is given as
 
1 1 2
p(x) = √ exp − 2 (x − µ) . (4.10)
2πσ 2 2σ
This is the familiar bell-shaped curve. It is completely specified by two
parameters– the mean µ which determines the position of the peak, and the
variance σ 2 which determines the width or spread of the curve.

Now consider the more interesting multi-dimensional case. Consider a Gaussian-


distributed random vector x ∈ Rn with mean µ and covariance Σ. This form
of random variable is denoted as x ∼ N (µ, Σ), the N (·, ·) denoting a normal
distribution, which is alternative designation for Gaussian distribution. The
symbol ∼ indicates “is distribtued as”.

The multivariate pdf describing x is


 
n 1 1
p(x) = (2π)− 2 | Σ |− 2 exp − (x − µ)T Σ−1 (x − µ) . (4.11)
2

We can see that the multi-variate case collapses to the uni-variate case when
the number of variables reduces to one. A plot of p(x) vs. x is shown in

112
Fig. 4.2, for a mean µ = 0 and covariance matrix Σ = Σ1 defined as
 
2 1
Σ1 = . (4.12)
1 2

Because the exponent in (4.11) is a quadratic form, the set of points satisfied
by the equation 21 (x − µ)Σ−1 (x − µ) = k where k is a constant, is an
 

ellipse. Therefore this ellipse defines a contour of equal probability density.


The interior of this ellipse defines a region into which an observation will
fall with a specified probability α which is dependent on k. This probability
level α is given as
Z  
−n − 21 1 −1
α= (2π) 2 | Σ | exp − (x − µ)Σ (x − µ) dx, (4.13)
R 2

where R is the interior of the ellipse. Stated another way, an ellipse is the
region in which any observation governed by the probability distribution
(4.11) will fall with a specified probability level α. As k increases, the
ellipse gets larger, and α increases. These ellipses are referred to as joint
confidence regions (JCRs) at probability level α.

The covariance matrix Σ controls the shape of the ellipse. Because the
quadratic
√ form in thispcase involves Σ−1 , the length of the ith principal axis
is 2kλi instead of 2k/λi as it would be if the quadratic form were in
Σ. Therefore as the eigenvalues of Σ increase, the size of the JCRs increase
(i.e., the variances of the distribution increase) for a given value of k.

We now investigate the relationship of the covariances between the vari-


ables (i.e., off-diagonal terms of the covariance matrix) and the shape of
the Gaussian pdf. We have seen previously in Chapter 2 that covariance
is a measure of dependence between individual random variables. We have
also seen that as the off-diagonal covariance terms become larger, there is a
larger disparity between the largest and smallest eigenvalues of the covari-
ance matrix. Thus, as the covariances increase, the eigenvalues, and thus
the lengths of the semi-axes of the JCRs become more disparate; i,e, the
JCRs of the Gaussian pdf become elongated. This behaviour is illustrated
in Fig. 4.3, which shows a multi– variate Gaussian pdf for a mean µ = 0
and for a covariance matrix Σ = Σ2 given as
 
2 1.9
Σ2 = . (4.14)
1.9 2

113
0.5

0.4

0.3

0.2

0.1

0
5
5
0
0

-5 -5

Figure 4.3. A Gaussian pdf with larger covariance elements. The covariance matrix is
[2 1.9; 1.9 2].

Note that in this case, the covariance elements of Σ2 have increased substan-
tially relative to those of Σ1 in Fig. 4.2, although the variances themselves
(the main diagonal elements) have remained unchanged. By examining the
pdf of Figure 4.3, we see that the joint confidence ellipsoid has become
elongated, as expected. (For Σ1 of Fig. 4.2 the eigenvalues are (3, 1), and
for Σ2 of Fig. 4.3, the eigenvalues are (3.9, 0.1)). This elongation results
in the conditional probability p(x1 |x2 ) for Fig. 4.3 having a much smaller
variance (spread) than that for Fig. 4.2; i.e., when the covariances are
larger, knowledge of one variable tells us more about the other. This is how
the probability density function incorporates the information contained in
the covariances between the variables. With regard to Gaussian probabil-
ity density functions, the following concepts: 1) larger correlations between
the variables, 2) larger disparity between the eigenvalues, 3) elongated joint
confidence regions, and 4) lower variances of the conditional probabilities,
are all closely inter–related and are effectively different manifestations of
correlations between the variables.

114
4.4 The Rayleigh Quotient

The Rayleigh quotient is a simple mathematical structure that has a great


deal of interesting uses. The Rayleigh quotient r(x) is defined as

xT Ax
r(x) = . (4.15)
xT x
It is easily verified that if x is the ith eigenvector v i of A, (not necessarliy
normalized to unit norm), then r(x) = λi :

v Ti Av i λi v T v
=
v Ti v i vT v
= λi . (4.16)

In fact, it can be shown by differentiating r(x) with respect to x, that x = v i


is a stationary point of r(x).

4.5 Methods for Computing a Single Eigen– Pair

4.5.1 The Rayleigh Quotient Method:

The Rayleigh quotient discussed above leads naturally to an iterative method


for computing an eigenvalue/eigenvector pair of a square symmetric matrix
A. If x is an approximate eigenvector, then r(x) gives us a reasonable
approximation to the corresponding eigenvalue. Further, the inverse pertur-
bation theory of Golub and Van Loan [1] says that if u is an eigenvalue, then
the solution to (A − uI)z = b, where b is an approximate eigenvector, gives
us a better estimate of the eigenvector. These two ideas lead to the following
Rayleigh Quotient technique for calculating an eigenvector/eigenvalue pair:

initialize x0 to an appropriate value; set ||x0 ||2 = 1.


for k = 0, 1, . . . ,
µk = r(xk )
Solve (A − µk I)z k+1 = xk for z k+1
xk+1 = z k+1 /||z k+1 ||2

115
This procedure exhibits cubic convergence to the eigenvector. At conver-
gence, µ is an eigenvalue, and z is the corresponding eigenvector. Therefore
the matrix (A − µI) is singular and z is in its nullspace. The solution z be-
comes extremely large and the system of equations (A−µI)z = x is satisfied
only because of numerical error, since x should normally be 0. Nevertheless,
accurate values of the eigenvalue and eigenvector are obtained.

4.5.2 The Power Method:

This is a technique for computing the dominant eigenvalue λ1 (i.e., the


largest in absolute value) and its corresponding eigenvector. With this
method, the dominant eigenvalue must be unique. If A is not symmetric,
we must also assume its eigenvectors are linearly independent.

We start with an initial guess x(0) of the eigenvector. Because the eigenvec-
tors are linearly independent, we can express x(0) in the eigenvector basis
V = [v 1 , v 2 . . . , v n ] as
Xn
x(0) = cj v j . (4.17)
j=1

where the cj are the coefficients of x(0) in the basis V . We postmultiply A


by x(0) to get x(1) = Ax(0). Substituting (4.17) for x(0) we have
n
X n
X
x(1) = cj Av j = cj λj v j .
j=1 j=1

We continue post–multiplying in this manner, and at the kth iteration we


have x(k) = Ax(k − 1) or
n
X
x(k) = cj (λj )k v j . (4.18)
j=1

 k
λ1
We multiply the above by λ1 = 1 to obtain

 n
k X  λ k
j k
x(k) = λ1 cj v j −→ λ1 c1 v 1 , (4.19)
λ1
j=1

116
as k becomes large. The result on the right follows because the terms
 k
λj
λ1 → 0, j 6= 1, as k becomes large, since λ1 is the largest eigen-
vector. When (4.19) is satisfied, the method has converged, and we have
x(k + 1) = λ1 x(k). At this point, λ1 is revealed, and v 1 = x(k + 1).

There is a practical matter remaining, and that is from (4.18) we see that
x(k) can become very large or very small as k increases, depending whether
the λ’s are greater than or less than 1, leading to floating point over– or
underflow. This situation is easily remedied by replacing x(k) with ||xx(k)||
(k)
2
at each iteration. Other scaling options are possible.

The schema for the power method is therefore given as follows:

1. Initialize x(0) to some suitable value. Often setting all the elements
to one is a good choice. Initialize k = 0.

2. x(k+1)=Ax(k).

3. set µk+1 = ||x(k + 1)||2 and replace x(k + 1) with x(k + 1)/µ(k + 1).

4. if |µ(k + 1) − µ(k)| < , where  is some suitably small threshold,

(a) λ1 = µ(k + 1)
(b) v 1 = x(k + 1).
(c) return.

5. k = k + 1

6. go to step 2.

Deflation: It turns out the power method can be applied sequentially to


calculate the eigen–pairs λ2 , v 2 , . . . using a straightforward deflation proce-
dure. We rewrite the relation A = V ΛV T using the outer product rule for
matrix multiplication and using the fact that Λ is diagonal, in the form
n
X
A= λi v i v Ti .
j=1

117
The power method as described above allows us to specify the first term in
this expansion. We can therefore define a deflated version Adef as

Adef = A − λ1 v 1 v T1 .

It is noted that the largest eigenvalue of Adef is now λ2 . Therefore the


power method described above may be applied to Adef to yield λ2 and v 2 .
This process as described may then be applied sequentially to determine as
many eigen–pairs of A as desired.

One caveat regarding this sequential power method is that if the dimension
of A is large, then small errors due to floating point error etc. in the early
eigen–pair estimates can compound, leaving the eigen–pair estimates of the
later stages inaccurate. If the complete eigendecomposition is desired, then
there are better methods which use the QR decomposition (to be described
later) for finding the complete eigendecomposition.

It is straightforward to show that the smallest eigen–pair may be determined


using the power method on the matrix A−1 instead of A. The proof is left
as an excersize. Hint: What are the eigenvalues of A−1 relative to A?

118
Appendix

4.6 Alternate Differentiation of the Quadratic Form

We see that the quadratic form is a scalar. To differentiate a scalar with


respect to the vector x, we differentiate with respect to each element of x
in turn, and then assemble all the results back into a vector. We proceed as
follows:

We write the quadratic form as


n X
X n
xT Ax = xi xj aij . (4.20)
i=1 j=1

When differentiating the above with respect to a particular element xk , we


need only consider the terms when either index i or j equals k. Therefore:
 
n n
d T d X X
xi xk aik + x2k akk 

x Ax =  x x a
k j kj + (4.21)
dxk dxk j=1

i=1

j6=k i6=k

where the first term of (4.21) corresponds to holding i constant at the value
k, and the second corresponds to holding j constant at k. Care must be
taken to include the term x2k akk corresponding to i = j = k only once;
therefore, it is excluded in the first two terms and added in separately. Eq.
(4.21) evaluates to
d T X X
x Ax = xj akj + xi aik + 2xk akk
dxk
j6=k i6=k
X X
= xj akj + xi aik
j i

= [Ax]k + AT x k
 

= (A + AT )x k
 

= [2T x]k

where (·)k denotes k th element of the argument. By assembling these indi-


vidual terms corresponding to k = 1, . . . , n back into a vector, we have the

119
result that
d T
x Ax = 2T x.
dx

It is interesting to find the stationary points of the quadratic form subject


to a norm constraint; i.e., we seek the solution to

max xT Ax.
||x||22 =1

To solve this, we form the Lagrangian

xT Ax + λ 1 − xT x .


Differentiating, and setting the result to zero (realizing that d/dx xT x =




2x) gives
T x = λx.
Thus, the eigenvectors are stationary points of the quadratic form, and the
x which gives the maximum (or minimum), subject to a norm constraint, is
the maximum (minimum) eigenvector of A.

120
4.7 Problems

1. Consider the following gerneralization of the power method, which


determines the vectors u1 , v 1 and the quantity σ1 of the SVD of a
matrix A:

(a) Initialize x(0) to some suitable value, assign i = 0:


(b) Repeat until convergence
ˆ y (i+1) = Ax(i)
ˆ x(i+1) = AT y (i+1)
ˆ Normalize: x(i+1) ← ||xx(i+1) ||
(i+1)

2
ˆ i←i+1
r
||x(i+1) ||2
After convergence, u1 = y, v 1 = x and σ1 = ||x(i) ||2
. (Values
taken before normalization).

(a) Prove the method works.


(b) Explain how to modify the method to find the 2nd–largest SVD
component.
(c) Why is the normalization step necessary?
(d) What happens if σ1 = σ2 ?

2. Consider the inverse power method for computing the smallest eigen–
pair of a matrix A. Show that convergence can be significantly ac-
celerated by replacing A with A − γI, where γ is an estimate of the
smallest eigenvalue, before inversion of A.

3. Consider the multi–variate quadratic equation given as

1 T
x Ax + bT x + c = 0,
2
where x ∈ Rn and A is positive definite.

(a) Develop a closed–form solution for the value x which minimizes


this equation.
(b) Explain what happens if A is mixed–definite or negative definite.
(c) Explain what happens when A is positive semi–definite?

121
4. Prove that the diagonal elements of a positive definite matrix must be
positive.

5. The χ2 distribution with n degrees


Pn of2 freedom is the probability den-
sity function of the quantity i=1 xi , where the xi are independent,
identically distributed (iid ) Gaussian random variables with zero mean
and unit variance. Prove that the quantity (y − µ)T Σ−1 (y − µ) is χ2
distribubted with n degrees of freedom, where y ∈ Rn is a Gaussian
random vector with mean µ and covariance matrix Σ.

6. Sketch the joint confidence region of the Gaussian probability distri-


bution for µ = [1, 1]T and Σ given by (4.12) at the α = 95% level,
showing all relevant values.

122
Chapter 5

Gaussian Elimination and


Associated Numerical Issues

In this chapter we briefly discuss floating point number systems in comput-


ers, and the effect of errors due to floating point representation on algebraic
systems. We look at Gaussian elimination with a view towards the effect of
floating point error, including the important idea of the condition number
of a matrix.

5.1 Floating Point Arithmetic Systems

A real number x can be represented in floating point form (denoted f l(x))


as
f l(x) = s · f · bk (5.1)

where

s = sign bit = ±1
f = fractional part of x of length t bits
b = machine base = 2 for binary systems
k = exponent

123
Note that the operation fl(x)(i.e., conversion from a real number x to its
floating point representation) maps a real number x into a set of discrete
points on the real number line. These points are determined by (5.1). This
mapping has the property that the separation between points is proportional
to |x|. Because the operation fl(x) maps a continuous range of numbers into
a discrete set, there is error associated with the representation fl(x).

In the conversion process, the exponent is adjusted so that the most signif-
icant bit (msb) of the fractional part is 1, and so that the binary point is
immediately to the right of the msb. For example, the binary number

x = .0000100111101011011 (5.2)

could be represented as a floating point number with t = 9 bits as:

1.00111101 × 2−5 .

Since it is known that the msb of the fractional part is a one, it does not need
to be physically present in the actual floating-point number representation.
This way, we get an extra bit, “for free”. This means the number x in (5.2)
may be represented as
−5
| {z } ×2 .
00111101
f
↑ leading 1 assumed present

This above form only takes 8 bits instead of 9 to represent fl(x) with the
same precision.

The range of possible real numbers which can be mapped into the represen-
tation |fl(x)| is:

| ← t bits → |
1.00 . . . 00 × 2L ≤ |f l(x)| ≤ 1.111111 . . . 1 ×2U

where L and U are the minimum and maximum values of the exponent,
respectively. Note that any arithmetic operation which produces a result
outside of these bounds results results in a floating point overflow or under-
flow error.

124
Note that because the leading one in the most significant bit position is
absent, it is now impossible to represent the number zero. Thus, a special
convention is needed. This is usually done by reserving a special value of
the exponent field.

Machine Epsilon u

Since the operation f l(x) maps the set of real numbers into a discrete set,
the quantity f l(x) involves error. The quantity machine epsilon, represented
by the symbol u is the maximum relative error possible in f l(x).

If we have a real number y and an approximation ŷ to it, then the relative


error r in the representation of y is given as
|y − ŷ|
r = .
|y|
We see that r is the normalized version of the absolute error given by |y− ŷ|.
This normalization is important, since, e.g., if y is a time measurement of
say one millisecond, then an error of 1/2 a millisecond would usually be
considered a sizeable error. On the other hand, if y were measured in years,
an error of 1/2 a millisecond would usually be considered very small indeed,
yet the absolute error is the same in each case. Relative error, due to its
normalization, removes this difficulty.

The relative error r in the quantity f l(x) is given by


|f l(x) − x|
r = . (5.3)
|x|

But u is the maximum relative error. Therefore


max |f l(x) − x|
u = max r =
min |x|
|←tbits→|
0.00 . . . 0 1111111111 . . .
=
1
1−t
→ 2 .

125
if the machine chops. By “chopping”, we mean the machine constructs the
fractional part of fl(x) by retaining only the most significant t bits, and
truncating the rest. If the machine rounds, then the relative error is one
half that due to chopping; hence
u = 2−t
if the machine rounds. Thus, the number fl(x) may be represented as
fl(x) = x(1 + ), where || ≤ u.
It is also noteworthy that if we perform operations on a sequence of n floating
point
Pn numbers, then the worst–case error accumulates; i.e., if we evaluate
i=1 xi , then it is possible in the worst case that each of the xi are subject
to the maximum relative error of the same sign, in which case the maximum
relative error of the sum becomes nu. This result holds for both addition
and subtraction operations. It may also be shown that the same result also
holds (to a first order approximation) for both multiplication and division
operations.

In an actual computer implementation, s is a single bit (usually 0 to indi-


cate a positive number, and 1 to represent a negative number). In single
precision, the total length of a floating point number is typically 32 bits. Of
these, 8 are used for the exponent k, one for s, leaving 23 for the fractional
part f (t = 24 bits effective precision). This means for single precision
arithmetic with chopping, u = 2−23 = 1.19 × 10−7 .

Absolute Value Notation

It turns out that in order to perform error analysis on floating- point matrix
computations, we need the absolute value notation:

If A and B are in Rm×n then


B = |A| ⇒ bij = |aij |, i = 1 : m, j=1:n
also B ≤ A ⇒ bij ≤ aij , i = 1 : m, j=1:n
This notation is used often throughout this chapter.

From our discussion on floating point numbers, we then have


|f l(A) − A| ≤ u|A|. (5.4)

126
5.1.1 Catastrophic Cancellation

Significant reduction in precision may result when subtracting two nearly


equal floating-point numbers. If the fractional part of two numbers A and
B are identical in their first r digits (r ≤ t), then fl(A − B) has only t − r
bits significance; i.e., we have lost r bits of significance in representing the
difference. As r approaches t, the difference has very few significant bits in
its fractional part. This reduction in precision is referred to as catastrophic
cancellation.

We can demonstrate this phenomenon by example as follows: Let A and


B be two numbers whose fractional parts are identical in their first r = 7
digits. Then for the case t = 10 and b = 2 (binary arithmetic)

|←rbits→|
frac(A) =
1011011 101
|←rbits→|
frac(B) =
1011011 001

where frac(·) is the fractional part of the number. Because the numbers
are nearly equal, it may be assumed that their exponents have the same
value. Then, we see that the difference frac(A − B) is (100)2 , which has only
t − r = 3 bits significance. We have lost 7 bits of significance in representing
the difference, which results in a drastic increase in u. Thus the difference
can be in significant error.

Another example of catastrophic cancellation is as follows: Find roots of the


quadratic equation
x2 + 1958.63x + 0.00253 = 0

Solution: √
−b ± b2 − 4ac
x= (5.5)
2a

computed roots: x1 = −1958.62998, x2 = −0.00000150


true roots: x1 = −1958.6299, x2 = −0.0000012917

There are obviously serious problems with the accuracy of x2 , which √ corre-
sponds to the “+” sign in (5.5) above. In this case, since b2 >> 4ac, b2 − 4ac '

127
b. Hence, we are subtracting two nearly equal numbers when calculating x2 ,
which results in catastrophic cancellation.

Another example of catastrophic cancellation is in evaluating the inner


productPof two nearly orthogonal vectors. In this case the inner product
xT y = ni=1 xi yi ≈ 0. So we can express the inner product as
X X
xT y = x i yi + xi yi ≈ 0, (5.6)
i∈P i∈N

where P, (N ) is the set of indexes representing positive (negative) terms.


Thus all products in the first sum are positive, and those in the second sum
are negative. Since each product xi yi can be arbitrarily larger in absolute
value than the overall result which is ≈ 0, the absolute value of each sum-
mation term above can be large, yet the final result is close to zero. Thus
(5.6) inherently involves the subtraction of two nearly equal numbers.

In the case where the vectors are not closely orthogonal, then many of the
products (5.6) all have the same sign and the effect of catastrophic cancel-
lation is suppressed, and so there is little if any reduction in the number of
effective significant bits. In this case, the relative error in the inner product
can be expressed in the form
f l(xT y) − xT y ≤ nu xT y .

(5.7)

We see that by dividing each side by |xT y|, we get exactly the relative
error we would expect when representing a sum of n numbers in floating
point format. However, when the vectors become close to orthogonality, the
number of effective significant bits becomes reduced and so the bound of
(5.7) no longer applies as is. It is shown [1] that in this case,
f l(xT y) − xT y ≤ nu|x|T |y| + O(u2 ),

(5.8)

where the absolute value notation of Sect.5.1 has been applied; i.e., we
consider |x| and |y|, which denotes the absolute value of the elements of the
vectors. The notation O(u2 ), read “order u squared”, indicates the presence
of terms in u2 and higher, which can be ignored due to the fact they may be
considered small in comparison to the first-order term in u. Hence (5.8) tells
us that if |xT y|  |x|T |y|, which happens when x is nearly orthogonal to y,
then the relative error in f l(xT y) may be much larger than the anticipated
result, which is that the error is upper bounded by nu. This is due to the
catastrophic cancellation implicitly expressed in the form of (5.6).

128
Fix: If the partial products are accumulated in a double precision register
(length of fractional part = 2t), little error results. This is because multipli-
cation of two t-digit numbers can be stored exactly in a 2t digit mantissa.
Hence, roundoff only occurs when converting to single precision, and the
result is significant to approximately t bits significance in single precision.

5.2 Gaussian Elimination

In this section, we discuss the concept of Gaussian elimination in some detail.


But first, we present a very quick review, by example, of the elementary
approach to Gaussian elimination. Given the system of equations

Ax = b

where A ∈ R3×3 is nonsingular. The above system can be expanded into


the form     
a11 a12 a13 x1 b1
 a21 a22 a23   x2  =  b2  .
a31 a32 a33 x3 b3

To solve the system, we transform this system into the following upper
triangular system by Gaussian elimination:
    
a11 a12 a13 x1 b1
 a022 a023   x2  =  b02  → Ux = b0 (5.9)
00
a33 x3 00
b3

using a sequence of elementary row operations, as follows:


a21
row 20 := row 2 − row 1,
a11
a31
row 30 := row 3 − row 1,
a11
and
a032
row 300 := row 30 − row 20 . (5.10)
a022
The prime indicates the respective quantity has been changed. Each elemen-
tary operation preserves the solution of the original system of equations and

129
is designed to place a zero in the appropriate place below the main diagonal
of A.

Once A has been triangularized to yield the upper triangular matrix U ,


the solution x is obtained by applying backward substitution to the system
U x = b. With this procedure, xn is first determined from the last equation
of (5.9). Then xn−1 may be determined from the second-last row, etc. The
algorithm may be summarized by the following schema:

for i = n, . . . , 1
xi := bi
for j = i + 1, . . . , n
xi := xi − uij xj
xi
xi := uii
end

What About the Accuracy of Back Substitution?

With operations on floating point numbers, we must be concerned about the


accuracy of the result, since the floating point numbers themselves contain
error. We want to know if it is possible that the small errors in the floating
point representation of real numbers can lead to large errors in the computed
result. In this vein, we can show [1] that the computed solution x̂ obtained
by back subtitution satisfies the expression

(U + E)x̂ = b0

where |E| ≤ nu|U | + O(u2 ), and u is machine epsilon. The above equation
says that x̂ is the exact solution to a perturbed system. We see that all
elements of E are of O(nu), which is exactly the error expected in U due
to floating point error alone, with operations over n floating point numbers.
This is the best that can be done with floating point systems. It is worthy
of note that if elements of E have a larger magnitude, then the error in the
solution can be large, such as in the case with Gaussian elimination without
pivoting, as we see later. However in the case at hand, we can conclude that
back substitution is stable. By a numerically stable algorithm, we mean one
that produces relatively small errors in its output values for small errors in
the input values.

130
The total number of flops required for Gaussian elimination of a matrix
3
A ∈ Rn×n may be shown to be O( 2n3 ) (one “flop” is one floating point
operation; i.e., a floating point add, subtract, multiply, or divide). It is
easily shown that backward substitution requires O(n2 ) flops. Thus, the
number of operations required to solve Ax = b is dominated by the Gaussian
elimination process for moderate n.

5.2.1 The LU Decomposition

Suppose we can find lower and upper n × n triangular matrices L (with ones
along the main diagonal), and U respectively such that:

A = LU .

This decomposition of A is referred to as the LU decomposition. To solve


the system Ax = b, or LU x = b we define the variable z as z = U x and
then

solve Lz = b for z
and then U x = z for x.

Since both systems are triangular, they are easy to solve. The first system
requires only forward elimination; and the second only back-substitution.
Forward elimination is the analogous process to backward substitution, but
since it is performed on a lower triangular system, the unknowns are solved
in ascending order for forward elimination ( i.e., x1 , x2 , . . . , xn ) instead of
descending order (xn , xn−1 , . . . , x1 ) as in backward substitution. Forward
substitution requires an equal number of flops as back substitution and is
just as stable. Thus, once the LU factorization is complete, the solution of
the system is easy: the total number of flops required to solve Ax = b is
2n2 . The details of the computation of the LU factorization and the number
of flops required is discussed later.

We are lead to several interesting questions:

1. How does one perform the LU decomposition?

2. How much computational effort is required to perform the LU decom-


position?

131
3. What is the relationship of LU decomposition, if any, to Gaussian
elimination?

4. Is the LU decomposition process numerically stable?

The answer to these questions is provided in the following sections.

5.2.2 Gauss Transforms

While the algorithmic description described above adequately describes the


Gaussian elimination process, it is preferable to describe the process using a
sequence of matrix operations. Not only is the resulting matrix description
more compact, but it leads to theoretical insights that are not possible oth-
erwise. In this vein, Gaussian elimination may be described as a sequence
of Gauss transformations M 1 . . . M n−1 ∈ Rn×n operating sequentially on
A ∈ Rn×n such that
M n−1 . . . M 2 M 1 A = U (5.11)
where U is the n × n upper triangular matrix yielded by Gaussian elimina-
tion process. The matrix M k introduces zeros below the main diagonal in
the (k − 1)th column of the version A(k−1) of A, which results after k − 1
previous transformations. Thus, after n − 1 such transformations, the re-
sulting product is upper triangular and the Gaussian elimination procedure
is complete. Now we look at the structure of M k in (5.11).

Suppose for k < n we have already determined Gauss transformations


M k−1 . . . M 1 so that the resulting matrix A(k−1) has the form

" #
(k−1) (k−1)
A11 A12 k−1
(k−1) (k−1)
A = M k−1 . . . M 1 A = 0 A22 n−k+1 (5.12)
k−1 n−k+1

(k−1)
where A11 is upper triangular.

(k−1)
The fact that A11 is upper triangular means that the decomposition of
(5.12) has already progressed (k − 1) stages, as indicated by the superscript

132
(k − 1). The next stage of Gaussian elimination proceeds one step to make
(k−1)
the first column of A22 zero below the main diagonal element.

Lets see how this can be done by pre-multiplication of A(k−1) by a matrix


M k.

Define
M k = I − α(k) ek T (5.13)

where
I is the n × n identity matrix

ek is the k th column of I

i.e., eTk = (0, . . . , 0, 1, 0, . . . , 0)

↑ k th position

and
δ
α(k) = (0, . . . , 0, lk+1,k , . . . , ln,k )T , (5.14)

where
(k−1)
aik
lik = (k−1)
, i = k + 1, . . . , n. (5.15)
akk

(k−1)
aik
Note that the terms lik = (k−1) above are precisely the multipliers required
akk
to introduce the required zeros, as in (5.10).

(k−1)
The pivot element: The quantity akk , which is the upper left–hand
(k−1)
element of A22 , is the pivot element for the kth stage. This element
plays a strategically significant role in the Gaussian elimination process,
due to the fact it appears in the denominator of (5.15). We will see that
small pivot values lead to large elements in U and L and therefore have the
potential to lead to large errors in the solution x.

By evaluating (5.13), we see that M k has the following structure:

133
 
1 0
 .. 
 . 
 
 
 ← k th row
 

 1 
Mk = 
 −lk+1,k 1 
 (5.16)
 .. .. 
 . . 
0 −ln,k . . . 1

th
k column

We can visualize the multiplication A(k) = M k A(k−1) with the aid of Fig.
(k−1)
5.1. We assume the pivot element akk 6= 0. It may be verified by in-
spection that the first k rows of the matrix product A(k) are unchanged
relative to those of A(k−1) , as is the lower left block of zeros. We may
gain appreciation for the operation of the Gauss transform, by considering
the most relevant part, which is in forming the kth column of the product
A(k) below the main diagonal. Here we take the inner product of the jth
row (j = k + 1, . . . , n) of M k with the kth column of A(k−1) . Here, the
−lj,k term of M k multiplies the pivot element of A(k−1) , which according to
(5.15) yields the term −aj,k . Due to the ”one” in the jth diagonal position
of M k , this result is then added to the element aj,k . The result over values
j = k + 1, . . . , n is that the kth column of A(k) is replaced with zeros below
the main diagonal, as desired. These arithmetic operations are identical to
those expressed by (5.10), except now we have been able describe the process
using matrix multiplication.

5.2.3 Recovery of the LU factors from Gaussian Elimination

We now discuss the relationship between Gaussian elimination and the LU


decomposition. Specifically, we investigate how to determine the L and U
factors of A in an efficient manner, from the Gaussian elimination process.
We note the Gaussian elimination process produces

M n−1 . . . M 1 A = U (5.17)

where U is the upper triangular matrix resulting from the Gaussian elimi-
nation process. Each M i is unit lower triangular (ULT) (ULT means lower

134
Figure 5.1. Depiction of the multiplication M k A(k−1) , to advance the Gaussian elimina-
tion process from step k − 1 to step k. The dotted lines in the upper right matrix A(k−1)
show how the partitions advance from the (k − 1)th to the kth stage. The multiplication
(k−1)
process replaces the ×’s in the first column of A22 with zeros, except for the pivot
element.
135
triangular with one’s on the main diagonal), and it is easily verified that the
product of ULT matrices is also ULT. Therefore, we define a ULT matrix
L−1 as
M n−1 . . . M 1 = L−1 (5.18)
From (5.17), we then have L−1 A = U . But since the inverse of a ULT
matrix is also ULT, then
A = LU , (5.19)
which is the product of lower and upper triangular factors as desired. We
have therefore completed the relationship between LU decomposition and
Gaussian elimination. U is simply the upper triangular matrix resulting
from Gaussian elimination, and L is the inverse of the product of the M i ’s.

Efficient recovery of L L can be recovered from the M k ’s in a very


efficient manner without having to perform any explicit computation. The
reason is that the M k ’s have a very simple structure which can be exploited
to our advantage. We note from (5.18) that

L = M −1 −1
1 . . . M n−1 . (5.20)

Therefore, we formulate L efficiently in two steps: we first examine the


relationship betwen M −1 and M k , k = 1, . . . , (n − 1), and then investigate
Qn−1 k −1
the structure of k=1 M k .

The structure of M −1
k We note that

M k A(k−1) = A(k) (5.21)

The matrix A(k) is formed from A(k−1) by implicitly performing a subtrac-


tion operation as indicated by the structure of M k :

M k = I − α(k) eTk . (5.22)

The matrix M −1 k must operate on A


(k)
to restore A(k−1) . Do you suppose
this could be achieved by implicitly performing an addition operation on
A(k) ? This insight is in fact correct. Consider M −1 k of the following form:

M −1 (k) T
k = I + α ek . (5.23)

136
We may prove this form is indeed the desired inverse, as follows. Using the
definition of M −1
k from (5.23), we have

M −1 (k) T (k) T
k M k = (I + α ek )(I − α ek )
= I − α(k) ek T + α(k) ek T − α(k) ek T α(k) ek T (5.24)
| {z }
0
= I.

From (5.14), α(k) has non-zero elements only for those indeces which are
greater than k, (i.e., below the main diagonal position). The only nonzero
element of eTk is in the k th position. Therefore, eTk α(k) = 0 as indicated.
Thus M −1k is given by (5.23). We therefore see, that by looking at the
structure of M k carefully, we can perform the inversion operation simply
by inverting a set of signs!

M−1
Q
Structure of L = k k From (5.18) we have

L = (M n−1 , . . . , M 1 )−1
= M −1 −1
1 , . . . , M n−1
n−1
Y
= (I + α(i) ei T ) (5.25)
i=1

where the last line follows from (5.23). Eq. (5.25) may be expressed as
n−1
X
L=I+ α(k) eTk + cross-products of the form α(i) eTi · α(j) eTj (5.26)
k=1

Using similar reasoning to that used in (5.24), it may be shown that the
cross-product terms in (7.14) are all zero. Therefore
n−1
X
L=I+ α(i) eTi . (5.27)
i=1

Each term α(k) eTk in (5.27) is a square matrix of zeros except below the main
diagonal of the k th column. Thus the addition operation in (5.27) in effect
inserts the elements of α(k) in the kth column below the main diagonal of L,
for k = 1, . . . n − 1, without performing any explicit arithmetic operations.

137
The addition of I in (5.27) puts 1’s on the main diagonal to complete the
formulation of L.

As an example, we note from (5.15) that L has the following structure, for
n = 4:  
1
 
 (0) 
 a21 
 (0)
 a11 1 

 
L =  (0)
 
 a31 a(1)

31
1

 (0) (1) 
 a11 a22 
 
 
 a(0) a(1) a(2) 
41
(0)
42
(1)
43
(2) 1
a11 a22 a33

Thus, given the sequence of Gauss transformations M 1 . . . M n−1 , we can


form the factor L without any explicit computations. The inverses are
accomplished simply by inverting a set of signs, and the multiplication is
performed by placing the nonzero elements of the α(k) ’s into their respective
positions in L. With this simple formulation of L, and the matrix U given by
(5.11), the relationship between Gaussian elimination and LU decomposition
is complete.

Discussion and examples

1. Note that in performing the sequence of Gauss transformations, we are


performing exactly the same arithmetic operations as with elementary
Gaussian elimination.

2. LU decomposition is a “high-level” description of Gaussian elimina-


tion. Matrix-level descriptions highlight connections between algo-
rithms that may appear quite different at the scalar level.
3
3. The Gaussian elimination process requires O( 2n3 ) flops. This is the
lowest number of any triangularization technique for square matrices
with no specific structure.

Example 1:

138
Let  
2 −1 0
A =  2 −2 1 
−2 −1 5

We will apply Gauss transforms to effect the LU decomposition of A.

By inspection,  
1
M 1 =  −1 1  = I − α1 eT1
1 0 1
 
2 −1 0
M 1 A = 0 −1
 1  = A(2)
0 −2 5

Thus,  
1
M2 =  0 1 
0 −2 1
and  
2 −1 0
M 2 A(2) =  0 −1 1  = U .
0 0 3

What is L = M −1 ?

2
X
L= M −1 −1
1 M2 =I+ α(i) eTi
i=1

Thus,
 
 1 
L =  1 1 
−1 (2) 1
↑ ↑
α(1) α(2)

139
Note that LU does in fact = A.

Evaluation of Determinants: This example provides an avenue to il-


lustrate an efficient means of evaluating determinants. Recall det(AB) =
det(A)det(B). Here we have det(L) = 1 (since the determinant of a trian-
gular matrix is the product of its diagonalQelements); also note that U is
triangular. Therefore det(A) = det(U ) = uii = −6. This is a far more
efficient method for evaluating determinants than the cofactor expansion
method of Chapter 1. The number of flops required for the cofactor expan-
sion method is proportional to n!, whereas this technique founded on the
3
Gaussian elimination process requires only ≈ 2n3 operations.

5.3 Numerical Properties of Gaussian Elimination

The numerical properties of Gaussian elimination can be described by the


following statement [1] :

Let L̂Û be the computed LU decomposition of A ∈ Rn×n . Then


ŷ is the computed solution to L̂ŷ = b, and x̂ the computed
solution to Û x̂ = ŷ. Then,

(A + E)x̂ = b,

where
|E| ≤ nu [3|A| + 5|L||U |] + O(u2 ). (5.28)

This analysis, as in the back substitution case, shows that x̂ exactly satisfies
a perturbed system. The question is whether the perturbation |E| is always
small. If |E| is of the order induced by floating point representation alone
(i.e., O(nu)), we may conclude that Gaussian elimination yields a solution
which is as accurate as possible in the face of floating point error. But unlike
the back substitution case, further inspection reveals that (5.28) does not
allow such an optimistic outlook. It may happen during the course of the
Gaussian elimination procedure that the term |L||U | may become large, if
small pivot elements are encountered, causing |E| to become large, as we
consider in the following:

140
(k−1)
By referring to (5.16), we can see that if any pivot akk is small in mag-
nitude, then the kth column of M k is large in magnitude. Because M k
premultiplies A(k−1) , large elements in M k will result in large elements in
(k)
the block A22 of (5.12). The result is that both U and L will have large
elements as k varies over its range from 1, . . . n − 1. Hence, | E | in (5.28) is
“large”, resulting in an inaccurate solution.

The fact that large | L | and | U | lead to an unstable solution can also be
explained in a different way as follows. Consider two different LU decom-
positions on the same matrix A:

1. A = LU (large pivots)
2. A = ΛR (small pivots)

Two different LU decompositions on the same matrix can exist, because


in cases which are prone to numerical error it is possible to interchange
(k−1)
rows and columns of the A22 block to place elements with either large or
small magnitude as desired into the pivot position, using row and column
interchanges, as described in more detail later. Generally, the elements lij
and uij of L and U respectively are small, whereas the elements λij and rij
of Λ and R are large in magnitude. Consider the (i, j)th element aij of A
computed according to the two different decompositions. We have
 T
T li = ith row of L
aij = li uj (5.29)
uj = jth column of U
and
aij = λi T rj

likewise. (5.30)

Let us assume that the pivots in the second case are small enough so that

|λij | and |rij |  |aij |, (i, j) ∈ [1, . . . n]. (5.31)

Thus (5.30) can be written in the form

aij = P + N (5.32)

where P , (N ) is the sum of all terms in (5.30) which are positive (negative).
Using arguments similar to those surrounding (5.8), we see that (5.31) im-
plies that both |P |, |N |  |aij |. Thus when the pivots are sufficiently small

141
in magnitude, from (5.32) we see that two nearly equal numbers are being
subtracted, which leads to catastrophic cancellation, and ensuing numerical
instability.

We note however, that the P and N terms corresponding to (5.29) do not


satisfy (5.31), and as a result, little or no catastrophic cancellation arises
from the computation of (5.32). In this case the resulting system is stable.

Thus, for stability, large pivots are required. Otherwise, even well-conditioned
systems can have large error in the solution, when computed using Gaussian
elimination.

As an example of the effects of small pivots, consider the matrix A given by

 
−0.2725 −2.0518 0.5080 1.1275
 1.0984 −0.3538 0.2820 0.3502 
A=
 −0.2779
,
−0.8236 0.0335 −0.2991 
0.7015 −1.5771 −1.3337 0.0229

which has been designed so that a single pivot element of very small mag-
nitude on the order of 10−13 appears in the (2, 2) position after the first
stage of Gaussian elimination. The L and U matrices which result after
completing the Gaussian elimination process without pivoting contain el-
ements with very large magnitude, on the order of 1012 . The computed
solution x obtained using the LU decomposition without pivoting, for b =
[−0.6888, 10.0022, −1.3670, −2.1863]T is given as

 
1.0826
 0.9882 
x=
 1.0622  ,

0.9704

whereas the true solution is [1, 1, 1, 1]T . The relative error in this computed
solution is 0.1082, which may be regarded as significant, depending on the
application. On the other hand, the solution obtained using the Matlab lin-
ear equation solver, which does use pivoting, yields the true solution within
a relative error of approxiately u, which is 2.2204 × 10−16 on the Matlab
platform used to obtain these results.

142
5.3.1 Pivoting

We can greatly improve the numerical stability of Gaussian elimination us-


ing a pivoting process, the objective of which is to place the element with the
largest magnitude in the A22 block at the kth stage into the pivot position.
In this way, the Gaussian elimination is as stable as possible. This largest
element can be moved into the pivot position through a row– and column–
interchanges. In a manner similar to the Gaussian elimination description,
we wish to express these row– and column–interchanges using matrix opera-
tions. It is readily verified that row interchanges can be accomplished using
the following matrix multiplication:
A(ij) = P (ij) A
where A(ij) is the matrix A with rows i and j interchanged, and P (ij)
(referred to as a permutation matrix) is the identity matrix with rows i
and j interchanged. Likewise, to interchange columns k and l, the following
matrix multiplication is performed:
A(kl) = AΠ(kl)
where the permutation matrix Π(kl) is the identity matrix with columns k
and l interchanged.

Full pivoting, where both row and column interchanges are performed, is
stable yet expensive, since arithmetic comparisons are almost as costly as
flops, and many comparisons are required to search through the entire A22
block at each stage to search for the element with the largest magnitude.
Note that both row and column permutations take place to swap the re-
spective element into the pivot position. The number of comparisons can
be drastically reduced if only row permutations take place. That is, the
element with the largest magnitude in the leading column of the A22 block
is permuted into the pivot position using only row interchanges. The result,
which is known as partial pivoting, is almost as stable.

Note that the row– and column–interchange operations will destroy the in-
tegrity of the system of the original system of equations Ax = b. In ef-
fect, the matrix A has been replaced by the quantity P AΠ, where P =
P n−1 , . . . , P 1 , and Π = Π1 , . . . , n − 1, where P i (Πi ) is the respective row
(column) permutation matrix at the ith stage of the decomposition. There-
fore, a system of equations which is equivalent to the original can be written

143
in the form
P AΠ ΠT x = P b,
  

where we have made use of the fact that [Π][ΠT ] = I.1 Thus, for every row
interchange we also exchange corresponding elements of b, and for every
column interchange we exchange corresponding elements of x.

5.3.2 The Cholesky Decomposition [1]

We now consider several modifications to the LU decomposition, which ul-


timately lead up to the Cholesky decomposition. These modifications are
1) the LDM decomposition, 2) the LDL decomposition on symmetric matri-
ces, and 3) the LDL decomposition on positive definite symmetric matrices.
The Cholesky decomposition is relevant only for square symmetric positive–
definite matrices and is an important concept in its own right. Several
examples of the use of the Cholesky decomposition are provided at the end
of the section.

The LDM Factorization: If no zero pivots are encountered during the


Gaussian elimination process, then we can factor U so that

U = DM T

where D is diagonal and M is unit lower triangular (ULT). It is then ap-


parent that
A = LDM T .
The matrix M is calculated simply by dividing each row of U by its diag-
onal element dii , and then taking the transpose of the result. As expected,
the errors involved in solving a system of equations according to this factor-
ization behave in the same manner as with ordinary Gaussian elimination,
as in (5.28). Thus, pivoting is required for this case, to prevent growth
in computed versions of |L|, |D| or |M |. It is staightforward to show that
for a symmetric non-singular matrix A ∈ Rn×n , the factors L and M are
identical. This means that for a symmetric matrix A, the LU factorization
3
requires only n3 flops, instead of 23 n3 as for the general case. This is because
only the lower factor need be computed.
1
The proof is left as an exercise.

144
Now consider the case where A is positive definite. Define the symmetric
part T and the asymmetric part S of A respectively as:
A + AT A − AT
T= , S=
2 2
It is shown [1] that the computed solution x̂ to a positive definite system of
equations satisfies
(A + E)x̂ = b
where
  
2 −1
+ O(u2 )

||E||F ≤ u 3n||A||F + 5cn ||T ||2 + ST S 2
(5.33)

where c is a constant of modest size. Eq. (5.33) is a significant


result, since
−1

it implies that when A is symmetric and positive definite, ST S 2 is

zero and the E matrix for the bound (5.33) is close to the error introduced
by floating–point representation alone. Also, since it is independent of the
factors L, D or M , the bound (5.33) is stable without pivoting. This is
because positive definite matrices tend to have larger magnitude elements
along the main diagonal, and hence the element with the largest magnitude
is already in the pivot position during the Gaussian elimination process.

Incorporating the discussion for the symmetric and positive–definite cases


together, we have the following:

The Cholesky Decomposition Itself: For A ∈ Rn×n symmetric and


positive definite, there exists a lower triangular matrix G ∈ Rn×n with
positive diagonal entries, such that A = GGT .

Proof: Consider A which is positive definite and symmetric. (Note that


covariance matrices fall into this class.) Therefore, xT Ax > 0, 0 6= x ∈
Rn×n , and hence xT LDLT x > 0. If A is positive definite, then L is full

rank; let y = LT x. Then, y T Dy > 0, if and only if all elements of D are
positive. Therefore, if A is positive definite, then dii > 0, i = 1 . . . , n.

T
Because A is symmetric,
√ then√ A = LDL . Because the dii are positive,
then G = L · diag( d11 , . . . , dnn ). Then GGT = A as desired. 

Therefore, in solving the system Ax = b, where A is symmetric and positive


definite (e.g., for the case where A is a sample covariance matrix), the

145
Cholesky decomposition requires fewer flops than regular LU decomposition,
since a properly–designed algorithm can take advantage of the fact the two
factors are transposes of each other. Further, the factorization does not
require pivoting. Both these points result in significantly reduced execution
times.

Computation of the Cholesky Decomposition: An algorithm for com-


puting the Cholesky decomposition, which offers a faster computation time
over the conventional LU decomposition, is developed simply by direct com-
parison: e.g. in the 3 × 3 case we have:
     
g11 g11 g21 g31 a11 a12 a13
 g21 g22   g22 g32  =  a12 a22 a23 
g31 g32 g33 g33 a13 a23 a33
symmetric,
positive definite.

By following a proper order, each element of G may be determined in se-


quence, simply by comparing a particular element aij of A with the inner
product g Ti g j , where g Ti is taken to be the ith row of G. First, we may
determine g11 by comparison with a11 . Then, all remaining elements of the
first column of G may be determined once g11 is known. Then, g22 can be
determined, and the process repeats. For example,

g11 2 = a11 → g11 = a11

Also,
ai1
gi1 = i = 2, . . . , n.
g11

Thus, all elements in first column of G can be solved. Now, consider the
second column. First, we solve g22 :
2 2
g21 + g22 = a22

Thus,
1
2 2
g22 = (a22 − g21 )
where the term in the round brackets is positive if A is positive definite.
Once g22 is determined, all remaining elements in the second column may

146
be found by comparison with corresponding element in the second column
of A. The third and remaining columns are solved in a similar way. If
the process works its way in turn through columns 1, . . . , n, each element
in G is found by solving a single equation in one unknown. Determining
each diagonal element involves finding a square root of a particular quantity.
This quantity is always positive if A is positive definite.

If A = GGT , then to solve the system Ax = b we first solve


Gz = b for z
then
GT x = z for x
If the positive square root is always taken in the computation of the Cholesky
factorization, then the Cholesky factorization is unique. The Cholesky de-
composition A = GGT is a matrix analog of a scalar square-root operation.
Note however that a square root factor of a positive definite matrix is not
unique. The matrix GQ, where Q is any orthonormal matrix of dimension
n × k, where k ≥ n is also a square root factor. Another square root matrix
of A is given as V ·diag(λ1 , λ2 , . . . , λn )1/2 , where the v i and λi are the eigen-
vectors and eigenvalues of A, respectively. The uniqueness of the Cholesky
factor is a result of it being lower triangular with positive diagonal elements.
The advantage of the Cholesky factorization is that it is easier to compute
compared to other square root factors.

5.3.3 Application of the Cholesky Decomposition

Generating a vector process with a prescribed covariance: We


may use the Cholesky decomposition to generate a random vector process
x ∈ Rn with a desired covariance matrix Σ ∈ Rn×n . Since Σ must be
symmetric and positive definite, let
Σ = GGT
be the Cholesky factorization of Σ. Let w ∈ Rn be a white random vector
such that E(wwT ) = I. Such w’s are easily generated by random number
generators on the computer, such as the command “randn” in Matlab.

Then, define x as:


x = Gw

147
The vector process x has the desired covariance matrix because

E(xxT ) = E(GwwT GT )
= GE(wwT )GT
= GGT
= Σ.

This procedure is particularly useful for computer simulations when it is


desired to create a random vector process with a specified covariance matrix.

Whitening a noise process: Consider the MUSIC example discussed in


Chapter 2. In this case we observe the vector process

xi = S(θ)ai + ni (5.34)

where in this case we assume the noise covariance matrix is Σ, which is


assumed known. As a general rule, and as illustrated in Chapter 6 on
least squares analysis, estimation of parameters in non–white noise results
in increased variances, relative to the case where the noise is white. We
therefore wish to whiten the noise before the estimation process begins. In
this vein, let G be the Cholesky factorization of Σ such that GGT = Σ.
Premultiply both sides of (5.34) above by G−1 :

G−1 xi = G−1 S(θ)ai + G−1 ni (5.35)

The noise component is now G−1 ni . The corresponding noise covariance


matrix is

E(G−1 ni nTi G−T ) = G−1 E(nnT )G−T


= G−1 ΣG−T
= G−1 GGT G−T
= I (5.36)

Thus, by premultiplying the original signal x by the inverse Cholesky factor


of the noise, the resulting noise becomes white with unit variance. Since the
covariance matrix is diagonal, the elements of n become uncorrelated due
to the whitening operation.

148
We note that as a consequence of this whitening process, the signal compo-
nent has also been transformed by G−1 . Therefore, in the specific case of
the MUSIC algorithm, we must therefore substitute G−1 S for S to achieve
correct results.

Note that in both the above cases, any square root matrix B such that
BB T = Σ will achieve the same effects. However, the Cholesky factor is
typically the easiest one to compute.

5.4 The Sensitivity of Linear Systems

Up to now, we have quantified the effective error in the matrix A due to


the effects of potential catastrophic cancellation and to the inherent error in
floating point number systems. Examples of these effects are the E–matrices
as specified in (5.33) and (5.28).

But the capability to quantify error does not address the complete problem.
What we also need to know is how sensitive is the solution x to error in the
quantities A and b. In this respect, in this section, we develop the idea of
the matrix condition number κ(A) of a matrix A.

Consider the system of linear equations

Ax = b (5.37)

where A ∈ Rn×n is nonsingular, and b ∈ Rn . How do perturbations in A or


b affect the solution x?

To gain insight, we consider several situations where small perturbations


can induce large errors in x. For the first example, we perform the singular
value decomposition on A:

A = U ΣV T . (5.38)

Let us now consider a perturbed version Ã, (following the method of (5.28)
or (5.33)), where à = A + E, and as before E is an error matrix and
 controls the magnitude of error. In this example let E be taken as the
outer product E = un v Tn . Then, the singular value decomposition of à is

149
identical to that for A, except the transformed σn , denoted σ˜n , is replaced
with σn + .

Because x = A−1 b, we have

x = V Σ−1 U T b,

or, using the outer product representation for matrix multiplication we have
n
X ui T b
x= vi . (5.39)
σi
i=1

If we assume σn to be small in comparison to σ1 , and that σn and epsilon


are of the same order of magnitude, then σ̃n contains large relative error.
Further, since the term for i = n in (5.39) (i.e., the one which has been
significantly perturbed) contributes strongly to x, the computed version x̂
of x is strongly perturbed. Therefore a small change in A can result in large
changes in the solution x.

The “Useful Theorem” of Sect. 3.6 provides an additional viewpoint. Here,


we see that the smallest singular value σn is the 2-norm distance of A from
the set of singular matrices. Consider the matrix An−1 defined in the the-
orem. Then An−1 is the closest singular matrix in the 2–norm sense to A,
and this 2–norm distance is σn . Thus, if σn is small, then A is close to sin-
gularity, in which case the solution can vary arbitrarily. Thus as σn becomes
smaller, the computed x becomes more sensitive to changes in either A or
b.

These examples indicate that a small σn can cause large errors in x. But
we don’t have a precise idea of what “small” means in this context. “Small”
relative to what? The following section addresses this question.

Derivation of Condition Number:

Consider the perturbed system where there are errors in both A and b. Here,
the notation is simpler if we denote the errors as δA and δb1 respectively.
The perturbed system becomes

(A + δA)(x + δx) = b + δb1

150
We can write the above as

A(x + δx) = b + δb1 − δb2

where δb2 = δA(x + δx) ≈ δAx to a first–order approximation. The error


δb2 is the error in A transformed to appear as an error in b. Defining

δb = δb1 − δb2 we have

A(x + δx) = b + δb.

We therfore have the following two equations:

Ax = b =⇒ x = A−1 b (5.40)
−1
Aδx = δb =⇒ δx = A δb. (5.41)

We now consider what is the worst possible relative error ||δ x||
||x|| in the solution
x in the 2–norm sense. This occurs when the direction of δb from (5.41) is
such that ||δx||2 is maximum, and simultaneously, when b from(5.40) is in
the direction such that the corresponding ||x||2 is minimum.

Note the largest singular value of A−1 is 1/σn , and the smallest is 1/σ1 .
Likewise, the u–vector associated the largest singular value is un , and is
u1 if associated with the smallest singular value. With this in mind, it is
straightforward to show from the ellipsoidal interpretation of the SVD of
A−1 , as shown in Fig. 1 Sect. 3.5 that the maximum of ||δx||2 = ||A−1 δb||2
with respect to δb, for ||δb||2 held constant, occurs when δb aligns with the
vector un ; i.e.,

max ||A−1 δb||2 = max ||V Σ−1 U T δb||2


δb δb
= max ||Σ−1 U T δb||2
δb
−1 T
= Σ U un ||δb||2

  2
||δb||2
−1  0 
 
= Σ  .. 
 . 
2
0
1
= ||δb||2 . (5.42)
σn

151
In the second line, we have used the fact that the 2–norm is invariant to
multiplication by the orthonormal matrix V . The third line follows from
the fact that the maximum occurs when δb = ||δb||2 un , and so from the
orthonormality of U , the quantity U T δb is a vector of zeros except for the
first element, which is equal to ||δb||2 .

Using analogous logic, we see that the minimum of ||A−1 b||2 in (5.40) for
fixed ||b||2 occurs when the direction of b aligns with u1 . In this case,
following the same process as in (5.42), except replacing the maximum with
minimum, we have
1
min ||A−1 b||2 = ||b||2 . (5.43)
b σ1
We can now use (5.43) and (5.42) as worst–case values in (5.40) and (5.41)
respectively to evaluate the worst case upper bound on the relative error
||δ x||2
||x||2 in x. We have
||δx||2 σ1 ||δb||2
≤ . (5.44)
||x||2 σn ||b||2
The quantity ||δb||2 in (5.44) may be interpreted as the relative error in A
||b||2
and b. This relative error is magnified by the factor σσn1 to give the relative
error in the solution x. The ratio σσn1 is an important quantity in matrix
analysis and is referred to as the condition number of the matrix A, and is
given the symbol κ2 (A). The subscript 2 refers to the 2–norm used in the
derivation in this case. In fact, the condition number may be derived using
any suitable norm, as discussed in the Appendix of this chapter.

The analysis for this section gives an interpretation of the meaning of the
condition number κ2 (A). It also indicates in what directions b and δb must
point to result in the maximum relative error in x. We see for worst error
performance, δb points along the direction of un , and b points along u1 .
If the “SVD ellipsoid” is elongated, then there is a large disparity in the
relative growth factors in δx and x, and large relative error in x can result.

Properties of the Condition Number

1. κ(A) ≥ 1.
2. If κ(A) ∼ 1, we say the system is well-conditioned, and the error in
the solution is of the same magnitude as that of A and b.

152
3. If κ(A) is large, then the system is poorly conditioned, and small errors
in A or b could result in large errors in x. In the practical case, the
errors can be treated as random variables and hence are likely to have
components along all the vectors ui , including un . Thus in a practical
situation with poor conditioning, error growth in the solution is almost
certain to occur.

4. It is interesting to note that the best–case error growth factor is less


than unity. This could happen if the directions for δb and b are re-
versed; i.e., δb aligns with u1 and b aligns with un . However, in view
of item 3 above, this favourable scenario is unlikely to happen.

5. κ(A) is norm–dependent. In the Appendix, an alternative derivation


of condition number is presented, which is better suited to showing
that any suitable norm can be used to evaluate it. The type of norm
used is typically indicated by a subscript on the κ.

We still must consider how bad the condition number can be before it starts
to seriously affect the accuracy of the solution for a given floating–point
precision. In ordinary numerical systems, the errors in A or b result from
the floating point representation of the numbers. The maximum relative
error in the floating point number is u. The condition number κ(A) is
the worst-case factor by which this floating–point error is magnified in the
solution. Thus, the relative error in the solution x is bounded from above
by the quantity O(uκ(A)). Therefore, if κ(A) ∼ u1 , then the relative error
in the solution can approach unity, which means the result is meaningless.
−r
If κ(A) ∼ 10u , then the relative error in the solution can be taken as 10−r ,
and the solution is approximately correct to r decimal places.

Questions:

What is the condition number of an orthonormal matrix?

What is the condition number of a singular matrix?

What happens if δb ∈ span(u1 ) and b ∈ span(un ) ?

153
5

direction of
4
perturbation
of b(2)
3

X(2)
1
direction of
perturbation
0 of b(1)

-1

-2

-3
-3 -2 -1 0 1 2 3
x(1)

Figure 5.2. A poorly conditioned system of equations. The value for b is [1, 1]T , which
is along the same direction as u1 , whereas the perturbations δb in b are [−0.01, 0.01]T ,
which is along the direction of u2 . Visual inspection shows that this arrangement results
in a relatively large shift in the point of intersection of the two lines, which corresponds
to the solution of the system of equations.

Example: Here we consider a poorly conditioned 2×2 system of equations,


given as     
1 1 x1 1
= (5.45)
0.92 1.08 x2 1

The two equations are shown plotted together in Fig. 5.2, where it may
be seen they are close to being co-linear. The singular values of A are
[2.0016, 0.0799], to give a value of κ(A) = 25.0401. The solution to the
unperturbed system is x = [0.5, 0.5]. The U –matrix from the SVD of A is
given as  
−0.7060 −0.7082
U= .
−0.7082 0.7060
We now perturb the b–vector in (5.45) in the direction corresponding to the
worst–case error in x, which is along the u2 –axis. The perturbed value of
b is given as bp = b + 0.01u2 , where the value 0.01 was chosen to be the
magnitude of the perturbation. The resulting directions in which b(1) and
b(2) are perturbed are indicated in Fig. 5.2. We note that the unperturbed b
already points along the u1 –direction, which is the direction corresponding

154
to the worst–case error in the solution.

We now solve the perturbed system and compare the corresponding relative
error in the solution with the upper bound given by (5.44), repeated here
for convenience:
||δx||2 σ1 ||δb||2
≤ .
||x||2 σn ||b||2
||δ b||2
The value of on the right, using the current values is 0.0070711.
||b||2
κ(A) = 25.0401, so the worst–case relative error in x predicted by (5.44) is
0.17706. The perturbed solution is x = [0.40807, 0.58485]T , which results
in an actual relative error of [0.17692], which is seen to be very close to the
worst–case upper bound predicted by (5.44), as it should be in this case.

An example of a well–conditioned system of equations is shown in Fig. 5.3,


where it is seen in this case the two equations are almost orthogonal. The
same experiment as above was conducted for this system of equations. In
this case, κ(A) = 1.23721, and ||δb||2 = 0.0070711. The actual relative
||b||2
error in x is 0.0071661, whereas the relative error of the worst–case upper
bound is 0.0087479. Thus we can see that in this case, due to the improved
condition number, little growth in the error of the solution is evident, and
the actual relative error in x is still close to the value predicted by the upper
bound.

5.5 The Interlacing Theorem and Condition Num-


bers [1]

Here we discuss a useful theorem which is useful in the context of condition


numbers. We have a symmetric matrix B n ∈ Rn×n . Let B r = B(1 : r, 1 : r).
Then

λr+1 (B r+1 ) ≤ λr (B r ) ≤ λr (B r+1 ) ≤ . . . ≤ λ2 (B r+1 ) ≤ λ1 (B r ) ≤ λ1 (B r+1 )

for r = 1 : n − 1. Here λi (B) indicates the ith–largest eigenvalue of B.


To look at how this theorem is useful in the condition number context, we
consider a square, symmetric (n − 1) × (n − 1) matrix B n−1 . Its largest and
smallest eigenvalues respectively are λ1 (B n−1 ) and λn−1 (B n−1 ). We can

155
6

x(2)
1

-1

-2

-3

-4
-3 -2 -1 0 1 2 3
x(1)

Figure 5.3. A well–conditioned system of equations.

denote the respective condition number as κ(B n−1 ). Now we add a column
and row to form B n (in such a way so that B n remains symmetric). The
largest and smallest eigenvalues of B n are now λ1 (B n ) and λn (B n ). We
can infer from the interlacing theorem that

λ1 (B n ) ≥ λ1 (B n−1 ), and
λn (B n ) ≤ λn−1 (B n−1 )

where we have set the value of r above equal to n − 1 in each case. These
equations imply that κ(B n ) ≥ κ(B n−1 ). This means that increasing the
size of a square symmetric matrix, the condition number does not decrease,
and only under special conditions does it remain unchanged.

This treatment has special relevance when B is a covariance matrix, of the


form X T X. Adding an extra coumn to X increases the dimension of B
by 1, in which case the condition number is most likely to increase. This
fact has significant consequences in linear least squares estimation, which
we discuss in Chapters 7 - 9. Adding a column to X has the likely effect of
making its columns more linearly dependent, thus increasing the condition
number.

In Chapter 10, we discuss the concept of regularization, which is a means of

156
mitigating the effect of a poor condition number when solving a system of
equations.

5.6 Iterative Solutions

Appendices

5.7 Alternate Derivation of condition number

We now develop the idea of the condition number, which gives us a precise
definition of the sensitivity of x to changes in A or b in eq. (5.37). Now
consider the perturbed system

(A + F)x() = b + f (5.46)

where

 is a small scalar
F ∈ Rn×n and f ∈ Rn are errors
x() is the perturbed solution, such that x(0) = x.

We wish to place a lower bound on the relative error in x due to the pertur-
bations. Since A is nonsingular, we can differentiate (5.46) implicitly wrt
:
(A + F)ẋ() + Fx() = f (5.47)
For  = 0 we get
ẋ(0) = A−1 (f − Fx). (5.48)
The Taylor series expansion for x() about  = 0 has the form:

x() = x + ẋ(0) + O(2 ). (5.49)

Substituting (5.48) into (5.49), we get

x() − x = A−1 (f − Fx) + O(2 ) (5.50)

157
Hence by taking norms, we have

||x() − x|| = A−1 (f − Fx) + O(2 )


≤  A−1 (f − Fx) + O(2 )


where the triangle inequality has been used; i.e., ||A + b|| ≤ ||A|| + ||b||.
Using the property of p–norms, ||Ab|| ≤ ||A|| ||b||, we have

||x() − x|| ≤  A−1 ||{f − Fx}|| + O(2 )


≤  A−1 {||f || + ||Fx||} + O(2 )


≤  A−1 {||f || + ||F|| ||x||} + O(2 ).


Therefore the relative error in x() can be expressed as


 
||x() − x|| −1 ||f ||
≤  A + ||F|| + O(2 )
||x|| ||x||
 
||f || ||F||
=  A−1 ||A|| + O(2 ).

+
||A|| ||x|| ||A||

But since Ax = b, then ||b|| ≤ ||A|| ||x|| and we have


 
||x() − x|| −1 ||f || ||F||
≤ A
||A|| + + O(2 ). (5.51)
||x|| ||b|| ||A||

There are many interesting things about (5.51):

1. The left–hand side = ||x||()−x||


x|| is the relative error in x due to the
perturbation.
||f || δ
2.  ||b|| is the relative error in b = ρb

δ
3.  ||F|| is the relative error in A = ρA
||A||

4. A−1 ||A|| is defined as the condition number κ(A) of A.


From (5.51) we write

||x() − x||
≤ κ(A)(ρA + ρB ) + O(2 ) (5.52)
||x||

158
Thus we have the important result: Eq.(5.52) says that, to a first-order
approximation, the relative error in the computed solution x is bounded by
the expression κ(A)× (relative error in A + relative error in b). This is
a rather intuitively satisfying result. Thus the condition number κ(A) is
the maximum amount the relative error in A + b is magnified to give the
relative error in the solution x.

The condition number κ(A) is norm-dependent. The most common norm is


the 2-norm. In this case, ||A||2 = σ1 . Further, since the singular
−1 values−1of
−1
A are the reciprocals of those of A, it is easy to verify that A 2 = σn .
Therefore, from the definition of condition number, we have
σ1
κ2 (A) = (5.53)
σn

5.8 Condition Number and Power Spectral Den-


sity

Theorem 6 The condition number of a covariance matrix representing a


random process is bounded from above by the ratio of the maximum to min-
imum value of the corresponding power spectrum of the process.

Proof: (from [8]. Let R ∈ Rn×n be the covariance matrix of a stationary or


wide–sense stationary random process x, with corresponding eigenvectors
v i and eigenvalues λi . In this treatment, the eigenvectors do not necessarily
have unit 2-norm. Consider the Rayleigh quotientdiscussed in Sect. 5.3:

v Ti Rv i
λi = . (5.54)
v Ti v i

The quadratic form in the numerator may be expressed in an expanded form


as
n X
X n
v Ti Rv i = vik r(k − m)vim (5.55)
k=1 m=1

where vik denotes the kth element of the ith eigenvector v i matrix V , and
r(k − m) is the (k, m)th element of R. Using the Wiener–Khintchine rela-

159
tion2 we may write
Z π
1
r(k − m) = S(ω)ejω(k−m) dω. (5.56)
2π −π

where S(ω) is the power spectral density of the process. Substituting (5.56)
into (5.55) we have
n n Z π
1 XX
v Ti Rv i = vik vim S(ω)ejω(k−m) dω
2π −π
k=1 m=1
Z π n n
1 X X
= S(ω)dω vik ejωk vim e−jωm . (5.57)
2π −π
k=1 m=1

At this point, we interpret the eigenvector v i as a waveform in time. Let its


corresponding Fourier transform Vi (ejω ) be given as
n
X
Vi (ejω ) = vik e−jωk . (5.58)
k=1

We may therefore express (5.57) as


Z π
T 1
v i Rv i = | Vi (ejω ) |2 S(ω)dω. (5.59)
2π −π

It may also be shown that


Z π
1
v Ti v i = | Vi (ejω ) |2 dω. (5.60)
2π −π

Substituting (5.59) and (5.60) into (5.54) we have



| Vi (ejω ) |2 S(ω)dω
λi = −πR π jω 2
. (5.61)
−π | Vi (e ) | dω

As an aside, (5.61) has an interesting interpretation in itself. The numerator


may be regarded as the integral of the output power spectral density of a
filter with coefficients v i , driven by the input process x. The ith eigenvalue
is this quantity normalized by the squared norm of v i .

2
This relation states that the autocorrelation sequence r(·) and the power spectral
density S(ω) are a Fourier transform pair [8].

160
Let Smin and Smax be the absolute minimum and maximum values of S(ω)
respectively. Then it follows that
Z π Z π
jω 2
| Vi (e ) | S(ω)dω ≥ Smin | Vi (ejω ) |2 dω (5.62)
−π −π

and Z π Z π
| Vi (ejω ) |2 S(ω)dω ≤ Smax | Vi (ejω ) |2 dω (5.63)
−π −π

Hence, from (5.61) we can say that the eigenvalues λi are bounded by the
maximum and minimum values of the spectrum S(ω) as follows:

Smin ≤ λi ≤ Smax , i = 1, . . . , n. (5.64)

Further, the condition number κ(R) is bounded as

Smax
κ(R) ≤ . (5.65)
Smin

A consequence of (5.65) is that if a covariance matrix R is rank deficient,


then there exist values of ω ∈ [−π, π] such that the power spectrum is zero.

161
5.9 Problems

1. Suggest a sequence of modified Gauss transforms N i , i = n, n−1 . . . , 2,


so that
D = NU

where D is diagonal, U is full–rank upper triangular, and N =


N 2 , . . . , N n−1 N n . All matrices are n×n. Let U i−1 = N i−1 . . . N n U .
Show that U i differs from U i−1 only in the ith column. Use this fact
to propose an efficient means of calculating N from the N i . Hint:
The N i ’s are calculated in the order n, n − 1, . . . , 2.

2. Apply the regular Gaussian elimination process on the matrix on the


augmented matrix A1 = [A I] where A is invertible, to give the result
A2 = [U B 1 ]. What is B 1 ?
Apply a sequence of the modified Gaussian transforms as above on
A2 so that the U –partition becomes diagonal, to give A3 = [D B 2 ].
Then form A4 = D −1 A3 = [I B 3 ]. What is B 3 ?
 
A11 A12
3. Let A ∈ Rn×n
= , where A11 ∈ Rk×k is non–singular.
A21 A22
Then S = A22 − A21 A−1 11 A12 is called the Schur complement of A.
Show that after k steps of the Gaussian elimination algorithm without
pivotting, A22 has been replaced by S.

4. We have an observed data matrix X whose covariance matrix R =


X T X. Suggest a transform B on X so that the covariance matrix
corresponding to XB equals I.

5. A white noise process is fed through a low–pass filter with cutoff fre-
quency of fo and a monotonic rolloff characteristic. The process is
sampled in accordance with Nyquist’s criterion at a frequency fs only
slightly larger than 2fo . The covariance matrix R1 of this process is
evaluated. Then, the sampling frequency is increased well above the
value 2fo and the resulting covariance matrix R2 is again evaluated.
Compare the condition number of R1 with that of R2 and explain
your reasoning carefully. Hint: Consider Sect. 5.8.

6. Give bases for the row and column subspaces of A in terms of its L
and U factors.

162
7. On the course website, you will find a .mat file named Ch5Prob7.mat.
It contains a very poorly conditioned matrix A and a vector b.

(a) What is the condition number of A in comparison to the mat-


lab machine epsilon? (These may be determined by the matlab
commands “cond” and “eps” respectively).
(b) Solve the system Ax = b and compare the computed result x
with the true solution xo = [1, 1, 1, 1, 1]T . How does the relative
error in x compare with the condition number bound?
(c) Calculate the relative residual errors ||Ax −b||
||A||
and ||Ax o −b||
||A||
and
explain your findings in comparison with the relative error in
computed solution x.
(d) The matrix A has an approximate nullspace N (A). What is it?
Add a vector δx ∈ N (A) to xo to give the vector x1 , (where
δx1 has moderate to small norm) and again evaluate the relative
residual error ||Ax 1 −b||
||A||
. Explain your result.

8. For the system of equations given by (5.45), determine the directions


for both b and δb so that ||δ x||2
||x||2 is minimum. Compare the actual
value of the relative error obtained with these choices of b and δb to
that corresponding to the best–case bound.

163
164
Chapter 6

The QR Decomposition

In this chapter we look at the QR decomposition of a matrix. Any matrix A


can be factored into the product A = QR, where Q is orthonormal and R is
upper triangular. The QR decomposition is a very useful concept because it
provides an orthonormal basis for R(A). In a way, it is like a “poor-man’s”
SVD. As we see later, the QR decomposition provides an efficient means
of solving both the rank-deficient and the full-rank least-squares problem
and also is the main computational step in the QR algorithm for computing
eigenvalues.

Given A ∈ Rm×n then the QR decomposition may be defined as

A = QR

where Q ∈ Rm×m is orthonormal and R ∈ Rm×n is upper triangular.

Unless A is square, R must be padded with a block of zeros to maintain


dimensional consistency. If A is tall, then the triangular portion of R is
padded with an (m − n) × n block of zeros from below as in (6.1) below.
When A is short, then an m × (n − m) is padded to the right. In most
practical cases of interest, m > n, and our following discussion assumes this
fact.

165
For m ≥ n, we can partition the QR decomposition in the following manner
   
 
R1 n
m  A  =  Q1 Q2 
0 m−n (6.1)
n
n n m−n

Then we have the following properties for the QR decomposition:

1. If A = QR is a QR factorization of a tall, full rank matrix A as


defined above, then

span(a1 , a2 , . . . , ak ) = span(q 1 , q 2 , . . . , q k ), k = 1, . . . , n, (6.2)

This follows from the fact that since R is upper triangular, the column
ak is a linear combination of the columns [q 1 , . . . , q k ] for k = 1, . . . , n.

2. Further to the above, if A is tall, then1

R(A) = R(Q1 )
R(A)⊥ = R(Q2 )

and A = Q1 R1 . If R1 has positive diagonal entries then it is unique.


Furthermore, R1 = GT , where G is the Cholesky factor of AT A.

3. If A = Q1 R1 and R1 is nonsingular, then

AR−1
1 = Q1 . (6.3)

Hence, R−11 is a matrix which orthonormalizes A. In fact, this prop-


erty not only holds for R−1
1 , but also holds for any inverse square root
factor of the matrix AT A.

We note the QR decomposition does not possess the optimality property of


the eigenvectors, as explained in Sect. 2.2.

We now consider the Gram-Schmidt procedure which justifies the existence


of the QR decomposition, and provides a means of computing it.
1
Don’t confuse the notation in this section: R(·) denotes range, and R denotes an
upper triangular matrix.

166
L__�-----+--'----p- g__,

Figure 6.1. Geometry of the Gram–Schmidt method for QR decomposition.

6.1 Classical Gram-Schmidt

We specify “classical” because later we consider other forms of the GS proce-


dure. In this procedure, we successively form orthonormal columns Q from
the columns of A ( assumed to be full rank), beginning at the first column.
The first column q 1 is defined as shown in Fig. 6.1
a1
q1 = . X
||a1 ||2
Thus, we see q 1 is a vector of unit norm. The element r11 of R is given as
||a1 ||2 . Now consider the formation of the second column q 2 . The columns
a1 and a2 of A are represented as shown in Fig. 6.1.

Because R1 is upper triangular, the column a2 is a linear combination of q 1


and q 2 :
a2 ∈ span(q 1 , q 2 ). (6.4)
Since Q is to be orthonormal, q 2 must also satisfy

||q 2 ||2 = 1 (6.5)

q2 ⊥ q1. (6.6)
From Fig. 1, we may satisfy (6.4)–(6.6) by considering a vector p2 , which is
the projection of a2 onto the orthogonal complement subspace of q 1 . The
vector p2 is thus defined as

p2 = P ⊥ T
2 a2 = (I − q 1 q 1 )a2 .

167
Then, the vector q 2 is determined by normalizing the 2–norm of p2 :
p2
q2 = .
||p2 ||2


We define the matrix Q(k) = [q 1 . . . , q k ]. Then the second column r 2 of R1
contains the coefficients of a2 relative to the basis Q(2) . Thus,
T
r 2 = Q(2) a2 .

To generalize, at the kth stage, q k may be determined by finding the vector


pk which is the projection of ak onto the orthogonal complement subspace
R(Q(k−1) )⊥ . Thus,
 
pk = (P k )⊥ ak = I − Q(k−1) Q(k−1)T ak

and
pk
qk =
||pk ||2

We now define Q(k) = [Q(k−1) , q k ]. The column r k is then defined as
T
r k = Q(k) ak .

We then increment k by one, and iterate the process until k = n − 1. At


that point, the matrices Q and R are completely determined.

We note that the matrix Q ∈ Rm×n generated from the Gram-Schmidt


process is not a full orthonormal matrix in the case when m > n. There
are only n columns, which however are enough to determine an orthonormal
basis for R(A), and to be useful in solving least squares problems. This is
in contrast to other methods, to be discussed later, which give the complete
orthonormal matrix Q ∈ Rm×m .

Unfortunately, the classical GS method is not numerically stable. This is


because if columns of A are close to linear dependence, catastrophic can-
cellation occurs in computing the pk . This error is quickly compounded,
because the resulting errored q k are used in successive stages, and quickly
lose orthogonality. Nevertheless, the GS process is useful as an analytical
tool, and for geometric interpretation.

168
6.1.1 Modified G-S Method for QR Decomposition

In this section we investigate an alternative method for computing the QR


decomposition by the Gram-Schmidt process. It requires the same number
of flops as classical GS, but is stable. Thus, this modified method offers an
effective method of computing the QR decomposition.

We are given a matrix A ∈ Rm×n , m > n. In this case, (as with classical
Gram-Schmidt) the matrix Q which is obtained contains only n columns.

Because R is upper triangular, we can write successive columns of A as


a1 = r11 q 1
a2 = r12 q 1 + r22 q 2
a3 = r13 q 1 + r23 q 2 + r33 q 3 (6.7)
.. .. .. .. ..
. . . . .
an = r1n q 1 + r2n q 2 + r3n q 3 . . . rnn q n

With classical G-S, the matrix R is computed column–wise. However, with


this modified G-S procedure, we note R is computed row-by-row.

To describe the method, we note that the first column q 1 of Q is given as


q 1 = a1 /r11 and r11 = ||a1 ||2 . Since QT A = R, the elements rij of R equal
q Ti aj , for i < j. Therefore the remaining elements [r12 , . . . , r1n ] in the first
row r T1 of R are therefore given as

[r12 , . . . , r1n ] = q T1 A(:, 2 : n).

where matlab notation has been used. We see that the first column on
the right in (6.7) is now completely determined. We can proceed to the
second stage of the algorithm by forming a matrix B by subtracting this
first column from both sides of (6.7):

B (1) = A − q 1 r T1 .

Since a1 = r11 q 1 , the first column of B (1) is zero. We then have from (6.7):

b2 = r22 q 2
b3 = r23 q 2 + r23 q 3
.. .. .. .. (6.8)
. . . .
bn = r2n q 2 + r3n q 3 . . . rnn q n

169
From (6.8) it is evident that the column q 2 and row r T2 may be formed from
B (1) in exactly the same manner as q 1 and r T1 were from A. The method
proceeds n steps in this way until completion.

To formalize the process, assume we are at the kth stage of the decomposi-
tion. At this stage we determine kth column of Q = q k and the kth row of
R = r Tk . We define the matrix A(k) in the following way:

A − k−1 q i r Ti [0, A(k) ].


P
i=1 =
↑ k−1 n−k+1
sum of outer products

We partition A(k) as

A(k) = [z B (k) ] m
1 n−k

This situation above corresponds to having just subtracted out the (k − 1)th
column in (6.7). Then,
rkk = ||z||2
and
z
qk =
.
rkk
The kth row of R may then be calculated as:

[rk,k+1 , . . . , rk,n ] = q Tk B (k) .

We now proceed to the (k + 1)th stage by removing the componenet q k from


each column in B (k) :

A(k+1) = B (k) − q k [rk,k+1 , . . . , rk,n ]

Then, increment k and go to (6.1.1).

This method, unlike the classical Gram Schmidt, is very stable. The nu-
merical stability results from the fact that errors in q k at the k th stage are
not compounded into succeeding stages. It also requires the same number
of flops as classical G-S. It may therefore be observed that modified G-S
is a very attractive method for computing the QR decomposition, since it
has excellent stability properties, coupled with relatively few flops for its
computation.

170
L__�-----+--'----p- g__,

Figure 6.2. Geometry of a reflector matrix.

6.2 Householder Transformations

Householder transforms are generally the preferred method for performing


the QR decomposition. They are fast to implement and have stable numer-
ical properties.

6.2.1 Description of the Householder Algorithm

We have seen previously that the vector xS = P x is the projection of x


onto the range of P , and that

x⊥ = (I − P )x

is the projection of x onto the orthogonal complement subspace of R(P ). We


now have a new variation of the projector matrix. Specifically, the matrix

H = I − 2P (6.9)

is a reflection matrix. The vector xr = Hx is a reflection of x in the


orthogonal complement subspace of R(P ). This fact may be justified with
the aid of Fig. 6.2. In this figure and in the sequel, we assume the matrix
P is defined in terms of a single vector v as P = v(v T v)−1 v T . It is easily
verified that the matrix H defined by (6.9) is orthonormal and symmetric.

171
The H matrices may be used to zero out selected components of a vector.
For example, by choosing the vector v in the appropriate fashion, all ele-
ments of a vector x may be zeroed, except the first, x1 . This is done by
choosing v so that the reflection of x in span(v)⊥ lines up with the x1 –axis.
Thus, in this manner, all elements of x are eliminated except the first.

We can use this property to perform a QR decomposition on a matrix A.


The method proceeds in a manner similar to Gaussian elimination, in that we
eliminate all elements below the main diagonal in each column successively.
We can find a vector v 1 so that

A1 = H 1 A

where H 1 is defined from v 1 according to (6.9). The matrix A1 has zeros


below the main diagonal in the first column, as desired. Then, we can find
a new v 2 so that
A2 = H 2 A1

has zeros below the main diagonal in both the first and second columns.
This may be done by designing H 2 so that the first column of H 2 A1 is the
same as that of A1 , and so that the second column of H 2 A1 is zero below
the main diagonal.

The process continues for n − 1 stages. At that stage we have


R = An−1 = H n−1 . . . H 1 A (6.10)

where R is upper triangular.


Qn
Because the H’s are orthonormal, i=1 H i is also orthonormal. Thus, from
(6.10), we have
A = QR
Q1
where QT = i=n H i , and thus the QR decomposition is complete.

Let us now consider the first stage of the Householder process. Extension
to other stages is done later. How do we choose P (or more specifically v)
so that y = (I − 2P )x has zeros in every position except the first, for any
x ∈ Rn ? That is, how do we define v so that y = Hx is a multiple of e1 ?

172
Here goes:

Hx = (I − 2P )x
I − 2v(v T v)−1 v T x

=
2v T x
= x− v. (6.11)
vT v
Householder made the observation that If v is to reflect the vector x onto
the e1 -axis, then v must be in the same plane as that defined by [x, e1 ], or
in other words, v ∈ span(x, e1 ). Accordingly, we set v = x + αe1 , where α
is a scalar to be determined. At this stage, this asignment may appear to be
rather arbitrary, but as we see later, it leads to a simple and elegant result.

Substituting this definition for v into (8.33), where x1 is the first element of
x, we get

v T x = xT x + αx1
v T v = xT x + 2αx1 + α2 .

Thus,

2v T x
Hx = x − T [x + αe1 ]
v v
2(xT x + αx1 ) vT x
 
= 1− T x − 2α e1 (6.12)
x x + 2αx1 + α2 vT v
To make Hx have zeros everywhere except in the first component, the first
term above is forced to zero. If we set α = ||x||2 , then the first term is:
   
2 ||x||22 + ||x||2 x1
1 −   = 0
||x||22 + 2 ||x||2 x1 + ||x||22

as desired. By using this choice of α, (8.34) becomes

Hx = − ||x||2 e1 . (6.13)

Thus, we see that by defining v = x + ||x||2 e1 , then Hx has zeros every-


where except in the first position.

Note that we could also have achieved the same effect by setting α = −||x||2
in (8.34). The choice of sign of α affects the numerical stability of the

173
algorithm. If x is close to a multiple of e1 , then v = x − sign(x1 )||x||2 e1

has small norm; hence large relative error can exist in factor β = v T2 v . This
difficulty can be avoided if the sign of α is chosen as the sign of x1 (first
component of x); i.e2 .,
v = x + sign(x1 )||x||2 e1 . (6.14)
The corresponding matrix H is given from the second line of (8.33) as
2vv T
H=I− . (6.15)
(v T v)

6.2.2 Example of Householder Elimination

Suppose x = (1, 1, 1)T .

What is H such that the only non–zero element of Hx is in the first posi-
tion? That is, Hx ∈ span {e1 }. The process is very simple.

Since H is uniquely determined by v, we must find v. From (6.14),


v = x + αe1 where α = +||x||2

in this case. Thus, since ||x||2 = 3,
   √   √ 
1 3 1+ 3
v = 1 + 0 = 1 
1 0 1
and H is formed from (6.15) as
vv T
H = I −2
vT v
2
= I − T vv T
v v √ √ √ 
(1 + √3)2 1 + 3 1 + 3

= I − 0.21132  1 + √3 1 1 
1+ 3 1 1
 
−0.57734 −0.57734 −0.57734
=  −0.57734 0.78868 −0.21132 
−0.57734 −0.21132 0.78868
2
sign(x) = +1 if x is positive, and -1 if x is negative.

174
We see that

 
−1.73202
Hx =  0 
0
which is exactly the way it is supposed to be. Note from this example, Hx
has the same 2–norm as x. This is a consequence of (6.13), which itself
follows from the orthonormality of H.

6.2.3 Selective Elimination

[1]

We have discussed the Householder procedure for annihilating all elements


of a vector except the first. We now consider how the Householder procedure
may be generalized to eliminate any contiguous block of vector components.
This procedure is necessary in computing a QR decomposition, since only
the elements below the main diagonal in a specific column are annihilated,
while the remaining elements must remain intact.

Suppose we wish to eliminate all elements xk , . . . , xj of any x ∈ Rn , where

1 < k < j ≤ n, x ∈ Rn .

Then, the vector v for this case has the form:

v T = [0, . . . , 0, xk + sign(xk )α, xk+1 , . . . , xj 0, . . . , 0]


| {z }
this has same structure as a
(j−k+1)–dimensional Householder
vector as in (6.14)).

where α2 = x2k + . . . + x2j .

In this case, if we define H to have the form


 
H = diag I k−1 , H, I n−j

175
where H = I − 2vv T /v T v is the Householder matrix formed by v’s non
trivial portion, then we have in this case,

Hx = [ x ,...,x , 0, 0, . . . 0, xj+1 , . . . , xn ]T
| 1 {z k−1} −sign(xk )α, | {z } | {z }
these elements 0’s in desired these elements
are unchanged positions also unchanged

By using Householder matrices H constructed in this way, the complete QR


decomposition may be effected, by choosing the block to be eliminated as
that below the main diagonal in the respective column of A.

6.2.4 Householder Numerical Properties

2
Let β = , and v̂ and β̂ be the computed versions of v and β respectively.
vT v

Then,
Ĥ = I − β̂v̂v̂T

and it is shown [14] that



H − Ĥ ≤ 10u, u = machine epsilon

The matrix HAh hasi a block of zeros in a desired location. The floating
point matrix f l ĤA satisfies
h i
f l ĤA = H(A + E)

where
||E||2 ≤ cp2 u||A||2
c is a constant of order 1
p is the number of elements which are zeroed.

Conclusion: The computed Householder transformation process on a ma-


trix A is an exact Householder transformation on a matrix close to A. Thus,
the Householder procedure is stable.

176
6.3 The QR Method for Computing the Eigende-
composition

In this section we use the QR decomposition to compute all eigenvalues and


eigenvectors of a square matrix. While the method is applicable to both
symmetric and non–symmetric matrices, in this treatment we consider only
the symmetric case, since that is the most relevant for the field of signal
processing. A more general description is given in [15].

We first consider the similarity transform of a matrix A. Let A be a square


symmetric matrix. Then
Av = λv
for any eigenpair v, λ. We now substitute the matrx C = BAB −1 for
A, where B is an invertible matrix of compatible size. C is the similarity
transform of A. Then
BAB −1 v = λv.
Define u = B −1 v and we have

Au = λu.

Thus we see that C has the same eigenvalues as A, and the eigenvectors u
of C are transformed versions of those of A.

To execute the QR method itself, we perform a QR decomposition on the


square symmetric matrix A(0) = A:

A(0) = Q(0)R(0),

and form A(1) = R(0)Q(0), simply by reversng the factors.. Substituting


R = QT A, we have A(1) = Q(0)T A(0)Q(0), so A(1) is a similarity trans-
form of A(0) and so they have the same eigenvalues. We then perform a
QR decomposition on A(1) to obtain

A(1) = Q(1)R(1)

and again reverse the factors to form A(2), and continue iterating in this
fashion. The eigenvalues at each stage are identical to those of A(0). At
each successive stage of this process, A(k) becomes increasingly diagonal,
and eventually, A(k) becomes completely diagonal for large enough k, thus

177
revealing the eigenvalues. An extensive discussion on the convergence char-
acteristics of the QR procedure is presented in Golub and Van Loan [1].

At convergence, we can write for k sufficiently large,

A(k) = Q(k)T Q(k − 1)T . . . Q(0)T A Q(0)Q(1) . . . Q(k) = Λ

where Λ is diagonal. It follows that

A = Q(0)Q(1) . . . Q(k) Λ Q(k)T Q(k − 1)T . . . Q(0)T .


| {z } | {z }
V V T

By comparison with the general form of the eigendecomposition A = V ΛV T


for a symmetric matrix, it is clear that the eigenvector matrix is given by
the product of the Q(k) as shown above.

The use of orthogonal matrices for the similarity tranformations is desirable,


since then the 2–norm of the matrix or vector is preserved. Otherwise,
there is a potential for matrix or vector products to become large in specific
directions, as shown in Fig. 1, Ch.3. Large growth in elements can introduce
the possibility for catastrophic cancellation to occur, in the manner discussed
in Sect. 5.3.

6.3.1 Enhacements to the QR Method

We discuss two techniques that significantly reduce the computation time


of the QR method as described above. The first is conversion of A to
tridiagonal form, which is applicable only in the symmetric case, and the
second is the shifted QR method which is a modification of the basic method
to accelerate convergence.

Tridiagonalization: The motivation of the tridiagonalization approach


is to introduce as many zeros into a transformed version of A as possible.
Assuming these zeros persist from one iteration to the next, the QR decom-
position at each stage is easier because fewer elements need to be zeroed out
at each step, and further because the algorithm exhibits faster convergence.

178
A square symmetric matrix can be converted to tridiagonal form with one
similarity transform. ( Tridiagonal means that only the main and first upper
and lower diagonals are non–zero. ) The process is quite straightforward.
The original matrix A is replaced with A(0) as follows:

A(0) = Qo AQTo (6.16)

where Qo is the product of Householder transforms required to eliminate


all elements below the first lower diagonal in each column. It is left as an
exercise to show that if premultiplication of a symmetric A by Qo eliminates
elements below the first lower diagonal in each column, then postmultipli-
cation by QTo eliminates elements to the right of the first upper diagonal in
each row, thus maintaining symmetry. The QR method then proceeds on
the transformed matrix A(0) instead of A and is much faster as a result.

One might well ask “If we can tridiagonalize A with one similarity transform,
then why can’t we completely diagonalize A in one step? The answer lies in
the fact that we can indeed find a Qo so that Qo A has zeros in all positions
below the main diagonal. The problem is that the post–multiplication by
QTo as in (6.16) overwrites the zeros that result from the pre–multiplication.
As an example, we take the following Toeplitz matrix for A:
 
4 3 2 1
 3 4 3 2 
A=
 2
 (6.17)
3 4 3 
1 2 3 4

We eliminate the first column using a Householder matrix H given by:


 
−0.7303 −0.5477 −0.3651 −0.1826
 −0.5477 0.8266 −0.1156 −0.0578 
H=
 −0.3651 −0.1156
.
0.9229 −0.0385 
−0.1826 −0.0578 −0.0385 0.9807

The matrix HA is then:


 
−5.4772 −5.8424 −5.1121 −3.6515
 0.0000 1.2010 0.7487 0.5276 
HA = 
 0.0000
, (6.18)
1.1340 2.4991 2.0184 
0.0000 1.0670 2.2496 3.5092

179
which has zeros below the main diagonal in the first colunm as desired.
However, when we post–muliptly by H T , we get
 
9.7333 −1.0275 −1.9022 −2.0465
 −1.0275 0.8757 0.5318 0.4192 
HAH T =   −1.9022
,
0.5318 2.0977 1.8177 
−2.0465 0.4192 1.8177 3.2933

and thus it is apparent that the zeros introduced in (6.18) have been over-
written by the later post–multiplication by H T and the procedure has not
accomplished our objective to introduce as many zeros as possible into A(0).
So now we accept the fact that the best we can do is to tridiagonalize. As
a first step in this respect, we formulate a Householder matrix H to wipe
out the elements below the first lower diagonal in the first column. This is
given by  
1.0000 0 0 0
 0 −0.8018 −0.5345 −0.2673 
H= .
 0 −0.5345 0.8414 −0.0793 
0 −0.2673 −0.0793 0.9604
Notice that the selective elimination procedure as in Sect. 6.2.3 has been
used, since we wish to keep the element (1,1) intact. This explains the
identity structure in the first row and column of H. The second column
below the first lower diagonal is eliminated in a corresponding fashion. The
overall result of the tridiagonalization procedure is then given by
 
4.0000 −3.7417 −0.0000 −0.0000
 −3.7417 8.2857 −2.6030 −0.0000 
HAH T =   −0.0000 −2.6030
.
3.0396 −0.2254 
−0.0000 −0.0000 −0.2254 0.6747

We see that the only non–zero elements are indeed located along the three
main diagonals, as desired. After tridiagonalization, regular QR iterations
as described above are applied, but now only one element in each column
requires elimination, a fact that greatly speeds up both convergence and
execution of the algorithm. The tridiagonal structure is maintained at each
QR iteration.

The QR method may also be applied to non–symmetric square matrices


using exactly the same procedure as for the symmetric case. The only
difference is that the initial matrix A(0) cannot be made tridiagonal, since it
is not symmetric. In this case we make do with just zeroing out the elements

180
below the first lower diagonal. This is the so–called upper Hessenburg form
of the matrix. The overall process requires somewhat more computation
than the symmetric case.

Shifted QR: We can significantly improve the convergence rate of the


QR algorithm using a shifting procedure. We replace the basic QR iteration
described earlier with the following procedure:

A(k) − αk I = Q(k)R(k), and then A(k + 1) = R(k)Q(k) + αk I

where αk is chosen to be close to the smallest eigenvalue. This variation


of the QR iteration may be justified, because A(k) is similar to A(k + 1),
which we can show as follows:

Q(k)T A(k)Q(k) = Q(k)T Q(k)R(k) + αk I Q(k)




= R(k)Q(k) + αk I
= A(k + 1).

Therefore the shift is justified in the QR procedure because the eigenvalues


are not altered. In practice, an estimate of λmin is provided by the element
in the lower right corner of A(k). As indicated in [15], if αk is sufficiently
close to λmin , the shifted QR algorithm can exhibit cubic convergence3 to
the eigenvalue.

The Jacobi method is yet another approach to computing the complete


eigendecompostion using orthonormal similarity transforms, specifically for
symmetric matrices. However it is not as commonly used as the QR method.
The interested reader is referred to [5].

3
By cubic convergence, we mean that if the error at iteration k is k for suitably large
k, (where k may be assumed small) then the error is o(3k ) at iteration k + 1. Thus the
convergence is very fast.

181
Appendices
In this Appendix, we examine two additional forms of QR decomposition
– the Givens rotation and fast Givens rotation methods. They are both
useful methods, particularly when only specific elements of A need to be
eliminated.

6.4 Givens Rotations

We have seen so far in this lecture that the QR decomposition may be ex-
ecuted by the Gram Schmidt and Householder procedures. We now discuss
the QR decomposition by Givens rotations. A Givens transformation (ro-
tation) is capable of annihilating a single zero in any position of interest.
Givens rotations require a larger number of flops compared to Householder
to compute a complete QR decomposition on a matrix A. Nevertheless,
they are very useful in some circumstances, because they can be used to
eliminate only specific elements.

In this presentation we consider a Givens transformation J (i, k, θ) to anni-


hilate the (i, k)th element of the product J A below the main diagonal; i.e.,
for i > k. The matrix J has the form:
k i
 
1

 1 

 c s 
k
J= (6.19)
 
 .. 
 . 
i
 

 −s c 

 1 
1
where s = sin(θ) c = cos(θ), and θ is an angle to be determined. J
has the form of an identity matrix except for the (c, s) entries. These c, s
entries occupy positions involving all combinations of the indeces (i, k). The
transformation J on x rotates x by θ radians in the i − k plane.

A sequence of Givens rotations may be used to effect the QR decomposi-


tion. One rotation (premultiplication by J ) exists for every element to be

182
eliminated; thus,
J n,n−1 · · · J n2 · · · J 32 Jn1 · · · J 21 A = R.
| {z }
QT upper triangular

We therefore see that the QR decomposition may be effected by a sequence


of Givens rotations. The resulting upper triangular product is R and the
product of all the Givens transformations is QT .

There are two conditions which must exist on each J :

1. If Q is to be orthonormal, then each J must be orthonormal. (Because


the product of orthonormal matrices is orthonormal).

2. The (i, k)th element of J A must be zero.

The first condition is satisfied by the structure of J from (6.19):


  
1 1  
 1  1  1

 c −s

  c s
 
 1 0 
  
1
 
JT J = 
 ..  .. 
=
  
 . 
 .   1


s c −s c

0 1
    
  
 1  1 
1
1 1
which holds for any θ. Condition 2 is satisfied by considering the following
diagram:
 
1  
 ..  a11 ··· ··· a1n
 . 
   .. .. 
 c s   . · · · akk aki · · · . 
  
. ..
 c = cos(θ)
 −s c   .. · · · aik aii · · ·
  
 .  s = sin(θ)
 .. 
 .  an1 ··· ··· ann
1 A
J
The evaluation of the (i, k)th element of the product J A for i > k is given
as
−sakk + caik = 0.

183
Thus,
s aik
= tan θ = .
c akk

We therefore have with the aid of Fig. 3


aik
s = q 
a2kk + a2ik
akk
c = q 
a2kk + a2ik .

Notice that θ is not explicitly computed . The matrix J (i, k, θ) is now com-
pletely specified.

The following algorithm computes the c and s in the most stable numerical
fashion and ensures that J (i, k))x has a 0 in the (i, k)th position:

If xk = 0 then c := 1 and s := 0 else

If |xk | ≥ |xi |
xi 1
then t := xk ; s := 1 ; c := st
(1 + t2 ) 2
xk 1
else t := xi ; c := 1 ; s := ct
(1 + t2 ) 2
This algorithm assures that |t| ≤ 1. If |t| becomes large, we may run into
stability problems in calculating c and s.

It is easily verified that the following facts hold true when evaluating the
product J (i, k)A:

1. The ith row of J A ← caTk + saTi

2. The k th row of J A ← −saTk + caTi

3. All other rows of A are unchanged.

where aTi , aTk are the ith and kth rows of A respectively. Thus, only the ith
and kth rows of the product J A are actually relevant in the Givens analysis.

184
The order in which elements are annihilated in the QR decomposition is
critical. It is explained with the aid of the following diagram:

 a11  a12
a13 ...
a21 a22
 a31   a23
a32

 a33 
 
 ..
a42

 .  
a43
  

 ..
  .. 
 .
  .   .. 
  . 
am1 am2
am3
↑ ↑

elements a21 . . . am1 are elements a32 . . . am2
then a43 . . . an3
first annihilated by are annhilated next by linear
etc.
linear comb. with 1st row comb. with 2nd row

i.e., k = 1, i = 2, . . . , m − 1. i.e., k = 2, i = 3, . . . , m − 1.

If the ordering indicated by the above diagram is not followed, then previ-
ously written zeros may be overwritten by non-zero values in later stages.

Numerical Stability of QR Decomposition by Givens

QR decomposition by Givens rotation is of the same degree of stability as for


Householder. Both are very stable, and more so than Gaussian elimination
for triangularization.

6.5 “Fast” Givens Method for QR Decomposition

Even though the ordinary Givens method is stable, it is expensive to com-


pute. In this section we discuss a modified Givens method which is almost
as stable yet is considerably faster.

First, we present a quick review of “slow” Givens: Suppose we have a vector


x = [x1 . . . xi . . . xn ] and we wish to annihilate the element xi for the value

185
k = 1. This is done using Givens rotations in the following way:

 k i     (1) 
c ··· s 0 ··· 0 x1 x1
k  .. ..   ..   .. 
 . .   .   . 
     
i  −s c   xi  =  0 
    
  ..   ..

 0 1 
   .   . 
 .. .. 
 . . xn (1)
 xn
0 1
where the bracketed superscript indicates the corresponding element has
been changed once, and
x1 xi
c= q s= q
x21 + x2i x21 + x2i

The relevant portion of this process may be represented at the 2 × 2 level


as:     0 
c s x1 x1
= . (6.20)
−s c xi 0
Each element is eliminated in turn, using an appropriate Givens matrix J ,
in the order of Gaussian Elimination, until an upper triangular matrix is
obtained. Note that each element is eliminated by an orthonormal matrix.

The difficulty with the original Givens method is that generally, none of the
elements of the J –matrix at the 2 × 2 level in (6.20) are 0 or 1. Thus, the
update of a given element from (6.20) involves 2 multiplications and one add
for each element in rows k and i. We now consider a faster form of Givens
where the off– diagonal elements of the transformation matrix are replaced
by ones. This reduces the number of explicit multiplications required for
the evaluation of each altered element of the product from two to one.

In this vein, let us consider Fast Givens: the idea here is to eliminate each
element of a using a simplified transformation matrix, denoted as M , to
reduce the number of flops required over ordinary Givens. The result is that
the M used for fast Givens is orthogonal but not orthonormal.

We can therefore speculate that we can triangularize A as:


 
T S
M A= (6.21)
0

186
where A ∈ <m×n , m > n, S ∈ <n×n is upper triangular, and M ∈ <m×m
has orthogonal but not orthonormal columns.

Hence
M T M = D = diag(d1 . . . dm ), (6.22)
1
and M D − 2 is orthonormal.

We deal with the fast Givens problem at the 2 × 2 level. Let x = [x1 x2 ]T ,
and we define the matrix M 1 as
 
β1 1
M1 = . (6.23)
1 α1

As with regular Givens, there are two conditions on M 1 if the appropriate


element of M T A is to be annihilated and M 1 is to be othogonal:
 0 
x1
1) M1 x =
0
T
2) M1 M1 = diag(d1 , d2 ).

To satisfy the first condition, we note using (6.23)


 
β1 x1 + x2
M1 x =
x1 + α1 x2

Therefore, for x2 = 0, we must have


−x1
α1 = . (6.24)
x2

To satisfy the second condition, we note


    2 
T β1 1 β1 1 β 1 + 1 β 1 + α1
M1 M1 = = . (6.25)
1 α1 1 α1 β1 + α1 α12 + 1

Hence, we must have β1 = −α1 . This completely defines the matrix M 1 .

187
At the m × m level, the matrix M1 has a form analogous to slow Givens:
 
1 ··· 0 ··· 0 ··· 0
 .. . . 
 .
 . 

 0 · · · β1 · · · 1 ··· 0 
  k
 .. 
M1 (i, k) = 
 . 
(6.26)
i

 0 · · · 1 · · · β1 ··· 0 
 
 .. 
 . 
0 1
k i

Each element of A is eliminated in turn, just as with slow Givens. We note


from (6.23) that
1 + β12
 
0
M1T M1 = ;
0 1 + β12
hence, with each elimination, the rows of M T A grow by a factor of 1 + β12 =
1+α12 . But we see from (6.24) that if x1  x2 this growth factor can become
large, and lead to the potential of floating point overflow.

To control this growth, we consider a different form of M :


 
1 β2
M2 =
α2 1

In this case, to satisfy the two conditions, it is easily verfied that


−x2
α2 =
x1
and β2 = −α2 . Hence, if x1 > x2 choose form M 2 , else choose M 1 . This
way, the growth with each elimination is controlled to within a factor of 2.
(With many eliminations, this still can be a problem).

For the sake of interest, let us see how the fast Givens decomposition may
be used to solve the LS problem. From (6.21) and (6.22), and using the fact
1
that M D − 2 is orthonormal, we can write
1 1

||Ax − b||2 = D − 2 M T Ax − D − 2 M T b

    2
− 1 S c
= D 2 x− (6.27)
0 d
2

188
where  
c n
MT b = .
d m−n

Thus, xLS is the solution to


Sx = c (6.28)
and
1
ρLS = D − 2 d . (6.29)

2

The great advantage to the fast Givens approach is that the triangularization
may be accomplished using half the number of multiplications compared to
slow Givens, and may be done without square roots, which is good for VLSI
implementations.

6.5.1 Flop Counts

The following table presents a flop count for various methods of QR decom-
position of a matrix A ∈ Rm×n :

Householder: 2n2 (m − n/3) 1 flop = 1 floating-pt. op. (add, mult, div or subt.)
slow Givens: 3n2 (m − n/3)
fast Givens: 2n2 (m − n/3)
Gram-Schmidt 2mn2
2 3
by comparison, Gauss: 3n .

where of course, the Gaussian elimination entry applies only to a non-


orthogonal transformation of a square matrix. Thus, even though House-
holder and both forms of Givens are very stable and furthermore yield an or-
thonormal decomposition, they are significantly slower than Gaussian elim-
ination.

189
6.6 Problems

1. Consider a QR decomposition on a matrix Am×n , m > n, which has


proceeded k steps, k < n, (i.e., k columns of A have been eliminated).
A Householder method has been used. The situation may be depicted
in the following equation

   
 
 A1 A2  =  R11 R12 k
m Q1 Q2 
0 R22 m−k
k n−k
k n−k k m−k

where Q1 ∈ Rm×k , R11 ∈ Rk×k is upper triangular, and R22 ∈


R(m−k)×(n−k) . The matrix A may be partitioned as A = [A1 A2 ],
where A1 has k columns.

(a) The matrix Q1 is an orthonormal basis for R(A1 ). In terms of


the R–matrix, how can the quantity ||ai ||2 , i ≤ k be determined?
(b) Let si , i > k be the projection of ai ∈ A2 onto R(A1 ). Determine
the quantity ||si ||2 using only the R–matrix.
(c) What is an orthonormal basis for R(A1 )⊥ ? Denote pi as the
projection of ai , i > k onto R(A1 )⊥ . Determine the quantity
||pi ||2 using only the R–matrix. Explain your reasoning carefully.

2. We are given a tall matrix Q1 with orthonormal colummns. Explain


how to find a Q2 so that Q = [Q1 Q2 ] is a complete orthonormal
matrix.

3. Consider the QR decomposition on a tall rank–deficient matrix. Ex-


plain the characteristics of R in this case.

4. (a) Given that P is a projector onto a subspace S, we have seen that


the operation (I − 2P )x reflects a vector x in the orthogonal
complement subspace S⊥ . What is the matrix which reflects x
about S?
(b) What are the eigenvalues of each of these reflectors?

5. (From Strang): Show that for any two different vectors x and y of the
same length, the choice v = x − y leads to a Householder transforma-
tion such that Hx = y and Hy = x.

190
6. Updating the QR decomposition with time: At a certain time t, we
have available m row vectors aTi ∈ Rn and their corresponding desired
values bi , for i = 1, . . . , m, to form the matrix At ∈ Rm×n and bt ∈ Rm .
The QR decomposition QTt At = Rt is available at time t to aid in the
computation of the LS problem minx ||Ax − b||. At time t + 1 a new
(m + 1)th row aTm+1 of A and a new (m + 1)th element of b become
available. Explain in detail how to update Qt and Rt to get Qt+1 and
Rt+1 . Hint: At+1 can be decomposed as
  
Qt v Rt
At+1 =
zT 1 aTm+1

where z and v are vectors of zeros of appropriate length. What is the


order of the FLOP count for this estimate? This process is the basis
for an adaptive filtering method, used for tracking the least–squares
solution in non–stationary environments throughout time with a view
to being as computationally efficient as possible. Hint 2 : Consider
Givens rotations.

7. Write a program to convert the Toeplitz matrix of (6.17) into a tridi-


agonal form.

191
192
Chapter 7

Linear Least Squares


Estimation

In this chapter, we discuss the idea of linear least-squares estimation of


parameters. Least-squares (LS) analysis is one of the foundations of signal
processing and is the fundamental concept in adaptive systems, linear pre-
diction/signal encoding, system identification, machine learning and many
other applications.

We start off with a quick look at a few applications of least squares, and go
on to develop the LS model. We then develop the so-called normal equations
for solving the LS problem. We discuss several statistical properties of the
LS solution including the Cramer–Rao lower bound (CRLB). We look at
the performance of the LS estimates relative to the CRLB in the presence
of white and coloured noise. We show that in the coloured noise case, per-
formance is degraded, and so we consider various methods for whitening the
noise, which restore the performance of the LS estimator.

Because least squares is such an important topic, it is the focus of the next
four chapters. In Chapter 8, we discuss LS estimation when the matrix
A is poorly conditioned or rank deficient. Then we extend this treatment
in Chapter 9 to discuss latent variable methods, which are useful for mod-
elling poorly conditioned linear systems. Specifically, we deal with the case
where we wish to predict system responses given new input values. Then

193
in Chapter 10, we discuss the important concept of regularization, which is
an additional method for mitigating the effects of poor conditioning when
modelling linear systems.

7.1 Examples of least squares analysis

7.1.1 Example 1: An Equalizer in a Communications System

In a digital communications system such as a cell phone, symbols y(iT ), i =


1, 2, . . . , are generated at the transmitter every T seconds and received as
the quantities x(iT ). The discrete-time impulse response of the channel is
denoted h(iT ). Ideally, this function should be a delta–function δ(iT ). How-
ever due to several causes, in a practical system h(iT ) extends over several
symbol periods. Since noise is always present in the receiver, the received
symbols x(iT ) may then be expressed as x(iT ) = h(iT )∗y(iT )+n(iT ), where
n(iT ) is the noise sequence and ∗ denotes the convolution operation. There-
fore as a result of the convolution operation, the received symbol at time i
is a weighted combination of a number of past input symbols, plus noise.
The fact that the current symbol x(iT ) contains contributions from symbols
from other time periods reduces the immunity of the receiver to noise, and is
referred to as intersymbol interference, or ISI. An equalizer is incorporated
into the communication system’s receiver to alleviate the effects of ISI. A
block diagram of the structure is shown in Fig. 7.1. Here, received symbols
xi , xi−1 , . . . , xi−n+1 1 are fed into a finite impulse response (FIR) filter as
shown. These samples are multiplied by a set of weights a1 , a2 , . . . , an and
added together to give a set of output symbols zi , as shown. In the fre-
quency domain, the equalizer acts as a filter which attempts to invert the
frequency response of the channel. From a time domain perspective, the
purpose of the equalizer is to disentangle the effect of the previous samples
on the current sample and suppress ISI. If this operation is successful, then
the combined response of the channel plus equalizer in the frequency domain
is flat, equal to a constant value in frequency. The corresponding impulse
response is thus a delta- function, with the result that ISI is suppressed as

1
Here and in the sequel, for simplicity of notation, we use subscript notation to imply
the quantity x(iT ).

194
Figure 7.1. A block diagram of an equalizer in a communications system.

far as possible. Thus ideally, the symbols zi at the output of the equalizer
are equal to the corresponding transmitted symbols yi plus noise. For more
details on this topic, there are several good references on equalizers and
digital communications systems at large, e.g., [16].

In the communications system, it is a relatively straightforward procedure


to produce a signal di which most of the time is a good guess of what zi
should be in the absence of ISI and noise. The idea of the equalizer is then to
generate the set of weights ai such that the output zi is as close as possible
to di for i = 1, 2, . . . Let us define the signal ei as the difference between zi
and di . Then we have

di = zi + ei
Xn
di = ak x(i − k) + ei . (7.1)
k=1

where in the last line we have made use of the fact that the sequence z[n]
is the convolution of the input sequence x[n] with the sequence a[n], as is
evident from Figure 7.1. If we observe (7.1) over m sample periods we obtain
a new equation in the form of (7.1) for every value of the index i = 1, . . . , m,
where m > n. We can combine these resulting m equations into a single

195
matrix equation:

d = X a + e
(7.2)
(m×1) (m×n) (n×1) (m×1)

where d = (d1 , d2 , . . . , dm )T ; similarly for e. The matrices X and a are


given respectively as
 
xn xn−1 . . . x1  
 xn+1 a1
xn ... x2 
..  .
X= . , a = 
  
. .. . 
 . . 
an
xm+n−1 . . . . . . xm

A reasonable and tractable method of choosing a is to find that value of a


which minimizes the 2-norm-squared difference ||e||22 between the equalizer
outputs z = [zi , zi+1 , . . . , zn+i−1 ]T = Xa, and d = [di , di+1 , . . . , di+n−1 ]T .
Thus, we choose the optimum value a0 to satisfy

a0 = arg min ||e||22 = arg min ||Xa − d||22


a a
= arg min(Xa − d)T (Xa − d). (7.3)
a

The fact that we determine a0 by minimizing ||e||22 (squared 2-norm) is the


origin of the term “least squares”. The method of determining a to satisfy
(7.3) is discussed later. Even though this example pertains specifically to
an equalizer, the mathematical descriptions for other types of systems, e.g.,
adaptive antenna arrays, adaptive echo cancellors, adaptive filters, etc., are
all identical. Basically, the mathematical framework of this section applies
virtually to any type of adaptive system.

7.1.2 Example 2: Autoregressive Modelling [2]

An autoregressive (AR) process is a (stationary) random process which is the


output of an all-pole discrete–time filter when excited by either white noise
or a pulse train. The reason for this terminology is made apparent later.
AR modelling is used exensively in signal processing applications, since in
many cases it is a useful means of signal compression, or a parsimonious
representation of an entire data sequence. For example, the human voice is

196
an excellent example of a time varying AR process, where the vocal tract acts
as a time–varying, highly resonant all–pole acoustic filter whose input is a
pulse train generated by the vocal cords, or a white noise sequence generated
by a restriction somewhere in the vocal tract. During the production of a
single phoneme, which represents an interval of about 20 msec, the voice
may be considered approximately stationary and is therefore amenable to
AR modelling. At an 8 KHz sampling rate, which is a typical value in
telephone systems, there are 160 samples in this 20 msec interval, and an
AR model representing this sequence typically consists of about 10 ∼ 15
parameters. Therefore the sequence of 160 samples can be compressed into
this range of parameters by AR modelling, thus achieving a significant degree
of compression. The AR model must be updated in roughly 20 msec intervals
to track the variation in phoneme production of the voice signal.

From fundamental linear system theory, an all-pole discrete–time filter has


a transfer function H(z) given by the expression
1 1
H(z) = Qn −1 )
≡ Pn −i
, (7.4)
i=1 (1 − z i z 1 − i=1 hi z
where zi are the poles of the filter, and hi are the coefficients of the corre-
sponding polynomial in z.

Let W (z) and Y (z) denote the z-transforms of the input and output se-
quences, respectively. Then
Y (z) 1
H(z) = =
1 − hi z −i
P
W (z)
or
n
" #
X
Y (z) 1 − hi z −i = W (z).
i=1
We note the expression on the left is a product of z–transforms, so the corre-
sponding time–domain expression involves the convolution of the sequence
[1, −h1 , −h2 , . . . , −hn ] with [y1 , y2 , . . .]. The equivalent of Eq. (8.31) in the
time domain is therefore
Xn
yi − hk yi−k = wi
k=1
or
n
X
yi = hk yi−k + wi , (7.5)
k=1

197
where the variance of the sequence w[i] is σ 2 . From (7.5), we see that the
output of an all-pole filter when driven by white noise may be given the
interpretation that the present value of the output is a linear combination
of past outputs weighted by the denominator coefficients, plus a random
disturbance. The closer the poles of the filter are to the unit circle, the
more resonant is the filter, and the more predictable is the present output
from its past values.

Repeating (7.5) for i = 1, . . . , m, we have

y =Yh+w (7.6)

where y = [y1 , y2 , . . . , ym ]T , w is defined in a corresponding way, h =


[h1 , h2 , . . . , hn ]T and the ith row of Y ∈ Rm×n contains the n past values of
y required for the prediction of yi according to (7.5).

The mathematical model corresponding to (7.2) and (7.6) is sometimes re-


ferred to as a regression model . In (7.6), the variables y are “regressed” onto
themselves, and hence the name “autoregressive”.

Eq. (7.6) is of the same form as (7.2). So again, it makes sense to choose
the h’s in (7.6) so that the predicting term Y h is as close as possible to the
true values y in the 2-norm sense. Hence, as before, we choose the optimal
h0 as the solution to

h0 = arg min(Y h − y)T (Y h − y). (7.7)


h

Notice that if the parameters h and the variance σ 2 are known, the autore-
gressive process is completely characterized.

7.1.3 Example 3: Hurricane prediction using machine learn-


ing

A further example of the use of least squares is in the machine learning


context. Suppose we have a training set of m observations each consisting
of n meteorological features or variables such as water temperatures, air
temperatures, atmospheric pressures, cloud conditions, wind velocities, etc.,
(all taken at the same time), pertaining to various tropical areas over a range

198
of years. We also have information whether the respective observation led to
a hurricane developing or not (let’s say +1 for developed, -1 for not). Here,
the set of conditions for the ith observation, i = 1, . . . m forms a row of a
matrix A, and bi is the corresponding response [±1]. We can formulate this
situation into a linear mathematical model for this problem as follows:

b = Ax + e (7.8)

where e is the error between the linear model Ax and the observations b.
We can expand each of the matrix/vector quantities for clarity as follows:
    
±1 a11 a12 . . . a1n x1
 ±1   a21 a22 . . . a2n   x2 
b= . = .   ..  + e,
    
. . .. . . ..
 .   . . . .  . 
±1 am1 am2 . . . amn xn

where the aij elements are the jth variable of the ith observation. As in the
previous examples, we wish to determine a set x of predictors (or weights
for each of the variables), which give the best fit between the model Ax and
the observation b. The predictors x? may be determined as the solution to

x? = arg min ||Ax − b||22 .


x
Once we have solved for the optimal values x? of x, we can predict whether
a hurricane will occur given a new set of previously unseen meteorological
data (in the form of a new row aTnew of A) by evaluating

b̂ = aTnew x?

where “hat” denotes an estimated value. Typically the value of b̂ will not
be ±1 as it would be in the ideal case. But in practice a good prediction
could be made by declaring a hurricane will develop if b̂ ≥ T , and not
develop otherwise, where T is some suitably–chosen threshold such as, e.g.,
the value zero.

7.2 The Least-Squares Solution

It is now apparent that these examples all have the same mathematical
structure. Let us now provide a standardized notation. We define our

199
regression model , corresponding to (7.2), (7.6) or (7.8) as:

b = Ax + n (7.9)

and we wish to determine the value xLS which solves

xLS = arg min ||Ax − b||22 (7.10)


x
where A ∈ Rm×n , m > n, b ∈ Rm . For this treatment, the matrix A is
assumed full rank. Extension to the the rank deficient case is covered in the
following chapters. We first consider the case where the intercept in (7.9)
is zero; i,e, E(b) = 0 when x = 0, and then generalize to the non–zero case
later in this section.

A geometric interpretation for the LS objective function (7.10) is given in


Fig. 7.2 for the one–dimensional case as determining the slope x of the line
b = xa such that the sum of the squared distances from the observed points
to the line is minimized. In multi–dimensions we would have a hyperplane
b = Ax instead of a line, but we would still be minimizing the sum of
squared distances from the observed points into the hyperplane.

In this general context, we note that b is a vector of observations, which


correspond to a linear model of the form Ax, contaminated by a noise con-
tribution, n. The matrix A is a constant. In determining xLS , we find that
value of x which provides the best fit of the observations to the model, in
the 2–norm sense.

We now discuss a few relevant points concerning the LS problem:

ˆ The system (7.10) is overdetermined and hence no solution exists in


the general case for which Ax = b exactly.
ˆ Of all commonly used values of p for the norm || · ||p in (7.3) or (7.7),
p = 2 is the only one for which the norm is differentiable for all values of
x. Thus, for any other value of p, the optimal solution is not available
in closed form.
ˆ Note that for Q orthonormal, we have (only for p = 2)
2
||Ax − b||22 = QT Ax − QT b 2 .

(7.11)

This fact is used to advantage later on.

200
Figure 7.2. A geometric interpretation of the LS problem for the one-dimensional case.
The sum of the squared vertical distances from the observed points to the line b = xa is
to minimimized with respect to the variable (slope) x.

201
ˆ We define the minimum sum of squares of the residual ||AxLS − b||22
as ρ2LS .
ˆ If r = rank(A) < n, then there is no unique xLS which minimizes
||Ax − b||2 . However, the solution can be made unique by considering
n

only that element of the set xLS ∈ R ||AxLS − b||2 = min which
itself has minimum norm.

We wish to estimate the parameters x by solving (7.10). The method we


choose to solve (7.10) is to differentiate the quantity ||Ax − b||22 with respect
to x and set the result to zero. Thus, the remaining portion of this section
is devoted to this differentiation. The result is a closed-form expression for
the solution of (7.10).

The expression ||Ax − b||22 can be written as


||Ax − b||22 = (Ax − b)T (Ax − b)
= bT b − xT AT b − bT Ax + xT AT Ax (7.12)
The solution xLS is that value of x which which satisfies
d  T
b b − xT AT b − bT Ax + xT AT Ax = 0.

(7.13)
dx
Define each term in the square brackets above sequentially as t1 (x), . . . , t4 (x)
respectively. Therefore we solve
d
[t2 (x) + t3 (x) + t4 (x)] = 0 (7.14)
dx
where we have noted that the derivative ddx t1 = 0, since b is independent of
x.

We see that every term of (7.14) is a scalar. To differentiate (7.14) with


respect to the vector x, we differentiate each term of (7.14) with respect to
each element xi of x, and then assemble all the results back into a vector.
We now discuss the differentiation of each term of (7.14):

Differentiation of t2 (x) and t3 (x) with respect to x


Let us define the quantity c = AT b. This implies that the component ck of
c is aTk b, k = 1, . . . , n, where aTk is the transpose of the kth column of A.

202
Thus t2 (x) = −xT c. Therefore,
d d
t2 (x) = (−xT c) = −ck = −aTk b, k = 1, . . . , n. (7.15)
dxk dxk
Combining these results for k = 1, . . . , n back into a column vector, we get
d d
t2 (x) = (−xT AT b) = −AT b. (7.16)
dx dx
Since Term 3 of (7.14) is the transpose of term 2 and both are scalars, the
terms are equal. Hence,
d
t3 (x) = −AT b. (7.17)
dx

Differentiation of t4 (x) with respect to x

The differentiation of the quadratic form t4 is covered in Appendix A of


Chapter 2. Using the fact that AT A is symmetric, the result is:
d d T T
t4 (x) = (x A Ax) = 2AT Ax. (7.18)
dx dx
Substituting (7.16), (7.17) and (7.18) into (7.13) we get the important de-
sired result:
AT Ax = AT b. (7.19)
The value xLS of x, which solves (7.19) is the least-squares solution corre-
sponding to (7.10). Eqs. (7.19) are called the normal equations. The reason
for this terminology is discussed in the next section. Note that since AT A
in (7.19) is square, symmetric and positive definite, the Cholesky decompo-
sition is the preferred method for solving the normal equations.

Extension to the non–zero intercept case: It is straightforward to deal


with the situation where the intercept is known to be non–zero. An m × 1
column of ones (i.e., 1) is appended to the matrix A as follows:
A ← [A 1] . (7.20)
Then n ← n + 1 and the LS method proceeds exactly as before. The last
element of xLS is then the value of the intercept. As we have seen from Sect.
5.5, adding more columns to A has the effect of increasing the variance of the
LS estimates. It is therefore expected that the variances of the LS estimates
in the non–zero intercept case may be degraded relative to the zero intercept
case.

203
7.2.1 Interpretation of the Normal Equations

Eq. (7.19) can be written in the form

AT (b − AxLS ) = 0 (7.21)

or
AT r LS = 0 (7.22)
where

r LS = b − AxLS (7.23)
is the least–squares residual vector between AxLS and b. Thus, r LS must
be orthogonal to R(A) for the LS solution, xLS . Hence, the name “normal
equations”. This fact gives an important interpretation to least-squares
estimation, which we now illustrate for the 3 × 2 case. Eq. (7.9) may be
expressed as  
x1
b = [a1 , a2 ] + n.
x2
The above vector relation is illustrated in Fig. 7.3. We see from (7.22) that
the point AxLS is at the foot of a perpendicular dropped from b into R(A).
The solution xLS are the coefficients of the linear combination of columns
of A which equal the “foot vector”, AxLS .

This interpretation may be augmented as follows. From (7.19) we see that


xLS is given by
−1 T
xLS = AT A A b (7.24)
Hence, the point AxLS which is in R(A) is given by
−1 T
AxLS = A AT A A b ≡ Pb (7.25)

where P is the projector onto R(A). Thus, we see from another point of view
that the least-squares solution is the result of projecting b (the observation)
onto R(A).

It is seen from (7.9) that in the noise-free case, the vector b is equal to the
vector AxLS . The fact that AxLS should be at the foot of a perpendicular
from b into R(A) makes intuitive sense, because a perpendicular is the
shortest distance from b into R(A). This, after all, is the objective of the
LS problem as expressed by eq. (7.10).

204
Figure 7.3. A geometric interpretation of the LS problem for the 3 × 2 case. The red
cross-hatched region represents a portion of R(A). According to (7.21), the point AxLS
is at the foot of a perpendicular dropped from b into R(A).

205
There is a further point we wish to address in the interpretation of the
normal equations. Substituting (7.25) into (7.23) we have

r LS = b − A(AT A)−1 AT b
= (I − P )b
= P ⊥ b. (7.26)

Thus, r LS is the projection of b onto R(A)⊥ , as expected from Fig. 7.3.

We can now determine the value ρ2LS , which is the squared 2–norm of the
LS residual:

ρLS = ||r LS ||22 = ||P ⊥ b||22 . (7.27)

The fact that r LS is orthogonal to R(A) is of fundamental importance. In


fact, it is easy to show that choosing x so that r LS ⊥ R(A) is a sufficient
condition for the least–squares solution. Often in analysis, xLS is determined
this way, instead of through the normal equations. This concept is referred
to as the principle of orthogonality [7].

7.3 Properties of the LS Estimate

In many problems relating to the field of signal processing, we have available


a set of measurements (observations) of some physical process for which some
mathematical model, described in terms of a set of parameters, is available.
In our case, we have assumed a linear model for the observations b of the
form
b = Ax + n (7.28)

where the noise n is a random variable with a prescribed probability density


function, A is known and x are the parameters we wish to estimate. Eq.
(9.1) is referred to as the regression equation. Given the observations in
the presence of noise, we wish to determine as accurate an estimate of the
parameters x as possible. But because of the noise, the parameter estimates
are random variables and hence are subject to error. We will be looking
at the probability density function of the parameter estimates later in the
section.

206
We naturally want to reduce the error in the parameter estimates as far
as possible, but to do this, we need to quantify the error itself. Two such
measures for this purpose are bias and covariance. The bias of an estimated
parameter vector θ is defined as E(θ̂ − θ o ), where the expectation is taken
over all possible values of the parameter estimate θ̂, and θ o is the true value
of the parameter. The covariance matrix of the parameter estimate is given
as     T
cov(θ̂) = E θ̂ − E(θ̂) θ̂ − E(θ̂) . (7.29)

From (7.24), we see that xLS is a linear transform of b; therefore, xLS is


also a random variable. We now study its properties.

In order to discuss useful and interesting properties of the LS estimate, we


make the following assumptions:

A1 n is a zero-mean random vector with identically distributed, uncorre-


lated elements; i.e., E(nnT ) = σ 2 I.
A2 A is a constant matrix, which is known with negligible error. That is,
there is no uncertainty in A.

7.3.1 xLS is an unbiased estimate of xo , the true value

To show this, we have from (7.24)


−1
xLS = AT A AT b. (7.30)
We realize that the observed data b are generated from the true values xo
of x. Hence from (7.28)
−1 T
xLS = AT A A (Axo + n)
−1 T
= xo + AT A A n. (7.31)
Therefore, E(xLS ) is given as
E(xLS ) = xo + E AT A−1 AT n
 

= xo , (7.32)
which follows because n is zero mean from assumption A1. Therefore the
expectation of x is its true value, and xLS is unbiased.

207
7.3.2 Covariance Matrix of xLS

−1 T
From (7.31) and (7.30), xLS −E(xLS ) = AT A A n. Substituting these
values into (7.29), we have
h −1 T −1 i
cov(xLS ) = E AT A A nnT A AT A

(7.33)

From assumption A2, we can move the expectation operator inside. There-
fore,
 
−1 T −1 
cov(xLS ) =  AT A A E nnT A AT A 
 
| {z }
σ2 I
−1 −1
= AT A AT (σ 2 I)A AT A

−1
= σ 2 AT A (7.34)

where we have used the result that cov(n) = σ 2 I from A1.

It is desirable for the variances of the estimates xLS to be as small as possible.


How small does (7.34) say they are? We see that if σ 2 is large, then the
variances (which are the diagonal elements of cov(xLS ) are also large. This
makes sense because if the variances of the elements of n are large, then the
variances of the elements of xLS could also be expected to be large. But
more importantly, (7.34) also says that if AT A is “big” in some norm sense,
then cov(xLS ) is “small”, which is desirable. We see later that if any of the
eigenvalues of AT A are small, then the variances of xLS can become large.
We can also infer that if A is rank deficient, then AT A is rank deficient,
and the variances of each component of x approach infinity, which implies
the results are meaningless.

The geometry relating to LS variance is shown in Fig. 7.4 for the one–
dimensional case. Here, the normal equations (7.19) devolve into the form
xLS = a aT b , and the variance expression (7.34) for var(x ) becomes σ 2 /(aT a).
Ta LS
(Here a is denoted in lower case since it is a vector, and x in unbolded format,
since it is a scalar). For part (a) of the figure, we see the range of a–values
is relatively compressed, in which case aT a is small, whereupon var(xLS ) is
large. This fact is evident from the figure, in that the slope estimates xLS
will vary considerably over different sample sets of the observations (bi , ai ),

208
having the same noise variance and the same spread ∆a. On the other hand,
we see from part (b) that, due to the larger spread ∆a in this case, the slope
estimate is more stable over different samples of observations with the same
noise variance.

7.3.3 Variance of a Predicted Value of b

One of the main applications of LS analysis is to be able to predict a scalar


value b̂ of b that corresponds to a new observation aTN of variables in the form
of a new row of A. This procedure requires the availability of a training set
of data; i.e., a matrix A and the corresponding set of values b. The A and
corresponding b constitute the training set. The availability of the training
set enables us to calculate xLS through the normal equations (7.19). Once
xLS is calculated, the predicted value b̂ that corresponds to new observations
aTN may be determined as
b̂ = aTN xLS , (7.35)

which may be acertained to be the expected value of b given xLS and the
new data aTN . The problem of predicting responses to new observations is
treated at length in Chapter 9.

The question arises, “How good is this estimate of b̂”? To address this issue,
we evaluate the variance of the prediction b̂. We define bo as bo = aTN xo ,
where xo = ExLS . The variance σb2 of b̂ is calculated as follows:

σb2 = E(bo − b̂)(bo − b̂)


= E[aTN (xo − xLS )((xo − xLS )T aN ],

where we have used (7.35) in the second line. Since aTN are measured and
not random variables, the expectation operator can be moved to the inside
set of brackets. This expectation is given by (7.34). Therefore we can write

σb2 = σ 2 aTN (AT A)−1 aN . (7.36)

We note that the predicted value b̂ is also dependent on the quantity (AT A)−1 ,
and therefore as we have seen, one small eigenvalue can result in large vari-
ances in the estimate b̂. Various latent variable approaches for mitigating
this effect are dicussed in Chapter 9.

209
Figure 7.4. The geometry of LS variances. In the top figure, the observations (dots) are
spread over a narrow range of a–values, giving rise to large variation in slope estimates
over different samples of the observations. In the lower figure, the spread of a–values
is larger, giving rise to lower variances in the slope estimates over different samples of
observations. The magnitude of the slope variation is depicted by the magnitude of the
arc’d arrows.

210
7.3.4 xLS is a BLUE (aka The Gauss–Markov Theorem)

According to (7.24), we see that xLS is a linear estimate, since it is a linear


transformation of b, where the transformation matrix is (AT A)−1 AT . Fur-
ther from (7.32), we see that xLS is unbiased. We show that, amongst the
class of linear unbiased estimators, xLS has the smallest variances; i.e., xLS
is the best linear unbiased estimator (BLUE).

Theorem 7 Consider any linear unbiased estimate x̃ of x, defined by

x̃ = Bb (7.37)

where B ∈ Rn×m is an estimator, or transformation matrix. Then under


A1 and A2, xLS is a BLUE.

Proof: from [8]. Substituting (9.1) into (7.37) we have

x̃ = BAxo + Bn. (7.38)

because n has zero mean (A1),

E(x̃) = BAxo .

For x̃ to be unbiased, we therefore require

BA = I. (7.39)

We can now write (7.38) as

x̃ = xo + Bn.

The covariance matrix of x̃ is then

cov(x̃) = E (x̃ − xo )(x̃ − xo )T


 

= E BnnT B T
 

= σ 2 BB T , (7.40)

where we have used A1 in the last line.

211
We now consider a matrix Ψ defined as the difference of the estimator matrix
B and the least–squares estimator matrix (AT A)−1 AT :

Ψ = B − (AT A)−1 AT

Now using (7.39) we form the matrix product ΨΨT :

ΨΨT = B − (AT A)−1 AT B T − A(AT A)−1


  

= BB T − BA(AT A)−1 − (AT A)−1 AT B T + (AT A)−1


= BB T − (AT A)−1 . (7.41)

where we have used BA = I. We note that the ith diagonal element of


ΨΨT is the squared 2–norm of the ith row of Ψ; hence (ΨΨT )ii ≥ 02 .
Hence from (7.41) we have

σ 2 (BB T )ii ≥ σ 2 (AT A)−1


ii , i = 1, . . . , n. (7.42)

We note that the diagonal elements of a covariance matrix are the vari-
ances of the individual elements. But from (7.40) and (7.34) we see that
σ 2 BB T and σ 2 (AT A)−1 are the covariance matrices of x̃ and xLS respec-
tively. Therefore, (7.42) tells us that the variances of the elements of x̃ are
never better than those of xLS . Thus, within the class of linear unbiased es-
timators, and under assumptions A1 and A2, no other estimator has smaller
variance than the L–S estimate.

We see later in this chapter that at least one small eigenvalue of the matrix
AT A can cause the variances of xLS to become large. This undesirable
situaiton can be mitigated by using the pseudo–inverse method discussed
in the following chapter. However, the pseudo–inverse introduces bias into
the estimate. In many cases, the overall error (a combination of bias and
variance) is considerably reduced with the pseudo–inverse approach. Thus,
the idea of an unbiased estimator is not always desirable and there may
be biased estimators which perform better on average than their unbiased
counterparts.

2
The notation (·)ij means the (i, j)th element of the matrix argument.

212
7.4 Least Squares Estimation from a Probabilistic
Approach

In this section, we investigate the properties of the LS estimator in a proba-


bilistic sense, when the underlying probability density function (pdf) of the
noise (and hence of xls ) is known. In order to conduct the analysis, we make
the additional assumption A3:

A3: For the following properties, we further assume n is jointly


Gaussian distributed.

Let us reconsider the LS linear regression model, which we reproduce here:

b = Axo + n. (7.43)

Here we assume the more general case where the covariance of the noise is Σ.
Given the observation b, and if A and xo and Σ are known, then under the
current assumptions b is a Gaussian random variable with mean Axo and
covariance matrix Σ. (Recall this distribution is denoted as N (Axo , Σ)).
Since the multivariate Gaussian pdf is completely described in terms of its
mean and covariance, then
n 1
h T i
p(b|A, xo , Σ) = (2π)− 2 |Σ|− 2 exp − b − Axo Σ−1 b − Axo (7.44)

We also investigate an additional pdf, which is that of xLS given all the
parameters. It is a fundamental property of Gaussian–distributed random
variables that any linear transformation of a Gaussian–distributed quantity
is also Gaussian. From (7.24) we see that xLS is a linear transformation of
b, which is Gaussian by hypothesis. Since we have seen that the mean of
xLS is xo and the covariance specifically for the white noise case from (7.34)
−1
is σ 2 AT A , then xLS has the Gaussian pdf given by
 
−n −2 T 1 1 T T
p(xLS |xo , A, σ) = (2π) |σ A A| exp − 2 (xLS − xo ) A A(xLS − xo ) .
2 2

(7.45)
Recall from the discussion of Sect. 4.3 that the joint confidence region (JCR)
of xLS is defined as the locus of points ψ where the pdf has a constant value
with respect to variation in xLS . These JCR’s are elliptical in shape. The

213
probability level α of an observation falling within the JCR is the integral
of the interior of the ellipse. Since the variable xLS appears only in the
exponent, the set ψ is defined as the set of points xLS such that the quadratic
form in the exponent (and hence the pdf itself) is equal to a constant – that
is, the JCR ψ is defined as
1
ψ = xLS 2 (xLS − xo )T AT A(xLS − xo ) = k

(7.46)


where the value k is determined from the probability level α.

Let us rewrite the quadratic form in the exponent of (7.46) as

1 T
− z Λz
2σ 2

where z = V T (xLS − xo ) and V ΛV T is the eigendecomposition of AT A.
The length of√ the ith principal axis of the associated ellipse is then propor-
tional to 1/ λi . This means that if a particular eigenvalue is small, then
the length of the corresponding axis is large, and z has large variance in the
direction of the corresponding eigenvector v i , as shown in Fig. 7.5. It may
be observed that if v i has significant components along any component of
xLS , then these components of xLS have large variances too. From Fig. 7.5,
it is seen that λ2 is smaller than λ1 , which causes large variation along the
v 2 –axis, which in turn causes large variances on both the x1 and x2 axes.

On the other hand, if all the eigenvalues are larger, then the variances of z,
and hence xLS , are lower in all directions. This situation is shown in Fig.
7.6, where it is seen in this case the eigenvalues are well–conditioned and all
relatively large. In this case, the variation along the co–ordinate axes has
been considerably reduced.

We see that the variances of both the x1 and x2 components of xLS are large
due to only one of the eigenvalues being small. Generalizing to multiple
dimensions, we see that if all components of xLS are to have small variance,
then all eigenvalues of AT A must be large. Thus, for desirable variance
properties of xLS , the matrix AT A must be well– conditioned; i.e., the
condition number of AT A, (as discussed in Sect. 5.4) should be as close to
unity as possible, and the eigenvalues be as large as possible. This is the
“sense” referred to earlier in which the matrix AT A must be “big” in order
for the variances to be small.

214
Figure 7.5. The blue ellipse represents a joint confidence region at some probability level
α, where the semi–axes have lengths proportional to √1 as shown. The fact that λ2 is
λi
relatively small in this case causes large variation along the v 2 –axis, which in turn causes
large variation along each of the co–ordinate axes, leading to large variances of the LS
estimates.

215
Figure 7.6. A joint confidence region similar to that in Fig. 7.5, but where the eigenvalues
are better conditioned and relatively large. In this case we see that the variation along
the co–ordinate axes is considerably reduced.

216
From the above, we see that one small eigenvalue has the ability to make
the variances of all components of xLS large. In the following chapters,
we present various methods for mitigating the effect of a small eigenvalue
destroying the desirable variance properties of xLS .

Another way of interpretting that eigenvalues of AT A with large disparity


can cause large variances in xLS is from a condition number perspective.
From Section 6.5.1 in Ch. 4, we see that the condition number in this case
is given by λλn1 . The condition number is the factor by which errors in the
system of equations are magnified to give the error in the solution. Thus
large disparity in the eigenvalues results in the normal equations having
a large condition number, which means that errors in b due to noise are
magnified by a large amount, causing large variation in the solution xLS .

7.4.1 Least Squares Estimation and the Cramer Rao Lower


Bound

In this section, we discuss the relationship between the Cramer-Rao lower


bound (CRLB) and the linear least-squares estimate in white and coloured
noise.

The CRLB is a straightforward method for determining the minimum possi-


ble covariance matrix of a parameter estimate in the presence of noise, based
on the conditional probability function p(b|A, Σ, x) of the observed data b
given the parameters A, Σ and x. The CRLB we consider here applies
only to unbiased estimators. In order to define the CRLB, we consider the
so–called Hessian matrix J of second derivatives, where

∂ 2 ln p(b|A, Σ,x)
(J )ij = −E . (7.47)
∂xi ∂xj

The matrix J defined by (7.47) is referred to as the Fisher information


matrix. Now consider a matrix U which is the covariance matrix of the
parameter estimates x obtained by some unbiased estimation process; i.e.,
cov(x̃) = U , where x̃ is some estimate of x obtained by some other esti-
mator. Then, J −1 is always a better covariance matrix than U , and more
specifically we have
uii ≥ j ii , i = 1, . . . , n, (7.48)

217
where j ii denotes the (i, i)th element of J −1 . Because the diagonal elements
of a covariance matrix are the variances of the individual elements, (7.48)
tells us that the individual variances of the estimates x̃i obtained by some
arbitrary estimator are greater than or equal to the corresponding diagonal
term J −1 . The CRLB thus puts a lower bound on how small the variances
can be, regardless of how good the estimation procedure is.

For the regression model given by b = Ax + n in the white noise case, J is


obtained by substituting (7.45) into (7.47). The constant terms preceding
the exponent in (7.45) are not functions of x, and so are not relevant with
regard to the differentiation. Thus we need to consider only the exponential
term of (7.45). Because of the ln(·) operation, (7.47) reduces to the second
derivative matrix of the quadratic form in the exponent. The expectation
operator of (7.47) is redundant in our specific case because all the second
derivative quantities are constant. Thus,
∂2
 
1 T T
(J )ij ≡ (x − xo ) (A A)(x − xo ) .
∂xi ∂xj 2σ 2
Using the treatment of Appendix A, Ch. 2, it is straightforward to show
that
1
J = 2 (AT A). (7.49)
σ
The CRLB value J −1 corresponding to (7.49) is precisely the least squares
covariance matrix given by (7.34) for the white noise case. Thus, for the
model given by b = Ax + n when n is Gaussian distributed white noise,
the LS estimator satisfies the CRLB and therefore has the lowest covariance
possible given observations governed by (7.43) .

7.4.2 Least-Squares Estimation and the CRLB for Gaussian


Coloured Noise

In this case, we consider Σ to be an arbitrary covariance matrix, i.e.,


E(nn)T = Σ. By substituting (7.44) into (7.47) and evaluating, it is
straightforward to show that the Fisher information matrix J for this case
is given by
J = AT Σ−1 A. (7.50)
Now suppose we use the ordinary normal equations (7.19) (which assumes
the background noise is white), to produce the estimate xLS when the actual

218
noise n is coloured, with covariance matrix Σ. Using the same analysis as in
Sect. 7.3.2, except replacing E(b − Axo )(b − Axo )T with Σ, the covariance
matrix of xLS becomes

cov(xLS ) = (AT A)−1 AT ΣA(AT A)−1 . (7.51)

Note that this covariance matrix is not equal to J −1 from (7.50). Therefore
the variances on the elements of xLS in this case are necessarily larger than
the minimum possible given by the bound3 . Therefore using the ordinary
normal equations when the noise is not white results in estimates with sub–
optimal variances.

We now show however, that if Σ is known or can be estimated, we may


improve the situation by pre-whitening the noise. Let Σ = GGT , where G
is the Choleski factor. Then, multiplying both sides of (7.9) by G−1 , the
noise is whitened, and we have

G−1 b = G−1 Ax + G−1 n. (7.52)

Using the above as the regression model, and substituting G−1 A for A and
G−1 b for b in (7.19), we get:

xLS = (AT Σ−1 A)−1 AT Σ−1 b (7.53)

The covariance matrix corresponding to this estimate is found as follows.


We can write
E(xLS ) = (AT Σ−1 A)−1 AT Σ−1 Axo . (7.54)
Substituting (7.54) and (7.53) into (7.29) we get

cov(xLS ) = E(AT Σ−1 A)−1 AT Σ−1 (b − Axo )(b − Axo )T Σ−1 AT (AT Σ−1 A)−1
= (AT Σ−1 A)−1 AT Σ−1 E(b − Axo )(b − Axo )T Σ−1 AT (AT Σ−1 A)−1
| {z }
Σ
= (AT Σ−1 A)−1 . (7.55)

Notice that in the coloured noise case when the noise is pre–whitened as in
(7.52), the resulting matrix cov(xLS ) is equivalent to J −1 in (7.50), which
is the corresponding form of the CRLB; i.e., the equality of the bound is
now satisfied.
3
However, it may be shown that xLS obtained in this way in coloured noise is at least
unbiased.

219
Hence, in the presence of coloured noise with a covariance matrix that is
either known or can be estimated, pre–whitening the noise before applying
the linear least–squares estimation procedure also results in a minimum
variance unbiased estimator of x. We have seen this is not the case when
the noise is not prewhitened.

7.4.3 Maximum–Likelihood Property

The maximum likelihood method is a very pwerful technique for estimating


parameters from observed data. In this vein, we show that the least–squares
estimate xLS is the maximum likelihood estimate of xo . We first investigate
the probability density function of n = Ax − b in the coloured noise case,
given by (7.44) which is repeated here for convenience. Throughout this
section, we assume that A and Σ are known constants.
 
−n − 1 1 T −1
p(n) = p(b|xo , A, Σ) = (2π) 2 |Σ| 2 exp − (Axo − b) Σ (Axo − b) .
2
(7.56)
The conditional pdf (7.56) describes the variation in the observation b as
a result of the noise and that x is assigned its true value xo . Thus, with
regard to the synthesis process which generates b, the quantity xo is a
known constant, whereas the actual values of b and n are random variables,
governed by (7.56). But now consider the analysis process which observes,
or analyzes, b. Here the interpretation is flipped around – the observation
b (and hence n) are now constant since b is a measured quantity, but xo
is not known and is considered to be a random variable with a probability
distribution associated with it.

In order to estimate the value of x based on an observation b, we use a simple


but very elegant trick based on the analysis process. We choose the value of
x which is most likely to have given rise to the observation b based on the
distribution (7.56). This is the value of x for which (7.56) is maximum with
respect to variation x, with b held constant at the value which was observed.
This value of x is referred to as the maximum likelihood estimate of x. It
is a very powerful estimation technique and has many desirable properties,
discussed in several texts [12, 13].

Note from (7.56) that the value x which maximizes the conditional proba-
bility p(b|x) is precisely xLS . This follows because xLS is by definition that

220
value of x which minimizes the quadratic form of the exponent in (7.56).
Thus, xLS is also the maximum likelihood estimate of x. Variances of max-
imum likelihood estimates aymptotically approach the Cramer–Rao lower
bound as the number of observations m → ∞. However, specifically for the
linear LS case, the variances satisfy the CRLB for finite m, as we have seen
from (7.55).

7.5 Other Constructs Related to Whitening

We have seen that whitening the noise is necessary for LS estimates to


satisfy the equality of the CRLB and thus attain the best possible covariance
structure. It turns out that the requirement of whitening applies not only
to LS problems, but to parameter estimation problems in general. In this
section, we discuss various algebraic structures that aid us in analyzing the
coloured noise case.

7.5.1 Mahalanobis distance

In the following treatment, we assume x is a zero–mean random vector, or


one in which the mean has beem removed before processing. The Maha-
lanobis distance is a generalization of the Euclidean distance. We consider
the usual definition of the squared 2–norm for quantifying distances:

||x||22 = xT x.

We may write this in the form xT Ix. The squared Mahalanobis distance is
given by replacing the I with a full–rank, positive–definite matrix Σ−1 , to
get
||x||2Σ−1 = xT Σ−1 x. (7.57)
T −1
We have seen in Sect. 4.2 that the set of values {x x Σ x = 1} (for
which the Mahalanobis distance is constant) is an ellipse. From this, we
may suspect that the Mahalanobis distance varies with the direction of x.
To confirm this idea, we may be write (7.57) in the form

||x||2Σ−1 = xT V Λ−1 V T x,

221

where the eigendecomposition of Σ = V ΛV T . If we define z = V T x, then
n
X z2
||x||2Σ−1 = z T Λ−1 z = i
,
λi
i=1
where z are the coefficients of x in the basis V . The squared Mahalanobis
distance may therefore be interpretted as measuring distances along the
eigenvector directions, in units of the respective eigenvalue. In the coloured
noise case the eigenvalues of Σ are not equal, so distances are measured
differently along each eigenvector direction.

Further, let the covariance matrix of x be Σ. If Σ−1 = G−T G−1 , where


G is the Cholesky factor of Σ, then ||x||2Σ−1 = xT G−T G−1 x = y T y, where
y = G−1 x. We have seen previously in Sect 5.3.3 that the covariance matrix
of y = I. Thus we see that the Mahalanobis distance in effect subjects x to a
whitening linear transformation before determining its Euclidean distance.
As we have seen, if x is a Gaussian–distributed random vector, then the
joint confidence region corresponding to x is elliptical in shape, whereas
that for y is spherical. It is for this reason that the whitening operation is
also referred to as “sphering” the data.
T
We note that the exponent 12 x − xo Σ−1 x − xo of the multivariate


Gaussian distribution in x is in effect a Mahalanobis distance. The quantity


x − xo is a zero–mean Gaussian random variable which is subjected to the
whitening transformation G−1 as discussed above, to give a vector of zero–
mean, uncorrelated, unit–variance Gaussian random variables. Thus the
form of the Gaussian exponent transforms an arbitrary Gaussian–distributed
random vector into a standard form with zero–mean and covariance I.

7.5.2 Generalized Eigenvalues and Eigenvectors

We have studied the ordinary eigen–problem Av = λv at some length. We


now consider the so-called generalized eigen–problem given by
Av = λBv, (7.58)
where A and B are square, positive definite matrices.

We illustrate using covariance matrices as an example. Let RX = X T X


correspond to a set of measurements with an uncorrelated and additive noise

222
component whose data matrix is Y , which we assume to be coloured. We
substitute the covariance matrices RX and RY = Y T Y for A and B in
(7.58) respectively. We assume RY is full rank. Then (7.58) becomes

RX v = λRY v
= λGGT v

where the Cholesky factorization has been applied to RY . We define u =


GT v, or v = G−T u, to get

G−1 RX G−T u = λu. (7.59)

Thus the generalized eigen–problem is equivalent to an ordinary eigen–


problem on a transformed matrix. As we have seen from Sect. 5.3.3, the
transformation G−T whitens the additive noise component Y that is present
in the observations X – i.e., the columns of Y G−T are orthonormal with
corresponding covariance matrix equal to I. Therefore the effect of the
generalized eigen–problem is to apply a transformation on X to whiten the
noise component, thereby obtaining parameter estimates that can satisfy the
equality of the Cramer–Rao lower bound, as discussed earlier in this chapter.
Note that to apply this technique, either we must have samples of the noise–
only signal available, or RY must be estimated by some means. An example
of the use of the generalized eigendecomposition is the MUSIC algorithm
discussed in Ch.2 in the presence of coloured noise, where we may apply a
generalized eigendecomposition instead of the ordinary eigendecomposition
to obtain optimal estimates.

Common Spatial Patterns e.g., [17]. We present an additional appli-


cation of the generalized eigen–problem that is relevant to the electroen-
cephalogram (EEG) [18]. The EEG records brain activity by placing an
array of electrodes on the scalp in pre-set positions. The signals from the
electrodes are passed through a high–gain amplifier and filtered. The signals
on each channel are sampled simultaneously (typically with a sample rate
of less than 1000 Hz), digitized and then stored in memory. A depiction is
given in Fig. 7.7.

In this example, we wish to use the EEG discriminate between healthy brain
activiy and that which represents some form of pathological brain activity,
arising from e.g., coma, concussion, epilepsy or others. In this vein, we

223
Figure 7.7. A illustration of the EEG.

denote xH (t) ∈ Rn , where n is the number of electrodes, as the time–varying


EEG signal corresponding to a subject in the healthy state. Similarly, we
denote xP (t) as the EEG data measured from a subject in a pathological
state. We can form a data matrix X ∈ Rm×n , for either case, where m is
the number of available vector samples. Each row of X corresponds to a
snapshot from all electrodes at a distinct point in time.

We wish to discriminate between the healthy and pathological states using


the EEG. The strategy is to apply a scalar weight wi to the signal received
at the ith electrode, i = 1, . . . , n and add the weighted results. That is,
we evaluate the inner product xTj w over j = 1, . . . , m sample points. In
this vein, we determine the weight vector w ∈ Rn so that the ratio of the
quantities ||X H w||22 and ||X P w||22 is maximum. Finding such a value for
w gives us a scalar output from the EEG which maximally discriminates
between the two conditions. Expressed mathematically, we wish to find the
value of w which solves the following problem:

wT X T X H w
w∗ = arg max T H .
w w X TP X P w

We solve this optimization problem in the usual manner by differentiating


and setting the result to zero. Note that both the numerator and denomi-
nator are scalars. Hence we can use the regular form of the quotient rule for
differentiation of each element of w. Differentiating, and setting the result

224
to zero we obtain
wT RP w 2RH w − wT RH w 2RP w
   
= 0,
(·)
where RH = X TH X H ; similarly for RP . The denominator of the above is
not evaluated since it is multiplied by 0 and is therefore irrelevant. Solving
the above, we get
RH w = λRP w, (7.60)
where λ = w RH w
T

wT RP w is the generalized eigenvalue.

The result y(t) = x(t)T w gives us a scalar, time–varying waveform whose


variance provides the maximum discrimination between the two conditions,
when w is the “largest” generalized eigenvector. In practice, we would set
some suitably–defined threshold for the variance estimate of y(t), and if it
exceeds the threshold, declare the subject healthy, otherwise pathological.

We give a practical interpretation for the epilepsy case. During an epileptic


seizure, the EEG typically records high intensity spike activity over spe-
cific localized regions of the scalp. In this case therefore, we would expect
the weights wi corresponding to the spiking region to have larger values,
and those for the remaining regions to have lower values. In this manner,
the spiking activity is most strongly discriminated and therefore detected
relative to the healthy case which has no spiking activity.

An interesting interpretation of (7.60) can be obtained throuigh the formu-


lation of the generalized eigen–problem discussed in this Section with regard
to whitening. That is, we choose w so that the power in yH is maximum,
when measured in the Mahalanobis metric xT RP x.

7.6 Solving Least Squares Using the QR Decom-


position

In this section, we employ the QR decomposition to solve the LS problem


for the full rank case. We have A ∈ Rm×n , b ∈ Rm , m > n, rank(A) = n,
and we wish to solve:
xLS = arg min ||Ax − b||22 .
x

225
Let the QR decomposition of A be expressed as

n
QT A = R =
 
R1 n (7.61)
0 m−n

where Q is m × m orthonormal and R1 is upper triangular. We partition


Q as  
Q= Q1 Q2 m .
n m−n

From our previous discussion, and from the structure of the QR decompo-
sition A = QR, we note that Q1 is an orthonormal basis for R(A), and Q2
is an orthonormal basis for R(A)⊥ . We now define the quantities c and d
as  T   
T Q1 c n
Q b= T b= . (7.62)
Q2 d m−n

Then, we may write:


2
min ||Ax − b||22 = QT Ax − QT b 2

x
    2
R1 c
=
x− . (7.63)
0 d 2

It is clear that x does not affect the “lower half” of the above equation. Eq.
(7.63) may be written as

||Ax − b||22 = ||R1 x − c||22 + ||d||22 .

Because A is full rank, R1 is invertible, and the above is minimum when

xLS = R−1
1 c.

The LS residual ρLS is given directly as

ρLS = ||d||2 . (7.64)

Thus the LS problem is solved. Note that if a Gram-Schmidt procedure


is used to compute the QR decomposition on A, then there is not enough
information to represent the “lower half” in (7.63). This is because this
procedure only gives the partition Q1 of Q, and thus d and the quantity ρLS
cannot be computed; however, the solution xLS = R−1 1 c is still achieveable.
In contrast, the Householder or Givens procedure yields a complete m × m

226
 
orthonormal matrix Q = Q1 Q2 , allowing a complete solution to the
LS problem.

The use of the QR decomposition in solving the LS problem leads to a useful


interpretation. We define
b = b1 + b2
where b1 ∈ R(A) and b2 ∈ R(A)⊥ . The projectors onto these two subspaces
are Q1 QT1 and Q2 QT2 respectively. Therefore b2 = Q2 QT2 b. Substituting
(7.62) for QT2 b, we have b2 = Q2 d. Taking 2-norms of this last expression
yields ||b2 ||22 = dT QT2 Q2 d = ||d||22 . Thus from (7.64), ρLS is the norm of
the projection of b onto R(A)⊥ , which makes intuitive sense.

The use of the QR decomposition in the case when A is rank deficient is


discussed in Ch.7.

7.7 A Short Section on Adaptive Filters

227
7.8 Problems

1. Here we consider an unsupervised clustering problem. The file A4Q3


2019 contains a matrix Xn which contains 100 independent measure-
ments of 5 different variables. Find a subspace so that these data
cluster well into 2 distinct regions (i.e., classes). Explain how you
obtained this subspace. Plot each data point in this 2–dimensional
subspace to illustrate the resulting clustering behaviour. To which
class do the additional data samples xtst1 and xtst2 in the file belong?
Explain your methodology carefully. Hint: Do NOT use a clustering
algorithm such as k-means.

2. On the website you will find a file A4Q5.mat which contains 2 variables
A and B. Each column bi of B is generated according to bi = Axo +
ni , i = 1, . . . m, where m in this case is 1000. For this problem, the
ni are coloured. Write a matlab program to estimate the xLS (i), i =
1, . . . m so that the estimates have the minimum possible variance.
Also jointly estimate the noise covariance matrix Σ.
Hint: This will require an iterative procedure as follows:

(a) Initialize the iteration index k to zero, and the noise covariance
matrix estimate Σ̂o to some value (e.g., I).
(b) Using the current Σ̂k , use the appropriate form of normal equa-
tions to calculate xLS (i), i = 1, . . . , m.
(c) For a more stable estimate, calculate the mean x̄ over all the LS
estimates.
(d) The noise vectors ni can then be estimated using x̄, from which
an updated Σ̂k can be determined.
(e) Increment k and go to (b) until convergence.

Also calculate the initial cov(xLS ) which assumes white noise, and also
the final covariance estimate obtained after convergence. Comment on
the differences.

3. Consider the over–determined set of equations

b = Ax + n (7.65)

where cov(n) = σ 2 I. Suppose in a given experiment, we had control


over the values of the matrix A, so that ||ai ||22 = k, i = 1, . . . , n,

228
where k is an arbitrary constant > 0, and ai is the ith column of A.
Explain how to choose A so that the variance of each element of the
LS estimate xLS of x is minimum.
Hint: use the Hadamard inequality: For a positive definite square
symmetric matrix X ∈ Rn×n ,
n
Y
det(X) ≤ xii , (7.66)
i=1
with equality iff X is diagonal.

4. (a) Given A ∈ Rm×n , and a set xi ∈ Rm , y i ∈ Rn , i = 1, . . . , k, find


a set of coefficients ai so that

X k 2
T

A − ai xi y i (7.67)

i=1 F

is minimized.
(b) What are the set xi , y i that minimize the minimum in (7.67)?
(c) What constraint is there on k so that the solution is unique?
5. Let A ∈ Rm×n , and b ∈ Rn . Find x so that ||A−xbT ||F is minimized.
Hint: For matrices A and B of compatible dimension, trace(AB) =
trace(BA).
6. Here we look at evaluating the spectrum of the vocal track for a speech
signal. Using a least–squares approach, determine the frequency re-
sponse H(z) of the vocal tract used to generate the speech sample
found in file SPF2.mat. Use samples 5600:6200 from the speech signal
for your analysis. Experiment with prediction orders between 8 and
12.
7. Assuming the noise is Gaussian and the assumptions A1 and A2 of
Sect. 7.3 hold, calculate a 95% confidence interval for a specified
element of xLS .
8. Given that we are provided a table of values for ti and corresponding
yi in the following equation, devise a method for determining the time
constant τ and the scalar value a:
ti
yi = a exp(− )
τ
where a is a real constant.

229
9. With regard to the common spatial patterns method, assume RH and
RP are given respively by
   
1 0.5 1 −0.5
RH = , RP = .
0.5 1 −0.5 1

What is the optimum weight vector w in this case? What is the ratio
of the variance of yH (t) to that of yP (t) for this choice of w?

10. Consider a signal x(t) given by


n
X
x(t) = ai g(t − τi ) + (t),
i=1

where g(t) ∈ Rm is a basic Gaussian pulse of length m samples, simi-


lar to that shown in Fig. 2.12, whose peak value is unity and delay is
zero. The quantities ai and τi are the repective amplitudes and delays
of the pulse components comprising x(t). It is possible to estimate
the τi using a variation of the MUSIC algorithm. However, it is also
possible to accomplish this task using an LS procedure, by develop-
ing an estimator which is a function of the τi only. In this vein, we
eliminate the ai by treating the above as a regression equation for an
assumed set of τi –values, and substitute the LS estimate of the ai ’s for
the actual values. The number of components n is assumed known.

(a) Develop this estimator and explain how to estimate the delays τi
when the noise  is white.
(b) As above, when the noise has an arbitrary covariance Σ.
(c) Once the τi have been estimated, explain how to estimate the ai .

230
Chapter 8

Thc Rank Deficient Least


Squares Problem

In the previous chapters we considered only the case where A is full rank and
tall. In practical prolems, this may not always be the case. Here we present
the pseudo–inverse as an effective means of solving the LS problem in the
rank deficient case when the rank r is known. We also show that the pseudo–
inverse, (aka principal component analysis in this context), is also effective
in the near rank deficient case at controlling the large variances of the LS
solution that occur in this situation. Then we discuss an alternative method
of solving the rank–deficient LS problem using the QR decomposition. We
show that the pseudo–inverse is a generalized approach for solving any type
of linear system of equations under specified conditions.

8.1 The Pseudo–Inverse

Previously, we have seen that the LS problem determines the xLS which
solves the minimization problem given by
xLS = arg min ||Ax − b||22 (8.1)
x
where the observation b is generated from the regression model b = Axo +n.
The solution xLS is the one which gives us the best fit between the linear

231
model Ax and the observations b. For the case where A is full rank we saw
that the solution xLS which solves (8.1) is given by the normal equations
AT Ax = AT b. (8.2)

We have seen previously in Sect. 7.4, that even one small eigenvalue of the
matrix AT A destroys the desirable variance properties of the LS estimate
and introduces the potential for all elements of xLS to have large variance.
One small eigenvalue of AT A implies the matrix is poorly conditioned and
close to rank deficiency. The pseudo–inverse is a means of remedying this
adverse situation and can be very effective in reducing the error in the LS
solution.

Further, if the matrix A (and consequently AT A) is actually rank defi-


cient (instead of being close to rank deficient), then a unique solution to the
normal equations does not exist. There are an infinity of solutions which
minimize (8.1) with respect to x. However, we can generate a unique so-
lution if, amongst the set of x satisfying (8.2), we choose that value of x
which itself has minimum norm. The pseudo–inverse fulfills this goal. In this
case, the pseudo–inverse solution xLS is the result of two 2–norm minimiz-
ing procedures – the first determines a set {x} which minimizes ||Ax − b||22 ,
and the second determines xLS as that element of {x} for which ||x||2 is
minimum.

We are given A ∈ Rm×n , m > n, and rank(A) = r ≤ n. If the SVD of A is


given as U ΣV T , then we define A+ as the pseudo-inverse of A, defined by
A+ = V Σ+ U T . (8.3)
+
The matrix Σ is related to Σ in the following way. If
Σ = diag(σ1 , σ2 , . . . , σr , 0, . . . , 0)
then
Σ+ = diag(σ1−1 , σ2−1 , . . . , σr−1 , 0, . . . , 0), (8.4)
+
where Σ and Σ are padded with zeros in an appropriate manner to main-
tain dimensional consistency.

Theorem 8 When A is rank deficient, the unique solution xLS minimizing


(8.1) such that ||x||2 is minimum is given by
xLS = A+ b (8.5)

232
where A+ is defined by (8.3). Further, the squared norm ρ2LS of the LS
residual r LS is given as
m
X
ρ2LS = (uTi b)2 . (8.6)
i=r+1

Proof: for any x ∈ Rn we have


2
||Ax − b||22 = U T AV (V T x) − U T b 2

(8.7)
     2
Σr 0 w1 c1
=
− (8.8)
0 0 w2 c2 2

where
V T1
   
r w1
w=   = x = V Tx (8.9)
T
n−r w2 V 2

and
U T1
   
r c1
  =  b = UT b (8.10)
m−r c2 U T2
and
Σr = diag[σ1 , . . . , σr ].
Note that we can write the quantity ||Ax − b||22 in the form of (8.7), since
the 2-norm is invariant to the orthonormal transformation U T , and the
quantity V V T which is inserted between A and x is identical to I.

From (8.8) we can make several immediate conclusions, as follows:

1. Because of the zero blocks in the right column of the matrix in (8.8),
we see that the solution w is independent of w2 . Therefore, w2 is
arbitrary.

2. Note that for any vector y = [y 1 y 2 ]T , ||y||22 = ||y 1 ||22 + ||y 2 ||22 . Since
the argument of the left–hand side of (8.8) is a vector, it may therefore
be expressed as

||Ax − b||22 = ||Σr w1 − c1 ||22 + ||c2 ||22 . (8.11)

233
Therefore, (8.8) is minimized by choosing w1 to satisfy

Σr w1 = c1 .

Note that this fact is immediately apparent, without any differenti-


ations as was the case when deriving the normal equations. This is
because the SVD reveals so much about the structure of the underlying
problem.

3. From (8.9) we have x = V w. Therefore,

||x||22 = ||w||22 = ||w1 ||22 + ||w2 ||22 .

Clearly ||x||22 is minimum when w2 = 0,

4. where the inverse exists because Σr consists only of the non-zero sin-
gular values. Combining our definitions for w1 and w2 together, we
have
   −1 
w1 Σr 0
w= = c
0 0 0
= Σ+ c (8.12)

Using (8.9) and (8.10), this can be written as

V T xLS = Σ+ U T b

or

xLS = V Σ+ U T b
= A+ b (8.13)

which was to be shown. Furthermore, we can say from (8.11) that

ρ2LS = ||c2 ||22


2
= (ur+1 , . . . um )T b 2

X m
= (uTi b)2 .
i=r+1

Note that A+ is always defined even if A is singular.

234
8.2 Interpretation of the Pseudo-Inverse

8.2.1 Geometrical Interpretation

In the pseudo–inverse case, xLS is again the solution which corresponds to


projecting b onto R(A). Substituting (8.13) into the expression AxLS , we
get
AxLS = AA+ b (8.14)

But, for the specific case where m > n, we know from our previous discussion
on linear least squares, that

AxLS = P b (8.15)

where P is the projector onto R(A). Comparing (8.14) and (8.15), and
noting the projector is unique, we have

P = AA+ . (8.16)

Thus, the matrix AA+ is a projector onto R(A).

This may also be seen in a different way as follows: Using the definition of
A+ , we have

AA+ = U ΣV T + T
 V Σ U
Ir 0
= U UT
0 0
= U r U Tr (8.17)

where I r is the r × r identity and U r = [u1 , . . . , ur ]. In the above, care


must be taken to ensure the zero blocks are arranged appropriately. From
our discussion on projectors, we know U r U Tr is also a projector onto R(A)
which is the same as the column space of A.

We also note that it is just as easy to show that for the case m < n, the
matrix A+ A is a projector onto the row space of A.

235
8.2.2 Relationship of the Pseudo-Inverse Solution to the Nor-
mal Equations

Suppose A ∈ Rm×n m > n, and rank(A) = n (full rank). The normal


equations give us
xLS = (AT A)−1 AT b
but the pseudo-inverse gives:
xLS = A+ b.

In the full-rank case, these two quantities must be equal. We can indeed
show this is the case, as follows: We let
AT A = V Σ2 V T
be the ED of AT A and we let the the SVD of AT be defined as
AT = V ΣU T .

Using these relations, we have


(AT A)−1 AT = (V Σ−2 V T )V ΣU T
= V Σ−1 U T
= A+ (8.18)
as desired, where the last line follows from (8.13). Thus, for the full-rank
case for m > n, A+ = (AT A)−1 AT . In a similar way, we can also show
that A+ = A(AAT )−1 for the case m < n.

8.2.3 The Pseudo–Inverse as a Generalized Linear System


Solver

It is generally understood that there is no solution to an over–determined


system of equations (except when b ∈ R(A)). Likewise, the solution is
not unique when the system is under–determined, or A is rank deficient.
However, if we are willing to accept the least–squares solution in the over–
determined case, and accept the definition of a unique solution as that for
which ||x||2 is minimum, then x = A+ b solves the system Ax = b under
any conditions.

236
8.3 Principal Component Analysis (PCA)

We have already discussed Principal Component Analysis in Chapter 2 in


the context of data compression and noise suppression. PCA is a commonly
used technique in the statistical and signal processing communities [19]. We
introduce the PCA concept through the pseudo–inverse, which is a concept
not normally associated with PCA. Later in the following chapter we show
that the pseudo–inverse and the regular PCA methods are identical, even
though their derivations are quite different.

We have seen previously in Ch. 7 that the covariance matrix cov(xLS ) of


the estimates xLS obtained by the ordinary normal equations in the white
noise case is given by the expression

cov(xLS ) = σ 2 (AT A)−1 .

If AT A is poorly conditioned, then it has at least one small eigenvalue, and


is therefore close to rank deficiency. We have seen in this case that the
variances of xLS become large. In this section, we show that use of the
pseudo–inverse instead of the ordinary normal equations is very effective
in restoring the variances to reasonable values, even though A may not be
completely rank deficient.

We denote the principal component LS solution as xP C . In solving for xP C ,


we replace the ordinary normal equation solution with the pseudo–inverse
solution, even though the smaller singular values are not exactly zero. That
is, if Σ in the SVD of A is given as Σ = diag[σ1 , σ2 , . . . , σr , σr+1 . . . , σn ],
then xP C in this case is given as

xP C = V Σ+ U T b, (8.19)

where  1 
σ1
 1 
 σ2 
 .. 

 . 

Σ+ =

1
σr
,


 0 

 .. 
 . 
0

237
where σr+1 . . . , σn are assumed small enough to cause trouble and are there-
fore truncated. In practice, the value of r is usually determined empirically
by trial–and–error methods, cross–validation, or through the use of some
form of prior knowledge.

The only difficulty with this principal component approach is that it in-
troduces a bias in xP C , whereas we have seen previously that the ordinary
normal equation xLS is unbiased. To see this biasedness, we let the singular
value decomposition of A be expressed as A = U ΣV T , and write

xP C = A+ b
= V Σ+ T
r U (Axo + n). (8.20)

Thus, because the noise has zero mean, the expected value of xP C may be
expressed as

E(xP C ) = V Σ+ T
r U (Axo ) (8.21)
= V Σ+ T T
r U (U ΣV xo )
= V Σ+ T
r ΣV xo
 
Ir 0
= V V T xo (8.22)
0 0
6= xo

and hence xP C obtained from the pseudo-inverse is biased.

However, we now look at the covariance matrix of xP C . Similar to the


treatment in Sect. 10, we have

cov(xP C ) = E(xP C − E(xP C ))(xP C − E(xP C ))T (8.23)

Substituting (8.21) for E(xP C ), using (8.20) for xP C , and assuming that
E(nnT ) = σ 2 I, we get

cov(xP C ) = E(V Σ+ T T + T
r U nn U Σr V
= σ 2 V Σ+ T + T
r U IU Σr V
= σ 2 V (Σ+ 2 T
r ) V . (8.24)

This expression for covariance is similar to that for xLS , except that it
excludes the inverses of the smallest singular values and the corresponding
directions which have large variation. Thus, the elements of cov(xP C ) can
be significantly smaller than those for xLS , as desired.

238
Thus, we see that principal component analysis (PCA) is a tradeoff between
reduced variance on the one hand, and increased bias on the other. The
objective of any estimation problem is to reduce the overall error, which
is a combination of both bias and variance, to a minimum. In fact, it is
readily verified that the mean–squared error E(x̂ − xo )2 of an estimate x̂ of
a quantity whose true value is xo is given by
E(x̂ − xo )2 = b2 + σx2 ,
where b is the bias and σx2 is the variance of the estimate. If A is poorly
enough conditioned, then the improvement in the variance of xP C over that
of xLS is large, and the bias introduced is small, so the overall effect of PCA
is positive. However, as A becomes better conditioned, then the two effects
tend to balance each other off, and the technique becomes less favourable.

The choice of the parameter r controls the tradeoff between bias and vari-
ance. The smaller the value of r, the fewer the number of components in
A+ ; hence, the lower the variance and the higher the bias.

Simulation Example: We show by simulation how the pseudo-inverse


solution xP C can improve the variances of the estimates. A 5 × 3 matrix A
and a vector xo were chosen as shown below. Note that the third column of
A is almost equal to the average of the first two columns, which will make
the matrix poorly conditioned. The singular values of A are 17.1830, 1.0040,
and 0.0142. 500 observations b were generated using the regression equation
b = Axo + n, where in each observation, n is an independently-distributed
Gaussian random vector with zero mean and covariance σ 2 I.
 
2.0000 1.0000 1.4738  
 4.0000 1.0000 2.4913  1
 
A=  6.0000 1.0000 3.5069 ,
 xo =  1 
 8.0000 1.0000 4.5716  1
10.0000 1.0000 5.5554
Fig. 8.1 shows the scatter-plot of the first and second elements of the esti-
mates xLS (red) and xP C (blue), obtained from each of the 500 observations
of b. In this case, we see a dramatic contraction of the scatter diagram for
xP C compared to that for xLS , indicating that the variances have drastically
reduced. To illustrate the point further, the covariance matrices in each case
were calculated, using
500
1 X
cov(x) = xi xTi .
500
i=1

239
Figure 8.1. Scatter plots for the simulation example using both the normal equation solu-
tion and the principal component solution when A is poorly conditioned. The shrinkage
in the variation for the PC case is strongly evident.

240
The result for the normal equation case is
 
2.0915 1.8141 −4.0824
cov(xLS ) =  1.8141 1.5880 −3.5445 
−4.0824 −3.5445 7.9695
whereas that for the pseudo–inverse case is
 
0.0011 −0.0026 −0.0012
cov(xP C ) =  −0.0026 0.0093 0.0023  .
−0.0012 0.0023 0.0016
The means from the simulation for xLS and xP C are given by respectively
by    
1.0126 1.0145
 1.0102  , and  1.0119  ,
0.9757 0.9720
which are both close to the true values, as expected. Thus, we see that the
pseudo-inverse technique has significantly improved the variance in this case
when A is poorly conditioned. We also see that the error in the ordinary
LS estimate of the means is approximately equivalent to that of the PC
estimate. Thus it appears that in this example, the bias in the PC estimate
may be considered negigible, especially in view of the significant reduction
in variance.

8.4 The Rank–deficient QR Method

8.4.1 Computation of the Rank-Deficient QR Decomposition

Before investigating the use of the QR decomposition in the rank–deficient


LS problem, we must first examine the structure of the QR decomposition
when the matrix A is rank deficient. If A ∈ Rm×n , m > n, rank(A) = r < n,
then for the QR decomposition to be of value in solving the LS problem, it
is important that the relation

R(A) = span [q 1 , . . . , q r ] (8.25)

always holds. Only in this case can Qr = [q 1 , . . . , q r ] act as an orthonormal


basis for A.

241
We construct an example to show this is not always true. Suppose the rank
2 matrix A is defined as follows:
 
−0.4437 0.1500 −0.4119
 0.4836 −0.1635 −1.5977 
A=  0.6345 −0.2146
.
0.4580 
−0.2555 0.0864 0.5244

Then the QR decomposition of A degenerates as follows:

A = QR
  
−0.468 0.849 0.047 0.239 0.948 −0.321 −0.457
 0.510 0.237 −0.767 0.308  0 0 −0.822 
=
 0.669
 (8.26)
.
0.253 0.638 0.286  0 0 1.524 
−0.269 −0.398 0.048 0.875 0 0 0

Because of the zero in the R(2, 2) position, we see that R(A) 6= span[q 1 q 2 ]
as desired. Further, this QR decomposition is of no value in solving the LS
problem, because R is not full rank. The problem in (8.26) is that there are
no r columns (in this case 2 columns) of Q that can act as an orthonormal
basis for R(A).

We now show that column–permutation matrices Π can be applied at every


stage so that the QR decomposition on A may be expressed in the form
 
R11 R12 r
T
Q AΠ = 0 0 m−r (8.27)
r n−r

where R11 ∈ Rr×r is upper triangular and non-singular and R12 is a rect-
angular matrix. In this case it is clear that that the rank-deficient QR
decomposition in the form of (8.27) indeed satisfies (8.25), where R(A) =
span[q 1 . . . , q r ]. The permutation matrix Π is determined in such a way so
that so that at each stage i, i = 1, . . . , r, the diagonal elements rii of R11
are as large in magnitude as possible, thus avoiding the degenerate form of
(8.26). But what is the procedure to determine Π?

To answer this, consider the ith stage, i < r of the QR decomposition with
column pivoting. Here, the first i columns have been annihilated below
the main diagonal by an appropriate QR decomposition procedure, such as

242
Householder. There exist an orthonormal Q and a permutation matrix Π
so that
 
R11 R12 i
QT AΠ = 0 R22 n−i (8.28)
i n−i

where R11 is upper triangular.

The (i + 1)th stage of the decomposition proceeds by first post-multiplying


both sides of (8.28) by a permutation matrix Π(i+1) (to swap the desired
column into the leading position of R22 , as discussed shortly), and then pre-
multiplying both sides by an orthonormal matrix Q(i+1) such that the first
column r 22 (1) of the R22 partition is annihilated below the first element.
The superscript (·)(i+1) on both Q and Π denotes the specific matrix at stage
i + 1. The Q in (8.28) at the (i + 1)th stage is given by Q(i+1) . . . Q(2) Q(1)
and the Π in (8.28) is Π(1) Π(2) . . . Π(i+1) .

Since at the (i + 1)th stage we wish to eliminate only the elements i + 1 : n


of r 22 (1), then according to the selective elimination procedure discussed in
Sect. 8.3.3 of the Ch.5 notes, the orthonormal matrix Q(i+1) to execute the
(i + 1)th stage has the form

 
I 0 i
Q(i+1) = 0 Q̃ m−i . (8.29)
i m−i

where the Q̃ above is the matrix which eliminates the desired elements
of r 22 (1). Since Q̃ is orthonormal, the element r(i + 1, i + 1) in the top
left position of R22 after the multiplication is complete is therefore equal to
||r 22 (1)||2 . It is then clear that to place the elements with the largest possible
magnitudes along the diagonal of R, we must choose the permutation matrix
Π(i+1) at the (i + 1)th stage so that the column of R22 in (8.28) with
maximum 2-norm is swapped into the lead column position of R22 . This
procedure ensures that the resulting QR decomposition will have the form
of (8.27) as desired. Effectively, this procedure ensures that no zeros are
introduced along the diagonal of R until after stage r.

243
We can write (8.28) at the completion of ith stage in the form
 
    R11 R12 i
Ã1 Ã2 Q1 Q2
= 0 R22 m−i (8.30)
i n−i i m−i
i n−i

where the tilde over the A– blocks indicates that the columns of A have
been permutated as prescribed by Π(i) . From our previous discussions,
Q1 is an orthonormal basis for R(A1 ) and Q2 is an orthonormal basis for
R(A1 )⊥ . It follows directly from the block multiplication in (8.30) that
the elements of the column r 22 (k) of R22 are the coefficients of the column
ã2 (k) in the basis Q2 . Thus, the column r 22 (1) which is annihilated after
the permutation step at the (i + 1)th stage corresponds to the column of Ã2
which has the largest component in R(Ã1 )⊥ .

After the (i + 1)th stage of the decomposition is complete, Π(i+1) post–


multiplies the previously accumulated product Π, and Q(i+1) pre–multiplies
its previous accumulated product. After r stages, the QR decomposition in
the form of (8.27) is complete. Given that the QR decomposition now has
the correct structure, we solve:

8.5 The Rank-Deficient LS Problem with QR:

Given A ∈ Rm×n , m > n, rank(A) = r < n, b ∈ Rn , then


2
||Ax − b||22 = (QT AΠ)ΠT x − QT b 2 .

(8.31)

Note that in the rank deficient case, there is no unique solution for (8.31).
Hence, unless an extra constraint is imposed on x, the LS solution obtained
by a particular algorithm can wander throughout the set of possible solu-
tions, and very large variances can result. As in the pseudo- inverse case,
the constraint of minimum norm is a convenient one to apply in this case, in
order to specify a unique solution. However, unlike the development of the
pseudo-inverse solution, we will see that the direct use of the QR decompo-
sition does not lead directly to the minimum norm solution xLS . However,
it is still possible to derive an elegant solution to the LS problem using only
the QR decompostion procedure. We now discuss how this is achieved.

244
Let  
y r
ΠT x = . (8.32)
z n−r

Substituting (8.27), (8.32) and (8.10) into (8.31) we have


     2
2
R11 R12 y c
||Ax − b||2 =
− .
0 0 z d 2
The minimum residual of norm ||d||2 is obtained when
R11 y + R12 z = c.
By solving for y above, we have ΠT x = [y, z]T , which may be written in
the form  −1 
T R11 (c − R12 z)
Π x= . (8.33)
z
We see from (8.33) that the vector z is arbitrary. It may seem that a
reasonable approach to determine the xLS with minimum norm is to choose
z = 0. However, it is shown [1] that this is not the case unless R12 = 0.

We therefore seek a more efficient means of determining xLS in this case.


We note that the desired solution xLS to (8.31) is that solution which min-
imizes ||Ax − b||2 with respect to x and simultaneously minimizes ||x||2 .
Hence, this solution would possess the same properties as the pseudo–inverse
solution, and because of uniqueness, this solution would be identical to the
pseudo–inverse solution.

To address this goal, we consider the complete orthogonal decomposition(COD).


Consider the matrix decomposition resulting from (8.27):
r n−r
QT AΠ =
 
R11 R12 r
0 0 m−r

The idea is to eliminate R12 ; then finding the xLS with minimum norm is
straightforward. There exists an orthonormal Z ∈ Rn×n such that
   
R11 R12 T 11 0
Z= (8.34)
0 0 0 0
where T 11 is nonsingular and upper triangular of dimension r×r. Therefore,
 
T T 11 0
Q AΠZ = (8.35)
0 0

245
Eq. (8.35) is called the complete orthogonal decomposition of the matrix A.

The fact that an orthonormal matrix Z can exist may be understood by


taking the transpose of both sides
 T of(8.34). Then (8.34) becomes an or-
R11
dinary QR decomposition on , with the exception that the result
RT12
is T T11 , which is lower triangular instead of upper triangular, as expected.
However, it is easy to modify the ordinary QR decomposition procedure to
yield a lower instead of an upper triangular matrix.

Now solving the LS problem is straightforward:


2
||Ax − b||22 = (QT AΠZ)(Z T ΠT x) − QT b 2

     2
T 11 0 w c
= − (8.36)
0 0 y d 2
where  
T T w r
Z Π x= ,
y n−r

and c, d are defined in (7.62) as before. Clearly, y is arbitrary, and d is


independent of both w and y. We can write (8.36) in the form

||Ax − b||22 = ||T w − c||22 + ||d||22

which is minimum when w = T −1 c. We also have


 
w
xLS = ΠZ . (8.37)
y
which clearly has minimum norm when y = 0.

The xLS calculated in this way is identical to the pseudo–inverse solution.


However, the computational cost with the COD is significantly less. The
COD requires only two QR decompositions; the SVD is computed using an
iterative procedure involving one QR decomposition per iteration.

8.6 Problems
1. We have seen that the pseudo-inverse is very effective in solving least
squares problems when the X–matrix is poorly conditioned. Another

246
method of dealing with poorly–conditioned systems is regularization,
which we discuss in Chapter 10. Yet another method is to use principal
component analysis (PCA). With this approach, we replace X ∈ Rm×n
with a rank-r approximation X r , defined as

X Tr = V r C, (8.38)

where V r ∈ Rn×r consists of the principal r eigenvectors of the ma-


trix X T X and C ∈ Rr×m is a matrix of coefficients. This approach
represents the rows of X in terms of its principal components.

(a) Describe how to find the matrix C.


(b) Substituting the definition of C from above into (8.38), use the
respective X r in place of X in the normal equations to give
a closed–form solution for β in the regression equation Y =
Xβ + n. Show that this solution is identical to that given by
the pseudo–inverse solution.

2. Explain how the method of problem 6 of chapter 6 can be used to


recursively solve the LS problem at every time step using fewer FLOPS
than the solution offered by the normal equations. This is the objective
of adaptive filtering.

247
248
Chapter 9

Model Building Using Latent


Variable Methods

In this chapter, we discuss various practical issues that arise when solving
real problems using LS methods and in particular, the more recent latent
variable (LV) methods. The primary objective of model building we consider
in this chapter is the prediction of response values ŷ T corresponding to
new values xTN of our observations. We investigate three types of latent
variable methods, which are PCA (revisited), partial least squares (PLS)
and canonical correlation analysis (CCA).

The LV approach is well–developed with a vast literature e.g. [20–24], with


an alternative set of notation to what we have been using thus far. In this
section we adapt this alternative notation. Here we take on a modelling
context, where we have a set of independent input variables X ∈ Rm×n and
a corresponding set of output response values Y ∈ Rm×k . The new variable
X is equivalent to the previous A. In the previous context, we had only one
column b of observation (response) variables, but in the current context b
is replaced by multiple response variables, represented by the m × k matrix
Y . The regression equation is now expressed as

Y = Xβ + E, (9.1)

where β ∈ Rn×k is the new notation for x and E is the error matrix, which is
the same size as Y . Eq. (9.1) is again referred to as the regression equation,

249
where we adopt the terminology that Y is regressed onto X through (9.1).
The notation we adopt here is standard in the statistical literature, where
latent variable methods are prevalent. On the other hand, the previous
notation of Chapters 7 and 8 is the most commonly used in the algebraic
literature.

Latent variable methods are founded on the idea that X (and often Y )
are expressed in a basis of dimension r, which is typically small relative
to n or k. Doing so alleviates the conditioning problem, resulting in large
variances of the parameter estimates, as we have seen previously. “Latent”
in this context implies “unseen”, and the latent variables in this context are
actually basis vectors which are used to represent X and/or Y . In the PCA
case, the latent variables are the principal eigenvectors of X T X; however,
for the partial least squares (PLS) and canonical correlation analysis (CCA)
methods which we discuss later, the latent variables are derived differently.
Since r is small relative to n or k, the latent variable basis is referred to
as being “incomplete”; i.e., X or Y cannot be represented in the LV basis
without error. However, by careful choice of the LV basis, this error can be
controlled while at the same time the error in the prediction of Y values
corresponding to new values of X can be considerably reduced.

The ith row of X is the ith observation of a set of n variables, and the
jth column contains the values of the jth variable over the set of m obser-
vations. For example, in a chemical reactor environment, each row of X
corresponds to a set of controllable (independent) inputs such as tempera-
ture, pressure, flow rates, etc. Each corresponding row of Y represents the
corresponding response values (outputs, or dependent variables) from the
reactor; i.e., output parameters containing concentrations of desired prod-
ucts, etc. Each row represents one of m different settings of the various
inputs and corresponding outputs.

The over–arching objective with the LV methods is to develop a system


model which gives a relationship between the input variables (X) and the
corresponding outputs or responses (Y ). In this vein we assume a set of
X and corresponding Y values are available from the system we wish to
model. In the machine learning context, these data are referred to as the
training set. We train the model (which is equivalent to estimating β from
(9.1) and determining the structure of X) based on the training data. To
evaluate the performance of the system model we have developed, we also

250
need an accompanying test set, which is an independent set of X and Y
values used solely for evaluating the performance of the model. The model
must be trained using only the training set, and then evaluated using only
the test set. If the test set is included in the training procedure, then the
model will be trained using data it will be tested on, with the result that es-
timated performance is inflated upwards. The evaluation of the performance
of a model is a process which must be undertaken with care, and involves
implemention of a cross–validation process [25, 26]. Cross–validation is not
discussed in this volume, but is well–described in the references.

The objective we consider in this section is to predict or estimate the un-


known response ŷ T corresponding to a new, previously unseen set of input
variables xTN in the form of a new row of X. From the prediction perspec-
tive, predicted values Ŷ of Y are determined by taking E(Y ) in (9.1), giving
Ŷ = Xβ. In the case of a new row xTN , the predicted value of a scalar ŷ is
given as ŷ = xTN β. In this case, it is important to consider the variance of
ŷ. From Sect. 7.3.3, the variance σy2 of a predicted value ŷ corresponding to
new data xTN is given from (7.36) (repeated here for convenience) as
σy2 = σ 2 xTN (X T X)−1 xN .
Notice that this form, and the expression for the variance of xLS in (7.34),
both involve the term (X T X)−1 . Thus, as X becomes poorly conditioned,
both the above forms of variance degrade. The objective of LV methods is
to mitigte this effect by choosing an appropriate r–dimensional basis which
eliminates as many noise components as possible.

In most applications of latent variable methods, it is common practice to


normalize the data before the modelling process begins. In many cases,
the variables involved may have significantly different scales, or a difference
in units, e.g., some variables may be measured in microvolts and others in
degrees Celsius. Also the variables may have significantly different means.
To alleviate these disparities, each variable (column) is typically converted to
their corresponding z-score, where all values in the jth column are subjected
to the transformation
xij−µj
xij ← , i = 1, . . . , m,
σj
where µj and σj are the mean and standard deviation over the jth column.
Thus each variable is transformed so that it has zero mean and unit standard
deviation.

251
9.1 Design of X

Up to now, we have assumed that X is determined before we solve the


underlying LS problem. However determination of the structure of X in
a practical scenario is not necessarily a straightforward procedure. A case
in point is which variables should be included in X. With regard to the
hurricane example discussed in the previous chapter, we could have a large
assortment of variables to consider – these may be air temperature, wa-
ter temperature, barimetric pressure, wind direction and velocity, ocean
currents, tides, presence of clouds, as well as many other possible choices.
Questions arise whether all these variables are in fact equally predictive of a
hurricane event, whether they are correlated (thus leading to linear depen-
dencies of the columns of X and thus poor conditioning), and what is the
optimal number of variables to include in the model?

If we do not choose a sufficient number of variables (i.e., n is too small),


then the model may not have enough degrees of freedom or flexibility to fit
the observations Y . This condition is called underfitting. Choosing n too
large results in overfitting. That is, the model, due to the large number of
variables is overly flexible may over–adapt to the available training set, but
fail to predict to new data samples that are not included in the training set.
Also increasing n causes the condition number of X T X to increase, giving
rise to a degradation of the variances of the LS estimates, as we discuss
below.

We show that as n increases while m remains constant, cond(A) never im-


proves, and typically becomes worse. Because LS variances are intimately
associated with condition number, increases in n tend to degrade perfor-
mance, all else being equal. We cite the modified version of the Interlacing
Property [1], originlly discussed in Chapter 5, repeated here for convenience:

Property 13 Let B n+1 ∈ R(n+1)×(n+1) be a symmetric matrix. Let B n =


B(1 : n, 1 : n). Then
λn+1 (B n+1 ) ≤ λn (B n )
and
λ1 (B n+1 ) ≥ λ1 (B n ).

where λi (B) indicates the ith largest eigenvalue of B. If X T X is of size n ×

252
n, then adding a column to X adds an additional variable to the LS problem
and increases X T X to size (n + 1) × (n + 1). From the Interlacing Property,
we have that λ1 (n + 1) ≥ λ1 (n) and λn+1 (n + 1) ≤ λn (n), where the number
in round brackets indictes the size of the matrix the respective λ is associated
with. Therefore the condition number (i.e., λ1 /λn ) never improves when
a column is added to X. In fact, the equality only holds under special
conditions, so in the general case, the condition number of X T X increases
by adding a column, and hence the variances of the LS estimates will degrade
under these circumstances. This behaviour is a manifestation of a general
principle in estimation theory that as the number of parameters estimated
from a given quantity of data increases, the variances of the estimates also
increase. This principle applies in particular to least–squares estimation.

There are two methods commonly employed for controlling the number of
variables. “Variables” are also referred to as “features” in the machine learn-
ing context. These methods are feature selection and feature extraction re-
spectively. With variable selection, variables are selected to be included in X
based on their statistical dependency with the responses of Y . For example,
the minimum redundancy maximum relevance (mRMR) method [27] selects
features iteratively. On the first iteration, the feature with the strongest
statistical dependance on Y is chosen. Then in subsequent iterations, the
feature with the best combination of maximum statistical dependance with
Y (relevance) and minimum statistical dependance (redundancy) with the
features chosen in previous iterations, is chosen. The process repeats un-
til the number of prescribed features is selected. This process produces a
set of features that are maximally predictive of the response variable and
as mutually independent as possible, meaning that the columns of X are
“discouraged” from being linearly dependent, thus resulting in a favourable
condition number.

The second method for controlling the number of variables is feature extrac-
tion, which is equivalent to the latent variable methods as discussed in this
section. Here, a prescribed number r of latent variables, each of which is
some form of optimal linear combination of all the available variables, are
calculated from the data. With this approach, the irrelevant variables would
be given small weights and contribute little to the latent variables. Thus
with the feature extraction method, all the variables are optimally combined
into a set of r latent vectors.

253
The choice of m, which is the number of observations, is more starightfor-
ward than choosing n. Generally speaking, the more observations the better,
so it is desirable to choose m as large as possible. The larger the value of
m, the larger the elements of X T X become, and consequently the smaller
are the resulting variances (see (7.34)). In fact, the variances decrease as
1/m. However, in most applications, collecting data is an expensive and
time consuming proposition, and so often we must make do with whatever
quantity of data is available. In the following, our prime motivation with
regard to LV methods is to predict reponses ŷ T from new data samples xˆTN .

9.2 Principal Component Analysis Revisited

We have already developed the PCA approach in Section 8.3 from the per-
spective of the pseudo–inverse, and also in Ch. 2 from the perspective of
data compression and denoising signals. Here we present PCA in the latent
variable context, which, as stated directly above, has to do with prediction.

To start, we repeat some material from Chapter 2 here for convenience.


Consider a vector t ∈ Rm whose elements are samples of a random vari-
able. Then recall from (2.30) that the sample variance estimate σ̂ 2 that is
obtainable from this sample is given as
1 1
σ̂ 2 = var(t) = ||t||22 = tT t. (9.2)
m m
In this PCA case we choose an r–dimensional orthonormal latent vector
basis [s1 . . . sr ], so that the variances of the ti = Xsi , i = 1, . . . , r are
maximum. According to (9.2), we wish to find the set of s which are the
solution to the following maximization problem:
1 1
s?i = arg max tTi ti = sTi X T Xsi .
si m m
subject to
||si ||22 = 1.
From Sect. 2.2, the solution for the [s1 . . . sr ] are the r principal eigenvec-
tors [v 1 . . . , v r ] of X T X. We denote the PCA latent variables as V r =
[v 1 . . . , v r ]. Let us define the quantity T r ∈ Rm×r as
T r = XV r . (9.3)

254
The quantity T r is the latent variable representation of X, since by design,
projections of X along the vectors ti have maximum variance. Since we
assume there is a linear relationship between X and Y (otherwise Y cannot
be predicted from X using the methods discussed here), then it follows there
must also be a linear relationship between Y and T r . which can be expressed
in the form of a regression equation as follows

Y = T r β T + E, (9.4)

where β T can be solved through the normal equations as

β T = (T Tr Tr )−1 T Tr Y . (9.5)

The determination of β T constitutes the training process for the PCA method.

Once β T is determined, we can calculate predicted values ŷ T corresponding


to a new observation (row) xTN of X. Inspired by (9.3), we first form the
latent variable representation of xTN as the r–length row tTr = xTN V r . Then
the corresponding row of predicted values ŷ T is given through (9.4) as

ŷ T = tTr β T .

Note that if we define the quantity X r = T r V Tr = XV V T , which is the
projection of the row space of X onto the latent variable subspace, then
according to Property 12 Chapter 2, there is no other r-dimensional basis
for which the quantity ||X − X r ||2 is smaller. This is the motivation for
choosing the eigenvectors as the latent variables.

It is interesting to check the consistency of the predictions offered by the


pseudo–inverse method given by (8.13) (after accounting for the change in
notation) and the PCA method described here. To illustrate, we investigate
the prediction of a set of Y values given corresponding values in X. In the
pseudo–inverse case, we assume X and Y are related through the regression
equation Y = Xβ + E, where in this case, β = X + Y . Thus the predictions
Ŷ are given as
Ŷ = Xβ = XX + Y . (9.6)
We have seen from (8.16) that XX + is a projector onto the principal sub-
space (PS)1 of R(X), so the predicted Ŷ is the projection of Y onto the PS
1
By this we mean the incomplete subspace formed from the PCA basis [v 1 . . . , v r ].
We refer to [v 1 . . . , v r ] as the principal subspace (PS).

255
of R(X). On the other hand, for the present PCA approach, X and Y are
related through (9.4), where we have replaced X with T r , which is valid since
T r is a principal basis for R(X) through (9.3). The quantity Ŷ = Xβ in
this case is given through (9.4) and (9.5) as Ŷ = Xβ T = T r (T Tr T r )−1 T Tr Y .
The quantity T r (T Tr T r )−1 T Tr is also a projector onto the PS of R(X). We
have seen in Sect. 3.2 that the projector is unique, regardless of its formu-
lation. Thus, Ŷ in the PCA case is also the projection of Y onto the PS of
R(X) and so the pseudo–inverse and PCA both give identical predictions.

9.3 Partial Least Squares (PLS) and Canonical Cor-


relation Analysis (CCA)

The PCA latent variables are determined solely from X and are independent
of the Y variables, and capture the directions of major variation in X
only. For the PLS and CCA methods on the other hand, we form a set of
latent variables, one in the X–space and another in the Y –space that have
maximum covariance in the PLS case, or correlation in the CCA case. By
forming latent variable sets in this manner that incorporates both datasets,
we expect that the PLS and CCA methods might be better at predicting Y
corresponding to a new set of X values.

As a preliminary, we recall the defining relationships for the SVD, as outlined


in Sect. 3.1.3, which we repeat here for convenience. Consider a matrix
A = U ΣV T . From this definition of the SVD, it follows that

Av i = σi uu (9.7)
T
A ui = σi v i . (9.8)

We use these relations later in this section.

Consider random vectors x and y ∈ Rm , which contain m samples of the


random variables X and Y respectively. The sample covariance estimate rxy
between these vectors is given from (2.30) as

1 T
rxy = x y, (9.9)
m

256
whereas the sample correlation estimate ρxy is given as

xT y
ρxy = . (9.10)
||x||2 ||y||2

The inner product in both cases can be written in the form

xT y = ||x||2 ||y||2 cos(θ), (9.11)

where θ is the angle between the two vectors. Thus from (9.9) we note that
the covariance depends on ||x||2 , ||y||2 and θ. Comparing (9.11) with (9.10),
we have
ρxy = cos(θ).
and so ρxy , unlike rxy , depends only on the angle between the vectors and
is independent of the norms. Thus ρxy lies in the range −1 ≤ ρxy ≤ 1 and
gives an idea how closely the random variables X and Y agree with other
on average.

The idea of covariances and correlations can be generalized to the multidi-


mensional case where we have matrices X ∈ Rm×n and Y ∈ Rm×k instead
of vectors x and y. The covariance matrix RXY ∈ Rn×k for the multidi-
mensional case is given as

RXY = X T Y . (9.12)

To define a multi–dimensional version of correlation, we must first define


matrices GX and GY which are square–root factors (e.g., Cholesky fac-
tors) of the covariance matrices X T X and Y T Y respectively. Then it is
straightfoward to show that the matrices X̃ = XG−1 −1
X and Ỹ = Y GY have
orthonormal columns. (The proof is left as an exercise). The presence of
the tilde indicates the respective quantity has been orthonormalized. The
correlation matrix R̃XY is then defined as
T
R̃XY = X̃ Ỹ . (9.13)

In loose terms, the normalizing or orthonormalizing factors G−1 in the


multi–dimensional case play the same role as the norm expressions in the
denominator of (9.10).

With the PLS and CCA methods, we create a set of r orthogonal basis
vectors for each of the X and Y datasets. We refer to the respective

257
subspaces formed by these bases as SX and SY . Each are of dimension
r ≤ min(n, k). The latent variable basis vectors for X and Y are denoted ti
and pi , i = 1, . . . r respectively. For the PLS case, we choose the t1 ∈ SX
and p1 ∈ SY so that their covariance; i.e., the quantity tT1 p1 is maximum.
Then t2 and p2 are chosen so they too have maximum covariance, under
the constraint they are each orthogonal to their counterparts of the first set.
The remaining basis vectors are found in a similar manner. The CCA case
is similar, except we choose to maximize correlations instead of covariances.
By choosing the latent variables in this manner, we provide the best possible
fit between the X and Y subspaces and therefore new values xN of X are
more likely to lead to “good” predictions of the corresponding Y –values.

For the time being, we consider only the covariance or PLS case – the CCA
case is addressed later. To determine t ∈ SX and p ∈ SY with maximum
covariance, we identify unit–norm vectors s ∈ Rn and q ∈ Rk , such that the
covariance between t = Xs and p = Y q is maximum. Posing the problem
in this manner guarantees the solutions t∗ and p∗ belong to their respective
subspaces. This problem may be expressed in the form of the following
constrained optimization problem:

[s∗ , q ∗ ] = arg max sT X T Y q ≡ arg max sT Rxy q, (9.14)


s ,q s ,q
subject to
||q||2 = 1, ||s||2 = 1.
The Lagrangian corresponding to this problem is given by
h 1
i h 1
i
sT Rxy q + γi,1 1 − (sT s) 2 + γi,2 1 − (q T q) 2 . (9.15)

We differentiate (9.15) with respect to s and q. With regard to the first term,
using a procedure similar to that outlined in Sect. 2.8, it is straightforward
to show that
d T
s Rxy q = sT Rxy
dq
d T
s Rxy q = Rxy q.
ds
Differentiation of the second term of (9.15) with respect to s is straightfor-
ward using the chain rule. It is readily verified that
d h 1
i s̄
γi,1 1 − (sT s) 2 = −γi,1 .
ds ||s̄||2

258
where s̄ is a vector with the same direction as s but whose 2–norm is arbi-
trary. Therefore we assign the vector s = ||s̄s̄i as the normalized version of
i ||
s̄ having unit 2-norm. A corresponding result holds for the last term:
d h 1
i q̄
γi,2 1 − (q T q) 2 = −γi,2 .
dq ||q̄||2

We also define q = ||q̄ || which is the normalized version of q̄.

Assembling these derivative terms with respect to s and q individually, and


setting the result to zero for each case, we have

Rxy q = γi,1 s (9.16)


T
Rxy s = γi,2 q, (9.17)

respectively, where we have transposed both sides of the second line above.
Comparing (9.16) and (9.17) to (9.7) and (9.8), the latter of which are the
defining relations for the SVD, we see that the stationary points of (9.14)
are respectively the right and left singular vectors of Rxy . Let the SVD of
Rxy be expressed as Rxy = U ΣV T . Therefore the optimal set satisfying
(9.14) are the first r right and left singular vectors U r = [u1 . . . ur ] and
V r = [v 1 . . . v r ] respectively. Note that the required orthogonality property
of the solutions follow directly from the orthonormality of U and V .

The corresponding vector sets in SX and SY respectively with maximum


covariance are therefore given as

T r = [t1 . . . , tr ] = XU r , (9.18)

and
P r = [p1 . . . , pr ] = Y V r . (9.19)
Note that both T r and P r are both m × r. The T r and P r are the desired
latent variable bases for SX and SY respectively.

It is interesting to evaluate the covariance values corresponding to the opti-


mal solution. To do so we evaluate the quantity tTi pi as follows:

tTi pi = uTi X T Y v i
= uTi Rxy v i
= uTi U ΣV T v i
= σi . (9.20)

259
It is seen that the r maximum covariances betwen X and Y are the largest
r singular values of Rxy , which are the σi . The directions in SX and SY
which result in this largest covariance are given by ti = Xui and pi = Y v i
respectively.

The development for the CCA case is identical to that of the PLS case,
except that we use the variables X̃ and Ỹ in place of X and Y throughout
the development above. Because PLS is related to the covariance between
X and Y , the PLS latent variablces are formed from a combination of the
directions of major variation in both X and Y , as well as the angles between
the latent vectors in the two respective subspaces. Because CCA is derived
from the correlation between the variables, the CCA latent variables are de-
termined solely from the angles between the subspaces and are independent
of the directions of major variation. This is a direct consequnce of the fact
the columns of both X̃ and Ỹ are orthonormal. In the CCA case only, it
can be shown [1] that the σi , i = 1, . . . , r are the cosines of the r angles
specifying the relative orientations between SX and SY . In this vein, it may
be shown (Problem 3) that 0 ≤ σi ≤ 1.

The steps involved in the evaluation of the latent variables for the PLS or
CCA methods can be sumarized as follows:

ˆ Given a training set consisting of X and Y data, evaluate Rxy or R̃xy


according to (9.12) or (9.13), depending on whether the PLS or CCA
method is being used.

ˆ Calculate the SVD of the R–matrix above to give the values U r and
V r.

ˆ Finally, calculate the latent variables T r and P r according to (9.18)


and (9.19) respectively.

These latent variables are used to predict y–values corresponding to new


x–values xTN as explained in the following section.

The presentation here for identifying the PLS and CCA latent variables is
quite different from the usual treatment in the literature. Most methods
e.g.[20, 22] use the nonlinear iterative partial least squares (NIPALS) algo-
rithm for extracting the latent variables. However, the method presented

260
here using the SVD on the matrix Rxy as in (9.14) yields identical LVs, and
affords a simpler presentation.

9.3.1 Prediction for the PLS and CCA cases

We now consider the problem where the latent variables for either one of the
two methods discussed are available, and we wish to determine the row(s)
of prediction estimates ŷ T corresponding to a new observation (row(s)) xTN
of X–values. The prediction procedure for both the PLS and CCA methods
are identical given the set of their respective latent variables. ’

We represent X and Y in their r–dimensional respective LV subspaces us-


ing T r and P r from (9.18) and (9.19). We assume there is an underlying
relationship between these LV spaces, that is reflective of the underlying re-
lationship between X and Y that we are trying to model. This relationship
may be expressed in the form of a regression equation P r = T r β P + E. The
value of β P is therefore given from the normal equations as
β P = (T Tr T r )−1 T Tr P r . (9.21)
The identification of β P is part of the training process, obtained using the
available training data X and Y . Given a new row (or rows) xTN of data that
has not been used in the training process, we first form the LV representation
tTN in the X–space as tTN = xTN U r , in a manner corresponding to (9.18).
Then the corresponding LV representation pTN ∈ Rr in the Y –space is given
as
pTN = tTN β P . (9.22)
We now must convert these pTN values of length r in the LV space into a
corresponding row in the original Y –space, which is length k. To do this,
we establish the relationship between P r and Y . Assuming they are related
as Y = P r β y + E y , then β y is given as

β y = (P Tr P r )−1 P Tr Y . (9.23)

Then we can compute our predicted row ŷ T of Y corresponding to the new


data xTN as
ŷ T = pTN β y . (9.24)
The prediction process for the PLS or the CCA methods may now be sum-
marized. Given a training set of X and Y values for which their respective

261
latent variables have been computed as described in the previous section,
we compute predicted values ŷ T for Y corresponding to new values xTN of
X in the following manner:

ˆ compute the latent variable repesentation of the new variables as tTN =


xTN U r .

ˆ Given β P from (9.21), transform tTN from SX into SY to give pTN using
(9.22), representing the predicted values of Y in the latent subspace.

ˆ Transform the pTN from the latent space into the Y –space using (9.24),
where β y is given by (9.23), to yield the predicted values ŷ T of Y .

The prediction algorithm presented here is considerably simpler than the


prediction method that appears in most of the literature e.g. [20,22]. Simu-
lation results for the two approaches are indistinguishable from one another.

A last topic for this section is to introduce the terminology “loadings and
scores”, with respect to latent variables that is in common use in the sta-
tistical literature. The variables X and Y are typically represented in their
rank-r latent variable bases as

X = T P T + Ex
Y = U QT + E y .

The matrices T and U are bases for the column spaces of X and Y re-
spectively, whereas P and Q are the corresponding row space bases. The
matrices T and U are referred to as “scores”, whereas P and Q are referred
to as “loadings”.

9.3.2 Simulation Example

We present a simulation example to compare the relative performances of the


three latent variable methods we have discussed, with respect to prediction
of Y values corresponding to previously unseen values of X. First, we
construct a matrix X ∈ Rm×n whose elements are independent, zero mean,
unit variance Gaussian random variables using the “randn” command in
matlab. In our simulations , n = 8 and m = 25.

262
With this present simulation scenario, the singualr values of X are typically
in the range between 1 and 10. To introduce near linear dependence among
the columns of X and corresponding poor conditioning (which is necessary
to illustrate the effectiveness of LV methods), we perfrom an SVD on X,
and replace the three smallest singular values of Σ with the values 1 × 10−8 .
A new matrix X is then reassembled from its SVD components using the
modified version of Σ.

An m × k matrix Y where k = 6 was constructed as follows:

Y = XB + σE (9.25)

where

ˆ B is an n×k matrix whose elements are independent, zero–mean, unit


variance Gaussian random variables, drawn from the matlab “randn”
command.

ˆ E ∈ Rm×k is additive Gaussian noise, also with zero mean and unit
variance.

ˆ σ controls the signal–to–noise ratio (SNR) of Y .

The SNR is defined as


||Xβ||2F
SNR = .
||σE||2F
The simulation to determine the prediction error between the true and es-
timated values of Y consists of an inner and outer loop, as shown in Fig.
9.1. In each iteration of the inner loop, a noise sample E is generated by
the “randn” command in matlab for a given set of values for X and B and
used to form a new sample of Y according to (9.25). The X and Y arrays
are then split into training and test sets, giving X train and X test ; likewise
for Y . The latent variables and the β–values are determined in the LV anal-
ysis block for each of the three methods, using only the training set data.
The predicted values Ŷ corresponding to X test are then predicted using the
methods discussed in Sect. 9.3.1 and compared to the true test values Y test
to generate the normalized prediction error given by

||Y test − Ŷ ||2F


normalized prediction error = .
||Y test ||2F

263
Then after 100 iterations of the inner loop, a new iteration of the outer loop
proceeds, where new values of X and B are calculated and 100 iterations
of the inner loop are repeated for these new values of X and B. After 50
iterations of the outer loop, all the normalized prediction errors are averaged
together. In this manner, the final results reflect LV performance over many
settings of X and β values. We show the normalized prediction errors
vs. r (the number of latent components) for SNR = 6 and 20 dB in the
following figures, for each of the three LV methods discussed. Results from
the ordinary normal equations are not shown since their accuracies are very
poor, due to the poor conditioning of X.

The nominal rank for X is 5, since there are 8 columns but the three smallest
singular values were set to very small values, thus making the effective rank
equal to 5. It may be seen that in all cases (except perhaps CCA at 6 dB
SNR) the prediction error drops to a plateau, whose value depends mostly
on the SNR. In this case, the dimensionality of the latent variable subspaces
is high enough to form an accurate model, and this prediction error is low.
Below r = 5, the prediction errors rise sharply due to underfitting. In these
cases, PLS uniformly performs better than PCA as expected, since PLS is
inherently a more flexible model. The CCA performance is approximately
comparable to that of PLS, except CCA performance drops off for low values
of SNR. It is apparent that the prediction performance of all three methods
is highly dependent on the SNR value.

264
Figure 9.1. An overview of the simulation process used to compare the three forms of LV
methods.

265
SNR=6 dB
0.9
PCA
PLS
CCA

average relative error in prediction


0.8

0.7

0.6

0.5

0.4
1 2 3 4 5 6
r, no. of latent components

Figure 9.2. Normalized prediction error vs. r, for SNR=6dB.

SNR=20 dB
0.9
PCA
0.8 PLS
CCA
Average relative error in prediction

0.7

0.6

0.5

0.4

0.3

0.2

0.1

2 3 4 5 6
r, no. of latent components

Figure 9.3. As above, for SNR = 20dB.

266
9.4 Problems

1. Explain the effect of (independently) varying m and n with respect to


the matrix X in an LV least squares problem.

2. Let X = U ΣV T . The PCA method can also be performed by forming


a rank–r m × n matrix X r = U r U Tr X, where U r consists of the first
r columns of U . Show this approach is identical to the pseudo-inverse
method for prediction. This method is based on projecting columns of
X onto the principal column space U r of X. Show that the identical
result can also be given by taking X r = XV r V Tr , where V r contains
the principal r columns of V . This method is based on projecting
rows of X onto the principal row space. Since we have seen the PCA
method presented in Sect. 9.2 is also equivalent to the pseudo–inverse
approach for prediction, we can conclude that all four methods are
identical, although the implementation differs.

3. Given Z = X T Y , find transformations T x on X and T y on Y (post


multiplications are implied) so that T Tx ZT y is diagonal. What are the
diagonal values?

4. Prove that the canonical correlation coefficients σi in (9.20) satisfy


0 ≤ |σi | ≤ 1, i = 1, . . . , r.

5. (From Applied Linear Algebra, by James W. Demmel.) Let A, B and


C be matrices with dimensions such that the product AT CB T is
defined. Let χ be the set of matrices X minimizing ||AXB − C||F ,
and let X o be the unique member of χ with minimum ||X||F . Show
that X o = A+ CB + . Hint: Use the SVDs of A and B.

6. Give a simplified form for the quantity P Tr P r in (9.23).

7. In our discussion on latent variable methods, our focus was on predict-


ing ŷ T corresponding to new values xTN . We did not discuss estimation
of β. Propose a method based on PLS for this purpose.

267
268
Chapter 10

Regularization

The methods we have discussed until now improve stability of the model
by purposely reducing the effective rank of X and or Y . An alternative
approach is regularization, which improves the modelling process by incor-
porating prior information into the model in some form. In the LS example,
regularization helps to mitigate the effects of poor conditioning, which as
the reader may recall, occurs when X T X is poorly conditioned. This leads
to large variances in the LS estimates, as we have seen in previous chapters.

Regularization is an extensive topic in the computer science/mathematical


literature. There are many different forms, although usually they fall into
the generalized framework referred to as Tikhonov regularization [refs]. The
regularization methods we discuss here are ridge regression, smoothness reg-
ularization and sparsity regularization, which fall under the Tikhonov um-
brella. Tikhonov regularization methods are also referred to as shrinkage
methods for reasons which will become apparent in the sequel.

LV methods are one form of regularization, since they impose prior infor-
mation by assuming low rank approximations to X and Y . All forms exist
to mitigate the effects of poor conditioning, which results when the columns
of X become close to linear dependence, resulting in near rank deficiency
and hence at least one small eigenvalue. This implies that the variables cor-
responding to each column are too dependent on one another, or in other
words, there is not enough joint information in the columns/variables to

269
create a stable model. Regularization imposes additional prior information
on the solution to help mitigate this situation.

We consider only three types of regularization in common use; these are


ridge regression, smoothness regression and regularization by sparsity.

10.1 Ridge Regression

If X is poorly conditioned, then some eigenvalues of X T X are relatively


small, and therefore some eigenvalues of (X T X)−1 become large, result-
ing in (X T X)−1 having large elements. Therefore in solving the ordinary
normal equations β LS = (X T X)−1 X T y, elements of the solution β LS can
become inappropriately large. The idea behind ridge regression is to impose
a constraint on ||β LS ||2 to encourage a more stable solution with a more
moderate norm. In this vein, we modify the ordinary LS objective function
to the form
||Xβ − y||22 + λ||β||22 . (10.1)
The second term of this objective function encourages a solution where ||β||22
is small. In effect, this modified objective function trades off fit (the first
term) for a small–norm solution (imposed by the second term). The param-
eter λ ≥ 0 controls the degree of this tradeoff, where a larger value places
more emphasis on the norm of the solution being small, and less weight to
the fit, and vice–versa.

A universal value for λ in general cannot be determined beforehand. Its


value is typically determined by examining the effect of these tradeoffs on a
case–by–case basis, using trial–and–error or cross–validation techniques. A
useful, more structured approach to the determination of λ is the L–curve
method presented in [28].

An alternative form for (10.1) is the following:


rr
β̂ = arg min ||Xβ − y||22
β
rr
subject to ||β ||2 ≤ t. (10.2)
rr
for some value of t, where β̂ is the ridge regression estimate of β. Eq.
(10.2) expresses (10.1) in the form of a strict upperbound on ||β||2 . It may

270
be shown [25] there is a corresponding value of t in (10.2) for which the
solutions are identical.

There is an analytic solution to (10.1). The derivative with respect to β of


the first term is, as before, 2X T Xβ − 2X T y. It is straightforward to verify
that the derrivative of the second term is 2λβ = 2λIβ. Adding these terms
and setting the result to zero, we obtain a modified set of normal equations
given by
(X T X + λI)β = X T y,
and therefore
rr
β̂ = (X T X + λI)−1 X T y. (10.3)

Thus, the ridge regression method effectively adds the value λ to the diagonal
elements of X T X. Recall from the Properties of Eigenvalues in Chapter 2,
adding a constant term λ to the diagonal elements of a matrix has the effect
of adding the same value to each of its eigenvalues; i,e., each λi is replaced
by λi + λ1 . Further recall the discussion on the condition number K2 (A) of
a matrix A matrix in Ch.4. When solving a system of equations Ax = b,
the quantity K2 (A) is the worst–case magnification factor by which errors
in A or b appear in the solution x. K2 (A) in the 2-norm sense is given by
|λ1 |
K2 (A) =
|λn |
i.e., the ratio of the absolute values of largest to smallest eigenvalues of
A. After regularization, the modified condition number K20 (A) therefore
becomes
|λ1 + λ|
K20 (A) = .
|λn + λ|
In a poorly conditioned LS problem, λ1  λn , and so if λ is significantly
greater than λn , K20 (X) can be significantly less than K2 (X) without sig-
nificantly perturbing the matrix X T X, and therefore also the integrity of
the solution.

An additional interpretation of ridge regression is as follows. Using the


ordinary normal equations, the quantity Xβ n is given by

Xβ n = X(X T X)−1 X T y,
1
A clarification on notation: a λi (with a subscript) denotes an eigenvalue, whereas λ
without a subscript denotes the ridge regression regularization parameter.

271
where β n is the normal equation estimae of β. When we substitute the SVD
of X = U ΣV T into the above, we obtain the simplified form

Xβ n = U U T y,

which, when we apply the outer product rule for matrix multiplication, may
be expressed as
n
X
n
Xβ = ui uTi y, (10.4)
i=1

It is interesting to note that the PCA solution β pca can also be expressed in
the form
Xβ pca = U r U Tr y
where U r = [u1 . . . , ur ]. Using the outer–product rule for matrix multipli-
cation, this can be written in the form
r
X
Xβ pca = ui uTi y. (10.5)
i=1

Thus, by comparing (10.4) and (10.5), we see that the PCA solution is
similar in form to the normal equation solution, but PCA applies a hard
thresholding procedure to eliminate the components [ur+1 . . . un ] which are
associated with the smaller singular values of X.

Now we look at the ridge regression solution in the light of (10.4) and (10.5).
From the ridge regression estimate (10.3), we get
rr
X β̂ = X T (X T X + λI)−1 Xy.

Substituting the SVD for X as before and simplifying, we have


rr
X β̂ = U Σ(Σ2 + λI)−1 ΣU T y.

Because the inner matrices are diagonal, we can express the above using the
outer–product rule for matrix multiplication as
n
σi2
 
rr X
X β̂ = ui 2 uTi y. (10.6)
i=1
σi + λ

Since λ > 0, the term in the round brackets above is always less than
1. For suitably–chosen λ, this term is close to one for the larger singular

272
values and small for the small singular values. By comparing (10.5) and
(10.6), we see that the ridge regression approach is similar to the PCA
approach, but ridge regression applies a soft instead of a hard thresholding
function to suppress the effect of the components that are associated with the
small singular values. It may be seen from (10.4) that the ordinary normal
equation approach on the other hand applies no thresholding procedure at
all.

We note that both the PCA and ridge regression methods involve a process
which forces particular singular values of X to become smaller. This same
phenomenon also holds for other forms of regularization which are not dis-
cussed here. This process of reduction of the eigevalues is the origin of the
term “shrinkage”, which is a term often used in the machine learning and
statistical literature to describe the regularization procedure.

10.2 Regularization using a smoothness penalty

This approach is useful if it is known that the solution β is smooth, i.e.,


changes in successive elements of β are small relative to ||β||2 . Consider a
matrix B defined as
 
1 −1

 1 −1 

B=
 .. .. 
 . . 

 1 −1 
1

i.e., an identity matrix with the first upper diagonal replaced with -1. Then
the elements of z = Bx measure differences in successive elements of a
vector x. If the solution is to be smooth, then we want ||z||22 to be small. We
can therefore modify the LS objective function to incorporate a smoothness
constraint by adopting the following form:

||Xβ − y||22 + λ||Bβ||22 .

Differentiating and setting the result to zero, we obtain

X T X + λB T B β = X T y.


273
s
The solution β̂ to this form of normal equations penalizes a non–smooth
solution. As in (10.1), the above may also be expressed in the form
s
β̂ = arg min ||Xβ − y||22
β
subject to ||Bβ||2 ≤ t.

Simulation Example: We simulate an LS problem where the initial X is


a 30 × 6 matrix of independent Gaussian random variables with mean zero
and unit variance. In order to create a poorly–conditioned X, an SVD was
performed on this matrix, and the smallest two singular values were both
set to 0.001× their original values. Then a new X was reconstructed as
X = U Σ0 V T , where Σ0 is the modified matrix of singular values. The pa-
rameter β = [1, 1.1, 1.1, 1.1, 1.1, 1, 0.9]T , which can be verified by inspection
to be smooth. A response vector y was simulated, such that y = Xβ + ,
where  consists of zero–mean, independent Gaussian noise samples with a
standard deviation such that the ratio of ||Xβ||2 to ||||2 is approximately
2:1. The β parameter was then estimated using the following methods, over
1000 iterations, each with a different value of : smoothness constraint, ridge
regression, the pseudo–inverse and the ordinary normal equations. The av-
eraged relative error in the respective solutions is given in Table 10.1, after
tuning the respective λ’s for minimum error.
Table 10.1. Relative errors in β for different forms of regularization, when the solution is
known to be smooth.

Method Relative Error


smoothness 0.13735
ridge regression 0.83789
pseudo–inverse 0.83318
ordinary normal eqs. 118.2612

It is seen that imposing a smoothness penalty in this situation helps reduce


the relative error in the solution by a significant margin. It also may be
observed that the ordinary normal equations give a meaningless result, which
arises due to the lack of regularization, relatively high noise levels, and
poor conditioning. The remaining regularized methods (ridge regression
and pseudo–inverse) fare much better than the ordinary normal equation
approach, but because they do not exploit the knowledge that the solution
is smooth, cannot perform as well as smoothness regularization.

274
10.3 Sparsity regularization

Regularization by sparsity is also known as the least absolute shrinkage and


selection operator (the lasso) [25]. This form of regularization imposes a
solution that has as few non-zero elements as possible. A solution having
this property is referred to as sparse. An example illustrating why sparsity
is a useful form of penalty is given later in this section. The lasso is also
referred to as basis pursuit in some of the signal processing literature.

The least squares objective function for sparsity regularization is given by


||Xβ − y||22 + λ||β||S , (10.7)
where || · ||S denotes a norm which induces sparsity. Unlike the previous
forms of regularization, the lasso has no closed–form solution.

We now examine suitable sparsity–inducing norms for this purpose. Ideally,


we would like a norm which simply counts the number of non–zero elements
in the solution. Such a norm exists in the form of a modified p = 0 norm.
However, the use of this norm results in a computationally intractable op-
timization problem and so other forms of norm are more favoured. As may
be seen from Fig. 10.1, the p = 2 norm penalty function is small for small
values of its argument, and so it is ineffective at forcing small elements of
the solution towards zero. On the other hand, it may be seen that the
p = 1 norm penalty function imposes a significantly larger penalty for small
values, and so may be more effective at inducing sparsity. It may also be
observed from the figure that the penalty imposed for p < 1 is larger than
that for the 1–norm case for small values. The problem however is that the
optimization problem of (10.7) for p < 1 becomes non-convex and therefore
is more difficult to compute. Therefore in practice, the lasso is implemented
using the 1–norm penalty. In this case the objective function of (10.7) is a
convex quadratic program and can therefore be solved using readily available
optimization packages.

We now show a further illustration of the effect of using the 1-norm as a


penalty function. In a manner similar to (10.1), the lasso solution corre-
sponding to the objective function (10.7) can also be written in the form
β lasso = arg min ||Xβ − y||22
β
subject to ||β||1 ≤ t. (10.8)

275
Figure 10.1. Curves of ||x||p vs. x for various values of p for the one–dimensional case.

Figure 10.2 shows the elliptical contours of the joint confidence regions of
the estimates β lasso for various values of α, as discussed in Sect. 7.4. These
ellipses are the contours for which (β lasso − β o )T X T X(β lasso − β o ) = k,
where the curves for various values of k are shown. The solution to (10.8)
corresponds to the case where the ellipse with the lowest possible k just
touches the constraint function ||β||1 ≤ t, which is the diamond region in
the top figure of Fig. 10.2. As can be seen, the ellipse touches the constraint
function on the β1 axis, where β2 = 0, thus inducing a sparse solution in the
present two dimensional system. When this situation is extended to multi-
ple dimensions, the “pointy” nature of the 1-norm constraint encourages a
solution along one of the co–ordinate axes, where most of the elements of β
are zero, again promoting sparsity in the solution.

On the other hand, we see from the lower figure in Fig. 10.2 that the
ellipse touches the circular constraint function ||β||2 ≤ t at a point away
from a co–ordinate axis, thus admitting a small value of β2 to exist in the
solution. From this example we see that a 2–norm constraint is ineffective
at encouraging a sparse solution.

We now present a simulation example. We consider a waveform which is a su-


perposition of Gaussian pulses as shown in Fig. 10.3. This waveform loosely
resembles a clean version of an event–related potential (ERP) recorded from
an electroencephalogram (EEG) in response to a deviant stimulus tone. We
generated 1000 such pulses, where in each waveform the pulses are subjected
to timing jitter and amplitude variation, as well as additive noise at an SNR
of approximately 0 dB. These corrupted waveforms are characteristic of ac-

276
Figure 10.2. Illustration of the effect of a 1-norm penalty in the two dimensional case. The
interior of the diamond region in the top figure is the set of points for which ||β||1 ≤ t,
whereas the circular region in the lower figure corresponds to ||β||2 ≤ t. The ellipses are
the contours of the LS error function.

277
Superposition of Gaussian pulses
1.6

1.4

1.2

0.8

0.6
volts

0.4

0.2

-0.2

-0.4

10 20 30 40 50 60 70 80 90
time(samples)

Figure 10.3. An ERP waveform consisting of a superposition of Gaussian pulses for a lasso
simulation.

tual recorded EEG signals. We show 50 of the 1000 corrupted waveforms


superimposed in Fig. 10.4.

In this simulation example the objective is to estimate a single uncorrupted


waveform, (of the type shown in Fig. 10.3) while still retaining its delay
value, using the observed corrupted signals shown in Fig. 10.4. We first
apply the principal component method as discussed in Ch2, Sect 2.6 to par-
tially denoise the observed ERP signals x. We then apply a lasso technique
on the denoised signal to model each specific waveform.

In this vein, we construct a dictionary matrix of Gaussian pulses, as shown


in Fig. 10.5, with each pulse having its own unique delay value. Here we
assume the width (standard deviation) of the pulses corresponds to those
of the observed signals, as in Fig. 10.4. We model each observed ERP
waveform xi ∈ Rm as
xi = Dai + ni (10.9)

where D ∈ Rm×n is the dictionary matrix and n is the number of entries


(pulses) in D. As such, every column of D represents a Gaussian pulse with
its unique value of delay. Altogether there are n = 120 pulses (columns)

278
Superimposed simulated ERP responses
2

1.5

1
microvolts

0.5

-0.5

-1
0 10 20 30 40 50 60 70 80 90 100
time (samples)

Figure 10.4. 50 superimposed corrupted waveforms simulating real ERP pulses.

Dictionary of Gaussian pulses


1.2

0.8
Amplitude

0.6

0.4

0.2

-0.2
10 20 30 40 50 60 70 80 90 100
time

Figure 10.5. A dictionary of Gaussian pulses, each with its own unique delay.

279
in D. The vector a ∈ Rn is the vector of amplitudes associated with each
pulse. The vector n is the additive noise. A naive approach for constructing
the model Da is then to solve the following optimization problem:

a? = arg min ||x − Da||22 . (10.10)


a
The model Da? therefore represents the linear combination of pulses from
D that best fit the observation x. The problem with this naive approach is
that there is no penaty on the complexity of the model (which in this case
is the number of non-zero elements in a? ). In the presence of noise, a very
large number of dictionary elements will be selected to minimize (10.10).
What happens is that the approach in (10.10) will choose enough elements
from D to fit the noise as well as the signal, with the result that the model
will not suppress the noise. On the other hand, the parsimonious model is
one which is as simple as possible without seriously degrading the expected
fit, and therefore tends to generalize well. We can encourage parsimony in
(10.10) by introducing a lasso penalty:

alasso = arg min ||x − Da||22 + λ||a||1 .


a
This has the effect of significantly reducing the number of non-zero elements
in D, thus simplifying the model. The results of this process are shown in
Fig. 10.6 for a single observation, using the value λ = 0.5. It is seen
that even in the presence of significant noise, the relative error in the lasso
reconstruction is 0.0709, which in most circumstances could be considered as
an acceptable level of error. It is interesting to note that of the 160 values
in a, only 10 values of the optimized solution exceed the threshold value
of 0.03 microvolts. This illustrates the effectiveness of the lasso penalty
in restricting the number of non-zero elements (or close to zero) in the
optimized solution.

280
1.2

1 red = lasso reconstruction


black = observed, noisy signal
green = noise-free ERP signal
0.8

0.6
microvolts

0.4

0.2

-0.2

-0.4

-0.6
0 10 20 30 40 50 60 70 80 90 100
time (samples)

Figure 10.6. A lasso reconstruction of an ERP signal. Shown are the original noise–free
ERP signal, its noisy version, and the lasso reconstruction.

10.4 Problems

281
282
Chapter 11

Toeplitz Systems

In this chapter derive two different O(n2 ) algorithms for solving Toeplitz
systems. We start from the idea of forward and backward linear prediction
of an autoregressive process, which leads to a Toeplitz system of equations if
the process is stationary. These equations are then solved using a recursive
technique, where the dimension of the system is increased until the desired
size is obtained.

This analysis leads to several new ways of looking at Toeplitz systems. We


develop simple expressions for the determinant and the inverse of Toeplitz
matrices. Also, a very interesting method of orthonormalizing an AR pro-
cess, based on the lattice structure is presented.

11.1 Toeplitz Systems [3]

In this section we study the solution of Toeplitz1 systems of equations.


Specifically, we focus attention on the normal equations which yield esti-
mates of the coefficients of a stationary autoregressive (AR) process. These
normal equations are in the form of a Toeplitz system of equations.
1
A Toeplitz matrix is one where all elements on a diagonal are equal. This is a
useful and significant form of matrix because covariance matrices of stationary signals are
Toeplitz.

283
Our ostensible objective of this section is to exploit the structure of a
Toeplitz system to develop a fast O(n2 ) technique for computing its so-
lution. This compares with O(n3 ) complexity when Gaussian elimination
or Cholesky methods are used to solve the same system. However, it turns
out this fast algorithm is not the only dividend we receive in pursuing these
studies. In developing the Toeplitz solution, we are also lead to new insights
and very useful and interesting interpretations of AR systems, such as lattice
filter structures, and other special techniques for signal processing. These
structures lead to very powerful methods for adaptive filtering applications.

We first study AR processes in some detail. We then discuss the Levinson-


Durbin recursion (LDR) for solving Toeplitz systems. Then, we discuss
several further insights and developments which derive from the LDR, such
as lattice filters, and inverse and determinant expressions.

11.1.1 Autoregressive Processes

The analysis of this section is an extension of Example 2 of Chapter 7, which


we briefly review here. Here, we look at autoregressive processes and their
relationship to Toeplitz systems.

Many random processes, either man-made or naturally occurring, are actu-


ally AR processes, or can at least be closely approximated by them. Ex-
amples are voice and video signals, sinewaves-plus-noise, signals in control
systems, signals induced by earthquakes, etc. Modelling these signals as
AR processes leads to useful processing techniques such as signal compres-
sion, parameter estimation, identification, spectral estimation, etc. A very
successful form of speech and video coding, referred to as linear predictive
coding, is based on the idea that a human voice signal can be successfully
modelled as an autoregressive process. Thus, being able to analyze and
model AR processes is a very important signal processing tool, and finds
diverse application in many fields of engineering.

There are also moving average (MA) processes. These are the output of an
all-zero filter in response to a white-noise input. There are also autoregressive-
moving average (ARMA) processes which are the output of a filter with both
poles and zeros, in response to a white noise input. MA and ARMA pro-
cesses are not directly considered in this lecture. The interested reader is

284
referred to [29] for a deeper examination of these subjects.

Consider an all-pole filter driven by a white noise process w(n) with output
x(n). We define the denominator polynomial H(z) of the filter transfer
function as
XK
H(z) = 1 − hk z −k . (11.1)
k=1
1
Thus, the filter transfer function is H(z) . Taking z transforms of the input-
output relationships we have:

1
X(z) = W (z) or X(z)H(z) = W (z). (11.2)
H(z)

Converting the above relationship back into the time domain, and realizing
that multiplication in the z-domain is convolution in time, we have using
(11.1)
XK
x(n) − h(k)x(n − k) = w(n) (11.3)
k=1
or
K
X
x(n) = h(k)x(n − k) + w(n). (11.4)
k=1

This equation offers a very useful interpretation in that the present output
x(n) is predictable from a linear combination of past outputs within an
error w(n). This property is derived directly as a consequence of the all-
pole characteristic of the filter.

The sequence x(n) defined in this manner is referred to as an autoregressive


process. The quantity K is referred to as the order of the process, whereas
the quantities h(·) are referred to as the coefficients of the AR process.

If w(n) is small in comparison to the first sum term on the right of (11.4)
most of the time, then the the predicted value x̂(n) of x(n) given from (11.4)
as
XK
x̂(n) = h(k)x(n − k) (11.5)
k=1

is accurate and so we can say the system is well modelled by (11.4). We

285
Figure 11.1. The prediction–error filter configuration.

define the prediction polynomial P (z) as


K
X
P (z) = h(k)z −k . (11.6)
k=1

Using this form we can define a prediction error filter (PEF) by re–arranging
(11.4) as

w(n) = x(n) − x̂(n)


K
X
= x(n) − h(k)x(n − k),
k=1

which in the z–domain becomes W (z) = (1 − P (z))X(z), as shown in Fig.


11.12 . To model the system as accurately as possible, we choose the coeffi-
cients h(k) so that the prediction error variance ||w||22 is minimized, where
w is the vector containing all relevant values of w(n). More on this later.

An AR process is completely characterized by the coefficients hk . In the


following section we determine these coefficients by solving the respective
normal equations. We show that these normal equations have a Toeplitz
structure that can be exploited to yield a computationally efficient solution.

2
With the PEF filter, the input is an autoregressive process x(n) and the output is a
white noise process w(n). The transfer function of the PEF is W (z)
X(z)
= 1 − P (z). On the
other hand, the input to the AR generating filter is w(n) and the output is x(n), as shown
in Fig. 11.2. The AR transfer function can be written in the form 1−P1 (z) . Therefore the
AR generating filter and the PEF are inverses of each other.

286
Figure 11.2. The AR generating filter configuration, which is the inverse of the PEF con-
figuration.

The Forward Prediction Error Equations

Here, we assume we have an observation of length N samples of a real,


stationary, ergodic AR sequence. For generalization to the nonstationary or
complex case, see [3, 29]. Rewriting (11.4) we have

K
X
x(n) = hk x(n − k) + w(n), n = K, . . . , N, N  K. (11.7)
k=1

The above may be expressed in matrix form as

xp = Xh + w (11.8)

where
   
xK+1 wK
 xK+2   wK+1 
xp =  w =
   
..   .. 
 .   . 
xN wN
 
xK ... x1  
h1
 xK+1 . . . x2 
X= h =  ...  .
   
.. .. 
 . . 
hK
xN −1 . . . xN −K

We see that (11.8) is a regression equation, where the variables x are re-
gressed onto themselves. This is the origin of the term “autoregressive”.
As discussed, we choose the coefficients h to minimize the prediction error

287
power. The coefficients hLS found in such a manner minimize ||xp − Xh||22
and are therefore given as the solution to the normal equations:

X T XhLS = X T xp . (11.9)

Taking expectations in (11.9) we get:

E(X T X)hLS = E(X T xp ). (11.10)

If the sequence x(n) is stationary and ergodic, the matrix E(X T X) becomes
 
r0 r−1 r−2 . . . r−K+1
 r1 r0 r−1 
 
T
 . . 
E(X X) =  r2 r1 r0 . =R
 (11.11)
 .. . .. . .. 
 . 
rK−1 r1 r0

where ri = E(xn+i xn ) is the autocorrelation function of x at lag i, and R is
the covariance matrix of x(n).

Likewise, from E(X T xp ) in (11.10)


 
r1
 r2 
∆
E(X T xp ) =   = rp . (11.12)

..
 . 
rK

Equation (11.10) is therefore represented as

RhLS = r p . (11.13)

Eq. (11.13) is the expectation of the normal equations used to determine the
coefficients of a stationary AR process. The finite–sample version of (11.13)
is referred to as the Yule– Walker equations. It is apparent from (11.11)
that (11.13) is a Toeplitz symmetric system of equations. We describe an
efficient O(n2 ) method of computing the solution to (11.13). But in the
process of developing this solution, we also uncover a great deal about the
underlying structure of AR processes.

Since (11.13) involves expectations, it only holds for the ideal case when
an infinite amount of data is available to form the covariance matrix R of

288
{x}. In the practical case where we have finite N , the normal equations
corresponding to (11.13) are not exactly Toeplitz. However, in the following
treatment, we still treat the finite case as if it were exactly Toeplitz. While
this form of treament does not necessarily minimize the predicition error for
the finite case, it imposes an asymptotic structure to the finite N solution
which tends to produce more stable results.

Before discussing how to solve (11.13) in an efficient way, we need an ex-


pression for the variance σ 2 of the noise term w(n). We see that σ 2 is the
minimum of the quantity ||w||22 from (11.8), obtained for h = hLS .

We have from the regression equation (11.8)

σ 2 = E(wT w) = E(xp − XhLS )T (xp − XhLS )


= E(xTp xp − xTp XhLS − hTLS X T xp + hTLS X T XhLS )

Substituting (11.13) into the above where

E(X T xp ) = r p = RhLS ,

and using the fact that E(xTp xp ) = r0 to get

σ 2 = r0 − r Tp hLS − hTLS r p + hTLS r p


= r0 − r Tp hLS . (11.14)

We can combine (11.13) and (11.14) together into one matrix equation as
follows:
 
r0 r−1 . . . r−K σ2
  
1st row given by (11.14)→ 1
  .. 
−h1 0

  r1 r0 . r−K+1 

 
 



 
.. .. ..  ..   .. 
. = .
 
remaining rows given by
 . . . 
  



.. ..  ..   .. 
r p −RhLS =0 . .

.

(11.13) 

  . r−1     
−hK 0

rK r1 r0
(11.15)
These are called the forward prediction-error equations. They are developed
directly from (11.4) where we have predicted x(n) in a forward direction
from a linear combination of past values.

289
The Backward Prediction Equations

The forward prediction analysis just developed is varied slightly to develop


the idea of backward prediction. Prediction in the backward direction may
seem to be a rather odd concept at first, since prediction into the past does
not seem to have much intuitive relevance. But we see later it is a crit-
ical idea in the development of the Levinson–Durbin algorithm to follow.
Mathematically, the idea of “predicting” a previous sample from a linear
combination of future values (backward prediction) has just as much rele-
vance as predicting a future value from a linear combination of past values
(forward prediction).

The idea of backward prediction can be explained by reversing the ordering


of the input sequence x(n) and applying it to the same prediction filter we
considered earlier. In this case, we are effectively predicting past values of
x(n) from its future values. The mathematical description of the backward
prediction operation is
K
X
x(n − K) = h(k)x(n − K + k) + w(n − K), n = N, . . . , K + 1. (11.16)
k=1

As we see later, it is more convenient in the development of the Levinson–


Durbin algorithm to have the input sequences for both types of prediction
arranged in the same time ordering. To accomplish this, we reverse the
order of the input sequence again back to its original ascending order, but
keep the same relative ordering between x(n) and the prediction coefficients
hk . It is easily verified that the modified backward prediction equation is
given by
K
X
x(n − K) = h(k)x(n − K + k) + w(n − K), n = K + 1, . . . , N, (11.17)
k=1

which, for a given index n is seen to be identical to (11.16).

We can group the equations corresponding to each value of the index n


in (11.17) into a backwards regression equation analogous to (11.8). This
leads directly to a backward set of normal equations analogous to (11.13).
Furthermore, we can augment these backward normal equations with the
backward prediction error variance using an expression analogous to (11.14).

290
The resulting backward prediction error equations are given as
   
  −hK 0
r0 . . . r−K  .   .. 
 .. .. ..   ..   . 
 . . .  = . (11.18)
 −h1   0 
rK r0
1 σ2

By noting the matrix R is symmetric, we see that the equation corresponding


to the first row of (11.15) is identical to that corresponding to the bottom
row of (11.18). In general, the equation corresponding to ith row of (11.15)
is identical to that of the (K −i+1)th row of (11.18). Hence, we see that the
forward prediction coefficients hLS in (11.15) are identical to the backward
coefficients of (11.18). If the data are complex, the forward and backward
coefficients are complex conjugates of each other.

Equations (11.15) and (11.18) may be solved jointly using the Levinson–
Durbin recursion, which requires only O(n2 ) flops, and is explained as fol-
lows.

11.1.2 The Levinson-Durbin Recursion (LDR)

The idea of the Levinson-Durbin recursion is to start with a simple 1 × 1


system of equations. Then by induction we use that result to solve a 2 × 2
system, and recursively iterate until the solution to a K × K system is
obtained. It is assumed the value K, which is the order of the AR process
or the number of poles in the all-pole generating filter of Fig. 1, is known
beforehand.

We use the index m, m ∈ [1, . . . , K] to denote the stage of the iteration. At


the (m − 1)th stage of the algorithm, we have the (m − 1)th-order forward
prediction-error equations from (11.15):
 
  1
 2
σ(m−1)

r0 . . . r−m+1  (m−1)
−h1 0
  
 .. .. ..   
= 
(11.19)
 . . .

 .. .. 
. .
  
rm−1 r0
 
(m−1)
−hm−1 0

291
2
where σ(m−1) is the prediction error power at stage m − 1. The backward
equations are written from (11.18) as
 
(m−1)  
  −hm−1 0
r0 . . . r−m+1  ..   .. 
.. .. ..  .  
= . . (11.20)
  
. . .
 −h(m−1)
    0 
rm−1 r0 1

2
σ(m−1)
1

The Toeplitz structure of equations (11.19) and (11.20) leads to an efficient


O(n2 ) algorithm for solving the system of equations. To develop the al-
gorithm, we assume by induction that the mth–order coefficients may be
written in terms of the (m − 1)th-order coefficients as
   
  1 0
1 (m−1) (m−1)
  −h1  −hm−1
   
(m)
 −h1
  
 
= ..  
 + ρm  .. 
(11.21)
 .. . .

.
     
 −h(m−1)  −h(m−1)
     
(m) m−1
 1

−hm
0 1

where the parameter ρm is to be determined and has special significance.


Note that the true mth–order forward prediction-error equations can be
written in the form:
 
  1
 2
σ(m)

r0 . . . r−m  (m)
−h1 0
  
 .. . . ..    
= . (11.22)

 . . . 
 .. ..
. .
  
rm r0
 
(m)
−hm 0

By substituting (11.21) into (11.22), and comparing the result to (11.19)


and (11.20), we have
     
 1 0 
  (m−1) (m−1)

 
r0 . . . r−m  −h1  −hm−1

    

   

 .. . . ..  ..   ..  
 . .
  + ρm  (11.23)
. . .
  
     
rm r0
  (m−1)   (m−1)  


  −hm−1   −h1  


 
 0 1 

292
  2
∆(m−1)
   

 σ(m−1) 



  0   0  

 
..
    
=
 ..  + ρm 
  


 .   . 
 




 0   0  



2
∆(m−1) σ(m−1)
 

Equation (11.23) is a combination of both equations (11.19) and (11.20): the


corresponding terms on the left in the brace-brackets represent the forward
prediction error equations (11.19) (where R has been augmented in order
by 1), and the terms on the right in the brace-brackets represent (11.20) in
an equivalent way.

Looking only at the forward portion of (11.23), because of the zero at the
bottom of the vector of unknowns on the left of the equals sign, the first
m−1 equations of (11.23) are identical to those of (11.19). Only the last row
of the mth-order system is different; because this equation does not occur in
the (m − 1)th–order system, we denote the right-hand side of this equation
as the special quantity ∆(m−1) , which is defined from (11.23) as

1
X
∆(m−1) = − ri h(m−1) (m − i). (11.24)
i=m


In the above, h(0) = 1. We see that the backward portion of (11.23) is
analogous to the forward part, except everything is reversed top-to-bottom.

Comparing coefficients between (11.22) and (11.23), we have

(m−1)
hi (m) = hi (m−1) + ρm hm−i , i = 1, . . . , m − 1 (11.25)
hm (m) = −ρm . (11.26)

Notice that (11.25) gives the mth–order coefficients in terms of the (m−1)th–
order coefficients and the quantity ρm . Thus, once ρm is determined, we can
complete the iteration from the (m − 1)th to the mth stage. To determine

293
ρm , we compare the right-hand sides of (11.22) and (11.23), to obtain

σ(m−1)2 + ρm ∆(m−1) = σ(m)


2
(11.27)
∆(m−1) + ρm σ(m−1)
2
= 0. (11.28)

The quantity ∆(m−1) may be eliminated between these equations to give


2 2
1 − ρ2m .

σ(m) = σ(m−1) (11.29)

Using this form in (11.27) we obtain

∆(m−1)
ρm = − 2 . (11.30)
σ(m−1)

Equations (11.30), (11.29), (11.25) and (11.26) define the recursion from the
(m − 1)th to the mth step. The induction process is complete by noting that
∆(0) = r1 , h0 = 1, and σ02 = r0 .

We now summarize the LDR. Starting with the above initial conditions, and
m = 1, we proceed as follows:

1. calculate ρm from (11.30).

2. The mth–order prediction coefficients are computed from the (m −


1)th–order coefficients using (11.25) and (11.26).

3. ∆m is then calculated from (11.24).


2
4. σ(m) is given from (11.29).

5. If m < K, increment m and repeat.

The description of the basic Levinson-Durbin recursion for solving a Toeplitz


system of equations is now complete. The most computationally intensive
steps are the calculation of ∆(m) and the prediction coefficients. Each of
these require m multiply/accumulates at each stage over K stages. There-
fore the total P
number of multiply/accumulates required for these two eval-
uations are 2 K 2
m=1 m = K . This number compares very favourably with
K 3
3 for Gaussian elimination.

294
11.1.3 Further Analysis on Toeplitz Systems

There are many significant repercussions which result from this previous
analysis. In the following, we present several aspects of Toeplitz system
analysis as it relates to the field of signal processing.

Prediction Error Power

First, it is interesting to examine the behaviour of the prediction-error power


2
σ(m) vs. the prediction order m. We may gain insight into this behaviour
by looking at equation (11.4), reproduced here for convenience:

K
X
x(n) = h(k)x(n − k) + w(n).
k=1

The quantity K is the true order of the system, whereas m is an index which
iterates as m = 1, . . . , K, according to the LDR. Thus, at the mth stage of
the LDR, equation (11.4) is effectively replaced by

m
X
x(n) = h(k)x(n − k) + w(n). (11.31)
k=1

2
The quantity σ(m) is the power of the noise term w(n) at the mth stage; i.e.,
2 2
σ(m) = E(w (n)). Thus, with the initial value m = 1, only one past value of
x is used to predict the present value, when in fact K past values are required
to predict as accurately as possible. Thus, for m = 1, the prediction process
indicated by (11.31) is not very accurate and the resulting noise power σ(1) 2

is large. As m increases, more terms are used in the prediction, hence


2
σ(m) diminishes with m until m = K. At this stage, the model (11.31) is
2
accurate, and σ(K) is reduced to its minimum possible value. At this stage,
the sequence w(n) is white, since it was assumed x(n) was generated by
applying white noise to a Kth-order all-pole filter. For the value m = K,
all possible predictive capability of equation (11.31) has been exploited. In
2
an expected sense, there is no further reduction in σ(m) for m > K.

295
The Partial Correlation Coefficients ρm

The quantities ρm are significant. They are referred to as the partial corre-
lation coefficients, or by analogy of (11.29) to power reflected from a load
on a transmission line, they are sometimes referred to as the reflection coef-
ficients. From (11.29), ρm indicates the reduction in prediction error power
in going from the (m − 1)th to the mth stage. In accordance with previous
discussion, ρm = 0 for m > K.

It is interesting to note that the determinant of the Kth-order covariance


matrix R in (11.13) and (11.15) has a determinant given by [29]

K
Y
1 − |ρm |2

det R = r0
m=1
YK
2
= r0 σ(m) (11.32)
m=1

where r0 is the autocorrelation at lag 0. Later, at the end of this section,


we prove (11.32).

From (11.32), we see that if |ρm | > 1 for any m, then det R could be less than
zero. However, we know that covariance matrices are positive semi-definite
and must have determinants greater than or equal to zero. Furthermore, it
may be shown that if |ρm | > 1, then some of the poles of the all-pole filter
which generates the observed AR process are outside the unit circle. This of
course will lead to instability and a non-stationary process whose covariance
matrix does not exist. Therefore we must have |ρm | ≤ 1 for m = 1, . . . , K.
However, with a finite sample of data, it is not guaranteed that the LDR
will always yield values ρm such that |ρm | ≤ 1.

Therefore it is desirable to modify the LDR in such a way that |ρm | ≤ 1.


This was accomplished by J.P. Burg [30] in 1979. Before discussing the
Burg algorithm, we bring out one more point associated with (11.32). At
2 = 0.
any stage j, for j = 1, . . . , K, if |ρj | = 1, then according to (11.29), σ(j)
This means that the process is perfectly predictable. Therefore, according to
(11.32), the determinant of the covariance matrix of a perfectly predictable
AR process is zero.

296
11.1.4 The Burg Recursion

Here we look at the forward and backward prediction errors in further detail.
The forward prediction errors (i.e., the output of the forward PEF) at the
mth stage, referred to as wf,m (n) can be inferred through (11.7) as:
m
X
wf,m (n) = x(n) − h(m) (k)x(n − k)
k=1
m
X
= a(m) (k)x(n − k) m = 1, . . . , K (11.33)
k=0

where 
1 k=0
a(k) = (11.34)
−h(k) k = 1, . . . , m.

Likewise, the backward prediction errors at the mth stage, which are the
outputs of the backward PEF, denoted wb,m (n), can be inferred through
(11.17)
m
(m)
X
wb,m (n) = am−k x(n − k). (11.35)
k=0

For ease of notation, we define the forward prediction error power at the
mth stage as Pf,m (formerly σ(m)2 ) and the backward prediction error power

as Pb,m . Burg’s idea is, for each m = 1, . . . , K, to choose ρm in such a way


that the quantity
1
P = (Pf,m + Pb,m ) (11.36)
2
is minimized. In effect, the Burg algorithm performs a sequence of K
one-dimensional minimizations of P in solving for the AR coefficients. In
contrast, the LDR or Yule–Walker equations essentially perform one K–
dimensional minimization to obtain the same quantities. In view of these
differing philosophies, the results obtianed from the two methods are not
expected to be identical.

To accomplish this, we must express Pf,m and Pb,m in terms of ρm . This may
be done by first developing new expressions for the forward and backward
prediction errors wf,m (n) and wb,m (n) in terms of ρm .

297
Substituting (11.34) into (11.25) and (11.26), we have
(
(m−1) (m−1)
(m) ai + ρm am−i , i = 1, . . . , m − 1
ai = (11.37)
0, i>m
a(m)
m = ρm (11.38)
(m)
a0 = 1. (11.39)

Then, substituting (11.37)–(11.39) into (11.33), we have


m m
(m−1) (m−1)
X X
wf,m (n) = ak x(n − k) + ρm am−k x(n − k)
k=0 k=0
m−1 m
(m−1) (m−1)
X X
= ak x(n − k) + ρm am−k x(n − k) (11.40)
k=0 k=1
(m−1)
where in the last line we have used the fact that am = 0. The first term
on the right of (11.40) is recognized to be wf,m−1 (n). By substituting k for
k − 1 in the second term, we realize the summation is equal to the (m − 1)th
order backward prediction error but delayed by one unit. Therefore,
wf,m (n) = wf,m−1 (n) + ρm wb,m−1 (n − 1). (11.41)

We now perform a similar operation on the backward prediction errors. By


substituting k for m − k in (11.35), we have
m
(m)
X
wb,m (n) = ak x(n − m + k). (11.42)
k=0

Next, we substitute (11.37) into (11.42) to obtain


m m
(m−1) (m−1)
X X
wb,m (n) = ak x(n − m + k) + ρm am−i x(n − m + k)
k=0 k=0
m−1 m
(m−1) (m−1)
X X
= ak x(n − m + k) + ρm am−k x(n − m +(11.43)
k).
k=0 k=1

By substituting k for m − k − 1 in the first term on the right of (11.43)


and comparing the result to (11.35), we realize that we obtain the backward
prediction error of order m − 1, delayed by one time unit, wb,m−1 (n − 1).

298
Also, by substituting k for m − k in the second summation term, we get the
forward prediction error wf,m−1 (n). Therefore (11.43) may be written as
wb,m (n) = wb,m−1 (n − 1) + ρm wf,m−1 (n). (11.44)

Equations (11.41) and (11.44) are the desired expressions for wf,m (n) and
wb,m (n) in terms of the coefficient ρm . Figure 5 shows how the forward
and backward prediction errors at order m are formed from those at order
(m−1) We now return to the discussion on choosing ρm to minimize (11.36).

By definition,
N
1 X
Pf,m = (wf,m (n))2 (11.45)
N −m
n=m+1
and,
N −m+1
1 X
Pb,m = (wb,m (n))2 (11.46)
N −m
n=1
where N is the length of the original observation x(n). Differentiating
(11.45) with respect to ρm , we get
∂Pf,m 2 X ∂wf,m (n)
= wf,m (n) · (11.47)
∂ρm N −m n ∂ρm
2 X
= wf,m (n) · wb,m−1 (n − 1) (11.48)
N −m n
where (11.41) was used to evaluate the derivative in (11.47). Substituting
(11.41) into (11.48) to express all quantities at the (m − 1)th order, we have
∂Pf,m 2 X
= (wf,m−1 (n) + ρm wb,m−1 (n − 1)) wb,m−1 (n − 1)
∂ρm N −m n
2 X
= wf,m−1 (n) · wb,m−1 (n − 1)
N −m n
2 X
+ ρm (wb,m−1 (n − 1))2 . (11.49)
N −m n

∂P
In a similar way, we can determine ∂ρb,mm
as
∂Pb,m 2 X 2 X
= wf,m−1 (n)ẇb,m−1 (n − 1) + ρm (wf,m−1 (n))2 .
∂ρm N −m n N −m n
(11.50)

299
Substituting (11.49) and (11.50) into (11.36), and setting the result to zero,
we have the final desired expression for ρm :

−2 N
P
n=m−1 wf,m−1 (n)wb,m−1 (n − 1)
ρm = PN 2 PN 2
. (11.51)
n=m−1 (wf,m−1 (n)) + n=m−1 (wb,m−1 (n − 1))

Notice that this expression for ρm at the mth stage is a function only of the
prediction errors at stage m − 1. Hence, the quantity ρm may be calculated
using only the signals available at stage m − 1. The mth-order coefficients
may be immediately determined from (11.37)-(11.39) once ρm is known.

The denominator of (11.51) with the factor of 2 on the numerator is the


average of the mean- squared values of the forward and backward prediction
error sequences at the mth stage. The numerator is the cross-correlation of
the forward prediction error sequence and the backward sequence delayed
by one time unit. The denominator normalizes this cross-correlation. Thus,
the quantity ρm is a form of normalized cross-correlation quantity.

The Burg iteration procedure may be expressed as follows. Given a se-


quence x(n), n = 1, . . . , N , we may determine its prediction coefficients
by performing the following procedure:

Initialize:
m=0
(m)
a0 = 1
{wf,m } = {x}
{wb,m } = {x}
Pf,m = r0
m=1
Iterate for m = 1, . . . , K
determine ρm from (11.51)
calculate wf,m (n) and wb,m (n) from (11.41) and (11.44) respectively
calculate the mth-order coefficients from (11.37)-(11.39)
if desired, calculate Pf,m from (11.29)
end.

As mentioned previously, this new procedure has the advantage that |ρm | ≤
1 for any reasonable sample size of data. To prove this point, consider the

300
matrix  
wf,m wb,m
W = (11.52)
wb,m wf,m
where wf,m = [wf,m (m + 1), . . . , wf,m (N )]T , and wb,m = [wb,m (m), . . . , wb,m (N − 1)]T .

Then, the matrix W T W which is


 P   
N 2 (n) + w 2 (n − 1)
PN
n=m+1 w f,m b,m 2 n=m+1 (w f,m (n)w b,m (n − 1)
WTW =  PN PN   
2 n=m+1 (wf,m (n)wb,m (n − 1) w 2 (n) + w 2 (n − 1)
n=m+1 f,m b,m
(11.53)
is positive semi-definite. Therefore its determinant, which is the product of
the eigenvalues, is non-negative; hence,
N
N
X X
2 2
2 (wf,m (n)wb,m (n − 1) ≤ wf,m (n) + wb,m (n − 1) (11.54)


n=m+1 n=m+1

Therefore, by comparing (11.54) with (11.51), we find that |ρm | ≤ 1 as


desired.

11.1.5 Lattice Filters

Equations (11.41) and (11.44) lead to a very interesting and useful interpre-
tation of the earlier prediction error filter structure.

The mth-order forward and backward prediction errors can be represented


in terms of the (m − 1)th order prediction errors as shown in Fig. 11.3,
which is derived directly from equations (11.41) and (11.44). By cascading
these sections, for m = 0, . . . , K, and realizing that both the forward and
backward zeroth-order prediction error sequences are the original data se-
quence x(n), we arrive at the structure shown in Fig. 11.4 as an alternative
representation of the PEF structure of Fig. 11.1.

In infinite precision arithmetic, the lattice filter of Fig. 11.4 is identical to


that of Fig. 11.1. However, we have seen |ρm | ≤ 1, m = 1, . . . , K, whereas
(K)
the |ak |, k = 1, . . . , K, can be much larger, especially if some of the poles
of the AR generating filter of Fig. 11.2 are near the unit circle. Therefore,

301
'

----- -- --iU ti&C?Vt· IVsPIL 'li'GA'Yd'iMJN(WJ)_�.. �,-

..

(V\
h
W, Y\ l )
b'1
I t

Figure 11.3. Representation of a single stage of a lattice filter, corresponding to eqs. (11.41)
and (11.44)

Figure 11.4. The prediction error filter implemented as a cascaded lattice filter. Each
section is implemented as shown in Fig. 11.3.

302
for a given number of bits, the ρm ’s can be represented with higher precision
(K)
than the ak coefficients; i.e., they are less sensitive to finite precision effects
and the lattice filter of Fig. 11.4 is generally the preferred structure.

11.1.6 Application of AR Analysis to Speech Coding

This is a very important consideration in speech coding. In this application,


an entire phoneme of speech, consisting of roughly 200 – 400 samples, can
(K)
be represented with a single set of coefficients ρk , or ak , k = 1, . . . , K. A
typical value of K is about 12. Thus we see that ideally, 200 – 400 samples
can be compressed into about 12 parameters. Further, since the ρm ’s can be
(K)
represented using fewer bits than the ak coefficients, these 12 parameters
can be transmitted with less information using the ρ’s instead of the a’s.

11.1.7 Toeplitz Factorizations

We now look at some very interesting factorizations of the covariance matrix


R corresponding to an autoregressive process. We rewrite (11.15) here for
convenience:
  
1
 
r0 r−1 . . . r−K PK
 r1 r0 (K) 

 a1   0 

 .. .  . = 
..  (11.55)
 . . .
 .
 .   . 
  

rK r0 (K) 0
a K

We append new columns to each side of (11.55), corresponding to prediction


orders m = K, . . . , 1. We obtain
 
  1 
PK

r0 r−1 . . . r−K   a1
(K)
1

  0 PK−1
 r1 r0  . U 
(K−1) . .
  
 . .   .. .. ..
. a = .
 
 .. ..   1 . . 
 . .  .
 .. ..   
. 1   0 0 P1 
rK r0  
a
(K)
a
(K−1)
... a
(1)
1 0 0 0 P0
K K−1 1
(11.56)

303
where U is an undetermined upper-triangular matrix whose value will soon
become apparent. We denote the second matrix on the left of (11.56) as A,
and the matrix product on the left as C. We premultiply each side of (11.56)
by AT . The right-hand side of this product is AT C. The matrices AT and
C are both upper triangular. This follows because AT is upper triangular
from its definition; that C is upper triangular follows from the right-hand
side of (11.56). Since AT has one’s on its main diagonal, the diagonal entries
of the product AT C are the same as those of C. We therefore have
 
PK

 PK−1 U0 

T T
A C = A RA = 
 . .. ∆
 = P. (11.57)
 
 0 P1 
P0

But the matrix AT RA is symmetric; therefore U 0 must be 0. Thus, the


matrix AT RA = P is diagonal; i.e., P = diag[PK , PK−1 , . . . , P0 ]. This is a
significant result.

The first reason why (11.57) is significant is as follows. We define the upper-
triangular matrix B as
B = AS (11.58)
1
where S = P − 2 . Then, from (11.57), we have

R = A−T P A−1
= B −T B −1 (11.59)

Therefore B is the inverse Cholesky factor of R. We have seen previously


that the inverse Cholesky factor of the covariance matrix whitens the cor-
responding vector sequence. We now investigate this property further, to
show that it indeed whitens our observed AR sequence x(n).

We therefore define a matrix X from the original sequence x(n) as


 
x1 . . . xK+1
1  x2 . . . xK+2 
X= (11.60)
 
1  .. 
(N − K − 1) 2  . 
xN −K−1 . . . xN

304
where N  K. It is clear that R = X T X. We can perform a QR decom-
position of X as
X = QU (11.61)
where U is the upper-triangular factor and Q has orthonormal columns.
Then,
R = UT U. (11.62)

Comparing (11.59) and (11.62), we see that U = B −1 . Therefore, from


(11.61), the matrix XB = Q has orthonormal columns.

This fact may also be seen directly from (11.57):

AT RA = AT X T XA = P = diag. (11.63)

Therefore, the matrix XA has orthogonal columns whose squared 2-norms


are the corresponding prediction error power. Post-multiplying XA by S
normalizes the columns to unit norm; therefore XAS = XB has orthonor-
mal columns. Therefore, (XB)T (XB) = I.

This idea may be extended even further. By comparing the operation in-
volved in the matrix product XB with (11.35), we see that the columns of
XA are the backward prediction errors wb,m (n), m = K, . . . , 1. We can
thus write
XB = [w̃b,K , w̃b,K−1 , . . . , w̃b,0 ]

where wb,m = [wb,m (m + 1) . . . wb,m (N )]T , m = 0, . . . , K is the vector of


mth-order backward prediction errors, and the superscript tilde represents
power normalization. Therefore, the matrix XB, whose columns are the
normalized backward prediction error sequences, is the Gram-Schmidt or-
thonormalization of X. This is a very useful and interesting result, since
the QR or Gram-Schmidt orthonormalization of the sequence x(n) can be
computed much more readily using the lattice filter structure discussed here,
rather than the classical Gram-Schmidt procedure discussed earlier.

An important corollary of the above is that the backward prediction error


sequences of different order are orthogonal; i.e.,

Pm m = n
wTb,m · wb,n =
0 m=6 n

305
This is equivalent to saying the matrix XA has orthogonal columns. This
orthogonality has important consequences in the field of adaptive filtering,
but is not considered further here.

We now have enough background where we can easily prove (11.32). From
(11.57) we see that R = A−T P A−1 . The matrices A−1 and A−T are upper
triangular with ones on the main diagonal, so their determinant is unity.
Because the determinant ofQa product is the product of determinants, we
therefore see that detR = i Pi . The second half of (11.32) follows from
(11.29).

Eq. (11.57) also leads to a computationally efficient means of calculating


R−1 . From R = A−T P A−1 , we have

R−1 = AP −1 AT . (11.64)

The computation of the term P −1 is easy, since P is diagonal.

306
Bibliography

[1] G. Golub and C. Van Loan, Matrix Computations. 3rd edn The Johns
Hopkins Univ, 1996.
[2] S. Marple Jr, Digital Spectral Analysis Prentice-Hall, 1987.
[3] S. Haykin, Nonlinear methods of spectral analysis. Springer Science &
Business Media, 2006, vol. 34.
[4] A. J. Laub, Matrix analysis for scientists and engineers. Siam, 2005,
vol. 91.
[5] N. K. Sinha and G. J. Lastman, Microcomputer-based numerical meth-
ods for science an d engineering, 1988.
[6] K. Petersen, M. Pedersen et al., “The matrix cookbook, vol. 7,” Tech-
nical University of Denmark, vol. 15, 2008.
[7] A. Papoulis, Random variables and stochastic processes. McGraw Hill,
1994.
[8] S. Haykin, Adaptive Filter Theory Prentice Hall, 4th Ed. Englewood
Cliffs, NJ, USA, 2001.
[9] R. Schmidt, “Multiple emitter location and signal parameter estima-
tion,” IEEE transactions on antennas and propagation, vol. 34, no. 3,
pp. 276–280, 1986.
[10] S. Haykin, J. Reilly, V. Kezys, and E. Vertatschitsch, “Some aspects
of array signal processing,” in IEE Proceedings F (Radar and Signal
Processing), vol. 139, no. 1. IET, 1992, pp. 1–26.
[11] T. M. Cover and J. Thomas, Elements of information theory. John
Wiley & Sons, 1999.

307
[12] H. L. Van Trees, Detection, estimation, and modulation theory, part I:
detection, estimation, and linear modulation theory. John Wiley &
Sons, 2004.
[13] L. L. Scharf and C. Demeure, Statistical signal processing: detection,
estimation, and time series analysis. Prentice Hall, 1991.
[14] J. H. Wilkinson, The algebraic eigenvalue problem. Clarendon press
Oxford, 1965, vol. 87.
[15] G. Strang, Linear Algebra and its Applications. Harcourt Brace Jo-
vanovich College Publishers, 1988.
[16] S. Haykin, Communication systems. John Wiley & Sons, 2008.
[17] Z. J. Koles, “The quantitative extraction and topographic mapping of
the abnormal components in the clinical eeg,” Electroencephalography
and clinical Neurophysiology, vol. 79, no. 6, pp. 440–447, 1991.
[18] P. L. Nunez, R. Srinivasan et al., Electric fields of the brain: the neu-
rophysics of EEG. Oxford University Press, USA, 2006.
[19] I. T. Jolliffe, “Principal components in regression analysis,” in Principal
component analysis. Springer, 1986, pp. 129–155.
[20] R. Rosipal and N. Krämer, “Overview and recent advances in partial
least squares,” in International Statistical and Optimization Perspec-
tives Workshop” Subspace, Latent Structure and Feature Selection”.
Springer, 2005, pp. 34–51.
[21] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,”
Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp.
37–52, 1987.
[22] H. Abdi, “Partial least square regression (pls regression),” Encyclopedia
for research methods for the social sciences, vol. 6, no. 4, pp. 792–795,
2003.
[23] P. Geladi and B. R. Kowalski, “Partial least-squares regression: a tu-
torial,” Analytica chimica acta, vol. 185, pp. 1–17, 1986.
[24] S. Wold, A. Ruhe, H. Wold, and W. Dunn, Iii, “The collinearity problem
in linear regression. the partial least squares (pls) approach to general-
ized inverses,” SIAM Journal on Scientific and Statistical Computing,
vol. 5, no. 3, pp. 735–743, 1984.

308
[25] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical
learning: data mining, inference, and prediction. Springer Science &
Business Media, 2009.

[26] C. M. Bishop, Pattern recognition and machine learning. springer,


2006.

[27] H. Peng, F. Long, and C. Ding, “Feature selection based on mu-


tual information criteria of max-dependency, max-relevance, and min-
redundancy,” IEEE Transactions on pattern analysis and machine in-
telligence, vol. 27, no. 8, pp. 1226–1238, 2005.

[28] P. C. Hansen, “The l-curve and its use in the numerical treatment of
inverse problems,” 1999.

[29] S. L. Marple Jr and W. M. Carey, “Digital spectral analysis with ap-


plications,” 1989.

[30] J. P. Burg, Maximum entropy spectral analysis. Stanford University,


1975.

309

You might also like